Asplos24 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Two-Face: Combining Collective and One-Sided

Communication for Efficient Distributed SpMM


Charles Block∗ Gerasimos Gerogiannis∗ Charith Mendis
University of Illinois at University of Illinois at University of Illinois at
Urbana-Champaign, IL, USA Urbana-Champaign, IL, USA Urbana-Champaign, IL, USA
[email protected] [email protected] [email protected]

Ariful Azad Josep Torrellas


Indiana University University of Illinois at
Bloomington, IN, USA Urbana-Champaign, IL, USA
[email protected] [email protected]
Abstract Keywords: High-performance computing, distributed algo-
Sparse matrix dense matrix multiplication (SpMM) is com- rithms, sparse matrices, SpMM
monly used in applications ranging from scientific comput- ACM Reference Format:
ing to graph neural networks. Typically, when SpMM is exe- Charles Block, Gerasimos Gerogiannis, Charith Mendis, Ariful Azad,
cuted in a distributed platform, communication costs domi- and Josep Torrellas. 2024. Two-Face: Combining Collective and One-
nate. Such costs depend on how communication is scheduled. Sided Communication for Efficient Distributed SpMM. In 29th ACM
If it is scheduled in a sparsity-unaware manner, such as with International Conference on Architectural Support for Programming
collectives, execution is often inefficient due to unneces- Languages and Operating Systems, Volume 2 (ASPLOS ’24), April 27-
sary data transfers. On the other hand, if communication May 1, 2024, La Jolla, CA, USA. ACM, New York, NY, USA, 18 pages.
is scheduled in a fine-grained sparsity-aware manner, com- https://doi.org/10.1145/3620665.3640427
municating only the necessary data, execution can also be
inefficient due to high software overhead. 1 Introduction
We observe that individual sparse matrices often contain
regions that are denser and regions that are sparser. Based on Sparse matrix dense matrix multiplication (SpMM) is a key
this observation, we develop a model that partitions commu- kernel in sparse linear algebra. It has applications across a
nication into sparsity-unaware and sparsity-aware compo- wide range of domains. For example, SpMM is a key opera-
nents. Leveraging the partition, we develop a new algorithm tion in Latent Dirichlet Allocation, Non-negative Matrix Fac-
that performs collective communication for the denser re- torization, and Alternating Least Squares [11]. It is the bottle-
gions, and fine-grained, one-sided communication for the neck primitive in various Graph Neural Networks [17, 29, 30]
sparser regions. We call the algorithm Two-Face. We show and an integral part of popular graph learning frameworks
that Two-Face attains an average speedup of 2.11x over prior such as PyTorch Geometric (PyG) [18] and Deep Graph Li-
work when evaluated on a 4096-core supercomputer. Addi- brary (DGL) [55].
tionally, Two-Face scales well with the machine size. The ever-increasing computing and memory demands of
sparse matrix computations introduce the need for efficient
CCS Concepts: • Computing methodologies → Distributed distributed SpMM. However, designing efficient distributed
algorithms. SpMM algorithms is challenging [7, 8, 48]. Due to the low
arithmetic intensity of this kernel, the communication cost,
∗ Both authors contributed equally to this research. Order is alphabetical. rather than the computation cost, typically dominates the
execution time. Such cost depends on how communication
Permission to make digital or hard copies of all or part of this work for is scheduled [8].
personal or classroom use is granted without fee provided that copies
Communication can be scheduled in a sparsity-unaware
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights manner, such as with collective communications. For exam-
for components of this work owned by others than the author(s) must ple, assume that each node originally hosts a block of the
be honored. Abstracting with credit is permitted. To copy otherwise, or dense input matrix. This block may be sent to all the other
republish, to post on servers or to redistribute to lists, requires prior specific nodes by using collectives or RDMA accesses to fully repli-
permission and/or a fee. Request permissions from [email protected]. cate it [48] or by using shifting algorithms [8] similar to
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA
those that would be used for dense computation. With this
© 2024 Copyright held by the owner/author(s). Publication rights licensed
to ACM.
strategy, execution is often inefficient due to redundant data
ACM ISBN 979-8-4007-0385-0/24/04 transfers, since parts of the dense input matrix may not be
https://doi.org/10.1145/3620665.3640427 needed by some nodes.
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Charles Block, Gerasimos Gerogiannis, Charith Mendis, Ariful Azad, and Josep Torrellas

On the other hand, communication can be scheduled in 2.1 SpMM


a fine-grained sparsity-aware manner [3], communicating In SpMM, a sparse matrix 𝐴 and a dense matrix 𝐵 are mul-
only the truly necessary data and computing asynchronously. tiplied, and the result is a dense matrix 𝐶, as expressed by
Specifically, when a node processes a sparse element but does 𝐶 = 𝐴 × 𝐵. We refer to the number of columns in the dense
not own the necessary dense row for that computation, it matrices as K. Figure 1a illustrates the memory accesses and
gets that row with a fine-grained one-sided request. With computation of this kernel. Note that 𝐵 is shown transposed
this strategy, execution can also be inefficient due to high in the whole paper to help visualization. For each nonzero
software overheads and the need for more network round- of the input sparse matrix 𝐴, one row of 𝐵 and one row of
trips [8]. 𝐶 are accessed. Those rows are indexed by the nonzero’s
In this work, we observe that individual sparse matrices coordinates: the row index of the nonzero (𝑟 _𝑖𝑑) indexes 𝐶,
often contain regions that are relatively denser and regions while the column index of the nonzero (𝑐_𝑖𝑑) indexes 𝐵. For
that are relatively sparser. Based on this observation, we example, in the figure, nonzero 𝑎 triggers read accesses from
develop a model that partitions computation and communi- the dense input row of 𝐵 in black, and read and write accesses
cation into sparsity-unaware and sparsity-aware portions. to the dense output row of 𝐶 in black. The 𝐵 row is scaled by
Relatively denser regions of the sparse matrix are broken 𝑎 and added to the 𝐶 row. This process is repeated for all the
down into Synchronous Stripes and transfer the correspond- nonzeros of the sparse matrix. When the dense matrices are
ing parts of the dense input matrix with Sparsity-Unaware distributed in a multi-node machine, the nonzero structure
Transfers (SUT). Relatively sparser regions are broken down of 𝐴 affects the inter-node data transfer patterns.
into Asynchronous Stripes and transfer the corresponding
parts of the dense input matrix with Sparsity-Aware Trans- B[6,0:K-1] N0 N1 N2 N3

fers (SAT). B K B K

Leveraging the partition, we develop a new algorithm x C x C


that performs collective communication for the synchronous 0
1 c
a
e
b
a f C[1,0:K-1]
0
1
N0
stripes, and fine-grained, one-sided communication and asyn- 2
3 j
g
b
i 2
3
N1
A = A =
chronous computation for the asynchronous stripes. We call 4
5 c l m
4
5
N2
6 n d
the algorithm Two-Face. The synchronous and asynchronous 7 o p
e
6
7
N3
parts of the sparse matrix are processed in parallel, and the r_id 0 1 2 3 4 5 6 7
c_id K r_id 0 1 2 3 4 5 6 7
c_id K

model aims at equalizing the runtimes of the two parts. C[1,0:K-1] = C[1,0:K-1] + a x B[6,0:K-1]

We evaluate Two-Face on a CPU-based supercomputer (a) Computation in SpMM (b) 1D partitioning for 4-nodes
using large matrices and compare it to state-of-the-art base- Figure 1. SpMM and 1D partitioning.
lines. For a system with 32 nodes, 128 cores per node, and
dense matrices with 128 columns, Two-Face attains an av-
erage speedup of 2.11x against dense shifting [8], a high- 2.2 1D Partitioning
performing baseline. In addition, Two-Face is a scalable al- When executing SpMM on a distributed system, the sparse
gorithm: its average speedup over dense-shifting increases and dense matrices should be partitioned across nodes. In this
to 2.21x for 64 nodes. Finally, the overhead introduced by work, we use 1D partitioning, which in prior work [8] was
the necessary matrix preprocessing step is small enough to shown to display good performance for many sparse input
make Two-Face suitable for applications that use the same matrices. Figure 1b illustrates 1D partitioning for a system
sparse matrix only a few dozen times. consisting of 4 nodes (𝑁 0...𝑁 3). The matrices are partitioned
Overall, this paper’s contributions are: according to the node colors. Each node is responsible for
• The Two-Face algorithm for distributed SpMM, which is the nonzeros in a set of consecutive rows of 𝐴. It additionally
based on a mix of collective and one-sided communication. hosts a set of consecutive 𝐵 and 𝐶 rows as shown in Figure 1b.
The read and write accesses to the dense output matrix 𝐶 are
• A low-overhead model and method to partition sparse
always local. The accesses to the dense input matrix 𝐵 are
matrices into regions corresponding to the two access types.
often remote, except for nonzeros with 𝑐_𝑖𝑑s that index to
• An evaluation of Two-Face on a supercomputer and a com- the local portion of 𝐵 (e.g., nonzeros with 𝑐_𝑖𝑑 equal to 0 or
parison with the state-of-the-art. 1 in Node 1). Overall, with this partitioning scheme, remote
data accesses occur only for 𝐵. From now on, we will use the
2 Background term remote transfers to mean transferring remote elements
In this section, we provide a background on the memory of the 𝐵 matrix.
access patterns in SpMM, the 1D partitioning method for dis-
tributing the SpMM data structures in a multi-node system, 2.3 Sparsity-unaware and Sparsity-aware Transfers
and the differences in the communication patterns of SUT We now discuss the communication patterns associated with
and SAT. sparsity-unaware (SUT ) and sparsity-aware transfers (SAT ).
Two-Face: Collective and One-Sided Communication for Efficient Distributed SpMM ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA

K=32 K=128
102 Async Fine better Collectives better
1
10
Speedup of async
vs collectives

100

10−1

10−2
eb

ic
s

er

er

er

eb

ic
s

er

er

er
aw

aw
ke

ke
ee

ee
ab

ab
km

itt

st

km

itt

st
w

w
o

o
m

m
qu

qu
nd

nd
ar

ar
tw

tw
st

st
ie

ie
fr

fr
Figure 2. Speedup of Async Fine over a full-replication AllGather collective implementation for 𝐾 = 32 (left) and 𝐾 = 128 (right).
There is no data for kmer with 𝐾 = 128 due to the large memory consumption of the full-replication collective algorithm.

Figure 3a illustrates one possible SUT pattern. Each node to synchronize for the request to be completed). For these
sends its local rows of 𝐵 to all of the other nodes. This is a reasons, we refer to this particular communication pattern
conservative approach: all local rows are transferred to all for distributed SpMM as Async Fine. This is in contrast to
the other nodes regardless of whether the rows are actually the execution strategies that use SUT s, where transfers are
useful to all the destination nodes. This pattern can be imple- coarse-grained and require more synchronization.
mented either with collectives [48] such as AllGather [12, 51],
which fully replicates the dense input, or with shifting algo- 3 Motivation
rithms [8] that perform the transfers iteratively in a cyclic
manner. A shifting algorithm consists of a series of compu- We now show that, for some matrices, the SAT pattern is
tation and communication steps. more suitable, while for others the SUT is better. Then, we
provide an example to motivate that a combination of the
two approaches for the same matrix can yield the best results.
N0 N1

3.1 Choice of SAT & SUT SpMM is Input Dependent


Both the SUT and SAT approaches have pitfalls: SUT s can
N2 N3 lead to unnecessary data transfers, while SAT s can have high
software overhead and more round-trips between nodes. To
(a) SUT pattern compare their performance, we profile the execution of dis-
tributed SpMM using Async Fine (a SAT approach) and All-
Gather collectives (a SUT approach). We use a distributed
B machine with 32 nodes and 128 CPU cores per node, run-
a: N3 N0
ning the two algorithms for 8 large sparse matrices from
x C
0 SuiteSparse [15], and for two different values of K (32 and
1
2
a
c: N0 N2 128). Section 6 gives more details about our methodology.
A 3
4
b = Figure 2 displays our findings. The figure shows the speedup
5 c of Async Fine over the AllGather implementation (Collec-
6 d d: N3
7 e
N2
tives) for 𝐾 = 32 (left) and 𝐾 = 128 (right). We do not include
0 1 2 3 4 5 6 7
results for the kmer matrix with 𝐾 = 128 because the single-
node memory demand of Collectives exceeds the single-node
(b) SAT pattern
capacity in our system.
Figure 3. Examples of SUT and SAT patterns. We see that, for half of the matrices, Async Fine outper-
forms Collectives, while the opposite is the case for the other
Figure 3b illustrates an example of the SAT approach. Each
half. The speedup of Async Fine over Collectives reaches 11.5x,
node traverses its local partition of the sparse matrix 𝐴 and
while the speedup of Collectives over Async Fine reaches
issues fine-grained read requests for rows of 𝐵. For example,
10.5x. Clearly, whether SUT or SAT works best depends on
Node N0 is responsible for the nonzero 𝑎, but the correspond-
the sparsity pattern of the input matrix.
ing 𝐵 row that this nonzero requires is hosted in N3. Hence,
as shown in the figure, N0 issues a request to N3 for the row
(dashed arrow) and N3 sends the row (solid arrow). Similar 3.2 Combining SUT & SAT for a Single SpMM
requests are issued for nonzeros 𝑐 and 𝑑, but not for 𝑏 and 𝑒, Typically, the nonzeros in a sparse matrix are not evenly
since the required 𝐵 rows of the latter are already local. distributed. Consequently, combining the SAT and SUT com-
The SAT s are fine-grained – only a single row is trans- munication flavors for a single SpMM could be beneficial. An
ferred instead of all the rows of a node as in the SUT approach. example of this idea is shown in Figure 4. We split the nonze-
In addition, requests are typically initiated by the receiver ros into three categories: (1) local-input nonzeros (NNZs) are
and are asynchronous (i.e., the set of all nodes does not need those for which the dense input rows needed are already
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Charles Block, Gerasimos Gerogiannis, Charith Mendis, Ariful Azad, and Josep Torrellas

local to the node with the nonzeros and, therefore, no re- Two-Face combines sparsity-unaware and sparsity-aware
mote transfers are needed; (2) sync nonzeros are those for transfers for efficient distributed SpMM. In this section, we
which the dense input rows needed are better off transferred describe how sparse matrices are analyzed and partitioned
through SUTs; and (3) async nonzeros are those for which it into two types of regions.
is more beneficial to transfer the rows through SAT.
Dense rows broadcast from
4.1 Sparse Matrix Partitioning
N2 to N0, N1 and N3 Before the SpMM execution begins, the sparse matrix is pre-
async transfers: processed in order to determine which of the data transfers
N0 N1 N2 N3
will use coarse-grained multicast operations, and which will
N3 N0
B use fine-grained one-sided communication. Then, during
runtime, both types of transfers and their corresponding
x C computations will proceed in parallel.
0 N0 N2
1
N0 Two-Face adopts 1D partitioning (Subsection 2.2). Thus,
2 each node, which contains one MPI rank, is responsible for
3 N1
A 4 = the nonzeros in a group of consecutive rows of sparse matrix
5 N2
6 sync transfers: 𝐴. In addition, the node hosts the corresponding group of
7 N3 consecutive rows of the dense output matrix 𝐶, and a group
0 1 2 3 4 5 6 7 N0 N1
of consecutive rows of the dense input matrix 𝐵. As discussed
local-input nonzeros earlier, the accesses to 𝐶 and 𝐴 are always local, while the
accesses to 𝐵 can either be local or require a remote data
sync nonzeros
N2
transfer, depending on the 𝑐_𝑖𝑑s of the processed nonzeros.
N3
async nonzeros The matrices are partitioned in the following way:
Megatile. As shown in Figure 5, we logically divide the
Figure 4. Example of how combining the two communica- matrix 𝐴 into megatiles (MT). Given the matrix 𝐴 with 𝑁
tion flavors can be beneficial. rows and 𝑀 columns, and given 𝑝 nodes in the distributed
system, a megatile is formed with 𝑁 /𝑝 consecutive rows
To understand which nonzeros are sync and which are and 𝑀/𝑝 consecutive columns. Node 𝑖 stores the 𝑖 𝑡ℎ row of
async, consider the example. Columns 4 and 5 of the sparse megatiles. We logically divide the matrices 𝐵 and 𝐶 based
matrix are quite dense. This means that the corresponding on the width and height of a megatile, respectively. Figure 5
dense input rows of 𝐵 (shaded in the figure) are useful to shows the breakdown of the matrices for 𝑝 = 4. The chunks
many nodes. Specifically, 𝐵[4,0:𝐾-1] is needed by N1, N2, and of the 𝐵 and 𝐶 matrices are distributed across the nodes as
N3, while 𝐵[5,0:𝐾-1] is needed by all the nodes. Hence, it is shown in the figure, where node 𝑖 is labeled N𝑖.
likely beneficial to transfer the whole group of rows hosted
by N2 to all the other nodes through a collective broadcast N2 dense stripe
operation. Hence, we classify the nonzeros at (0,5), (2,4), (3,5), W

(6,4), and (6,5) as sync. On the other hand, 𝐵[0,0:𝐾-1] is not


B N0 N1 N3
needed by any of the remote nodes and 𝐵[1,0:𝐾-1] is only
needed by the N2 remote node—N0 also accesses both rows, Megatile
M/p C
but they are already local and no transfers are needed. Trans-
ferring the whole group of dense rows that N0 hosts through N/p N0
a collective broadcast would lead to many unnecessary data
transfers. Thus, the best strategy is likely for N2 to issue a N1

one-sided request to N0 to get 𝐵[1,0:𝐾-1], without synchro-


N2
nizing with the rest of the nodes. Therefore, the nonzero at
(5,1) is classified as async. A similar situation occurs with sparse stripe N3
the nonzero at (1,6), which is an async nonzero. Note that a
A
collective broadcast operation does not necessarily need to
include all nodes as destinations. For example, if the nonzero
at (0, 5) was absent, then N2 could still issue a multicast Figure 5. Two-Face megatiles and sparse and dense stripes.
transfer directed only to N1 and N3. Sparse stripe. Each megatile is further divided into sparse
stripes, which determine the communication patterns. A
4 Overview of Two-Face sparse stripe has the same number of rows as a megatile
In this section, we translate the intuition about combining the and a fixed number of columns 𝑊 . We choose to divide
SUT and SAT approaches into an algorithm called Two-Face. megatiles into sparse stripes to allow for partitioning the
Two-Face: Collective and One-Sided Communication for Efficient Distributed SpMM ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA

sparse matrix into two types of regions at a fine granularity. the stripe and buffer the results in a thread-local buffer be-
Sparse stripes are classified as local-input if their correspond- fore updating 𝐶. Then, the thread uses a single synchroniza-
ing 𝐵 rows are owned by the local node; otherwise, they are tion operation to accumulate the contents of the thread-local
remote-input. A remote-input sparse stripe can either trigger buffer into the corresponding 𝐶 row. In asynchronous stripes,
a coarse-grained collective transfer or a fine-grained asyn- the nonzeros are stored in column-major order to benefit
chronous one. At a high level, remote-input stripes requiring communication. Specifically, a column-major format allows
many rows of the dense input matrix 𝐵 will be marked as a thread to quickly traverse the nonzeros and determine the
synchronous as, during execution, they will benefit from the unique 𝑐_𝑖𝑑s of the nonzeros in a stripe, which in turn iden-
coarse-grained collectives. On the contrary, remote-input tify the rows of matrix 𝐵 that need to be transferred. This
stripes requiring few rows of 𝐵 will be marked as asynchro- comes at the cost of computational inefficiency since this
nous as, during execution, they will benefit from fine-grained format makes buffering the output of a thread’s computa-
asynchronous transfers. During execution, multiple nodes tions hard. As a result, the thread must typically use one
containing synchronous stripes that require the same data synchronization operation for each nonzero to accumulate
from 𝐵 will participate in the same collective multicast op- results onto 𝐶.
eration to receive that data. It is possible that a “multicast”
transfers data to only a single destination node. 4.2 Preprocessing Model
Dense stripe. All the sparse stripes that have the same range During execution, Two-Face will process the asynchronous
of 𝑐_𝑖𝑑s in 𝐴 access the same group of rows in matrix 𝐵. We stripes in parallel with the synchronous and local-input
call this group of dense rows a dense stripe. A synchronous stripes. Consequently, the optimal choice to partition the
sparse stripe will trigger the coarse-grained transfer of a sparse matrix into synchronous and asynchronous stripes
dense stripe using a collective operation with, potentially, is one that equalizes the execution times of asynchronous
additional destination nodes; an asynchronous sparse stripe stripes and synchronous/local-input stripes. To this end, we
will trigger the fine-grained one-sided transfer of individual create a model of execution based on the following ideas.
rows (or groups of adjacent rows) within the dense stripe. For the synchronous stripes, the model assumes that the
These asynchronous transfers will only transfer the rows of computation time will be negligible compared to the syn-
the dense stripe that are needed for the computation. If a chronous communication time. The reason is that the row-
node does not need any of a dense stripe’s rows, that dense major format of the nonzeros lends itself to efficient exe-
stripe will not be communicated to it at all. cution: the output of the nonzeros in a row of the stripe is
In the next sections, we use stripe to refer to sparse stripes. reused through a thread-local buffer and accumulated into
Any mention of dense stripes is explicit. the corresponding 𝐶 row with a single synchronization oper-
Stripes are classified as asynchronous or synchronous dur- ation. In addition, we take advantage of this parallelizability
ing a preprocessing step using a model that tries to minimize by assigning more parallel threads to the computation of
the expected execution time. We present the model in Sec- synchronous/local-input stripes than for the asynchronous
tion 4.2. During the actual execution after the preprocessing stripes. For the local-input stripes, since they do not need
step, the local threads in an MPI rank operate in parallel and communication, the model neglects both communication
are split into two groups: (1) synchronous threads, which han- and computation time.
dle the data transfers for the synchronous stripes as well as In contrast, the computation time for asynchronous stripes
the computation for synchronous and local-input stripes, and may be significant because the column-major format of the
(2) asynchronous threads, which handle the data transfers nonzeros lends itself to inefficient execution: thread-local
and computation for the asynchronous stripes. All synchro- buffers are not used and we need to perform a synchroniza-
nous communication is completed before any synchronous tion operation for every nonzero. In addition, because syn-
computation begins. On the other hand, asynchronous com- chronization may be a bottleneck, we assign fewer threads to
munication and computation overlap: a thread may compute the computation of asynchronous stripes than to the others.
on one asynchronous stripe while another thread transfers Therefore, for the asynchronous stripes, the model considers
data for a second asynchronous stripe. both computation and communication.
To optimize computation and communication efficiency, We model the cost of synchronous communication (𝐶𝑜𝑚𝑚𝑆 ),
the nonzeros in sparse stripes are ordered in row-major order asynchronous communication (𝐶𝑜𝑚𝑚𝐴 ), and asynchronous
in synchronous stripes and column-major order in asynchro- computation (𝐶𝑜𝑚𝑝𝐴 ) for a particular node as:
nous stripes. In synchronous stripes, the nonzeros are stored
in row-major order because this benefits computation. Specif-
ically, a thread can process the nonzeros of a whole row of 𝐶𝑜𝑚𝑚𝑆 = 𝑆𝑆 (𝛽𝑆 𝐾𝑊 + 𝛼𝑆 )
𝐶𝑜𝑚𝑚𝐴 = 𝛽𝐴 𝐾𝐿𝐴 + 𝛼𝐴 𝑆𝐴
𝐶𝑜𝑚𝑝𝐴 = 𝛾𝐴 𝐾𝑁𝐴 + 𝜅𝐴 𝑆𝐴
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Charles Block, Gerasimos Gerogiannis, Charith Mendis, Ariful Azad, and Josep Torrellas

Consider 𝐶𝑜𝑚𝑚𝑆 first. 𝑆𝑆 is the number of synchronous from a remote node and contains few nonzeros. Therefore,
stripes processed by the node, 𝑊 is the stripe width, and 𝐾 is its communication and computation costs are relatively low.
the number of columns in the dense matrices. 𝛽𝑆 is the cost On the other hand, if the stripe is classified as synchronous,
of synchronous transfer per element of 𝐵 (i.e., it is inversely it has a constant communication cost.
proportional to the bandwidth), and 𝛼𝑆 is other per-stripe Consequently, we sort all of this node’s stripes by their
overheads of synchronous transfers. 𝑧𝑖 in ascending order. Then, we take one stripe at a time, in
Next, for 𝐶𝑜𝑚𝑚𝐴 , 𝑆𝐴 is the number of asynchronous stripes order, and classify it as asynchronous, until we have taken
processed by the node, and 𝐿𝐴 is the total number of rows the first 𝑟 stripes, where 𝑟 is the greatest number that satisfies
of the dense matrix 𝐵 transferred for these stripes via fine-
Õ
𝑟 −1
grained accesses. 𝛽𝐴 and 𝛼𝐴 represent the same costs as 𝛽𝑆 𝑆𝑇 (𝛽𝑆 𝐾𝑊 + 𝛼𝑆 ) ≥ 𝑧𝑖 .
and 𝛼𝑆 but for asynchronous accesses. 𝑖=0
Finally, for 𝐶𝑜𝑚𝑝𝐴 , 𝑁𝐴 is the total number of nonzeros in
the asynchronous stripes processed by the node. 𝛾𝐴 is the The rest of the stripes are classified as synchronous. With
computational cost per operation, and 𝜅𝐴 is the additional this method, the two sides of Equation 1 are approximately
per-stripe software overhead of asynchronous computation. equal, which is a necessary condition to attain optimal execu-
The coefficients 𝛽𝑆 , 𝛼𝑆 , 𝛽𝐴 , 𝛼𝐴 , 𝛾𝐴 , and 𝜅𝐴 are determined tion time. Indeed, following this method, the total runtimes
via a linear regression [40] calibration step (details in Sec- of the synchronous and asynchronous stripes in Two-Face
tion 6.2). These parameters are dependent on the system should be nearly equal, assuming the simplified cost model
configuration. For example, a system with a large bisection that we use does hold. Additionally, sorting the stripes by
bandwidth should have small 𝛽 terms. The 𝛼 terms may be their 𝑧𝑖 maximizes the number of stripes classified as asyn-
reduced by reducing the round-trip communication latency, chronous and minimizes the number of synchronous stripes.
including the latency incurred in software libraries/drivers Because the cost of communication for a synchronous stripe
and in the network. is constant for a given 𝐾 and 𝑊 , this strategy minimizes
In the optimal case, 𝐶𝑜𝑚𝑚𝑆 = 𝐶𝑜𝑚𝑚𝐴 + 𝐶𝑜𝑚𝑝𝐴 , so that 𝐶𝑜𝑚𝑚𝑆 and, therefore, the total cost of the operation.
there is a perfect overlap of the asynchronous and synchro- There are other possible methods of classifying stripes.
nous components. Defining 𝑆𝑇 = 𝑆𝑆 + 𝑆𝐴 to be the total One such method is to analyze columns of stripes in the
number of non-local-input stripes processed by a particular sparse matrix and classify a stripe as synchronous when
node and rearranging this equation gives the following: its corresponding dense stripe is needed by many nodes
and, therefore, is likely to benefit from optimized multicast
𝑆𝑆 (𝛽𝑆 𝐾𝑊 + 𝛼𝑆 ) = 𝛽𝐴 𝐾𝐿𝐴 + 𝛼𝐴 𝑆𝐴 + 𝛾𝐴 𝐾𝑁𝐴 + 𝜅𝐴 𝑆𝐴 operations. We leave the investigation of such methods for
=⇒ (𝑆𝑇 − 𝑆𝐴 )(𝛽𝑆 𝐾𝑊 + 𝛼𝑆 ) = 𝐾 (𝛽𝐴 𝐿𝐴 + 𝛾𝐴 𝑁𝐴 ) future work.
+ 𝑆𝐴 (𝛼𝐴 + 𝜅𝐴 )
5 The Two-Face Algorithm
=⇒ 𝑆𝑇 (𝛽𝑆 𝐾𝑊 + 𝛼𝑆 ) = 𝐾 (𝛽𝐴 𝐿𝐴 + 𝛾𝐴 𝑁𝐴 )
In this section, we provide greater details about Two-Face.
+ 𝑆𝐴 (𝛼𝐴 + 𝜅𝐴 + 𝛽𝑆 𝐾𝑊 + 𝛼𝑆 ) (1)
We discuss the sparse matrix representation, the Two-Face
We now classify the stripes requiring communication as algorithm, its tuning and portability, and its applicability to
either synchronous or asynchronous. Initially, we assume GNN training.
that all stripes are synchronous, which makes the right-hand
side of Equation 1 equal to 0. Then, we take each stripe 𝑖 and 5.1 Sparse Matrix Representation
consider classifying it as asynchronous, instead. In this case, Two-Face represents the sparse matrix 𝐴 in a modified COO
the stripe’s contribution (call it 𝑧𝑖 ) to the right-hand side of format. The nonzeros in asynchronous stripes are extracted
Equation 1 would be given by: from 𝐴 and are stored in an Asynchronous sparse matrix; the
𝑧𝑖 = 𝑣𝑖 + 𝑢, nonzeros in synchronous/local-input stripes are extracted
where 𝑣𝑖 = 𝐾 (𝛽𝐴𝑙𝑖 + 𝛾𝐴𝑛𝑖 ), into a Synchronous/Local-Input sparse matrix. Because we
use a compressed sparse matrix representation, this format
𝑢 = 𝛼𝐴 + 𝜅𝐴 + 𝛽𝑆 𝐾𝑊 + 𝛼𝑆 ,
does not significantly increase overall memory use.
where stripe 𝑖 requires 𝑙𝑖 dense rows from matrix 𝐵 and Figure 6 shows: (a) an example of an input sparse matrix 𝐴,
contains 𝑛𝑖 nonzeros. Note that 𝑢 depends only on the stripe (b) its corresponding synchronous/local-input sparse matrix,
width (𝑊 ) and other constants, and is therefore constant for and (c) its corresponding asynchronous sparse matrix. The
all the stripes in the matrix. figure assumes that there are four nodes and one sparse
To identify the most beneficial stripes to classify as asyn- stripe for each 2x2 megatile. Assume that, after running
chronous, we look for stripes with low values of 𝑧𝑖 . This is the preprocessing step, the stripes have been classified as
because a low 𝑧𝑖 implies that, if the stripe is classified as local-input, synchronous, and asynchronous such that the
asynchronous, it requires few dense rows to be transferred nonzeros in 𝐴 end up in the categories shown in Figure 6a.
Two-Face: Collective and One-Sided Communication for Efficient Distributed SpMM ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA

(a)
containing a list of nodes that are destinations of the col-
0 a b
1
lective transfer of that stripe. At runtime, this metadata is
c d
local-input nonzeros 2 e f replicated across all nodes.
3
sync nonzeros A 4
g h
5.2 Two-Face Algorithm Description
i j
5 k l m
async nonzeros
6 n o p q
The algorithm consists of three parts: top-level algorithm,
7 r s processing synchronous row panels, and processing asyn-
0 1 2 3 4 5 6 7 chronous stripes. We describe each part in turn.
(b) 5.2.1 Top-Level Algorithm. Algorithm 1 shows the top-
Sync/Local-Input
0 1 2 4 6 7 8 12 14
Panel Pointers level operation of the Two-Face algorithm. All nodes of the
distributed system execute Algorithm 1 in parallel. First,
Values a c e f g h j m n o p q r s the node initializes a flag and two atomic queues for work-
Columns 0 1 3 4 2 5 4 5 4 5 6 7 6 7 sharing (Lines 2-3). These queues provide indices of asyn-
Rows 0 1 2 2 3 3 4 5 6 6 6 6 7 7 chronous stripes (𝑎𝑠𝑦𝑛𝑐_𝑞) and indices of row panels (𝑠𝑦𝑛𝑐_𝑞).
In the example of Figure 6, the 𝑠𝑦𝑛𝑐_𝑞 of N2 is {4, 5}, which
(c) are the indices of the two pointers in the Sync/Local-Input
Asynchronous Stripe Pointers 0 2 5
Panel Pointers array used by N2 (pointing to rows containing
nonzeros 𝑗 and 𝑚). The 𝑎𝑠𝑦𝑛𝑐_𝑞 of N2 is {1}, which is the
Values d b k i l index of the pointer in the Asynchronous Stripe Pointers
Columns 6 7 0 1 1 array that points to the asynchronous stripe assigned to N2.
Rows 1 0 5 4 5
Algorithm 1 Top-Level Two-Face Pseudo-code.
1: procedure DistSPMM(𝐴, 𝐵, 𝐶)
Figure 6. Sparse matrix representation in Two-Face: (a) an 2: 𝑠𝑦𝑛𝑐_𝑡𝑟𝑎𝑛𝑠 𝑓 𝑒𝑟 _𝑑𝑜𝑛𝑒 ← False
input sparse matrix 𝐴, showing an example of nonzeros 3: 𝑎𝑠𝑦𝑛𝑐_𝑞, 𝑠𝑦𝑛𝑐_𝑞 ← 𝐼𝑛𝑖𝑡𝑄𝑢𝑒𝑢𝑒𝑠 (𝐴)
classified into the local-input, sync, and async categories; (b) 4: DoParallel
the corresponding synchronous/local-input sparse matrix; 5: if 𝑡𝑖𝑑 = 0 then ⊲ Sync Transfers
and (c) the corresponding asynchronous sparse matrix. This 6: TransferDenseStripes(𝐴, 𝐵)
figure assumes a 4-node system, a stripe width of 2, and a 7: 𝑠𝑦𝑛𝑐_𝑡𝑟𝑎𝑛𝑠 𝑓 𝑒𝑟 _𝑑𝑜𝑛𝑒 ← True
row panel height of 1. 8: end if
9: if 𝑡𝑖𝑑 ∈ 𝐴𝑠𝑦𝑛𝑐𝑇ℎ𝑟𝑒𝑎𝑑𝑠 then ⊲ Async Processing
10: while 𝑎𝑠𝑦𝑛𝑐_𝑞.nonempty() do
11: 𝑛 ← 𝑎𝑠𝑦𝑛𝑐_𝑞.pop()
The corresponding synchronous/local-input sparse matrix 12: ProcessAsyncStripe(𝐴, 𝐵, 𝐶, 𝑛)
(Figure 6b) organizes the synchronous/local-input nonzeros 13: end while
in a row-major order structure. The elements in this structure 14: end if
are divided into row panels—e.g., nonzeros e and f in Figure 6 15: WaitForFlag(𝑠𝑦𝑛𝑐_𝑡𝑟𝑎𝑛𝑠 𝑓 𝑒𝑟 _𝑑𝑜𝑛𝑒)
are in one row panel. Row panels are the units of work 16: while 𝑠𝑦𝑛𝑐_𝑞.nonempty() do ⊲ Sync Compute
assigned to threads computing on synchronous/local-input 17: 𝑛 ← 𝑠𝑦𝑛𝑐_𝑞.pop()
nonzeros. In Figure 6b, these panels are one row tall, and an 18: ProcessSyncRowPanel(𝐴, 𝐵, 𝐶, 𝑛)
array of Synchronous/Local-input Panel Pointers points to the 19: end while
beginning of each panel. 20: EndParallel
The corresponding asynchronous sparse matrix (Figure 6c) 21: end procedure
organizes the asynchronous nonzeros within stripes in a
column-major order structure. The order of the stripes them- Then, the code starting at Line 4 is executed by all the
selves is row-major, to simplify the distribution of the asyn- threads of the node in parallel. Specifically, Thread 0 initiates
chronous sparse matrix across nodes at runtime. An array of the transmission/reception of the dense stripes needed for
Asynchronous Stripe Pointers points to the beginning of each synchronous operations (Lines 5-8). Data transmission is
asynchronous stripe. implemented as non-blocking, but data reception is blocking.
At runtime, each node will only store those portions of the These transfers are done via a series of calls to MPI_Bcast.
synchronous/local-input and asynchronous sparse matrices The destination nodes are determined via metadata produced
that are relevant to its computation. In addition, for each by the preprocessing step, and these transfers occur at the
dense stripe of 𝐵, the preprocessing step generates metadata stripe granularity.
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Charles Block, Gerasimos Gerogiannis, Charith Mendis, Ariful Azad, and Josep Torrellas

In parallel, all the threads assigned to asynchronous stripes Algorithm 3 Two-Face Async Pseudo-code
begin processing those stripes (Lines 9-14). Once all needed 1: procedure ProcessAsyncStripe(𝐴, 𝐵, 𝐶, 𝑛)
dense input data from collectives has been received (Line 15), 2: 𝑠𝑡𝑟𝑖𝑝𝑒 ← 𝐴.𝑎𝑠𝑦𝑛𝑐_𝑠𝑡𝑟𝑖𝑝𝑒_𝑝𝑡𝑟𝑠 [𝑛]
all threads (including the asynchronous ones after they have 3: 𝑑𝑟𝑜𝑤_𝑖𝑑𝑠 ← 𝑠𝑡𝑟𝑖𝑝𝑒.UniqueColIDs()
processed the asynchronous stripes) process the synchro- 4: 𝑑𝑟𝑜𝑤𝑠 ← GetRemoteRows(𝑑𝑟𝑜𝑤_𝑖𝑑𝑠)
nous row panels (Lines 16-19). 5: for 𝑛𝑧 ∈ 𝑠𝑡𝑟𝑖𝑝𝑒 do in parallel
6: AtomicAdd(𝐶 [𝑛𝑧.𝑟𝑜𝑤], 𝑛𝑧.𝑣𝑎𝑙 ∗ 𝑑𝑟𝑜𝑤𝑠 [𝑛𝑧.𝑐𝑜𝑙])
5.2.2 Processing Synchronous Row Panels. Algorithm
7: end for
2 describes the processing of a row panel. The operation
8: end procedure
starts by initializing a thread-local Accumulation Buffer to
zero (acc in Line 2) and reading the row panel (panel in Line 3).
Then, the algorithm iterates through all of the nonzeros in the required for correct accumulation into 𝐶, just as in the syn-
row panel, accumulating each result onto acc (Line 10). When chronous stripe case. However, since asynchronous nonze-
we either complete a row of nonzeros (Line 7) or complete the ros are stored in column-major order, we cannot easily use
whole row panel (Line 13), we add acc to the corresponding thread-local buffers to reduce the number of atomics.
row of 𝐶. Atomics are required in this operation because To reduce transfer overheads, inside the GetRemoteRows
some threads operating on asynchronous stripes may also routine, we coalesce the transfer of nearby rows of 𝐵. For
be writing to the same rows of 𝐶. example, if a sparse stripe requires 𝐵 rows {2, 3, 6, 8}, we
transfer three groups of rows, with (𝑜 𝑓 𝑓 𝑠𝑒𝑡, 𝑠𝑖𝑧𝑒) pairs equal
to {(2, 2), (6, 1), (8, 1)}. This optimization reduces software
Algorithm 2 Two-Face Sync Compute Pseudo-code overheads. For small 𝐾, we also coalesce rows separated by
1: procedure ProcessSyncRowPanel(𝐴, 𝐵, 𝐶, 𝑛) unused rows, potentially reducing the software overhead fur-
2: 𝑎𝑐𝑐 ← {0, ..., 0} ⊲ Output row buffer ther, but transferring some useless data. Using the example
3: 𝑝𝑎𝑛𝑒𝑙 ← 𝐴.𝑝𝑎𝑛𝑒𝑙_𝑝𝑡𝑟𝑠 [𝑛] from before, we might transfer groups of rows {(2, 2), (6, 3)},
4: 𝑝𝑟𝑒𝑣_𝑟𝑜𝑤 ← 𝑝𝑎𝑛𝑒𝑙 [0].𝑟𝑜𝑤 ⊲ Initialize to first row retrieving one unnecessary row (row 7).
5: for 𝑛𝑧 ∈ 𝑝𝑎𝑛𝑒𝑙 do
6: if 𝑛𝑧.𝑟𝑜𝑤 ≠ 𝑝𝑟𝑒𝑣_𝑟𝑜𝑤 then 5.3 Tuning Knobs and Portability
7: AtomicAdd(𝐶 [𝑝𝑟𝑒𝑣_𝑟𝑜𝑤], 𝑎𝑐𝑐) Two-Face has several parameters that may need to be cali-
8: 𝑎𝑐𝑐 ← {0, ..., 0} brated for each individual system to achieve maximal per-
9: end if formance. Among these are the coefficients used in the pre-
10: 𝑎𝑐𝑐 ← 𝑎𝑐𝑐 + 𝑛𝑧.𝑣𝑎𝑙 ∗ 𝐵 [𝑛𝑧.𝑐𝑜𝑙] processing cost model (𝛽𝑆 , 𝛼𝑆 , 𝛽𝐴 , 𝛼𝐴 , 𝜅𝐴 , 𝛾𝐴 ). As mentioned
11: 𝑝𝑟𝑒𝑣_𝑟𝑜𝑤 ← 𝑛𝑧.𝑟𝑜𝑤 before, in our evaluation, we determine the values of these
12: end for coefficients via linear regression on a small number of work-
13: AtomicAdd(𝐶 [𝑝𝑟𝑒𝑣_𝑟𝑜𝑤], 𝑎𝑐𝑐) loads. These parameters only need to be calibrated once for
14: end procedure a system, possibly at installation time.
In addition to the preprocessing cost model coefficients,
the runtime algorithm is parameterized by the number of
threads assigned to sync/async stripe processing, the ag-
5.2.3 Processing Asynchronous Stripes. Algorithm 3 gressiveness of row coalescing in async stripe transfers, the
shows the algorithm to process an asynchronous stripe. A height of the row panels used for computation in the sync
thread reads the asynchronous stripe (stripe in Line 2) and stripes, and the width of the stripes. The optimal choice for
iterates over the nonzeros in the stripe to identify the unique these parameters may vary between systems and workloads,
𝑐_𝑖𝑑s of the nonzeros (Line 3). These determine the indices but we show in Sections 6.2 and 7 that choosing reason-
of the dense rows from 𝐵 that are required. The asynchro- able, static values can provide good performance. In practice,
nous thread then initiates the remote access of the dense these parameters could be determined at installation time
rows by calling GetRemoteRows (Line 4). This procedure similarly to the preprocessing coefficients.
uses MPI_Rget and a custom MPI datatype defined with Thus, although Two-Face relies on knowledge of system
MPI_Type_indexed to select only the rows of interest for characteristics to make decisions about how to schedule
the transfer. the work, porting to a new system just requires a one-time
Once the dense rows arrive, they are stored in drows profiling step during installation.
(Line 4), and multiple threads begin computing on them.
Each thread processes a subset of the nonzeros in the sparse 5.4 Applicability to GNN Training
stripe. Each nonzero is multiplied with the corresponding While SpMM is used in a variety of domains, one of the most
row of 𝑑𝑟𝑜𝑤𝑠 and accumulated into 𝐶 (Line 6). Atomics are important ones is GNN training. GNN training is often done
Two-Face: Collective and One-Sided Communication for Efficient Distributed SpMM ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA

in relatively small-scale systems, where the amount of mem- 6.1 Overview


ory is a limitation. To alleviate this problem, GNN training We evaluate Two-Face and multiple baseline algorithms using
algorithms have recently resorted to the use of sampling [25] large matrices on Delta [2], a supercomputer at the National
and mini-batching. While these techniques reduce the mem- Center for Supercomputing Applications (NCSA). We use
ory footprint requirements, they may introduce some inac- up to 64 CPU nodes, with a default of 32 CPU nodes. Each
curacy [25, 33], and may introduce runtime overhead which Delta node is a dual-socket system with two 64-core AMD
can sometimes increase the end-to-end execution time [23]. EPYC 7763 processor chips running at 2.45 GHz and a total
In Two-Face, like in most of the prior work in distributed of 256 GiB of DRAM. The nodes are connected through a
GNN training [33, 53], we are less concerned about the lack Cray Slingshot interconnect [16].
of memory because we can use the full aggregate memory We build on the code published by Bharadwaj et al. [8],
of a very large cluster [53]. As a result, we do not consider adapting it as necessary to support our algorithms and larger
sampling or mini-batching and, instead, support full-graph matrices. We use hybrid OpenMP / MPI programming, with
GNN training. one MPI rank and 128 OpenMP threads per node. We use
It is interesting, however, to consider the application of OpenMP 4.5 [43] and Open MPI 4.1.2 [19, 39] with UCX
Two-Face to an environment with sampling or mini-batching. 1.11.2 [49]. All of the baseline algorithms use the Intel Math
In principle, in its current form, Two-Face is incompatible Kernel Library [14] (version 2022.0.2) for local SpMM com-
with sampling or mini-batching. This is because, in sam- putations. These baselines also rely on CombBLAS [6] for
pling or mini-batching algorithms, different iterations of the I/O. Our implementation of Two-Face handles I/O by way
SpMM computation use a different reduced (or sampled) of custom data loaders for our preprocessed sparse matrix
matrix. As a result, Two-Face would have to re-run the pre- format. All algorithms used in these experiments make use
processing step every time the reduced matrix changes. of Eigen [46] for handling dense matrices locally.
Future work may involve adapting Two-Face to apply to We use eight large sparse matrices from SuiteSparse [15],
GNNs with sampling. One possible approach may involve described in Table 1. These matrices are derived from a va-
making preprocessing decisions offline once, based on the riety of domains, including internet traffic, social networks,
expected stripes’ densities, given knowledge of the sampling web crawls, and scientific applications.
to be done at runtime. Then, stripes that are expected to be
dense enough even after sampling would still be classified Table 1. Matrices used in the evaluation. All matrices are
as synchronous, and the other stripes would be classified as among the largest in SuiteSparse [15] and are square. Stripe
asynchronous. At runtime, the graph would still be stored widths are chosen to scale with the number of columns.
as shown in Figure 6, but with the addition of masks to filter
nonzeros eliminated by the sampling at each iteration. Matrix Name # Rows # Nonzeros Stripe
In current full-graph GNN training [35, 54], the prepro- Long Short (Mill) (Mill) Width
cessing cost can be easily amortized. We will quantify the mawi_201512020030 mawi 68.86 143.41 128K
exact cost of the preprocessing step in Section 7.3. In GNN Queen_4147 queen 4.15 316.55 8K
training, the same sparse matrix is used for hundreds or even stokes stokes 11.45 349.32 32K
thousands of SpMM iterations. Additionally, in many GNN kmer_V1r kmer 214.01 465.41 512K
arabic-2005 arabic 22.74 640.00 64K
applications, the same graph is used for both training and
twitter7 twitter 41.65 1,468.37 128K
inference. This is the assumption in most GNNs for semi- GAP-web web 50.64 1,930.29 128K
supervised node classification applications [35, 54]. Since com-Friendster friendster 65.61 3,612.13 128K
the sparse matrix does not change, the preprocessing done
in training can be reused for inference. For these reasons,
the overhead of Two-Face preprocessing in full-graph GNN Our evaluation in Section 7 supports the claim that dis-
training is negligible. tributed SpMM is typically a communication-bound work-
load. Thus, we expect that extending Two-Face to other com-
puting hardware would provide similar results. For example,
using GPUs in the nodes may accelerate the local compu-
6 Methodology tation, but communication will remain a bottleneck. Thus,
Here, we describe our evaluation methodology. We begin we expect that Two-Face will still see speedups if used with
with details about the hardware configuration and software GPUs. Here, we evaluate a CPU implementation.
libraries used to evaluate Two-Face, as well as the sparse
matrices used as benchmarks. Then, we discuss how we 6.2 Two-Face Parameterization
determined the values of various parameters. Finally, we Two-Face is a parameterizable algorithm. To determine its pa-
describe the other algorithms which we use as baselines to rameters for our system, we analyzed several combinations
compare to Two-Face. of parameters on a small set of workloads.
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Charles Block, Gerasimos Gerogiannis, Charith Mendis, Ariful Azad, and Josep Torrellas

To determine the appropriate width of stripes, we ana- Table 3. Coefficient values used in the preprocessing of
lyzed the performance of SpMM using the queen, arabic, matrices. The 𝛽 parameters relate to the system bandwidth,
and twitter matrices with various choices for 𝑊 . There was the 𝛼 parameters relate to other communication overheads,
increasing overhead in both the preprocessing and runtime and the 𝛾 & 𝜅 terms relate to computational throughput and
steps as the number of stripes grew, suggesting that the stripe other overheads.
width should not be made too small, relative to the size of the
matrix. We decided to scale the stripe width proportionally Coefficient Experimental Value
to the dimensions of the matrices, rounding to the nearest 𝛽𝑆 1.95 × 10 −10
power of two. Table 1 shows the stripe widths we chose. 𝛼𝑆 1.36 × 10 −6
All run-time parameters other than the stripe width are 𝛽𝐴 3.61 × 10 −9
held constant across matrices. Table 2 shows these parame- 𝛼𝐴 1.02 × 10 −5
ters. Each node runs 128 OpenMP threads. Since a large num- 𝛾𝐴 2.07 × 10 −8
ber of one-sided transfers results in high resource contention, 𝜅𝐴 8.72 × 10 −9
we limit the number of threads communicating asynchro-
nous data to 2 per node. We allow each of these threads to
fork up to four ways (for a total of 8 threads) when comput- divide the dense input matrix 𝐵 into as many equally-sized
ing on the asynchronous stripes. We dedicate the remaining portions as the number of nodes 𝑝, and call each portion a
120 threads in the node to computation on the synchronous “block”. The 𝐵 matrix is distributed across all nodes, where
and local-input stripes. We define the maximum row coalesc- each node stores a single block.
ing distance for asynchronous transfers to be proportional
Table 4. SpMM algorithms being compared.
to 𝐾1 , since the cost of transferring unnecessary dense rows
grows with 𝐾.
Algorithm Name MPI Transfer Operations
Table 2. Constant runtime parameters used in Two-Face. Dense Shifting [8] MPI_Allgather, MPI_Sendrecv
Allgather MPI_Allgather
Parameter Name Value Async Coarse-Grained MPI_Get
Async Communication Threads per Node 2 Two-Face MPI_Rget, MPI_Ibcast
Async Computation Threads per Node 8 Async Fine-Grained MPI_Rget
Sync/Local-Input Computation Threads per Node 120
Max Async Coalescing Distance (127/𝐾 ) + 1
Row Panel Height of Sync/Local-Input Sparse Matrix 32 rows Dense Shifting (DS) is a synchronous SpMM algorithm
that has been investigated by Bharadwaj et al. [8] and found
To determine the values of the preprocessing parameters to be highly competitive compared to other state-of-the-art
used in stripe classification (Section 4.2), we employ linear implementations. We use it as our main baseline. DS begins
regression [40]. We collect data by processing the twitter by using MPI_Allgather to replicate a certain number of
matrix [36] using 𝐾 = 32, 𝑝 = 32, and nine different com- blocks in each node, as determined by a replication factor 𝑐.
binations of stripe widths and asynchronous/synchronous It then continues by shifting the replicated blocks cyclically
stripe classifications. The number of samples is kept small via MPI_Sendrecv after each computation step. For instance,
to ensure that it is reasonable to calibrate these coefficients with 𝑐 = 4, this algorithm replicates each block such that
when installing Two-Face on a new system. The derived coef- each node holds four blocks at a time. It then performs 𝑝/𝑐
ficient values, which we use when preprocessing all matrices computation and shifting steps to complete the SpMM op-
in our evaluation (unless otherwise specified), are shown in eration. In our experiments, we evaluate this algorithm for
Table 3. 𝑐 = 2, 𝑐 = 4, and 𝑐 = 8, and refer to these settings as DS2,
These coefficients provide some insight into the perfor- DS4, and DS8, respectively.
mance difference between one-sided asynchronous and col- The next two algorithms replicate all or nearly all of the
lective synchronous communication. For example, they sug- matrix 𝐵 before beginning the computation. In Allgather,
gest that asynchronous transfers are more expensive per each node uses MPI_Allgather to broadcast its block of 𝐵 to
transferred element of 𝐵 than synchronous transfers by a all others and receive theirs in turn. In Asynchronous Coarse-
factor of 𝛽𝐴 /𝛽𝑆 ≈ 18.5. Grained, each node uses MPI_Get to obtain the blocks that it
In Section 7.4 of our evaluation, we evaluate the impact needs for its computation. In both cases, substantial memory
of different values of these coefficients. has to be allocated, creating issues as the problem size scales.
Two-Face is the algorithm we propose. We use the parame-
6.3 Algorithms Evaluated ters as described before. However, if the preprocessing algo-
In our evaluation, we compare Two-Face to other algorithms rithm determines that the chosen sync/async classification
shown in Table 4. All the algorithms use 1D partitioning. We of stripes would result in too much memory consumption
Two-Face: Collective and One-Sided Communication for Efficient Distributed SpMM ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA

Allgather Async Coarse Async Fine DS2 DS4 DS8 Two-Face


10
Speedup vs DS2

8
6
4
2
0
eb

es

ic

er

er

er

g
aw
ee

av
ab

km

itt

st
ok
w

m
qu

nd
ar

tw
st

ie
fr
Figure 7. Speedups of various SpMM algorithms over DS2 for 𝐾 = 32.

Allgather Async Coarse Async Fine DS2 DS4 DS8 Two-Face


10
Speedup vs DS2

8
6
4
2
0
eb

es

ic

er

er

er

g
aw
ee

av
ab

km

itt

st
ok
w

m
qu

nd
ar

tw
st

ie
fr
Figure 8. Speedups of various SpMM algorithms over DS2 for 𝐾 = 128.

Allgather Async Coarse Async Fine DS2 DS4 DS8 Two-Face


10
Speedup vs DS2

8
6
4
2
0
eb

es

ic

er

er

er

g
aw
ee

av
ab

km

itt

st
ok
w

m
qu

nd
ar

tw
st

ie
fr
Figure 9. Speedups of various SpMM algorithms over DS2 for 𝐾 = 512.

in one or more nodes during SpMM execution, it will clas- 7.1 Comparing Two-Face to Various Baselines
sify additional stripes as async until the expected memory Figures 7, 8, and 9 show the speedups of Two-Face and the
consumption in those nodes is feasible. other SpMM algorithms over DS2 for 𝐾 = 32, 𝐾 = 128,
Asynchronous Fine-Grained is implemented in the same and 𝐾 = 512, respectively. We normalize to DS2 because,
way as Two-Face, except that all stripes are asynchronous. unlike DS4 or DS8, DS2 does not run out of memory for any
This algorithm is used as an extreme example to illustrate matrices or value of 𝐾 in our evaluation. From the figures, we
the tradeoffs made by a balanced Two-Face implementation. see that, on average, across matrices and 𝐾 values, Two-Face
This baseline was used in Section 3. is the fastest algorithm, and delivers substantial speedups.
All algorithms are evaluated by averaging out the time of As 𝐾 increases, the advantage of Two-Face over the dense
5 consecutive SpMM operations. By default, our experiments shifting algorithms becomes more prominent. This is be-
use 𝑝 = 32 and 𝐾 = 128. Some experiments use 𝐾 = 32 or cause the cost of transferring unnecessary rows in the dense
𝐾 = 256, and others use 𝑝 = 1, 2, 4, 8, 16, 32, or 64. shifting algorithms increases with 𝐾, providing a greater
advantage to the fine-grained one-sided accesses of Two-
Face. At 𝐾 = 32, Two-Face’s average speedup over the dense
shifting algorithm with the best choice of replication factor
for each individual matrix is 1.53x. At 𝐾 = 128, the same
7 Evaluation speedup is 2.11x, and at 𝐾 = 512, is it 2.35x. The average
In this section, we evaluate Two-Face. First, we compare the speedup across all values of 𝐾 shown here is 1.99x.
performance of Two-Face to the various baselines and dis- The Async Fine and dense shifting algorithms are on aver-
cuss any bottlenecks observed. Next, we discuss the scaling age faster than the Async Coarse and Allgather algorithms.
behavior of Two-Face as we vary the number of nodes in Dense shifting is sometimes unable to run with higher repli-
the system. Finally, we analyze the preprocessing cost of cation factors due to memory constraints. For example, for
Two-Face and the sensitivity of Two-Face to the choice of the 𝐾 = 512, DS8 fails to run for half of the matrices, and DS4
preprocessing parameters.
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Charles Block, Gerasimos Gerogiannis, Charith Mendis, Ariful Azad, and Josep Torrellas

Table 5. Absolute execution times of DS2 and Two-Face for the experiments in Figures 7, 8, and 9. The numbers are the average
of five SpMM operations.

web queen stokes arabic mawi kmer twitter friendster


K=32 DS2 (seconds) 1.97 0.28 0.96 1.32 6.46 6.33 4.17 8.79
Two-Face (seconds) 0.56 0.08 0.23 0.26 8.50 6.70 5.71 8.41
K=128 DS2 (seconds) 7.198 0.86 2.22 3.85 19.78 35.77 11.57 20.61
Two-Face (seconds) 1.10 0.19 0.94 0.65 15.18 14.98 20.24 30.08
K=512 DS2 (seconds) 38.86 2.89 9.34 21.46 97.95 136.21 52.77 83.02
Two-Face (seconds) 4.46 0.634 3.552 2.74 55.40 62.77 86.62 117.31
Normalized Exec Time

Sync Comp Async Comp Other


2.0
Sync Comm Async Comm
1.5

1.0

0.5

0.0
S4

S4

S4

S4

S4

S4

S4

S4

e
c

c
Fa

Fa

Fa

Fa

Fa

Fa

Fa

Fa
D

D
2-

2-

2-

2-

2-

2-

2-

2-
web queen stokes arabic mawi kmer twitter friendster

Figure 10. Breakdown of the total execution times of DS4 and Two-Face for 𝐾 = 128. Two-Face’s time is divided into synchronous
and asynchronous components (left and right bars, respectively), which operate in parallel. These are further broken down
into computation (Comp) and communication (Comm). DS4 only has a Sync component. The Other category mainly consists of
the initial setup of data structures for MPI. Execution times are normalized to DS4.

fails in one matrix. As a reference, Table 5 provides the abso- Two exceptions are twitter and friendster. In these ma-
lute execution times of Two-Face and DS2 in these figures. trices, Two-Face’s Sync Comp and Sync Comm have both
The figures also show that the speedups (or slowdowns) increased over DS4, despite the fact that less data is being
are highly dependent on the matrix. For example, Two-Face transferred. We note that Two-Face’s synchronous broadcast
is not the fastest algorithm for twitter and friendster and, for operations are significantly slower than the cyclic shifting
𝐾 = 32, additionally for mawi and kmer. To understand this operations in DS4 when a large portion of the input dense
behavior, Figure 10 breaks down the total execution time matrix is required by many nodes. When Two-Face operates
of DS4 and Two-Face for each matrix for 𝐾 = 128. For Two- on a matrix like friendster, each node participates in many
Face, we break down the execution time into Sync Comp, more MPI calls than it does if dense shifting is used, due to
Sync Comm, Async Comp, and Async Comm. We stack the the finer granularity of the transfers.
Sync components in the left bar and the Async components An interesting case is mawi, where Two-Face is unable
in the right bar, and show both bars side-to-side, since the to reduce the execution time over DS4 because of the cost
execution time is equal to the highest of the two bars. Two- of asynchronous computation. The mawi sparse matrix has
Face also has some Other overheads, which mainly consist of regions that have a relatively high density of nonzeros. Com-
initializing necessary MPI structures before the main com- puting on such asynchronous stripes is likely expensive due
munication/computation begins. For DS4, only Sync Comp to the heavy use of atomics, as the nonzeros are organized in
and Sync Comm are relevant. For each matrix, the bars are column-major order. During this work, we conducted initial
normalized to DS4. tests into storing the nonzeros in row-major order instead.
We see that the dominant contributor to DS4’s execu- However, this change did not result in faster execution, as
tion time is its communication. Two-Face is able to attain the cost of identifying which columns contained nonzeros
significant speedups over DS4 by reducing the amount of (and therefore which dense rows were required) became
communication through fine-grained accesses. In five of the drastically higher.
matrices, we can see that the sum of the communication
time spent by Two-Face in Sync Comm and Async Comm is
significantly less than the amount of time spent by DS4 in 7.2 Two-Face Strong Scaling
its communication. Figure 11 shows the execution times of Two-Face and the
dense shifting algorithm with different replication factors
(DS1, DS2, DS4, and DS8) as we scale the number of nodes
Two-Face: Collective and One-Sided Communication for Efficient Distributed SpMM ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA

web queen stokes arabic

3 ⋅ 101
Execution Time (s)

101

3 ⋅ 100

100

3 ⋅ 10−1

10−1

mawi kmer twitter friendster

102
Execution Time (s)

8 ⋅ 101

4 ⋅ 101

2 ⋅ 101

101

1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Node Count Node Count Node Count Node Count

Two-Face DS1 DS2 DS4 DS8

Figure 11. Execution time of Two-Face and the dense shifting algorithm with different replication factors (DS1, DS2, DS4,
and DS8) as the number of nodes changes. The 𝐾 value is 128. Some data points for the dense shifting algorithm are missing
because they need too much memory or take too long to execute. Both axes in the plots use a logarithmic scale.

from 1 to 64. There is a plot for each matrix and both axes speedup of 12.12x for queen and a worst-case of 0.76x for
are in logarithmic scale. Some data points are missing, since twitter. Moreover, compared to the dense shifting algorithm
some workloads either exceed the memory capacity of one or with the optimal replication factor, Two-Face sees an average
more nodes (at small node counts or high replication factors) speedup ranging from 1.25x at 4 nodes to 2.21x at 64 nodes.
or take too long to run.
The figure shows that, in most of the matrices, Two-Face 7.3 Two-Face Preprocessing Cost
scales well with the number of nodes and, in fact, as well or Two-Face requires a preprocessing step that involves, mainly:
better than the dense shifting algorithm. The exceptions are (1) running our model to classify the stripes into synchronous
mawi, twitter, and, to a lesser extent, friendster. With mawi, and asynchronous, and (2) creating the asynchronous and
none of the algorithms scale particularly well due to the high the synchronous/local-input sparse matrices. In this section,
load imbalance across nodes induced by the matrix. With we give an idea of the execution time of the preprocessing
twitter and friendster, we saw in Figure 10 that Two-Face is step. Note that we have not fully optimized it; in particular,
impacted by inefficient synchronous communication. This is we have not parallelized it across multiple nodes. Therefore,
the reason for the worse scaling performance. the numbers reported are a pessimistic bound.
To understand the behavior of twitter and friendster better,
we profile the collectives in the 64-node runs. We measure Table 6. The overhead of preprocessing in Two-Face, nor-
the number of recipients of each multicast operation. On malized to the cost of a single SpMM operation.
average, this number is 35.7 for twitter and 43.5 for friend-
Matrix 𝑡𝑛𝑜𝑟𝑚_𝐼 /𝑂 𝑡𝑛𝑜𝑟𝑚
ster. In contrast, the matrix with the next largest average
recipient count is kmer, with an average of only 5.7. It ap- web 428.74 102.00
pears that the large collectives needed for Two-Face in twitter queen 302.55 23.60
and friendster are responsible for the inefficient execution stokes 116.70 11.18
arabic 180.35 36.57
and limited scaling. This effect does not appear at low node
mawi 2.58 1.50
counts, where the execution is primarily bottlenecked by
kmer 6.16 3.25
local computation, but it dominates at high node counts. Fu-
twitter 17.89 7.29
ture work should investigate methods to reduce the size of friendster 19.81 8.79
collectives in the algorithm or the design of more regular
Average 134.35 24.27
data movement patterns for the synchronous stripes.
Overall, the performance of Two-Face improves as we scale
from 1 to 64 nodes by 7.47x on average, with a best-case Table 6 shows the overhead of the preprocessing step
for 32 nodes and 𝐾=128 for each matrix. Column 2 shows
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Charles Block, Gerasimos Gerogiannis, Charith Mendis, Ariful Azad, and Josep Torrellas

0.8 ⋅ αA0 0.91 1.10 1.31 0.8 ⋅ αS0 1.20 1.03 1.03 0.8 ⋅ γA0 1.01 1.01 1.13

αA0 1.03 1.00 1.52 αS0 1.30 1.00 1.25 γA0 1.15 1.00 1.24

1.25 ⋅ αA0 0.94 1.07 1.38 1.25 ⋅ αS0 1.46 1.04 1.17 1.25 ⋅ γA0 1.25 1.48 1.19

A0
A0
S0

A0
A0

S0
A0

A0

S0

κ
β

⋅κ
⋅β

⋅κ
⋅β

⋅β
β

⋅β

25
25
8

8
25

0.
0.

0.

1.
1.
1.
(a) Varying αA and βA (b) Varying αS and βS (c) Varying γA and κA

Figure 12. Sensitivity of Two-Face’s execution time to the values of the parameters of the execution model used during the
preprocessing step. The default values of these parameters, as set in Section 6.2 and used in all the earlier experiments, are
represented with the 0 subscript (e.g., 𝛼𝐴0 ).

𝑡𝑛𝑜𝑟𝑚_𝐼 /𝑂 , which is the time of the preprocessing step nor- In the first one, we vary 𝛼𝐴 and 𝛽𝐴 , keeping the other pa-
malized to the time of one SpMM operation. On average, rameters unchanged. Specifically, if 𝛼𝐴0 and 𝛽𝐴0 are the de-
𝑡𝑛𝑜𝑟𝑚_𝐼 /𝑂 is 134.35. However, the preprocessing step is domi- fault values of 𝛼𝐴 and 𝛽𝐴 , we consider all combinations of
nated by I/O time, as the original sparse matrix is read from {0.8 · 𝛼𝐴0, 𝛼𝐴0, 1.25 · 𝛼𝐴0 } × {0.8 · 𝛽𝐴0, 𝛽𝐴0, 1.25 · 𝛽𝐴0 }. In
the file system in a textual Matrix Market format [9] and the second set of changes, we vary 𝛼𝑆 and 𝛽𝑆 in the same
the final asynchronous and synchronous/local-input sparse way, keeping the other parameters unchanged. Finally, we
matrices are written to the file system in a bespoke binary vary 𝛾𝐴 and 𝜅𝐴 , again keeping the others unchanged.
format. Figure 12 shows the outcome of the three sets of changes
Since in many realistic environments, this I/O will not be for the average of three representative matrices: web (Two-
present, Column 3 shows the more relevant 𝑡𝑛𝑜𝑟𝑚 , which is Face’s best case), twitter (Two-Face’s worst case), and stokes
the preprocessing overhead without I/O normalized to the (Two-Face’s median case). For example, Figure 12a corre-
time of one SpMM operation. In this case, the numbers have sponds to the experiments varying 𝛼𝐴 and 𝛽𝐴 . The number
reduced substantially. We see that 𝑡𝑛𝑜𝑟𝑚 ranges from 1.50 to in each box is Two-Face’s execution time with the new param-
102.00, with an average of 24.27. For 𝐾 = 512 (not shown in eters relative to Two-Face’s execution time with the default
the table), the average value of 𝑡𝑛𝑜𝑟𝑚 is 6.15. parameters. For example, if we use 0.8 · 𝛼𝐴0 and 1.25 · 𝛽𝐴0 ,
From these numbers, we see that the cost of the prepro- the Two-Face’s execution time becomes 1.31x higher than
cessing step can be easily amortized. For the matrices where when using the default values.
Two-Face demonstrates a speedup over dense shifting when Overall, the figure shows that using the default parameters
𝐾 = 128, an average of only 15 SpMM operations need to obtained using linear regression is a good choice. Changes to
be performed by Two-Face to already see a speedup when the parameter values typically end up increasing Two-Face’s
including preprocessing time. For 𝐾 = 512, this decreases execution time. The execution time decreases in only two
to only 3 SpMM operations, on average. In contexts such as cases, and the decrease is small.
GNN training, with hundreds of epochs, we can expect to
perform many more SpMM operations with the same matri-
ces than these numbers. In addition, the preprocessing step 8 Related Work
from training may be reusable during inference.
Distributed SpMM: Existing work on distributed SpMM is
rather limited, but there are recent works exploring the topic.
Bharadwaj et al. [8] investigate distributed SpMM, SDDMM,
7.4 Sensitivity to Parameter Values of the and methods of fusing the two for machine learning appli-
Preprocessing Model cations. The implementations of dense shifting evaluated in
The model of execution that we use during preprocessing our paper originate from their work. Additionally, Bharad-
(Section 4.2) uses parameters 𝛼𝐴 , 𝛽𝐴 , 𝛼𝑆 , 𝛽𝑆 , 𝛾𝐴 , and 𝜅𝐴 . In waj et al. [8] present a sparse shifting implementation. In our
Section 6.2, we used linear regression to set their default work, we did not evaluate their approach, since it partitions
values. We used such values in all the experiments so far. the dense input and output matrices in a way that requires
In this section, we change the values of these parame- additional all-to-all communication for GNNs or other ap-
ters, repeat the experiments, and measure the changes in plications that interleave SpMM with a row-wise operator.
Two-Face’s execution time. We perform three sets of changes. Bharadwaj et al. [8] compare their implementations to the
Two-Face: Collective and One-Sided Communication for Efficient Distributed SpMM ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA

SpMM provided by PETSc [7]. Selvitopi et al. [48] investi- of sparse kernels. We believe that algorithms such as Two-
gate multiple algorithms for SpMM, including algorithms Face can be useful in orchestrating the communication in
that use bulk-synchronous collective communication and scaled-up multi-node versions of these accelerators or for
algorithms that use one-sided asynchronous RDMA commu- other large-scale graph analytics architectures [1, 4, 45]. In
nication. They do not, however, investigate combining these addition, we believe that such algorithms can also be benefi-
communication primitives in a single algorithm. cial for inter-cube or inter-chip communication in PIM-based
GNN Training: Prior work has addressed the issue of large architectures for graph analytics [5, 22, 59, 60]
graphs in GNN training via sampling techniques [25]. How- Finally, scheduling algorithms for collectives such as Themis
ever, the benefits of sampling can come at a cost to accuracy, [47] have been proposed to maximize the bandwidth utiliza-
leading prior work to investigate full-batch distributed GNN tion of multidimensional, heterogeneous networks. These
training [33, 53]. However, in Tripathy et al. [53], dense ma- works could inspire network hardware support for Two-Face,
trices used in SpMM operations are only transferred in a but one would also require innovations to support the asyn-
coarse-grained sparsity-unaware fashion. Conversely, Jia et chronous communication operations.
al. [33], using Lux [32], assume a GNN runtime that operates
via pushing/pulling node embeddings in a fine-grained man- 9 Conclusion
ner. This is distinct from Two-Face, which uses a combination Sparse matrices often contain regions that are denser and
of coarse and fine-grained transfers to leverage the benefits regions that are sparser. Based on the observation, this pa-
of both approaches. per presented Two-Face, an algorithm for distributed SpMM
Non-Distributed Sparse Kernels: SpMM optimization has that, leveraging a preprocessing model, performs collective
been the topic of several investigations. Works such as WACO communications for the denser regions, and fine-grained
[56], WISE [57], and DDB [58] attempt to optimize sparse one-sided communications for the sparser regions. Two-Face
computations by using machine learning techniques to pre- attains an average speedup of 2.11x over dense shifting when
dict the performance of various configurations. Many CPU evaluated on a 4096-core supercomputer. Additionally, Two-
and GPU tiling techniques and implementations have been Face scales well with the machine size.
published [27, 30, 37, 41]. Other sparse kernels have also Two-Face suggests that distributed sparse algorithms should
been subject to several investigations aiming to tame irregu- be input-matrix aware, in that different sections of a sparse
lar access patterns [24, 28, 42, 52]. These optimizations for input matrix prefer using different communication meth-
non-distributed kernels may be applicable to the distributed ods. The algorithms should also be communication-oriented,
case, but they tend to assume a shared memory system, and since minimizing communication is a first-class concern.
they are largely orthogonal to our work. With simple modifications, the Two-Face algorithm should
Recently, SpMM, SpMV, and SpGEMM kernels for hetero- also be applicable to sparse kernels such as Sampled Dense-
geneous hardware have been proposed [13, 20, 38]. Cheng Dense Matrix Multiplication (SDDMM), which exhibits very
et al. [13] tackle SpGEMM on asymmetric multicore proces- similar patterns to SpMM. Likewise, with proper parameter
sors. HotTiles [20] partitions the SpMM sparse input matrix tuning, Two-Face may also be applicable to accelerate SpMV,
into two types of regions and assigns each region type to a which is a special case of SpMM. We are investigating these
different accelerator by solving an optimization problem. and other algorithms.
Other Distributed Sparse Kernels: Other distributed sparse
kernels have recently received attention. CombBLAS [6] is
a library for distributed sparse kernels such as SpGEMM Acknowledgments
and SpMV. CombBLAS provides a number of GPU SpMM We thank Fredrik Kjolstad and the reviewers for helping to
implementations using different partitioning and communi- improve this paper. This research was funded in part by ACE,
cation patterns. All of them use sparsity-unaware collectives. one of the seven centers in JUMP 2.0, a Semiconductor Re-
In contrast, Two-Face uses a hybrid approach. Hussain et search Corporation (SRC) program sponsored by DARPA; by
al. [31] investigate communication-avoiding algorithms for a grant from the IBM-Illinois Discovery Accelerator Institute;
SpGEMM. DGCL [10] is a library for distributed GNN train- by NSF grants PPoSS CCF 2316233, CNS 1956007 and CCF
ing that partitions graphs and processes GNN computations 2107470; by DOE grant DE-SC0022098; and by the National
at the level of nodes in the graphs, without explicitly express- Science Foundation Graduate Research Fellowship Program
ing the computation with SpMM operations. under Grant No. DGE 21-46756. This work used the Delta sys-
Domain-specific Architectures and Network Support: tem at the National Center for Supercomputing Applications
Several architectural designs that offer hardware support through allocation CIS230044 from the Advanced Cyber-
for SpMM computation have been recently proposed [21, infrastructure Coordination Ecosystem: Services Support
26, 34, 44, 50]. SPADE [21] is an accelerator for SpMM and (ACCESS) program, which is supported by National Science
SDDMM designed to be tightly coupled with CPU cores. Ten- Foundation grants 2138259, 2138286, 2138307, 2137603, and
saurus [50] and ExTensor [26] are accelerators for a variety 2138296.
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Charles Block, Gerasimos Gerogiannis, Charith Mendis, Ariful Azad, and Josep Torrellas

References Multiplication on Modern Asymmetric Multicore Processors. In Pro-


[1] Sriram Aananthakrishnan, Shamsul Abedin, Vincent Cavé, Fabio Chec- ceedings of the 52nd International Conference on Parallel Processing (Salt
coni, Kristof Du Bois, Stijn Eyerman, Joshua B. Fryman, Wim Heir- Lake City, UT, USA) (ICPP ’23). Association for Computing Machinery,
man, Jason Howard, Ibrahim Hur, Samkit Jain, Marek M. Landowski, 807–817. https://doi.org/10.1145/3605573.3605611
Kevin Ma, Jarrod Nelson, Robert Pawlowski, Fabrizio Petrini, Sebas- [14] Intel Corporation. 2023. Intel® oneAPI Math Kernel Library. Intel
tian Szkoda, Sanjaya Tayal, Jesmin Jahan Tithi, and Yves Vandriessche. Corporation. Retrieved 2023 from https://intel.com/content/www/us/
2023. The Intel® Programmable and Integrated Unified Memory Ar- en/developer/tools/oneapi/onemkl.html
chitecture (PIUMA) Graph Analytics Processor. IEEE Micro (2023), [15] Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse
1–11. https://doi.org/10.1109/MM.2023.3295848 Matrix Collection. ACM Trans. Math. Softw. 38, 1, Article 1 (dec 2011),
[2] Bruno Abreu, Galen Arnold, Gregory Bauer, Brett Bode, Craig Steffan, 25 pages. https://doi.org/10.1145/2049662.2049663
et al. 2024. Delta User Documentation. National Center for supercom- [16] Hewlett Packard Enterprise. 2024. HPE Slingshot interconnect. Hewlett
puting Applications. Retrieved Jan 2024 from https://docs.ncsa.illinois. Packard Enterprise. Retrieved Jan 2024 from www.hpe.com/us/en/
edu/systems/delta/en/latest/ compute/hpc/slingshot-interconnect.html
[3] Seher Acer, Oguz Selvitopi, and Cevdet Aykanat. 2016. Improving [17] Ruibo Fan, Wei Wang, and Xiaowen Chu. 2023. Fast Sparse GPU Ker-
performance of sparse matrix dense matrix multiplication on large- nels for Accelerated Training of Graph Neural Networks. In 2023 IEEE
scale parallel systems. Parallel Comput. 59 (2016), 71–96. International Parallel and Distributed Processing Symposium (IPDPS).
[4] Matthew Joseph Adiletta, Jesmin Jahan Tithi, Emmanouil-Ioannis 501–511. https://doi.org/10.1109/IPDPS54959.2023.00057
Farsarakis, Gerasimos Gerogiannis, Robert Adolf, Robert Benke, Sid- [18] Matthias Fey and Jan E. Lenssen. 2019. Fast Graph Representation
harth Kashyap, Samuel Hsia, Kartik Lakhotia, Fabrizio Petrini, Gu-Yeon Learning with PyTorch Geometric. In ICLR Workshop on Representation
Wei, and David Brooks. 2023. Characterizing the Scalability of Graph Learning on Graphs and Manifolds.
Convolutional Networks on Intel® PIUMA. In 2023 IEEE International [19] Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J.
Symposium on Performance Analysis of Systems and Software (ISPASS). Dongarra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur,
168–177. https://doi.org/10.1109/ISPASS57527.2023.00025 Brian Barrett, Andrew Lumsdaine, Ralph H. Castain, David J. Daniel,
[5] Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiy- Richard L. Graham, and Timothy S. Woodall. 2004. Open MPI: Goals,
oung Choi. 2015. A scalable processing-in-memory accelerator for Concept, and Design of a Next Generation MPI Implementation. In
parallel graph processing. In Proceedings of the 42nd Annual Interna- Proceedings, 11th European PVM/MPI Users’ Group Meeting. 97–104.
tional Symposium on Computer Architecture. 105–117. [20] Gerasimos Gerogiannis, Sriram Aananthakrishnan, Josep Torrellas,
[6] Ariful Azad, Oguz Selvitopi, Md Taufique Hussain, John R. Gilbert, and and Ibrahim Hur. 2024. HotTiles: Accelerating SpMM with Heteroge-
Aydın Buluç. 2022. Combinatorial BLAS 2.0: Scaling Combinatorial neous Accelerator Architectures. In 2024 IEEE International Symposium
Algorithms on Distributed-Memory Systems. IEEE Transactions on on High-Performance Computer Architecture (HPCA). IEEE.
Parallel and Distributed Systems 33, 4 (2022), 989–1001. https://doi. [21] Gerasimos Gerogiannis, Serif Yesil, Damitha Lenadora, Dingyuan Cao,
org/10.1109/TPDS.2021.3094091 Charith Mendis, and Josep Torrellas. 2023. SPADE: A Flexible and
[7] Satish Balay, Shrirang Abhyankar, Mark F. Adams, Steven Benson, Scalable Accelerator for SpMM and SDDMM. In Proceedings of the 50th
Jed Brown, Peter Brune, Kris Buschelman, Emil M. Constantinescu, Annual International Symposium on Computer Architecture (Orlando,
Lisandro Dalcin, Alp Dener, Victor Eijkhout, Jacob Faibussowitsch, FL, USA) (ISCA ’23). Association for Computing Machinery, Article 19,
William D. Gropp, Václav Hapla, Tobin Isaac, Pierre Jolivet, Dmitry 15 pages. https://doi.org/10.1145/3579371.3589054
Karpeev, Dinesh Kaushik, Matthew G. Knepley, Fande Kong, Scott [22] Christina Giannoula, Ivan Fernandez, Juan Gómez Luna, Nectarios
Kruger, Dave A. May, Lois Curfman McInnes, Richard Tran Mills, Koziris, Georgios Goumas, and Onur Mutlu. 2022. SparseP: Towards
Lawrence Mitchell, Todd Munson, Jose E. Roman, Karl Rupp, Patrick efficient sparse matrix vector multiplication on real processing-in-
Sanan, Jason Sarich, Barry F. Smith, Stefano Zampini, Hong Zhang, memory architectures. Proceedings of the ACM on Measurement and
Hong Zhang, and Junchao Zhang. 2023. PETSc Web page. https: Analysis of Computing Systems 6, 1 (2022), 1–49.
//petsc.org/. https://petsc.org/ [23] Zhangxiaowen Gong, Houxiang Ji, Yao Yao, Christopher W Fletcher,
[8] Vivek Bharadwaj, Aydın Buluc, and James Demmel. 2022. Distributed- Christopher J Hughes, and Josep Torrellas. 2022. Graphite: Optimiz-
Memory Sparse Kernels for Machine Learning. In 2022 IEEE Interna- ing graph neural networks on CPUs through cooperative software-
tional Parallel and Distributed Processing Symposium (IPDPS). IEEE hardware techniques. In Proceedings of the 49th Annual International
Computer Society, 47–58. https://doi.org/10.1109/IPDPS53621.2022. Symposium on Computer Architecture (ISCA). 916–931.
00014 [24] Zhixiang Gu, Jose Moreira, David Edelsohn, and Ariful Azad. 2020.
[9] Ronald Boisvert, Roldan Pozo, and K Remington. 1996. The Matrix Bandwidth Optimized Parallel Algorithms for Sparse Matrix-Matrix
Market Exchange Formats: Initial Design. Multiplication Using Propagation Blocking. In Proceedings of the 32nd
[10] Zhenkun Cai, Xiao Yan, Yidi Wu, Kaihao Ma, James Cheng, and Fan ACM Symposium on Parallelism in Algorithms and Architectures (Vir-
Yu. 2021. DGCL: An Efficient Communication Library for Distributed tual Event, USA) (SPAA ’20). Association for Computing Machinery,
GNN Training. In Proceedings of the Sixteenth European Conference 293–303. https://doi.org/10.1145/3350755.3400216
on Computer Systems (Online Event, United Kingdom) (EuroSys ’21). [25] William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive
Association for Computing Machinery, 130–144. https://doi.org/10. Representation Learning on Large Graphs. In Proceedings of the 31st In-
1145/3447786.3456233 ternational Conference on Neural Information Processing Systems (Long
[11] John Canny and Huasha Zhao. 2013. Bidmach: Large-scale learning Beach, California, USA) (NIPS’17). Curran Associates Inc., 1025–1035.
with zero memory allocation. In BigLearning, NIPS Workshop. [26] Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal
[12] Ernie Chan, Robert Van De Geijn, William Gropp, and Rajeev Thakur. Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W.
2006. Collective communication on architectures that support si- Fletcher. 2019. ExTensor: An Accelerator for Sparse Tensor Algebra.
multaneous communication over multiple links. In Proceedings of the In Proceedings of the 52nd Annual IEEE/ACM International Symposium
eleventh ACM SIGPLAN symposium on Principles and practice of parallel on Microarchitecture (Columbus, OH, USA) (MICRO ’52). Association
programming. 2–11. for Computing Machinery, 319–333. https://doi.org/10.1145/3352460.
[13] Helin Cheng, Wenxuan Li, Yuechen Lu, and Weifeng Liu. 2023. 3358275
HASpGEMM: Heterogeneity-Aware Sparse General Matrix-Matrix
Two-Face: Collective and One-Sided Communication for Efficient Distributed SpMM ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA

[27] Changwan Hong, Aravind Sukumaran-Rajam, Israt Nisa, Kunal Singh, [40] Douglas C Montgomery, Elizabeth A Peck, and G Geoffrey Vining.
and P. Sadayappan. 2019. Adaptive Sparse Tiling for Sparse Matrix 2021. Introduction to linear regression analysis. John Wiley & Sons.
Multiplication. In Proceedings of the 24th Symposium on Principles and [41] NVIDIA. 2024. cuSPARSE. Retrieved Jan 2024 from https://developer.
Practice of Parallel Programming (Washington, District of Columbia) nvidia.com/cusparse
(PPoPP ’19). Association for Computing Machinery, 300–314. https: [42] Toluwanimi O. Odemuyiwa, Hadi Asghari-Moghaddam, Michael Pel-
//doi.org/10.1145/3293883.3295712 lauer, Kartik Hegde, Po-An Tsai, Neal C. Crago, Aamer Jaleel, John D.
[28] Olivia Hsu, Maxwell Strange, Ritvik Sharma, Jaeyeon Won, Kunle Owens, Edgar Solomonik, Joel S. Emer, and Christopher W. Fletcher.
Olukotun, Joel S. Emer, Mark A. Horowitz, and Fredrik Kjølstad. 2023. 2023. Accelerating Sparse Data Orchestration via Dynamic Reflexive
The Sparse Abstract Machine. In Proceedings of the 28th ACM Inter- Tiling. In Proceedings of the 28th ACM International Conference on Ar-
national Conference on Architectural Support for Programming Lan- chitectural Support for Programming Languages and Operating Systems,
guages and Operating Systems, Volume 3 (Vancouver, BC, Canada) (AS- Volume 3 (Vancouver, BC, Canada) (ASPLOS 2023). Association for Com-
PLOS 2023). Association for Computing Machinery, 710–726. https: puting Machinery, 18–32. https://doi.org/10.1145/3582016.3582064
//doi.org/10.1145/3582016.3582051 [43] OpenMP Architecture Review Board. 2015. OpenMP Application Pro-
[29] Yuwei Hu, Zihao Ye, Minjie Wang, Jiali Yu, Da Zheng, Mu Li, Zheng gram Interface Version 4.5. https://openmp.org/wp-content/uploads/
Zhang, Zhiru Zhang, and Yida Wang. 2020. FeatGraph: A Flexible openmp-4.5.pdf.
and Efficient Backend for Graph Neural Network Systems. In SC20: [44] Marcelo Orenes-Vera, Aninda Manocha, Jonathan Balkind, Fei Gao,
International Conference for High Performance Computing, Networking, Juan L Aragón, David Wentzlaff, and Margaret Martonosi. 2022. Tiny
Storage and Analysis. 1–13. https://doi.org/10.1109/SC41405.2020. but mighty: designing and realizing scalable latency tolerance for
00075 manycore SOCs. In Proceedings of the 49th Annual International Sym-
[30] Guyue Huang, Guohao Dai, Yu Wang, and Huazhong Yang. 2020. posium on Computer Architecture. 817–830.
GE-SpMM: General-Purpose Sparse Matrix-Matrix Multiplication on [45] Marcelo Orenes-Vera, Esin Tureci, David Wentzlaff, and Margaret
GPUs for Graph Neural Networks. In SC20: International Conference Martonosi. 2023. Dalorex: A data-local program execution and ar-
for High Performance Computing, Networking, Storage and Analysis. chitecture for memory-bound applications. In 2023 IEEE International
1–12. https://doi.org/10.1109/SC41405.2020.00076 Symposium on High-Performance Computer Architecture (HPCA). IEEE,
[31] Md Taufique Hussain, Oguz Selvitopi, Aydin Buluç, and Ariful Azad. 718–730.
2021. Communication-Avoiding and Memory-Constrained Sparse [46] Eigen Project. 2023. Eigen v3.4. Retrieved Jan 2024 from https://eigen.
Matrix-Matrix Multiplication at Extreme Scale. In 2021 IEEE Interna- tuxfamily.org
tional Parallel and Distributed Processing Symposium (IPDPS). 90–100. [47] Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivas Sridha-
https://doi.org/10.1109/IPDPS49936.2021.00018 ran, and Tushar Krishna. 2022. Themis: A Network Bandwidth-Aware
[32] Zhihao Jia, Yongkee Kwon, Galen Shipman, Pat McCormick, Mattan Collective Scheduling Policy for Distributed Training of DL Models. In
Erez, and Alex Aiken. 2017. A Distributed Multi-GPU System for Fast Proceedings of the 49th Annual International Symposium on Computer
Graph Processing. Proceedings of the VLDB Endowment 11, 3 (Nov Architecture (New York, New York) (ISCA ’22). Association for Com-
2017), 297–310. https://doi.org/10.14778/3157794.3157799 puting Machinery, 581–596. https://doi.org/10.1145/3470496.3527382
[33] Zhihao Jia, Sina Lin, Mingyu Gao, Matei Zaharia, and Alex Aiken. 2020. [48] Oguz Selvitopi, Benjamin Brock, Israt Nisa, Alok Tripathy, Katherine
Improving the accuracy, scalability, and performance of graph neural Yelick, and Aydın Buluç. 2021. Distributed-Memory Parallel Algo-
networks with Roc. Proceedings of Machine Learning and Systems 2 rithms for Sparse Times Tall-Skinny-Dense Matrix Multiplication. In
(2020), 187–198. Proceedings of the ACM International Conference on Supercomputing
[34] Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Gian- (Virtual Event, USA) (ICS ’21). Association for Computing Machinery,
noula, Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi, Taha 431–442. https://doi.org/10.1145/3447818.3461472
Shahroodi, Juan Gomez Luna, and Onur Mutlu. 2019. Smash: Co- [49] Pavel Shamis, Manjunath Gorentla Venkata, M Graham Lopez,
designing software compression and hardware-accelerated indexing Matthew B Baker, Oscar Hernandez, Yossi Itigin, Mike Dubman, Gi-
for efficient sparse matrix operations. In Proceedings of the 52nd annual lad Shainer, Richard L Graham, Liran Liss, et al. 2015. UCX: an open
IEEE/ACM international symposium on microarchitecture. 600–614. source framework for HPC network APIs and beyond. In 2015 IEEE 23rd
[35] Thomas N Kipf and Max Welling. 2016. Semi-supervised classification Annual Symposium on High-Performance Interconnects. IEEE, 40–43.
with graph convolutional networks. arXiv preprint arXiv:1609.02907 [50] Nitish Srivastava, Hanchen Jin, Shaden Smith, Hongbo Rong, David
(2016). Albonesi, and Zhiru Zhang. 2020. Tensaurus: A Versatile Accelerator
[36] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. for Mixed Sparse-Dense Tensor Computations. In 2020 IEEE Interna-
What is Twitter, a social network or a news media?. In WWW ’10: Proc. tional Symposium on High Performance Computer Architecture (HPCA).
the 19th Intl. Conf. on World Wide Web (Raleigh, North Carolina, USA). 689–702. https://doi.org/10.1109/HPCA47549.2020.00062
ACM, 591–600. https://doi.org/10.1145/1772690.1772751 [51] Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Opti-
[37] Shigang Li, Kazuki Osawa, and Torsten Hoefler. 2022. Efficient Quan- mization of collective communication operations in MPICH. The
tized Sparse Matrix Operations on Tensor Cores. In SC22: International International Journal of High Performance Computing Applications 19,
Conference for High Performance Computing, Networking, Storage and 1 (2005), 49–66.
Analysis. 1–15. https://doi.org/10.1109/SC41404.2022.00042 [52] Han D. Tran, Milinda Fernando, Kumar Saurabh, Baskar Ganapathysub-
[38] Wenxuan Li, Helin Cheng, Zhengyang Lu, Yuechen Lu, and Weifeng ramanian, Robert M. Kirby, and Hari Sundar. 2022. A scalable adaptive-
Liu. 2023. HASpMV: Heterogeneity-Aware Sparse Matrix-Vector Mul- matrix SPMV for heterogeneous architectures. In 2022 IEEE Interna-
tiplication on Modern Asymmetric Multicore Processors. In 2023 IEEE tional Parallel and Distributed Processing Symposium (IPDPS). 13–24.
International Conference on Cluster Computing (CLUSTER). IEEE Com- https://doi.org/10.1109/IPDPS53621.2022.00011
puter Society, 209–220. [53] Alok Tripathy, Katherine Yelick, and Aydın Buluç. 2020. Reducing
[39] Message Passing Interface Forum. 2021. MPI: A Message-Passing Inter- Communication in Graph Neural Network Training. In SC20: Interna-
face Standard Version 4.0. https://www.mpi-forum.org/docs/mpi-4.0/ tional Conference for High Performance Computing, Networking, Storage
mpi40-report.pdf https://www.mpi-forum.org/docs/mpi-4.0/mpi40- and Analysis. 1–14. https://doi.org/10.1109/SC41405.2020.00074
report.pdf.
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Charles Block, Gerasimos Gerogiannis, Charith Mendis, Ariful Azad, and Josep Torrellas

[54] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana SIGPLAN Annual Symposium on Principles and Practice of Parallel Pro-
Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention net- gramming (Montreal, QC, Canada) (PPoPP ’23). Association for Com-
works. arXiv preprint arXiv:1710.10903 (2017). puting Machinery, 329–341. https://doi.org/10.1145/3572848.3577506
[55] Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, [58] Serif Yesil, José E. Moreira, and Josep Torrellas. 2022. Dense Dynamic
Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, Blocks: Optimizing SpMM for Processors with Vector and Matrix Units
George Karypis, Jinyang Li, and Zheng Zhang. 2019. Deep Graph Li- Using Machine Learning Techniques. In Proceedings of the 36th ACM
brary: A Graph-Centric, Highly-Performant Package for Graph Neural International Conference on Supercomputing (Virtual Event) (ICS ’22).
Networks. arXiv preprint arXiv:1909.01315 (2019). Association for Computing Machinery, Article 27, 14 pages. https:
[56] Jaeyeon Won, Charith Mendis, Joel S. Emer, and Saman Amarasinghe. //doi.org/10.1145/3524059.3532369
2023. WACO: Learning Workload-Aware Co-Optimization of the For- [59] Mingxing Zhang, Youwei Zhuo, Chao Wang, Mingyu Gao, Yongwei
mat and Schedule of a Sparse Tensor Program. In Proceedings of the Wu, Kang Chen, Christos Kozyrakis, and Xuehai Qian. 2018. GraphP:
28th ACM International Conference on Architectural Support for Pro- Reducing communication for PIM-based graph processing with effi-
gramming Languages and Operating Systems, Volume 2 (Vancouver, cient data partition. In 2018 IEEE International Symposium on High
BC, Canada) (ASPLOS 2023). Association for Computing Machinery, Performance Computer Architecture (HPCA). IEEE, 544–557.
920–934. https://doi.org/10.1145/3575693.3575742 [60] Youwei Zhuo, Chao Wang, Mingxing Zhang, Rui Wang, Dimin Niu,
[57] Serif Yesil, Azin Heidarshenas, Adam Morrison, and Josep Torrellas. Yanzhi Wang, and Xuehai Qian. 2019. GraphQ: Scalable PIM-based
2023. WISE: Predicting the Performance of Sparse Matrix Vector Mul- graph processing. In Proceedings of the 52nd Annual IEEE/ACM Inter-
tiplication with Machine Learning. In Proceedings of the 28th ACM national Symposium on Microarchitecture. 712–725.

You might also like