Asplos24 3
Asplos24 3
Asplos24 3
fers (SAT). B K B K
model aims at equalizing the runtimes of the two parts. C[1,0:K-1] = C[1,0:K-1] + a x B[6,0:K-1]
We evaluate Two-Face on a CPU-based supercomputer (a) Computation in SpMM (b) 1D partitioning for 4-nodes
using large matrices and compare it to state-of-the-art base- Figure 1. SpMM and 1D partitioning.
lines. For a system with 32 nodes, 128 cores per node, and
dense matrices with 128 columns, Two-Face attains an av-
erage speedup of 2.11x against dense shifting [8], a high- 2.2 1D Partitioning
performing baseline. In addition, Two-Face is a scalable al- When executing SpMM on a distributed system, the sparse
gorithm: its average speedup over dense-shifting increases and dense matrices should be partitioned across nodes. In this
to 2.21x for 64 nodes. Finally, the overhead introduced by work, we use 1D partitioning, which in prior work [8] was
the necessary matrix preprocessing step is small enough to shown to display good performance for many sparse input
make Two-Face suitable for applications that use the same matrices. Figure 1b illustrates 1D partitioning for a system
sparse matrix only a few dozen times. consisting of 4 nodes (𝑁 0...𝑁 3). The matrices are partitioned
Overall, this paper’s contributions are: according to the node colors. Each node is responsible for
• The Two-Face algorithm for distributed SpMM, which is the nonzeros in a set of consecutive rows of 𝐴. It additionally
based on a mix of collective and one-sided communication. hosts a set of consecutive 𝐵 and 𝐶 rows as shown in Figure 1b.
The read and write accesses to the dense output matrix 𝐶 are
• A low-overhead model and method to partition sparse
always local. The accesses to the dense input matrix 𝐵 are
matrices into regions corresponding to the two access types.
often remote, except for nonzeros with 𝑐_𝑖𝑑s that index to
• An evaluation of Two-Face on a supercomputer and a com- the local portion of 𝐵 (e.g., nonzeros with 𝑐_𝑖𝑑 equal to 0 or
parison with the state-of-the-art. 1 in Node 1). Overall, with this partitioning scheme, remote
data accesses occur only for 𝐵. From now on, we will use the
2 Background term remote transfers to mean transferring remote elements
In this section, we provide a background on the memory of the 𝐵 matrix.
access patterns in SpMM, the 1D partitioning method for dis-
tributing the SpMM data structures in a multi-node system, 2.3 Sparsity-unaware and Sparsity-aware Transfers
and the differences in the communication patterns of SUT We now discuss the communication patterns associated with
and SAT. sparsity-unaware (SUT ) and sparsity-aware transfers (SAT ).
Two-Face: Collective and One-Sided Communication for Efficient Distributed SpMM ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA
K=32 K=128
102 Async Fine better Collectives better
1
10
Speedup of async
vs collectives
100
10−1
10−2
eb
ic
s
er
er
er
eb
ic
s
er
er
er
aw
aw
ke
ke
ee
ee
ab
ab
km
itt
st
km
itt
st
w
w
o
o
m
m
qu
qu
nd
nd
ar
ar
tw
tw
st
st
ie
ie
fr
fr
Figure 2. Speedup of Async Fine over a full-replication AllGather collective implementation for 𝐾 = 32 (left) and 𝐾 = 128 (right).
There is no data for kmer with 𝐾 = 128 due to the large memory consumption of the full-replication collective algorithm.
Figure 3a illustrates one possible SUT pattern. Each node to synchronize for the request to be completed). For these
sends its local rows of 𝐵 to all of the other nodes. This is a reasons, we refer to this particular communication pattern
conservative approach: all local rows are transferred to all for distributed SpMM as Async Fine. This is in contrast to
the other nodes regardless of whether the rows are actually the execution strategies that use SUT s, where transfers are
useful to all the destination nodes. This pattern can be imple- coarse-grained and require more synchronization.
mented either with collectives [48] such as AllGather [12, 51],
which fully replicates the dense input, or with shifting algo- 3 Motivation
rithms [8] that perform the transfers iteratively in a cyclic
manner. A shifting algorithm consists of a series of compu- We now show that, for some matrices, the SAT pattern is
tation and communication steps. more suitable, while for others the SUT is better. Then, we
provide an example to motivate that a combination of the
two approaches for the same matrix can yield the best results.
N0 N1
local to the node with the nonzeros and, therefore, no re- Two-Face combines sparsity-unaware and sparsity-aware
mote transfers are needed; (2) sync nonzeros are those for transfers for efficient distributed SpMM. In this section, we
which the dense input rows needed are better off transferred describe how sparse matrices are analyzed and partitioned
through SUTs; and (3) async nonzeros are those for which it into two types of regions.
is more beneficial to transfer the rows through SAT.
Dense rows broadcast from
4.1 Sparse Matrix Partitioning
N2 to N0, N1 and N3 Before the SpMM execution begins, the sparse matrix is pre-
async transfers: processed in order to determine which of the data transfers
N0 N1 N2 N3
will use coarse-grained multicast operations, and which will
N3 N0
B use fine-grained one-sided communication. Then, during
runtime, both types of transfers and their corresponding
x C computations will proceed in parallel.
0 N0 N2
1
N0 Two-Face adopts 1D partitioning (Subsection 2.2). Thus,
2 each node, which contains one MPI rank, is responsible for
3 N1
A 4 = the nonzeros in a group of consecutive rows of sparse matrix
5 N2
6 sync transfers: 𝐴. In addition, the node hosts the corresponding group of
7 N3 consecutive rows of the dense output matrix 𝐶, and a group
0 1 2 3 4 5 6 7 N0 N1
of consecutive rows of the dense input matrix 𝐵. As discussed
local-input nonzeros earlier, the accesses to 𝐶 and 𝐴 are always local, while the
accesses to 𝐵 can either be local or require a remote data
sync nonzeros
N2
transfer, depending on the 𝑐_𝑖𝑑s of the processed nonzeros.
N3
async nonzeros The matrices are partitioned in the following way:
Megatile. As shown in Figure 5, we logically divide the
Figure 4. Example of how combining the two communica- matrix 𝐴 into megatiles (MT). Given the matrix 𝐴 with 𝑁
tion flavors can be beneficial. rows and 𝑀 columns, and given 𝑝 nodes in the distributed
system, a megatile is formed with 𝑁 /𝑝 consecutive rows
To understand which nonzeros are sync and which are and 𝑀/𝑝 consecutive columns. Node 𝑖 stores the 𝑖 𝑡ℎ row of
async, consider the example. Columns 4 and 5 of the sparse megatiles. We logically divide the matrices 𝐵 and 𝐶 based
matrix are quite dense. This means that the corresponding on the width and height of a megatile, respectively. Figure 5
dense input rows of 𝐵 (shaded in the figure) are useful to shows the breakdown of the matrices for 𝑝 = 4. The chunks
many nodes. Specifically, 𝐵[4,0:𝐾-1] is needed by N1, N2, and of the 𝐵 and 𝐶 matrices are distributed across the nodes as
N3, while 𝐵[5,0:𝐾-1] is needed by all the nodes. Hence, it is shown in the figure, where node 𝑖 is labeled N𝑖.
likely beneficial to transfer the whole group of rows hosted
by N2 to all the other nodes through a collective broadcast N2 dense stripe
operation. Hence, we classify the nonzeros at (0,5), (2,4), (3,5), W
sparse matrix into two types of regions at a fine granularity. the stripe and buffer the results in a thread-local buffer be-
Sparse stripes are classified as local-input if their correspond- fore updating 𝐶. Then, the thread uses a single synchroniza-
ing 𝐵 rows are owned by the local node; otherwise, they are tion operation to accumulate the contents of the thread-local
remote-input. A remote-input sparse stripe can either trigger buffer into the corresponding 𝐶 row. In asynchronous stripes,
a coarse-grained collective transfer or a fine-grained asyn- the nonzeros are stored in column-major order to benefit
chronous one. At a high level, remote-input stripes requiring communication. Specifically, a column-major format allows
many rows of the dense input matrix 𝐵 will be marked as a thread to quickly traverse the nonzeros and determine the
synchronous as, during execution, they will benefit from the unique 𝑐_𝑖𝑑s of the nonzeros in a stripe, which in turn iden-
coarse-grained collectives. On the contrary, remote-input tify the rows of matrix 𝐵 that need to be transferred. This
stripes requiring few rows of 𝐵 will be marked as asynchro- comes at the cost of computational inefficiency since this
nous as, during execution, they will benefit from fine-grained format makes buffering the output of a thread’s computa-
asynchronous transfers. During execution, multiple nodes tions hard. As a result, the thread must typically use one
containing synchronous stripes that require the same data synchronization operation for each nonzero to accumulate
from 𝐵 will participate in the same collective multicast op- results onto 𝐶.
eration to receive that data. It is possible that a “multicast”
transfers data to only a single destination node. 4.2 Preprocessing Model
Dense stripe. All the sparse stripes that have the same range During execution, Two-Face will process the asynchronous
of 𝑐_𝑖𝑑s in 𝐴 access the same group of rows in matrix 𝐵. We stripes in parallel with the synchronous and local-input
call this group of dense rows a dense stripe. A synchronous stripes. Consequently, the optimal choice to partition the
sparse stripe will trigger the coarse-grained transfer of a sparse matrix into synchronous and asynchronous stripes
dense stripe using a collective operation with, potentially, is one that equalizes the execution times of asynchronous
additional destination nodes; an asynchronous sparse stripe stripes and synchronous/local-input stripes. To this end, we
will trigger the fine-grained one-sided transfer of individual create a model of execution based on the following ideas.
rows (or groups of adjacent rows) within the dense stripe. For the synchronous stripes, the model assumes that the
These asynchronous transfers will only transfer the rows of computation time will be negligible compared to the syn-
the dense stripe that are needed for the computation. If a chronous communication time. The reason is that the row-
node does not need any of a dense stripe’s rows, that dense major format of the nonzeros lends itself to efficient exe-
stripe will not be communicated to it at all. cution: the output of the nonzeros in a row of the stripe is
In the next sections, we use stripe to refer to sparse stripes. reused through a thread-local buffer and accumulated into
Any mention of dense stripes is explicit. the corresponding 𝐶 row with a single synchronization oper-
Stripes are classified as asynchronous or synchronous dur- ation. In addition, we take advantage of this parallelizability
ing a preprocessing step using a model that tries to minimize by assigning more parallel threads to the computation of
the expected execution time. We present the model in Sec- synchronous/local-input stripes than for the asynchronous
tion 4.2. During the actual execution after the preprocessing stripes. For the local-input stripes, since they do not need
step, the local threads in an MPI rank operate in parallel and communication, the model neglects both communication
are split into two groups: (1) synchronous threads, which han- and computation time.
dle the data transfers for the synchronous stripes as well as In contrast, the computation time for asynchronous stripes
the computation for synchronous and local-input stripes, and may be significant because the column-major format of the
(2) asynchronous threads, which handle the data transfers nonzeros lends itself to inefficient execution: thread-local
and computation for the asynchronous stripes. All synchro- buffers are not used and we need to perform a synchroniza-
nous communication is completed before any synchronous tion operation for every nonzero. In addition, because syn-
computation begins. On the other hand, asynchronous com- chronization may be a bottleneck, we assign fewer threads to
munication and computation overlap: a thread may compute the computation of asynchronous stripes than to the others.
on one asynchronous stripe while another thread transfers Therefore, for the asynchronous stripes, the model considers
data for a second asynchronous stripe. both computation and communication.
To optimize computation and communication efficiency, We model the cost of synchronous communication (𝐶𝑜𝑚𝑚𝑆 ),
the nonzeros in sparse stripes are ordered in row-major order asynchronous communication (𝐶𝑜𝑚𝑚𝐴 ), and asynchronous
in synchronous stripes and column-major order in asynchro- computation (𝐶𝑜𝑚𝑝𝐴 ) for a particular node as:
nous stripes. In synchronous stripes, the nonzeros are stored
in row-major order because this benefits computation. Specif-
ically, a thread can process the nonzeros of a whole row of 𝐶𝑜𝑚𝑚𝑆 = 𝑆𝑆 (𝛽𝑆 𝐾𝑊 + 𝛼𝑆 )
𝐶𝑜𝑚𝑚𝐴 = 𝛽𝐴 𝐾𝐿𝐴 + 𝛼𝐴 𝑆𝐴
𝐶𝑜𝑚𝑝𝐴 = 𝛾𝐴 𝐾𝑁𝐴 + 𝜅𝐴 𝑆𝐴
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Charles Block, Gerasimos Gerogiannis, Charith Mendis, Ariful Azad, and Josep Torrellas
Consider 𝐶𝑜𝑚𝑚𝑆 first. 𝑆𝑆 is the number of synchronous from a remote node and contains few nonzeros. Therefore,
stripes processed by the node, 𝑊 is the stripe width, and 𝐾 is its communication and computation costs are relatively low.
the number of columns in the dense matrices. 𝛽𝑆 is the cost On the other hand, if the stripe is classified as synchronous,
of synchronous transfer per element of 𝐵 (i.e., it is inversely it has a constant communication cost.
proportional to the bandwidth), and 𝛼𝑆 is other per-stripe Consequently, we sort all of this node’s stripes by their
overheads of synchronous transfers. 𝑧𝑖 in ascending order. Then, we take one stripe at a time, in
Next, for 𝐶𝑜𝑚𝑚𝐴 , 𝑆𝐴 is the number of asynchronous stripes order, and classify it as asynchronous, until we have taken
processed by the node, and 𝐿𝐴 is the total number of rows the first 𝑟 stripes, where 𝑟 is the greatest number that satisfies
of the dense matrix 𝐵 transferred for these stripes via fine-
Õ
𝑟 −1
grained accesses. 𝛽𝐴 and 𝛼𝐴 represent the same costs as 𝛽𝑆 𝑆𝑇 (𝛽𝑆 𝐾𝑊 + 𝛼𝑆 ) ≥ 𝑧𝑖 .
and 𝛼𝑆 but for asynchronous accesses. 𝑖=0
Finally, for 𝐶𝑜𝑚𝑝𝐴 , 𝑁𝐴 is the total number of nonzeros in
the asynchronous stripes processed by the node. 𝛾𝐴 is the The rest of the stripes are classified as synchronous. With
computational cost per operation, and 𝜅𝐴 is the additional this method, the two sides of Equation 1 are approximately
per-stripe software overhead of asynchronous computation. equal, which is a necessary condition to attain optimal execu-
The coefficients 𝛽𝑆 , 𝛼𝑆 , 𝛽𝐴 , 𝛼𝐴 , 𝛾𝐴 , and 𝜅𝐴 are determined tion time. Indeed, following this method, the total runtimes
via a linear regression [40] calibration step (details in Sec- of the synchronous and asynchronous stripes in Two-Face
tion 6.2). These parameters are dependent on the system should be nearly equal, assuming the simplified cost model
configuration. For example, a system with a large bisection that we use does hold. Additionally, sorting the stripes by
bandwidth should have small 𝛽 terms. The 𝛼 terms may be their 𝑧𝑖 maximizes the number of stripes classified as asyn-
reduced by reducing the round-trip communication latency, chronous and minimizes the number of synchronous stripes.
including the latency incurred in software libraries/drivers Because the cost of communication for a synchronous stripe
and in the network. is constant for a given 𝐾 and 𝑊 , this strategy minimizes
In the optimal case, 𝐶𝑜𝑚𝑚𝑆 = 𝐶𝑜𝑚𝑚𝐴 + 𝐶𝑜𝑚𝑝𝐴 , so that 𝐶𝑜𝑚𝑚𝑆 and, therefore, the total cost of the operation.
there is a perfect overlap of the asynchronous and synchro- There are other possible methods of classifying stripes.
nous components. Defining 𝑆𝑇 = 𝑆𝑆 + 𝑆𝐴 to be the total One such method is to analyze columns of stripes in the
number of non-local-input stripes processed by a particular sparse matrix and classify a stripe as synchronous when
node and rearranging this equation gives the following: its corresponding dense stripe is needed by many nodes
and, therefore, is likely to benefit from optimized multicast
𝑆𝑆 (𝛽𝑆 𝐾𝑊 + 𝛼𝑆 ) = 𝛽𝐴 𝐾𝐿𝐴 + 𝛼𝐴 𝑆𝐴 + 𝛾𝐴 𝐾𝑁𝐴 + 𝜅𝐴 𝑆𝐴 operations. We leave the investigation of such methods for
=⇒ (𝑆𝑇 − 𝑆𝐴 )(𝛽𝑆 𝐾𝑊 + 𝛼𝑆 ) = 𝐾 (𝛽𝐴 𝐿𝐴 + 𝛾𝐴 𝑁𝐴 ) future work.
+ 𝑆𝐴 (𝛼𝐴 + 𝜅𝐴 )
5 The Two-Face Algorithm
=⇒ 𝑆𝑇 (𝛽𝑆 𝐾𝑊 + 𝛼𝑆 ) = 𝐾 (𝛽𝐴 𝐿𝐴 + 𝛾𝐴 𝑁𝐴 )
In this section, we provide greater details about Two-Face.
+ 𝑆𝐴 (𝛼𝐴 + 𝜅𝐴 + 𝛽𝑆 𝐾𝑊 + 𝛼𝑆 ) (1)
We discuss the sparse matrix representation, the Two-Face
We now classify the stripes requiring communication as algorithm, its tuning and portability, and its applicability to
either synchronous or asynchronous. Initially, we assume GNN training.
that all stripes are synchronous, which makes the right-hand
side of Equation 1 equal to 0. Then, we take each stripe 𝑖 and 5.1 Sparse Matrix Representation
consider classifying it as asynchronous, instead. In this case, Two-Face represents the sparse matrix 𝐴 in a modified COO
the stripe’s contribution (call it 𝑧𝑖 ) to the right-hand side of format. The nonzeros in asynchronous stripes are extracted
Equation 1 would be given by: from 𝐴 and are stored in an Asynchronous sparse matrix; the
𝑧𝑖 = 𝑣𝑖 + 𝑢, nonzeros in synchronous/local-input stripes are extracted
where 𝑣𝑖 = 𝐾 (𝛽𝐴𝑙𝑖 + 𝛾𝐴𝑛𝑖 ), into a Synchronous/Local-Input sparse matrix. Because we
use a compressed sparse matrix representation, this format
𝑢 = 𝛼𝐴 + 𝜅𝐴 + 𝛽𝑆 𝐾𝑊 + 𝛼𝑆 ,
does not significantly increase overall memory use.
where stripe 𝑖 requires 𝑙𝑖 dense rows from matrix 𝐵 and Figure 6 shows: (a) an example of an input sparse matrix 𝐴,
contains 𝑛𝑖 nonzeros. Note that 𝑢 depends only on the stripe (b) its corresponding synchronous/local-input sparse matrix,
width (𝑊 ) and other constants, and is therefore constant for and (c) its corresponding asynchronous sparse matrix. The
all the stripes in the matrix. figure assumes that there are four nodes and one sparse
To identify the most beneficial stripes to classify as asyn- stripe for each 2x2 megatile. Assume that, after running
chronous, we look for stripes with low values of 𝑧𝑖 . This is the preprocessing step, the stripes have been classified as
because a low 𝑧𝑖 implies that, if the stripe is classified as local-input, synchronous, and asynchronous such that the
asynchronous, it requires few dense rows to be transferred nonzeros in 𝐴 end up in the categories shown in Figure 6a.
Two-Face: Collective and One-Sided Communication for Efficient Distributed SpMM ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA
(a)
containing a list of nodes that are destinations of the col-
0 a b
1
lective transfer of that stripe. At runtime, this metadata is
c d
local-input nonzeros 2 e f replicated across all nodes.
3
sync nonzeros A 4
g h
5.2 Two-Face Algorithm Description
i j
5 k l m
async nonzeros
6 n o p q
The algorithm consists of three parts: top-level algorithm,
7 r s processing synchronous row panels, and processing asyn-
0 1 2 3 4 5 6 7 chronous stripes. We describe each part in turn.
(b) 5.2.1 Top-Level Algorithm. Algorithm 1 shows the top-
Sync/Local-Input
0 1 2 4 6 7 8 12 14
Panel Pointers level operation of the Two-Face algorithm. All nodes of the
distributed system execute Algorithm 1 in parallel. First,
Values a c e f g h j m n o p q r s the node initializes a flag and two atomic queues for work-
Columns 0 1 3 4 2 5 4 5 4 5 6 7 6 7 sharing (Lines 2-3). These queues provide indices of asyn-
Rows 0 1 2 2 3 3 4 5 6 6 6 6 7 7 chronous stripes (𝑎𝑠𝑦𝑛𝑐_𝑞) and indices of row panels (𝑠𝑦𝑛𝑐_𝑞).
In the example of Figure 6, the 𝑠𝑦𝑛𝑐_𝑞 of N2 is {4, 5}, which
(c) are the indices of the two pointers in the Sync/Local-Input
Asynchronous Stripe Pointers 0 2 5
Panel Pointers array used by N2 (pointing to rows containing
nonzeros 𝑗 and 𝑚). The 𝑎𝑠𝑦𝑛𝑐_𝑞 of N2 is {1}, which is the
Values d b k i l index of the pointer in the Asynchronous Stripe Pointers
Columns 6 7 0 1 1 array that points to the asynchronous stripe assigned to N2.
Rows 1 0 5 4 5
Algorithm 1 Top-Level Two-Face Pseudo-code.
1: procedure DistSPMM(𝐴, 𝐵, 𝐶)
Figure 6. Sparse matrix representation in Two-Face: (a) an 2: 𝑠𝑦𝑛𝑐_𝑡𝑟𝑎𝑛𝑠 𝑓 𝑒𝑟 _𝑑𝑜𝑛𝑒 ← False
input sparse matrix 𝐴, showing an example of nonzeros 3: 𝑎𝑠𝑦𝑛𝑐_𝑞, 𝑠𝑦𝑛𝑐_𝑞 ← 𝐼𝑛𝑖𝑡𝑄𝑢𝑒𝑢𝑒𝑠 (𝐴)
classified into the local-input, sync, and async categories; (b) 4: DoParallel
the corresponding synchronous/local-input sparse matrix; 5: if 𝑡𝑖𝑑 = 0 then ⊲ Sync Transfers
and (c) the corresponding asynchronous sparse matrix. This 6: TransferDenseStripes(𝐴, 𝐵)
figure assumes a 4-node system, a stripe width of 2, and a 7: 𝑠𝑦𝑛𝑐_𝑡𝑟𝑎𝑛𝑠 𝑓 𝑒𝑟 _𝑑𝑜𝑛𝑒 ← True
row panel height of 1. 8: end if
9: if 𝑡𝑖𝑑 ∈ 𝐴𝑠𝑦𝑛𝑐𝑇ℎ𝑟𝑒𝑎𝑑𝑠 then ⊲ Async Processing
10: while 𝑎𝑠𝑦𝑛𝑐_𝑞.nonempty() do
11: 𝑛 ← 𝑎𝑠𝑦𝑛𝑐_𝑞.pop()
The corresponding synchronous/local-input sparse matrix 12: ProcessAsyncStripe(𝐴, 𝐵, 𝐶, 𝑛)
(Figure 6b) organizes the synchronous/local-input nonzeros 13: end while
in a row-major order structure. The elements in this structure 14: end if
are divided into row panels—e.g., nonzeros e and f in Figure 6 15: WaitForFlag(𝑠𝑦𝑛𝑐_𝑡𝑟𝑎𝑛𝑠 𝑓 𝑒𝑟 _𝑑𝑜𝑛𝑒)
are in one row panel. Row panels are the units of work 16: while 𝑠𝑦𝑛𝑐_𝑞.nonempty() do ⊲ Sync Compute
assigned to threads computing on synchronous/local-input 17: 𝑛 ← 𝑠𝑦𝑛𝑐_𝑞.pop()
nonzeros. In Figure 6b, these panels are one row tall, and an 18: ProcessSyncRowPanel(𝐴, 𝐵, 𝐶, 𝑛)
array of Synchronous/Local-input Panel Pointers points to the 19: end while
beginning of each panel. 20: EndParallel
The corresponding asynchronous sparse matrix (Figure 6c) 21: end procedure
organizes the asynchronous nonzeros within stripes in a
column-major order structure. The order of the stripes them- Then, the code starting at Line 4 is executed by all the
selves is row-major, to simplify the distribution of the asyn- threads of the node in parallel. Specifically, Thread 0 initiates
chronous sparse matrix across nodes at runtime. An array of the transmission/reception of the dense stripes needed for
Asynchronous Stripe Pointers points to the beginning of each synchronous operations (Lines 5-8). Data transmission is
asynchronous stripe. implemented as non-blocking, but data reception is blocking.
At runtime, each node will only store those portions of the These transfers are done via a series of calls to MPI_Bcast.
synchronous/local-input and asynchronous sparse matrices The destination nodes are determined via metadata produced
that are relevant to its computation. In addition, for each by the preprocessing step, and these transfers occur at the
dense stripe of 𝐵, the preprocessing step generates metadata stripe granularity.
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Charles Block, Gerasimos Gerogiannis, Charith Mendis, Ariful Azad, and Josep Torrellas
In parallel, all the threads assigned to asynchronous stripes Algorithm 3 Two-Face Async Pseudo-code
begin processing those stripes (Lines 9-14). Once all needed 1: procedure ProcessAsyncStripe(𝐴, 𝐵, 𝐶, 𝑛)
dense input data from collectives has been received (Line 15), 2: 𝑠𝑡𝑟𝑖𝑝𝑒 ← 𝐴.𝑎𝑠𝑦𝑛𝑐_𝑠𝑡𝑟𝑖𝑝𝑒_𝑝𝑡𝑟𝑠 [𝑛]
all threads (including the asynchronous ones after they have 3: 𝑑𝑟𝑜𝑤_𝑖𝑑𝑠 ← 𝑠𝑡𝑟𝑖𝑝𝑒.UniqueColIDs()
processed the asynchronous stripes) process the synchro- 4: 𝑑𝑟𝑜𝑤𝑠 ← GetRemoteRows(𝑑𝑟𝑜𝑤_𝑖𝑑𝑠)
nous row panels (Lines 16-19). 5: for 𝑛𝑧 ∈ 𝑠𝑡𝑟𝑖𝑝𝑒 do in parallel
6: AtomicAdd(𝐶 [𝑛𝑧.𝑟𝑜𝑤], 𝑛𝑧.𝑣𝑎𝑙 ∗ 𝑑𝑟𝑜𝑤𝑠 [𝑛𝑧.𝑐𝑜𝑙])
5.2.2 Processing Synchronous Row Panels. Algorithm
7: end for
2 describes the processing of a row panel. The operation
8: end procedure
starts by initializing a thread-local Accumulation Buffer to
zero (acc in Line 2) and reading the row panel (panel in Line 3).
Then, the algorithm iterates through all of the nonzeros in the required for correct accumulation into 𝐶, just as in the syn-
row panel, accumulating each result onto acc (Line 10). When chronous stripe case. However, since asynchronous nonze-
we either complete a row of nonzeros (Line 7) or complete the ros are stored in column-major order, we cannot easily use
whole row panel (Line 13), we add acc to the corresponding thread-local buffers to reduce the number of atomics.
row of 𝐶. Atomics are required in this operation because To reduce transfer overheads, inside the GetRemoteRows
some threads operating on asynchronous stripes may also routine, we coalesce the transfer of nearby rows of 𝐵. For
be writing to the same rows of 𝐶. example, if a sparse stripe requires 𝐵 rows {2, 3, 6, 8}, we
transfer three groups of rows, with (𝑜 𝑓 𝑓 𝑠𝑒𝑡, 𝑠𝑖𝑧𝑒) pairs equal
to {(2, 2), (6, 1), (8, 1)}. This optimization reduces software
Algorithm 2 Two-Face Sync Compute Pseudo-code overheads. For small 𝐾, we also coalesce rows separated by
1: procedure ProcessSyncRowPanel(𝐴, 𝐵, 𝐶, 𝑛) unused rows, potentially reducing the software overhead fur-
2: 𝑎𝑐𝑐 ← {0, ..., 0} ⊲ Output row buffer ther, but transferring some useless data. Using the example
3: 𝑝𝑎𝑛𝑒𝑙 ← 𝐴.𝑝𝑎𝑛𝑒𝑙_𝑝𝑡𝑟𝑠 [𝑛] from before, we might transfer groups of rows {(2, 2), (6, 3)},
4: 𝑝𝑟𝑒𝑣_𝑟𝑜𝑤 ← 𝑝𝑎𝑛𝑒𝑙 [0].𝑟𝑜𝑤 ⊲ Initialize to first row retrieving one unnecessary row (row 7).
5: for 𝑛𝑧 ∈ 𝑝𝑎𝑛𝑒𝑙 do
6: if 𝑛𝑧.𝑟𝑜𝑤 ≠ 𝑝𝑟𝑒𝑣_𝑟𝑜𝑤 then 5.3 Tuning Knobs and Portability
7: AtomicAdd(𝐶 [𝑝𝑟𝑒𝑣_𝑟𝑜𝑤], 𝑎𝑐𝑐) Two-Face has several parameters that may need to be cali-
8: 𝑎𝑐𝑐 ← {0, ..., 0} brated for each individual system to achieve maximal per-
9: end if formance. Among these are the coefficients used in the pre-
10: 𝑎𝑐𝑐 ← 𝑎𝑐𝑐 + 𝑛𝑧.𝑣𝑎𝑙 ∗ 𝐵 [𝑛𝑧.𝑐𝑜𝑙] processing cost model (𝛽𝑆 , 𝛼𝑆 , 𝛽𝐴 , 𝛼𝐴 , 𝜅𝐴 , 𝛾𝐴 ). As mentioned
11: 𝑝𝑟𝑒𝑣_𝑟𝑜𝑤 ← 𝑛𝑧.𝑟𝑜𝑤 before, in our evaluation, we determine the values of these
12: end for coefficients via linear regression on a small number of work-
13: AtomicAdd(𝐶 [𝑝𝑟𝑒𝑣_𝑟𝑜𝑤], 𝑎𝑐𝑐) loads. These parameters only need to be calibrated once for
14: end procedure a system, possibly at installation time.
In addition to the preprocessing cost model coefficients,
the runtime algorithm is parameterized by the number of
threads assigned to sync/async stripe processing, the ag-
5.2.3 Processing Asynchronous Stripes. Algorithm 3 gressiveness of row coalescing in async stripe transfers, the
shows the algorithm to process an asynchronous stripe. A height of the row panels used for computation in the sync
thread reads the asynchronous stripe (stripe in Line 2) and stripes, and the width of the stripes. The optimal choice for
iterates over the nonzeros in the stripe to identify the unique these parameters may vary between systems and workloads,
𝑐_𝑖𝑑s of the nonzeros (Line 3). These determine the indices but we show in Sections 6.2 and 7 that choosing reason-
of the dense rows from 𝐵 that are required. The asynchro- able, static values can provide good performance. In practice,
nous thread then initiates the remote access of the dense these parameters could be determined at installation time
rows by calling GetRemoteRows (Line 4). This procedure similarly to the preprocessing coefficients.
uses MPI_Rget and a custom MPI datatype defined with Thus, although Two-Face relies on knowledge of system
MPI_Type_indexed to select only the rows of interest for characteristics to make decisions about how to schedule
the transfer. the work, porting to a new system just requires a one-time
Once the dense rows arrive, they are stored in drows profiling step during installation.
(Line 4), and multiple threads begin computing on them.
Each thread processes a subset of the nonzeros in the sparse 5.4 Applicability to GNN Training
stripe. Each nonzero is multiplied with the corresponding While SpMM is used in a variety of domains, one of the most
row of 𝑑𝑟𝑜𝑤𝑠 and accumulated into 𝐶 (Line 6). Atomics are important ones is GNN training. GNN training is often done
Two-Face: Collective and One-Sided Communication for Efficient Distributed SpMM ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA
To determine the appropriate width of stripes, we ana- Table 3. Coefficient values used in the preprocessing of
lyzed the performance of SpMM using the queen, arabic, matrices. The 𝛽 parameters relate to the system bandwidth,
and twitter matrices with various choices for 𝑊 . There was the 𝛼 parameters relate to other communication overheads,
increasing overhead in both the preprocessing and runtime and the 𝛾 & 𝜅 terms relate to computational throughput and
steps as the number of stripes grew, suggesting that the stripe other overheads.
width should not be made too small, relative to the size of the
matrix. We decided to scale the stripe width proportionally Coefficient Experimental Value
to the dimensions of the matrices, rounding to the nearest 𝛽𝑆 1.95 × 10 −10
power of two. Table 1 shows the stripe widths we chose. 𝛼𝑆 1.36 × 10 −6
All run-time parameters other than the stripe width are 𝛽𝐴 3.61 × 10 −9
held constant across matrices. Table 2 shows these parame- 𝛼𝐴 1.02 × 10 −5
ters. Each node runs 128 OpenMP threads. Since a large num- 𝛾𝐴 2.07 × 10 −8
ber of one-sided transfers results in high resource contention, 𝜅𝐴 8.72 × 10 −9
we limit the number of threads communicating asynchro-
nous data to 2 per node. We allow each of these threads to
fork up to four ways (for a total of 8 threads) when comput- divide the dense input matrix 𝐵 into as many equally-sized
ing on the asynchronous stripes. We dedicate the remaining portions as the number of nodes 𝑝, and call each portion a
120 threads in the node to computation on the synchronous “block”. The 𝐵 matrix is distributed across all nodes, where
and local-input stripes. We define the maximum row coalesc- each node stores a single block.
ing distance for asynchronous transfers to be proportional
Table 4. SpMM algorithms being compared.
to 𝐾1 , since the cost of transferring unnecessary dense rows
grows with 𝐾.
Algorithm Name MPI Transfer Operations
Table 2. Constant runtime parameters used in Two-Face. Dense Shifting [8] MPI_Allgather, MPI_Sendrecv
Allgather MPI_Allgather
Parameter Name Value Async Coarse-Grained MPI_Get
Async Communication Threads per Node 2 Two-Face MPI_Rget, MPI_Ibcast
Async Computation Threads per Node 8 Async Fine-Grained MPI_Rget
Sync/Local-Input Computation Threads per Node 120
Max Async Coalescing Distance (127/𝐾 ) + 1
Row Panel Height of Sync/Local-Input Sparse Matrix 32 rows Dense Shifting (DS) is a synchronous SpMM algorithm
that has been investigated by Bharadwaj et al. [8] and found
To determine the values of the preprocessing parameters to be highly competitive compared to other state-of-the-art
used in stripe classification (Section 4.2), we employ linear implementations. We use it as our main baseline. DS begins
regression [40]. We collect data by processing the twitter by using MPI_Allgather to replicate a certain number of
matrix [36] using 𝐾 = 32, 𝑝 = 32, and nine different com- blocks in each node, as determined by a replication factor 𝑐.
binations of stripe widths and asynchronous/synchronous It then continues by shifting the replicated blocks cyclically
stripe classifications. The number of samples is kept small via MPI_Sendrecv after each computation step. For instance,
to ensure that it is reasonable to calibrate these coefficients with 𝑐 = 4, this algorithm replicates each block such that
when installing Two-Face on a new system. The derived coef- each node holds four blocks at a time. It then performs 𝑝/𝑐
ficient values, which we use when preprocessing all matrices computation and shifting steps to complete the SpMM op-
in our evaluation (unless otherwise specified), are shown in eration. In our experiments, we evaluate this algorithm for
Table 3. 𝑐 = 2, 𝑐 = 4, and 𝑐 = 8, and refer to these settings as DS2,
These coefficients provide some insight into the perfor- DS4, and DS8, respectively.
mance difference between one-sided asynchronous and col- The next two algorithms replicate all or nearly all of the
lective synchronous communication. For example, they sug- matrix 𝐵 before beginning the computation. In Allgather,
gest that asynchronous transfers are more expensive per each node uses MPI_Allgather to broadcast its block of 𝐵 to
transferred element of 𝐵 than synchronous transfers by a all others and receive theirs in turn. In Asynchronous Coarse-
factor of 𝛽𝐴 /𝛽𝑆 ≈ 18.5. Grained, each node uses MPI_Get to obtain the blocks that it
In Section 7.4 of our evaluation, we evaluate the impact needs for its computation. In both cases, substantial memory
of different values of these coefficients. has to be allocated, creating issues as the problem size scales.
Two-Face is the algorithm we propose. We use the parame-
6.3 Algorithms Evaluated ters as described before. However, if the preprocessing algo-
In our evaluation, we compare Two-Face to other algorithms rithm determines that the chosen sync/async classification
shown in Table 4. All the algorithms use 1D partitioning. We of stripes would result in too much memory consumption
Two-Face: Collective and One-Sided Communication for Efficient Distributed SpMM ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA
8
6
4
2
0
eb
es
ic
er
er
er
g
aw
ee
av
ab
km
itt
st
ok
w
m
qu
nd
ar
tw
st
ie
fr
Figure 7. Speedups of various SpMM algorithms over DS2 for 𝐾 = 32.
8
6
4
2
0
eb
es
ic
er
er
er
g
aw
ee
av
ab
km
itt
st
ok
w
m
qu
nd
ar
tw
st
ie
fr
Figure 8. Speedups of various SpMM algorithms over DS2 for 𝐾 = 128.
8
6
4
2
0
eb
es
ic
er
er
er
g
aw
ee
av
ab
km
itt
st
ok
w
m
qu
nd
ar
tw
st
ie
fr
Figure 9. Speedups of various SpMM algorithms over DS2 for 𝐾 = 512.
in one or more nodes during SpMM execution, it will clas- 7.1 Comparing Two-Face to Various Baselines
sify additional stripes as async until the expected memory Figures 7, 8, and 9 show the speedups of Two-Face and the
consumption in those nodes is feasible. other SpMM algorithms over DS2 for 𝐾 = 32, 𝐾 = 128,
Asynchronous Fine-Grained is implemented in the same and 𝐾 = 512, respectively. We normalize to DS2 because,
way as Two-Face, except that all stripes are asynchronous. unlike DS4 or DS8, DS2 does not run out of memory for any
This algorithm is used as an extreme example to illustrate matrices or value of 𝐾 in our evaluation. From the figures, we
the tradeoffs made by a balanced Two-Face implementation. see that, on average, across matrices and 𝐾 values, Two-Face
This baseline was used in Section 3. is the fastest algorithm, and delivers substantial speedups.
All algorithms are evaluated by averaging out the time of As 𝐾 increases, the advantage of Two-Face over the dense
5 consecutive SpMM operations. By default, our experiments shifting algorithms becomes more prominent. This is be-
use 𝑝 = 32 and 𝐾 = 128. Some experiments use 𝐾 = 32 or cause the cost of transferring unnecessary rows in the dense
𝐾 = 256, and others use 𝑝 = 1, 2, 4, 8, 16, 32, or 64. shifting algorithms increases with 𝐾, providing a greater
advantage to the fine-grained one-sided accesses of Two-
Face. At 𝐾 = 32, Two-Face’s average speedup over the dense
shifting algorithm with the best choice of replication factor
for each individual matrix is 1.53x. At 𝐾 = 128, the same
7 Evaluation speedup is 2.11x, and at 𝐾 = 512, is it 2.35x. The average
In this section, we evaluate Two-Face. First, we compare the speedup across all values of 𝐾 shown here is 1.99x.
performance of Two-Face to the various baselines and dis- The Async Fine and dense shifting algorithms are on aver-
cuss any bottlenecks observed. Next, we discuss the scaling age faster than the Async Coarse and Allgather algorithms.
behavior of Two-Face as we vary the number of nodes in Dense shifting is sometimes unable to run with higher repli-
the system. Finally, we analyze the preprocessing cost of cation factors due to memory constraints. For example, for
Two-Face and the sensitivity of Two-Face to the choice of the 𝐾 = 512, DS8 fails to run for half of the matrices, and DS4
preprocessing parameters.
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Charles Block, Gerasimos Gerogiannis, Charith Mendis, Ariful Azad, and Josep Torrellas
Table 5. Absolute execution times of DS2 and Two-Face for the experiments in Figures 7, 8, and 9. The numbers are the average
of five SpMM operations.
1.0
0.5
0.0
S4
S4
S4
S4
S4
S4
S4
S4
e
c
c
Fa
Fa
Fa
Fa
Fa
Fa
Fa
Fa
D
D
2-
2-
2-
2-
2-
2-
2-
2-
web queen stokes arabic mawi kmer twitter friendster
Figure 10. Breakdown of the total execution times of DS4 and Two-Face for 𝐾 = 128. Two-Face’s time is divided into synchronous
and asynchronous components (left and right bars, respectively), which operate in parallel. These are further broken down
into computation (Comp) and communication (Comm). DS4 only has a Sync component. The Other category mainly consists of
the initial setup of data structures for MPI. Execution times are normalized to DS4.
fails in one matrix. As a reference, Table 5 provides the abso- Two exceptions are twitter and friendster. In these ma-
lute execution times of Two-Face and DS2 in these figures. trices, Two-Face’s Sync Comp and Sync Comm have both
The figures also show that the speedups (or slowdowns) increased over DS4, despite the fact that less data is being
are highly dependent on the matrix. For example, Two-Face transferred. We note that Two-Face’s synchronous broadcast
is not the fastest algorithm for twitter and friendster and, for operations are significantly slower than the cyclic shifting
𝐾 = 32, additionally for mawi and kmer. To understand this operations in DS4 when a large portion of the input dense
behavior, Figure 10 breaks down the total execution time matrix is required by many nodes. When Two-Face operates
of DS4 and Two-Face for each matrix for 𝐾 = 128. For Two- on a matrix like friendster, each node participates in many
Face, we break down the execution time into Sync Comp, more MPI calls than it does if dense shifting is used, due to
Sync Comm, Async Comp, and Async Comm. We stack the the finer granularity of the transfers.
Sync components in the left bar and the Async components An interesting case is mawi, where Two-Face is unable
in the right bar, and show both bars side-to-side, since the to reduce the execution time over DS4 because of the cost
execution time is equal to the highest of the two bars. Two- of asynchronous computation. The mawi sparse matrix has
Face also has some Other overheads, which mainly consist of regions that have a relatively high density of nonzeros. Com-
initializing necessary MPI structures before the main com- puting on such asynchronous stripes is likely expensive due
munication/computation begins. For DS4, only Sync Comp to the heavy use of atomics, as the nonzeros are organized in
and Sync Comm are relevant. For each matrix, the bars are column-major order. During this work, we conducted initial
normalized to DS4. tests into storing the nonzeros in row-major order instead.
We see that the dominant contributor to DS4’s execu- However, this change did not result in faster execution, as
tion time is its communication. Two-Face is able to attain the cost of identifying which columns contained nonzeros
significant speedups over DS4 by reducing the amount of (and therefore which dense rows were required) became
communication through fine-grained accesses. In five of the drastically higher.
matrices, we can see that the sum of the communication
time spent by Two-Face in Sync Comm and Async Comm is
significantly less than the amount of time spent by DS4 in 7.2 Two-Face Strong Scaling
its communication. Figure 11 shows the execution times of Two-Face and the
dense shifting algorithm with different replication factors
(DS1, DS2, DS4, and DS8) as we scale the number of nodes
Two-Face: Collective and One-Sided Communication for Efficient Distributed SpMM ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA
3 ⋅ 101
Execution Time (s)
101
3 ⋅ 100
100
3 ⋅ 10−1
10−1
102
Execution Time (s)
8 ⋅ 101
4 ⋅ 101
2 ⋅ 101
101
1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Node Count Node Count Node Count Node Count
Figure 11. Execution time of Two-Face and the dense shifting algorithm with different replication factors (DS1, DS2, DS4,
and DS8) as the number of nodes changes. The 𝐾 value is 128. Some data points for the dense shifting algorithm are missing
because they need too much memory or take too long to execute. Both axes in the plots use a logarithmic scale.
from 1 to 64. There is a plot for each matrix and both axes speedup of 12.12x for queen and a worst-case of 0.76x for
are in logarithmic scale. Some data points are missing, since twitter. Moreover, compared to the dense shifting algorithm
some workloads either exceed the memory capacity of one or with the optimal replication factor, Two-Face sees an average
more nodes (at small node counts or high replication factors) speedup ranging from 1.25x at 4 nodes to 2.21x at 64 nodes.
or take too long to run.
The figure shows that, in most of the matrices, Two-Face 7.3 Two-Face Preprocessing Cost
scales well with the number of nodes and, in fact, as well or Two-Face requires a preprocessing step that involves, mainly:
better than the dense shifting algorithm. The exceptions are (1) running our model to classify the stripes into synchronous
mawi, twitter, and, to a lesser extent, friendster. With mawi, and asynchronous, and (2) creating the asynchronous and
none of the algorithms scale particularly well due to the high the synchronous/local-input sparse matrices. In this section,
load imbalance across nodes induced by the matrix. With we give an idea of the execution time of the preprocessing
twitter and friendster, we saw in Figure 10 that Two-Face is step. Note that we have not fully optimized it; in particular,
impacted by inefficient synchronous communication. This is we have not parallelized it across multiple nodes. Therefore,
the reason for the worse scaling performance. the numbers reported are a pessimistic bound.
To understand the behavior of twitter and friendster better,
we profile the collectives in the 64-node runs. We measure Table 6. The overhead of preprocessing in Two-Face, nor-
the number of recipients of each multicast operation. On malized to the cost of a single SpMM operation.
average, this number is 35.7 for twitter and 43.5 for friend-
Matrix 𝑡𝑛𝑜𝑟𝑚_𝐼 /𝑂 𝑡𝑛𝑜𝑟𝑚
ster. In contrast, the matrix with the next largest average
recipient count is kmer, with an average of only 5.7. It ap- web 428.74 102.00
pears that the large collectives needed for Two-Face in twitter queen 302.55 23.60
and friendster are responsible for the inefficient execution stokes 116.70 11.18
arabic 180.35 36.57
and limited scaling. This effect does not appear at low node
mawi 2.58 1.50
counts, where the execution is primarily bottlenecked by
kmer 6.16 3.25
local computation, but it dominates at high node counts. Fu-
twitter 17.89 7.29
ture work should investigate methods to reduce the size of friendster 19.81 8.79
collectives in the algorithm or the design of more regular
Average 134.35 24.27
data movement patterns for the synchronous stripes.
Overall, the performance of Two-Face improves as we scale
from 1 to 64 nodes by 7.47x on average, with a best-case Table 6 shows the overhead of the preprocessing step
for 32 nodes and 𝐾=128 for each matrix. Column 2 shows
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Charles Block, Gerasimos Gerogiannis, Charith Mendis, Ariful Azad, and Josep Torrellas
0.8 ⋅ αA0 0.91 1.10 1.31 0.8 ⋅ αS0 1.20 1.03 1.03 0.8 ⋅ γA0 1.01 1.01 1.13
αA0 1.03 1.00 1.52 αS0 1.30 1.00 1.25 γA0 1.15 1.00 1.24
1.25 ⋅ αA0 0.94 1.07 1.38 1.25 ⋅ αS0 1.46 1.04 1.17 1.25 ⋅ γA0 1.25 1.48 1.19
A0
A0
S0
A0
A0
S0
A0
A0
S0
κ
β
⋅κ
⋅β
⋅κ
⋅β
⋅β
β
⋅β
25
25
8
8
25
0.
0.
0.
1.
1.
1.
(a) Varying αA and βA (b) Varying αS and βS (c) Varying γA and κA
Figure 12. Sensitivity of Two-Face’s execution time to the values of the parameters of the execution model used during the
preprocessing step. The default values of these parameters, as set in Section 6.2 and used in all the earlier experiments, are
represented with the 0 subscript (e.g., 𝛼𝐴0 ).
𝑡𝑛𝑜𝑟𝑚_𝐼 /𝑂 , which is the time of the preprocessing step nor- In the first one, we vary 𝛼𝐴 and 𝛽𝐴 , keeping the other pa-
malized to the time of one SpMM operation. On average, rameters unchanged. Specifically, if 𝛼𝐴0 and 𝛽𝐴0 are the de-
𝑡𝑛𝑜𝑟𝑚_𝐼 /𝑂 is 134.35. However, the preprocessing step is domi- fault values of 𝛼𝐴 and 𝛽𝐴 , we consider all combinations of
nated by I/O time, as the original sparse matrix is read from {0.8 · 𝛼𝐴0, 𝛼𝐴0, 1.25 · 𝛼𝐴0 } × {0.8 · 𝛽𝐴0, 𝛽𝐴0, 1.25 · 𝛽𝐴0 }. In
the file system in a textual Matrix Market format [9] and the second set of changes, we vary 𝛼𝑆 and 𝛽𝑆 in the same
the final asynchronous and synchronous/local-input sparse way, keeping the other parameters unchanged. Finally, we
matrices are written to the file system in a bespoke binary vary 𝛾𝐴 and 𝜅𝐴 , again keeping the others unchanged.
format. Figure 12 shows the outcome of the three sets of changes
Since in many realistic environments, this I/O will not be for the average of three representative matrices: web (Two-
present, Column 3 shows the more relevant 𝑡𝑛𝑜𝑟𝑚 , which is Face’s best case), twitter (Two-Face’s worst case), and stokes
the preprocessing overhead without I/O normalized to the (Two-Face’s median case). For example, Figure 12a corre-
time of one SpMM operation. In this case, the numbers have sponds to the experiments varying 𝛼𝐴 and 𝛽𝐴 . The number
reduced substantially. We see that 𝑡𝑛𝑜𝑟𝑚 ranges from 1.50 to in each box is Two-Face’s execution time with the new param-
102.00, with an average of 24.27. For 𝐾 = 512 (not shown in eters relative to Two-Face’s execution time with the default
the table), the average value of 𝑡𝑛𝑜𝑟𝑚 is 6.15. parameters. For example, if we use 0.8 · 𝛼𝐴0 and 1.25 · 𝛽𝐴0 ,
From these numbers, we see that the cost of the prepro- the Two-Face’s execution time becomes 1.31x higher than
cessing step can be easily amortized. For the matrices where when using the default values.
Two-Face demonstrates a speedup over dense shifting when Overall, the figure shows that using the default parameters
𝐾 = 128, an average of only 15 SpMM operations need to obtained using linear regression is a good choice. Changes to
be performed by Two-Face to already see a speedup when the parameter values typically end up increasing Two-Face’s
including preprocessing time. For 𝐾 = 512, this decreases execution time. The execution time decreases in only two
to only 3 SpMM operations, on average. In contexts such as cases, and the decrease is small.
GNN training, with hundreds of epochs, we can expect to
perform many more SpMM operations with the same matri-
ces than these numbers. In addition, the preprocessing step 8 Related Work
from training may be reusable during inference.
Distributed SpMM: Existing work on distributed SpMM is
rather limited, but there are recent works exploring the topic.
Bharadwaj et al. [8] investigate distributed SpMM, SDDMM,
7.4 Sensitivity to Parameter Values of the and methods of fusing the two for machine learning appli-
Preprocessing Model cations. The implementations of dense shifting evaluated in
The model of execution that we use during preprocessing our paper originate from their work. Additionally, Bharad-
(Section 4.2) uses parameters 𝛼𝐴 , 𝛽𝐴 , 𝛼𝑆 , 𝛽𝑆 , 𝛾𝐴 , and 𝜅𝐴 . In waj et al. [8] present a sparse shifting implementation. In our
Section 6.2, we used linear regression to set their default work, we did not evaluate their approach, since it partitions
values. We used such values in all the experiments so far. the dense input and output matrices in a way that requires
In this section, we change the values of these parame- additional all-to-all communication for GNNs or other ap-
ters, repeat the experiments, and measure the changes in plications that interleave SpMM with a row-wise operator.
Two-Face’s execution time. We perform three sets of changes. Bharadwaj et al. [8] compare their implementations to the
Two-Face: Collective and One-Sided Communication for Efficient Distributed SpMM ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA
SpMM provided by PETSc [7]. Selvitopi et al. [48] investi- of sparse kernels. We believe that algorithms such as Two-
gate multiple algorithms for SpMM, including algorithms Face can be useful in orchestrating the communication in
that use bulk-synchronous collective communication and scaled-up multi-node versions of these accelerators or for
algorithms that use one-sided asynchronous RDMA commu- other large-scale graph analytics architectures [1, 4, 45]. In
nication. They do not, however, investigate combining these addition, we believe that such algorithms can also be benefi-
communication primitives in a single algorithm. cial for inter-cube or inter-chip communication in PIM-based
GNN Training: Prior work has addressed the issue of large architectures for graph analytics [5, 22, 59, 60]
graphs in GNN training via sampling techniques [25]. How- Finally, scheduling algorithms for collectives such as Themis
ever, the benefits of sampling can come at a cost to accuracy, [47] have been proposed to maximize the bandwidth utiliza-
leading prior work to investigate full-batch distributed GNN tion of multidimensional, heterogeneous networks. These
training [33, 53]. However, in Tripathy et al. [53], dense ma- works could inspire network hardware support for Two-Face,
trices used in SpMM operations are only transferred in a but one would also require innovations to support the asyn-
coarse-grained sparsity-unaware fashion. Conversely, Jia et chronous communication operations.
al. [33], using Lux [32], assume a GNN runtime that operates
via pushing/pulling node embeddings in a fine-grained man- 9 Conclusion
ner. This is distinct from Two-Face, which uses a combination Sparse matrices often contain regions that are denser and
of coarse and fine-grained transfers to leverage the benefits regions that are sparser. Based on the observation, this pa-
of both approaches. per presented Two-Face, an algorithm for distributed SpMM
Non-Distributed Sparse Kernels: SpMM optimization has that, leveraging a preprocessing model, performs collective
been the topic of several investigations. Works such as WACO communications for the denser regions, and fine-grained
[56], WISE [57], and DDB [58] attempt to optimize sparse one-sided communications for the sparser regions. Two-Face
computations by using machine learning techniques to pre- attains an average speedup of 2.11x over dense shifting when
dict the performance of various configurations. Many CPU evaluated on a 4096-core supercomputer. Additionally, Two-
and GPU tiling techniques and implementations have been Face scales well with the machine size.
published [27, 30, 37, 41]. Other sparse kernels have also Two-Face suggests that distributed sparse algorithms should
been subject to several investigations aiming to tame irregu- be input-matrix aware, in that different sections of a sparse
lar access patterns [24, 28, 42, 52]. These optimizations for input matrix prefer using different communication meth-
non-distributed kernels may be applicable to the distributed ods. The algorithms should also be communication-oriented,
case, but they tend to assume a shared memory system, and since minimizing communication is a first-class concern.
they are largely orthogonal to our work. With simple modifications, the Two-Face algorithm should
Recently, SpMM, SpMV, and SpGEMM kernels for hetero- also be applicable to sparse kernels such as Sampled Dense-
geneous hardware have been proposed [13, 20, 38]. Cheng Dense Matrix Multiplication (SDDMM), which exhibits very
et al. [13] tackle SpGEMM on asymmetric multicore proces- similar patterns to SpMM. Likewise, with proper parameter
sors. HotTiles [20] partitions the SpMM sparse input matrix tuning, Two-Face may also be applicable to accelerate SpMV,
into two types of regions and assigns each region type to a which is a special case of SpMM. We are investigating these
different accelerator by solving an optimization problem. and other algorithms.
Other Distributed Sparse Kernels: Other distributed sparse
kernels have recently received attention. CombBLAS [6] is
a library for distributed sparse kernels such as SpGEMM Acknowledgments
and SpMV. CombBLAS provides a number of GPU SpMM We thank Fredrik Kjolstad and the reviewers for helping to
implementations using different partitioning and communi- improve this paper. This research was funded in part by ACE,
cation patterns. All of them use sparsity-unaware collectives. one of the seven centers in JUMP 2.0, a Semiconductor Re-
In contrast, Two-Face uses a hybrid approach. Hussain et search Corporation (SRC) program sponsored by DARPA; by
al. [31] investigate communication-avoiding algorithms for a grant from the IBM-Illinois Discovery Accelerator Institute;
SpGEMM. DGCL [10] is a library for distributed GNN train- by NSF grants PPoSS CCF 2316233, CNS 1956007 and CCF
ing that partitions graphs and processes GNN computations 2107470; by DOE grant DE-SC0022098; and by the National
at the level of nodes in the graphs, without explicitly express- Science Foundation Graduate Research Fellowship Program
ing the computation with SpMM operations. under Grant No. DGE 21-46756. This work used the Delta sys-
Domain-specific Architectures and Network Support: tem at the National Center for Supercomputing Applications
Several architectural designs that offer hardware support through allocation CIS230044 from the Advanced Cyber-
for SpMM computation have been recently proposed [21, infrastructure Coordination Ecosystem: Services Support
26, 34, 44, 50]. SPADE [21] is an accelerator for SpMM and (ACCESS) program, which is supported by National Science
SDDMM designed to be tightly coupled with CPU cores. Ten- Foundation grants 2138259, 2138286, 2138307, 2137603, and
saurus [50] and ExTensor [26] are accelerators for a variety 2138296.
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Charles Block, Gerasimos Gerogiannis, Charith Mendis, Ariful Azad, and Josep Torrellas
[27] Changwan Hong, Aravind Sukumaran-Rajam, Israt Nisa, Kunal Singh, [40] Douglas C Montgomery, Elizabeth A Peck, and G Geoffrey Vining.
and P. Sadayappan. 2019. Adaptive Sparse Tiling for Sparse Matrix 2021. Introduction to linear regression analysis. John Wiley & Sons.
Multiplication. In Proceedings of the 24th Symposium on Principles and [41] NVIDIA. 2024. cuSPARSE. Retrieved Jan 2024 from https://developer.
Practice of Parallel Programming (Washington, District of Columbia) nvidia.com/cusparse
(PPoPP ’19). Association for Computing Machinery, 300–314. https: [42] Toluwanimi O. Odemuyiwa, Hadi Asghari-Moghaddam, Michael Pel-
//doi.org/10.1145/3293883.3295712 lauer, Kartik Hegde, Po-An Tsai, Neal C. Crago, Aamer Jaleel, John D.
[28] Olivia Hsu, Maxwell Strange, Ritvik Sharma, Jaeyeon Won, Kunle Owens, Edgar Solomonik, Joel S. Emer, and Christopher W. Fletcher.
Olukotun, Joel S. Emer, Mark A. Horowitz, and Fredrik Kjølstad. 2023. 2023. Accelerating Sparse Data Orchestration via Dynamic Reflexive
The Sparse Abstract Machine. In Proceedings of the 28th ACM Inter- Tiling. In Proceedings of the 28th ACM International Conference on Ar-
national Conference on Architectural Support for Programming Lan- chitectural Support for Programming Languages and Operating Systems,
guages and Operating Systems, Volume 3 (Vancouver, BC, Canada) (AS- Volume 3 (Vancouver, BC, Canada) (ASPLOS 2023). Association for Com-
PLOS 2023). Association for Computing Machinery, 710–726. https: puting Machinery, 18–32. https://doi.org/10.1145/3582016.3582064
//doi.org/10.1145/3582016.3582051 [43] OpenMP Architecture Review Board. 2015. OpenMP Application Pro-
[29] Yuwei Hu, Zihao Ye, Minjie Wang, Jiali Yu, Da Zheng, Mu Li, Zheng gram Interface Version 4.5. https://openmp.org/wp-content/uploads/
Zhang, Zhiru Zhang, and Yida Wang. 2020. FeatGraph: A Flexible openmp-4.5.pdf.
and Efficient Backend for Graph Neural Network Systems. In SC20: [44] Marcelo Orenes-Vera, Aninda Manocha, Jonathan Balkind, Fei Gao,
International Conference for High Performance Computing, Networking, Juan L Aragón, David Wentzlaff, and Margaret Martonosi. 2022. Tiny
Storage and Analysis. 1–13. https://doi.org/10.1109/SC41405.2020. but mighty: designing and realizing scalable latency tolerance for
00075 manycore SOCs. In Proceedings of the 49th Annual International Sym-
[30] Guyue Huang, Guohao Dai, Yu Wang, and Huazhong Yang. 2020. posium on Computer Architecture. 817–830.
GE-SpMM: General-Purpose Sparse Matrix-Matrix Multiplication on [45] Marcelo Orenes-Vera, Esin Tureci, David Wentzlaff, and Margaret
GPUs for Graph Neural Networks. In SC20: International Conference Martonosi. 2023. Dalorex: A data-local program execution and ar-
for High Performance Computing, Networking, Storage and Analysis. chitecture for memory-bound applications. In 2023 IEEE International
1–12. https://doi.org/10.1109/SC41405.2020.00076 Symposium on High-Performance Computer Architecture (HPCA). IEEE,
[31] Md Taufique Hussain, Oguz Selvitopi, Aydin Buluç, and Ariful Azad. 718–730.
2021. Communication-Avoiding and Memory-Constrained Sparse [46] Eigen Project. 2023. Eigen v3.4. Retrieved Jan 2024 from https://eigen.
Matrix-Matrix Multiplication at Extreme Scale. In 2021 IEEE Interna- tuxfamily.org
tional Parallel and Distributed Processing Symposium (IPDPS). 90–100. [47] Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivas Sridha-
https://doi.org/10.1109/IPDPS49936.2021.00018 ran, and Tushar Krishna. 2022. Themis: A Network Bandwidth-Aware
[32] Zhihao Jia, Yongkee Kwon, Galen Shipman, Pat McCormick, Mattan Collective Scheduling Policy for Distributed Training of DL Models. In
Erez, and Alex Aiken. 2017. A Distributed Multi-GPU System for Fast Proceedings of the 49th Annual International Symposium on Computer
Graph Processing. Proceedings of the VLDB Endowment 11, 3 (Nov Architecture (New York, New York) (ISCA ’22). Association for Com-
2017), 297–310. https://doi.org/10.14778/3157794.3157799 puting Machinery, 581–596. https://doi.org/10.1145/3470496.3527382
[33] Zhihao Jia, Sina Lin, Mingyu Gao, Matei Zaharia, and Alex Aiken. 2020. [48] Oguz Selvitopi, Benjamin Brock, Israt Nisa, Alok Tripathy, Katherine
Improving the accuracy, scalability, and performance of graph neural Yelick, and Aydın Buluç. 2021. Distributed-Memory Parallel Algo-
networks with Roc. Proceedings of Machine Learning and Systems 2 rithms for Sparse Times Tall-Skinny-Dense Matrix Multiplication. In
(2020), 187–198. Proceedings of the ACM International Conference on Supercomputing
[34] Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Gian- (Virtual Event, USA) (ICS ’21). Association for Computing Machinery,
noula, Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi, Taha 431–442. https://doi.org/10.1145/3447818.3461472
Shahroodi, Juan Gomez Luna, and Onur Mutlu. 2019. Smash: Co- [49] Pavel Shamis, Manjunath Gorentla Venkata, M Graham Lopez,
designing software compression and hardware-accelerated indexing Matthew B Baker, Oscar Hernandez, Yossi Itigin, Mike Dubman, Gi-
for efficient sparse matrix operations. In Proceedings of the 52nd annual lad Shainer, Richard L Graham, Liran Liss, et al. 2015. UCX: an open
IEEE/ACM international symposium on microarchitecture. 600–614. source framework for HPC network APIs and beyond. In 2015 IEEE 23rd
[35] Thomas N Kipf and Max Welling. 2016. Semi-supervised classification Annual Symposium on High-Performance Interconnects. IEEE, 40–43.
with graph convolutional networks. arXiv preprint arXiv:1609.02907 [50] Nitish Srivastava, Hanchen Jin, Shaden Smith, Hongbo Rong, David
(2016). Albonesi, and Zhiru Zhang. 2020. Tensaurus: A Versatile Accelerator
[36] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. for Mixed Sparse-Dense Tensor Computations. In 2020 IEEE Interna-
What is Twitter, a social network or a news media?. In WWW ’10: Proc. tional Symposium on High Performance Computer Architecture (HPCA).
the 19th Intl. Conf. on World Wide Web (Raleigh, North Carolina, USA). 689–702. https://doi.org/10.1109/HPCA47549.2020.00062
ACM, 591–600. https://doi.org/10.1145/1772690.1772751 [51] Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Opti-
[37] Shigang Li, Kazuki Osawa, and Torsten Hoefler. 2022. Efficient Quan- mization of collective communication operations in MPICH. The
tized Sparse Matrix Operations on Tensor Cores. In SC22: International International Journal of High Performance Computing Applications 19,
Conference for High Performance Computing, Networking, Storage and 1 (2005), 49–66.
Analysis. 1–15. https://doi.org/10.1109/SC41404.2022.00042 [52] Han D. Tran, Milinda Fernando, Kumar Saurabh, Baskar Ganapathysub-
[38] Wenxuan Li, Helin Cheng, Zhengyang Lu, Yuechen Lu, and Weifeng ramanian, Robert M. Kirby, and Hari Sundar. 2022. A scalable adaptive-
Liu. 2023. HASpMV: Heterogeneity-Aware Sparse Matrix-Vector Mul- matrix SPMV for heterogeneous architectures. In 2022 IEEE Interna-
tiplication on Modern Asymmetric Multicore Processors. In 2023 IEEE tional Parallel and Distributed Processing Symposium (IPDPS). 13–24.
International Conference on Cluster Computing (CLUSTER). IEEE Com- https://doi.org/10.1109/IPDPS53621.2022.00011
puter Society, 209–220. [53] Alok Tripathy, Katherine Yelick, and Aydın Buluç. 2020. Reducing
[39] Message Passing Interface Forum. 2021. MPI: A Message-Passing Inter- Communication in Graph Neural Network Training. In SC20: Interna-
face Standard Version 4.0. https://www.mpi-forum.org/docs/mpi-4.0/ tional Conference for High Performance Computing, Networking, Storage
mpi40-report.pdf https://www.mpi-forum.org/docs/mpi-4.0/mpi40- and Analysis. 1–14. https://doi.org/10.1109/SC41405.2020.00074
report.pdf.
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Charles Block, Gerasimos Gerogiannis, Charith Mendis, Ariful Azad, and Josep Torrellas
[54] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana SIGPLAN Annual Symposium on Principles and Practice of Parallel Pro-
Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention net- gramming (Montreal, QC, Canada) (PPoPP ’23). Association for Com-
works. arXiv preprint arXiv:1710.10903 (2017). puting Machinery, 329–341. https://doi.org/10.1145/3572848.3577506
[55] Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, [58] Serif Yesil, José E. Moreira, and Josep Torrellas. 2022. Dense Dynamic
Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, Blocks: Optimizing SpMM for Processors with Vector and Matrix Units
George Karypis, Jinyang Li, and Zheng Zhang. 2019. Deep Graph Li- Using Machine Learning Techniques. In Proceedings of the 36th ACM
brary: A Graph-Centric, Highly-Performant Package for Graph Neural International Conference on Supercomputing (Virtual Event) (ICS ’22).
Networks. arXiv preprint arXiv:1909.01315 (2019). Association for Computing Machinery, Article 27, 14 pages. https:
[56] Jaeyeon Won, Charith Mendis, Joel S. Emer, and Saman Amarasinghe. //doi.org/10.1145/3524059.3532369
2023. WACO: Learning Workload-Aware Co-Optimization of the For- [59] Mingxing Zhang, Youwei Zhuo, Chao Wang, Mingyu Gao, Yongwei
mat and Schedule of a Sparse Tensor Program. In Proceedings of the Wu, Kang Chen, Christos Kozyrakis, and Xuehai Qian. 2018. GraphP:
28th ACM International Conference on Architectural Support for Pro- Reducing communication for PIM-based graph processing with effi-
gramming Languages and Operating Systems, Volume 2 (Vancouver, cient data partition. In 2018 IEEE International Symposium on High
BC, Canada) (ASPLOS 2023). Association for Computing Machinery, Performance Computer Architecture (HPCA). IEEE, 544–557.
920–934. https://doi.org/10.1145/3575693.3575742 [60] Youwei Zhuo, Chao Wang, Mingxing Zhang, Rui Wang, Dimin Niu,
[57] Serif Yesil, Azin Heidarshenas, Adam Morrison, and Josep Torrellas. Yanzhi Wang, and Xuehai Qian. 2019. GraphQ: Scalable PIM-based
2023. WISE: Predicting the Performance of Sparse Matrix Vector Mul- graph processing. In Proceedings of the 52nd Annual IEEE/ACM Inter-
tiplication with Machine Learning. In Proceedings of the 28th ACM national Symposium on Microarchitecture. 712–725.