0% found this document useful (0 votes)
22 views15 pages

MLSys 2022 Gpu Semiring Primitives For Sparse Neighborhood Methods Paper

This document discusses the development of high-performance GPU semiring primitives for sparse neighborhood methods in machine learning and information retrieval. It highlights the challenges of handling sparse data and proposes a flexible design that supports various distance measures while optimizing performance and memory usage. The implementation is open-source and aims to unify computations for critical distance measures on GPUs, providing a foundation for future research in this area.

Uploaded by

dj0398913
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views15 pages

MLSys 2022 Gpu Semiring Primitives For Sparse Neighborhood Methods Paper

This document discusses the development of high-performance GPU semiring primitives for sparse neighborhood methods in machine learning and information retrieval. It highlights the challenges of handling sparse data and proposes a flexible design that supports various distance measures while optimizing performance and memory usage. The implementation is open-source and aims to unify computations for critical distance measures on GPUs, providing a foundation for future research in this area.

Uploaded by

dj0398913
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

GPU S EMIRING P RIMITIVES FOR S PARSE N EIGHBORHOOD M ETHODS

Corey J. Nolet 1 2 Divye Gala 1 Edward Raff 2 3 Joe Eaton 1 Brad Rees 1 John Zedlewski 1 Tim Oates 2

A BSTRACT
High-performance primitives for mathematical operations on sparse vectors must deal with the challenges of
skewed degree distributions and limits on memory consumption that are typically not issues in dense operations.
We demonstrate that a sparse semiring primitive can be flexible enough to support a wide range of critical distance
measures while maintaining performance and memory efficiency on the GPU. We further show that this primitive
is a foundational component for enabling many neighborhood-based information retrieval and machine learning
algorithms to accept sparse input. To our knowledge, this is the first work aiming to unify the computation of
several critical distance measures on the GPU under a single flexible design paradigm and we hope that it provides
a good baseline for future research in this area. Our implementation is fully open source and publicly available
as part of the RAFT library of GPU-accelerated machine learning primitives (https://github.com/rapidsai/raft).

1 I NTRODUCTION Semirings provide a useful paradigm for defining and com-


puting inner product spaces in linear algebra using two op-
Many machine learning and information retrieval tasks erations, as in the MapReduce (Mattson et al., 2013; Emoto
operate on sparse, high-dimensional vectors. Nearest- et al., 2012) paradigm, where a product() function is used
neighbor based queries and algorithms in particular are in- to define a mapping between point-wise corresponding el-
strumental to many common classification, retrieval, and ements of vectors and a sum() function is used to reduce
visualization applications(Schölkopf et al., 2001; Alpay, the products into a scalar. Using semirings to implement
2012; Berlinet and Thomas-Agnan, 2011; Smola et al., algorithms with sparse linear algebra on GPUs is an active
2007; Scholkopf and Smola, 2018). As General-purpose area of research (Fender, 2017; Gildemaster et al., 2020;
GPU computing (GPGPU) has become more popular, the Lettich, 2021) and has been widely studied for helping to
tools for IR and distance computations on GPUs has not consolidate both the representation and execution of op-
kept pace with other tooling on dense representations like erations on graphs and probabilistic graphical models. In
image and signal processing that have contiguous access this paper, we show that semirings can be used for sparse
patterns that are easier to code for (Guo et al., 2020). neighborhood methods in machine learning, extending the
Sparse methods of linear algebra on GPUs have long ex- benefits to all algorithms capable of using them. We de-
isted, though they are often specialized and difficult to fine semirings more formally in subsection 2.2 but use the
adapt to new distance measures. This stems from having more general description above to navigate the benefits and
to account for various hardware and application-specific related work in the following section.
constraints (Jeon et al., 2020; Guo et al., 2020; Gale et al.,
2020; Gray et al., 2017; Bell and Garland, 2008), and as- A common issue for large-scale sparse problems in high-
sumptions on the distribution of non-zeros in the input and performance single-instruction multiple-data (SIMD) envi-
output data (Sedaghati et al., 2015; Mattson et al., 2013). ronments, like the GPU, is load balancing in order to keep
This complexity and specialization has slowed the adop- the processing units constantly moving forward. As we will
tion for sparse data and operations in general purpose tools show in Section 3.1, the imbalanced load and resource re-
like PyTorch and Tensorflow. quirements for a simple and straightforward naive semir-
ing implementation, capable of computing distances like
To develop a more general code base that supports good Manhattan, suffers from large thread divergences within
performance and flexibility for new distance measures on warps, highly uncoalesced global memory accesses, and
sparse data, we develop an approach leveraging Semirings. resource requirements which are unrealistic in many real-
1
NVIDIA 2 University of Maryland, Baltimore County 3 Booz world datasets.
Allen Hamilton. Correspondence to: Corey J. Nolet <cjno-
[email protected]>.
In order to integrate into an end-to-end data science or
scientific computing workflow, such as in the PyData or
Proceedings of the 5 th MLSys Conference, Santa Clara, CA, RAPIDS (Raschka et al., 2020) ecosystems, an efficient
USA, 2022. Copyright 2022 by the author(s).
GPU Semiring Primitives for Sparse Neighborhood Methods

implementation of a primitive for computing pairwise dis- 2.1 Distances


tances on sparse datasets should ideally preserve as many
Sparse matrix-matrix multiplication with a standard dot
of the following characteristics as possible. In this paper,
product semiring is most performant in cases where only
we show that our implementation preserves more of the be-
the intersection is needed between pairs of corresponding
low characteristics than any other known implementation.
nonzero columns in each vector. Because a standard multi-
plication between two terms has an identity of 1 and mul-
1. Maintain uniformity of intra-warp instruction process- tiplicative annihilation (e.g. ai ∗ 0 = 0), the dot product
ing. semiring between two vectors can be computed efficiently
by iterating over the nonzero columns of one vector and
2. Coalesce both reads from and writes to global mem- only computing the product of the corresponding nonzero
ory. columns of the other vector. Many distances can make use
of this property, in table 1 we dervice the semi-ring annihi-
3. Process data inputs without transposition or copying. lators and expansions (as needed) for 15 distances.

4. Use as little memory as necessary. For a distance to define a metric space, it must follow
four properties- implication (d(a, b) = 0 =⇒ a = b),
5. Enable semirings in addition to the simple dot prod- positivity (d(a, b) >= 0), symmetry (d(a, b) = d(b, a)),
uct. and the triangle inequality (d(a, c) ≤ d(a, b) + d(b, c)).
Several metrics, including Chebyshev, Manhattan, and Eu-
clidean, are derived from the generalized Minkowski for-
P 1/p
k p
2 S EMIRINGS AND PAIRWISE D ISTANCES mula i |ai − b i | where p defines a degree. The
absolute value in this equation defines a commutative
We formalize the concepts of semirings and distance mea- semiring which requires commutativity in the difference
sures in this section and describe building blocks required of each vector dimension. Euclidean distance is equiva-
to implement several popular distance measures, often en- Pk
lent to Minkowski with a degree of 2 (( i |ai − bi |2 )1/2 ).
countered in machine learning applications, into the semir-
Because the square of a number is always positive, this
ing framework.
equation can be expanded to (a − b)p for all even degrees
In machine learning applications, a distance measure is of- and still preserve the absolute value, such as (a − b)2 =
ten performed on two row matrices containing data samples a2 − 2ha, bi + b2 in the case of Euclidean distance. While
with columns that represent some number of observations, numerical instabilities can arise from cancellations in these
or features. In this paper, we will refer to these two ma- expanded equations, we will show in section 2.2 that the
trices as A and B in upper-case where A ∈ Rm×k , and expanded form is often preferred in sparse algebras, when
B ∈ Rn×k and a single vector as a and b in lowercase distances can make use of it, because it requires less com-
where a ∈ Rk or b ∈ Rk . As we show in this section, putations than the exhaustive evaluation over the nonzeros
the core of computing pairwise distances between A and of k. By example, the distances which don’t have an ex-
B is a matrix multiplication AB > in a topological space panded form, such as Manhattan (Minkowski with degree
equipped with an inner product semiring that defines dis- 1) and Chebyshev (Minkowski with degree max) distance,
tances between vectors. When this inner product is defined are often non-annihilating (e.g. x∗0 = x) and require com-
to be the dot product semiring, the topological space de- putation over the full union of nonzero columns from both
fines the standard matrix multiply but we can capture many vectors in order to preserve commutativity.
other core distances in machine learning applications by
simply redefining the inner product semiring. 2.2 Semirings
While some distance measures can make use of the sim- A monoid is a semigroup containing an associative bi-
ple dot product semiring from matrix-matrix multiplication nary relation, such as addition (⊕), and an identity el-
routines, we show that a more comprehensive package for ement (id⊕ ). A semiring (Ratti and Lin, 1971), de-
computing pairwise distances requires more flexibility in noted (S, R, {⊕, id⊕ }, {⊗, id⊗ }), is a tuple endowed with
terms of the arithmetic operations supported. Further, the a domain along with additive (⊕) and multiplicative (⊗)
explicit transposition of B which is required in routines monoids where
such as the cuSPARSE csrgemm() requires a full copy of
B, since no elements can be shared between the original 1. ⊕ is commutative, distributive, and has an identity el-
and transposed versions in the CSR data format. This has ement 0
a negative impact on scalability in memory-constrained en-
vironments such as GPUs. 2. ⊗ distributes over ⊕
GPU Semiring Primitives for Sparse Neighborhood Methods

Table 1: Common distances and their semirings. While all distances can be computed with the NAMM (where id⊗ = 0),
the distances in this table which require it have their ⊗ listed. The expansion function and any potential norms are provided
for distances that can be computed in the more efficient expanded form.

Distance Formula NAMM Norm Expansion


Pk
(xi −x̄)(yi −ȳ) khx·yi−kxkkyk
Correlation 1 − √Pk i=0
2 √P 2 L1 ,L2 1− √
i=0 xi −x̄2 2
i=0 yi −ȳ
2 (kkxk2 −kxk2 )(kkyk2 −kyk2 )
Pk
xi yi hx·yi
Cosine √Pk i=0 2
√ Pk 2
L2 1− kxk22 kyk22
i=0 xi i=0 yi
Pk
2| i=0 xi yi | 2hx·yi
Dice-Sorensen ( k
P 2
Pk 2 L0 |x|2 +|y|2
i=0 x) +( i=0 y)
Pk
Dot Product i=0 xi yi hx · yi
qP
k
Euclidean i=0 |xi − yi |2 L2 kxk22 − 2hx · yi + kyk22
Pk |xi −yi | |x−y|
Canberra i=0 |xi |+|yi | { |x|+|y| , 0}
Pk
Chebyshev i=0 max(xi − yi ) {max(x − y), 0}
Pk
i=0 xi 6=yi
Hamming k {x 6= y, 0}
√ q√
√ 2 √
qP
k
√1
Hellinger 2 i=0 ( xi − yi ) 1− h x · yi
Pk
x y hx·yi
Jaccard Pi=0 i2 i Pk L0 1−
( i=0 xi + k
Pk 2
i=0 yi − i=0 xi yi (kxk+kyk−hx·yi)
xi yi
q Pk
i=0 xi log µ +yi log µ
Jensen-Shannon 2
i i
{x log x
µ + y log µy , 0}
Pk
KL-Divergence i=0 xi log( xyii ) hx · log xy i
Pk
Manhattan i=0 |xi − yi | {|x − y|, 0}
Pk
Minkowski ( i=0 |xi − yi |p )1/p {|x − y|p , 0}
k− 2i=0 xi yi
P
k−hx·yi
Russel-Rao k k

Some formal definitions of semirings require that id⊗ = 1. Table 1 uses semirings to construct several commonly used
Given two sparse vectors a, b ∈ Rk , a semiring with distances common in machine learning and data science ap-
(S, R, {⊕, 0}, {⊗, 1}) and annihilator⊗ = 0 has the ef- plications. When an expanded form is possible, an expan-
fect of only requiring ⊗ be computed on columns that sion function can be performed as an element-wise oper-
are both nonzero (e.g. nonzeros(a) ∩ nonzeros(b)). ation on a simple pairwise dot product semiring with ar-
These rules are often relaxed in practice, for example in rays of row-vector norms. While most of the expanded
tropical semirings in Equation 1, which can solve dy- form distances can directly use the dot product semiring,
namic programming problems such as the Viterbi algo- KL-divergence directly replaces the ⊗ with ai log(ai /bi )
rithm. An annihilator is an input that will always cause and makes no further assumption of symmetry. A NAMM
a monoid to evaluate to 0 and the multiplicative annihilator is required for all unexpanded distance measures where
(annihilator⊗ ) is often assumed to be id⊕ . A monoid is id⊗ = 0 and special care must be taken to ensure it is
non-annihilating when it does not have a defined annihila- applied to the full union of the non-zero columns of cor-
tor. When an expanded form is not possible or efficient, ⊗ responding elements from each pair of vectors.
also must be commutative in metric spaces, and thus must
As mentioned in the previous section, the Euclidean dis-
be non-annihilating and id⊗ = 0. We refer to this monoid
tance can be expanded to kAk − 2hAB > i + kBk. This
as a non-annihilating multiplicative monoid (NAMM).
equation can be decomposed into the sum of individual L2
norms, a matrix product, and an element-wise expansion
(S, R ∪ {+∞}, {min, +∞}, {+, 0}) (1) function executed in parallel over the individual dot prod-
GPU Semiring Primitives for Sparse Neighborhood Methods

ucts from the matrix product to combine the parts into a Existing semiring implementations currently require that
single scalar distance. Given vectors Ai , Bj , the expan- the id⊕ be used as annihilator⊗ . For example, the Graph-
sion function for Euclidean distance can be derived by dis- BLAS specification enables the re-interpretation of the ze-
tributing their squared difference over the exponent to pro- roth element, but this is necessary to define the identity of
duce (Ai − Bi ) × (Ai − Bi ) and further expanding it to the ⊕ monoid.
kAki + 2hAi , Bj i − kBkj .
The annihilator⊗ and id⊗ determine the number of times 3 GPU-ACCELERATED S EMIRINGS
the ⊗ monoid must be applied during the computation of
In this section, we briefly introduce GPU architecture be-
pairwise distances. When annihilator⊗ = id⊕ , then
fore discussing some naive designs and the inefficiencies
⊗(ai , 0) = 0 and ⊗(0, bi ) = 0 so ⊗ can be applied only to
that led to the construction of our final design. Our goal
the intersection of columns. When annihilator⊗ is un-
was to preserve as many of the ideal design characteristics
defined and id⊗ = 0, then ⊗ must be applied exhaus-
from section 5 as possible but we found a need to accept
tively over the union of columns because ⊗(ai , 0) = ai
trade offs during implementation.
and ⊗(0, bi ) = bi .
A union between two sets can be decomposed into an in- 3.1 GPU Architecture
tersection between the two sets, along with the union of
the symmetric differences between them. These are shown The largest GPUs today contain hundreds of hardware pro-
in Equation 3, where a complement is denoted with a bar. cessing cores called streaming multiprocessors (SM) which
The nonzero columns of two sparse vectors can be used as execute groups of threads called warps. Each warp can
sets a and b in this equation and the sparse matrix multi- process a single instruction at a time in parallel using a
ply with an ordinary dot product only requires the applica- paradigm called single-instruction multiple data (SIMD).
tion of product() across a ∩ b. The NAMM, however, re- It’s important that threads within a warp minimize condi-
quires the application of the product() across the full union tional branching that will cause the threads to wait for each
of nonzero columns a ∪ b. branch to complete before proceeding. This is called thread
divergence, and can severely limit effective parallel execu-
tion. On the Volta and Ampere architectures, each SM can
track the progress of up to 64 warps concurrently (Tesla,
2018), and rapidly switch between them to fully utilize the
a∩b a∩b SM. Each SM has a set of registers available which allows
a∩b warps to perform collective operations, such as reductions.
Warps can be grouped into blocks and a small amount of
(2) memory can be shared across the threads and warps.
Global, or device, memory can be accessed by all of the
SMs in the GPU. Accesses to contiguous device mem-
a ∪ b = {a ∩ b} ∪ {a ∩ b} ∪ {a ∩ b} (3) ory locations within a warp can be coalesced into a single
blocked transaction so long as the accesses are performed
A common approach to implementing sparse matrix multi- in the same operation. In SIMD architectures, uniform pat-
ply is to iterate over the nonzeros from b in order to lookup terns can be critical to performance unless latencies from
and compute the intersection with the nonzeros from a. non-uniform processing, such as uncoalesced memory ac-
This design will also implicitly compute the symmetric dif- cesses, can be hidden with increased parallelism.
ference between either of the two sets of nonzeros, a ∩ b or Registers provide the fastest storage, and it’s generally
a ∩ b, depending on which vector is chosen in the iteration preferable to perform reductions and arithmetic as intra-
over nonzeros. To compute a full union, the remaining set warp collective operations where possible. Intra-block
difference can be computed in a second pass of the matrix shared memory is also generally preferred over global
multiply by looping over the nonzeros from the vector that memory when a problem can be made small enough to
remains. We will show in subsection 3.1 that we accom- benefit. However, contiguous locations of shared memory
plish this efficiently in our implementation in two passes- are partitioned across contiguous banks and any accesses to
one pass to compute the first two terms and another pass to different addresses in the same bank by the same warp will
compute the third term. Distances which can be computed create a bank conflict and be serialized within the warp,
with an expansion function only need the first pass while causing the threads to diverge.
distances which require the NAMM need both. Please re-
fer to subsection A.1 for an example of using semirings to
compute the Manhattan distance using the NAMM.
GPU Semiring Primitives for Sparse Neighborhood Methods

3.2 Naive Semi-Ring Full-Union CSR Designs creased the potential for coalesced global memory accesses
and created large thread divergences. Further, the exhaus-
3.2.1 Expand-Sort-Contract
tive nature of this design, while it will guarantee the ⊗
Initial implementations tried to minimize the memory foot- monoid is computed on the full union of nonzero columns,
print as much as possible by directly computing the output will end up performing many unnecessary computations
distances from the input CSR format. The CSR format re- when distances can be computed with the rules of a sim-
quires columns to be sorted with respect to row and we ple dot product semiring.
initially attempted to use a modified variant of the expand-
sort-contract (Dalton et al., 2015) pattern on the nonzero Algorithm 2 Semring on CSR inputs. Each thread com-
columns from each pair of row vectors, a, b ∈ Rk , con- putes a single dot product.
catenating the vectors together, sorting them, and applying Input: Ai , Bj , product op, reduce op
the ⊗ monoid on pairs of duplicate columns to contract Result: Cij = d(Ai , Bj )
the sorted array and invoking ⊗ with the identity for all startA = indptrAi , endA = indptrAi+1
other columns. At the row-level of the output matrix, no startB = indptrBj , endB = indptrBj+1
computations would be able to be reused by subsequent icolA = startA, icolB = startB
pairs of vectors so we implemented this pattern on the while icolA < endA —— icolB < endB do
GPU and mapped the nonzero columns and values for each colA = icolA < endA ? indicesicolA : MAX INT
row-vector pair to individual thread-blocks, expanding both colB = i colB < endB ? indicesicolB : MAX INT
vectors by concatenating them in shared memory, perform- valueA = 0, valueB = 0 if colA ≤ colB then
ing a sort-by-key, and compressing them in parallel. We at- valueA = valuesAicolA ++
tempted several efficient sorting algorithms on the GPU in- end
cluding the popular radix sort and bitonic sorting networks if colB ≤ colA then
and, while the use of shared memory in the sort step en- valueB = valuesBicolB ++
abled coalesced reads from global memory for the nonzero end
columns and values, the sorting step dominated the perfor- v = product op(valueA, valueB)
mance of the algorithm. Another downside with this partic- Cij = reduce op(Cij , v)
ular design is that both vectors need to fit in shared mem- end
ory, requiring space for 2 ∗ (nonzeros(a) + nonzeros(b))
elements in order to fit both the columns and correspond-
ing values at the same time. In addition to the need for We found marginal gains in performance by coalescing the
n ∗ m blocks to be scheduled, the shared memory require- reads of the vectors from A into shared memory and shar-
ment became a severe limit to scale, which was further ing it across all threads of each thread-block. We attempted
compounded by the shared memory size limiting the num- to load balance this algorithm by maintaining arrays to look
ber of blocks that could be scheduled concurrently on each up row information for each column but this increased warp
SM. divergence from the overly complicated conditionals re-
quired to maintain state across threads and warp bound-
Algorithm 1 Semiring on CSR inputs using expand-sort- aries.
contract pattern, parallelized across threads in each block.
Input: Ai , Bj , product op, reduce op 3.3 Load Balanced Hybrid CSR+COO
Result: Cij = d(Ai , Bj ) While the CSR format enables algorithms to be parallelized
smem[0..nnzai−1 ] = Ai over threads for individual rows, we found that using a row
smem[nnzai ..nnzbj−1 ] = Bj index array in coordinate format (COO) for B enabled load
sort(smem) balancing, coalescing the loads from each vector from A
Cij = reduce(smem, product op, reduce op) into shared memory, once per block, and threads of each
block parallelizing the application of the semiring over
nonzero elements of B. Since the columns in B are assumed
3.2.2 Iterating Sorted Nonzeros
to be sorted by their respective row, we use a segmented re-
Since columns will often be sorted within their respective duction by key within each warp, bounding the number of
rows in the CSR format, we removed the sort step from potential writes to global memory by the number of active
algorithm 1 by exhaustively iterating over the non-zeros warps over each row of B. Our design extends the work
of each O(m ∗ n) pair of vectors in parallel, one pair per of the COO sparse-matrix dense-vector multiplication de-
thread, as shown in algorithm 2. We found that even when scribed in (Anzt et al., 2020) by storing the vectors from A
the neighboring threads processed rows of similar degree, in dense form in shared memory only when the number of
the differing distributions of nonzeros within each row de- columns are small enough. Our extension enables sparse-
GPU Semiring Primitives for Sparse Neighborhood Methods

matrix sparse-vector multiplication by storing the vectors a max dimensionality of 40K with single-precision. Cou-
in sparse form when their degrees are small enough. We pling the amount of shared memory to the dimensionality
achieve full occupancy on the Volta architecture by trading creates a problem for occupancy as it approaches capacity.
off the size of the L1 cache to double the amount of shared Both of these architectures limit the maximum block sizes
memory per GPU, allowing each SM to use 96KiB. Since to 1024 threads and max concurrent warps per SM to 64
our design uses less than 32 registers, a block size of 32 so anything over 48KB of shared memory per block is go-
warps allows two blocks, the full 64 warps, to be scheduled ing to decrease occupancy. For this reason, the maximum
concurrently on each SM. dimensionality of dense vectors that can be processed with
full occupancy is actually 12K and 20K, respectively.
Algorithm 3 Load-balanced Hybrid CSR+COO SPMV.
This boundary becomes too small for many sparse datasets
Input: Ai , B, product op, reduce op which would instead benefit from coupling the shared
Result: Cij = d(Ai , Bj ) memory size to individual row degrees. Inspired by other
read Ai into shared memory sparse matrix multiplication implementations on the GPU
cur row=rowidx[ind] (Anh et al., 2016; Kunchum, 2017; Liu and Vinter, 2014;
ind = idx of first elem to be processed by this thread Nagasaka et al., 2017), we enhanced the vector insertion
c = product op(A[ind], x[colidx[ind]]) and lookup patterns of the COO SPMV design outlined in
for i ← 1 to nz per chunk ; by warp size do (Anzt et al., 2020) by building a hash table to store these
next row = cur row + warp size
columns in shared memory. Unlike many other hash ta-
if next row != cur row —— is final iter ? then
v = segmented scan(cur row, c, product op) ble implementations on the GPU (Alcantara et al., 2009;
if is segment leader ? then Ashkiani et al., 2018; Alcantara et al., 2012; Pan and
atomic reduce(v, reduce op) Manocha, 2011; Cassee and Wijs, 2017), our implementa-
end tion builds an independent hash table per thread-block and
c=0 so many other designs and concurrency patterns that opti-
end mize the key distribution and collision-resolution strategies
cur row = next row for the GPU are not efficient or cannot be easily ported for
ind += warp size our use-case. For this reason, we used a simple hash table
c = product op(A[ind], x[colidx[ind]]) with a Murmur hash function and linear probing and leave
end the investigation of a better and more optimized design to
future work.

3.3.1 Two-pass execution Hash tables have the best performance when the number of
entries is less than 50% of the capacity. As the hash table
As described in subsection 2.2, a single execution of this size grows beyond 50% capacity, the collision resolution
strategy will compute the intersection and symmetric dif- cycles of linear probing, which are non-uniform, increase
ference a ∩ b between nonzero columns from each vector the serialization of instructions from warp divergences and
a, and b so long as ⊗ is applied to all nonzero columns also increase the number of transactions from global mem-
of b. While only a single pass covers distance measures ory reads of B since they can no longer be coalesced. The
which require only a column intersection (e.g. dot product hash table strategy decreases the amount of shared memory
semiring (S, R, {+, 0}, {∗, 1})), a second pass can com- available, often by a factor of 2, because the nonzeros need
pute the remaining symmetric difference required for the to be stored together as key/value pairs to avoid an addi-
full union between non-zero columns by commuting A and tional costly lookup to global memory, a side-effect which
B and skipping the application of of id⊗ in B for the sec- would only further increase serialized execution from di-
ond pass. verging threads. Our hash table strategy allows for a max
degree of 3K on Volta architectures and 5K on Ampere.
3.3.2 Sparsifying the Vector in Shared Memory
Another unfortunate side-effect from the linear-probing
While we found storing the vectors from A in dense form collision strategy of our hash table is the increase in lookup
in shared memory to have the highest throughput rate and times for columns even for elements that aren’t in the ta-
least amount of thread divergence within each warp, sparse ble. For example, as the hash table approaches capac-
datasets are generally assumed to have high dimensional- ity, the increase in collisions can cause a lookup to probe
ity and the limited amount of shared memory that can be through multiple candidates, sometimes hundreds, before
allocated per SM bounds the size of the vectors that can finding an element doesn’t exist. Bloom filters have been
be stored in it. For example, The 96KiB limit per block used to implement fast list intersection problems for sparse
on Volta allows a max dimensionality of 23K with single- matrix multiplication problems on the GPU (Zhang et al.,
precision and the 163KiB limit per SM on Ampere allows
GPU Semiring Primitives for Sparse Neighborhood Methods

2020; 2011). As an alternative to the hash table approach, 4 E XPERIMENTS


we tried building a bloom filter in shared memory and
used a binary search to perform lookups of nonzeros in We evaluated the runtime performance characteristics and
global memory for positive hits. While we found this tech- generalization of our approach by benchmarking our semir-
nique to yield marginally better performance on the Jensen- ing strategies against several real-world sparse datasets
Shannon distance in one of our benchmarks, likely because with different shapes and degree distributions. We also
it helped hide some of the compute-bound latencies from analyze the GPU memory footprint of the cuSPARSE
the additional arithmetic, we were not able to extract a sim- csrgemm() and our load-balanced COO SPMV.
ple rule from the data shapes or sparsity patterns that would
allow us to know, before starting the computation, when it 4.1 Datasets
should be used. The datasets which we found are often used to benchmark
sparse matrix-matrix and matrix-vector implementations
3.3.3 Handling High Degree Columns on the GPU demonstrate the subtle differences in the ob-
Our hash table implementation shows reasonable perfor- jectives between using semirings for sparse neighborhood
mance up to 50% capacity. Rows with degree greater than methods and using sparse linear algebra more generally for
50% hash table capacity are partitioned uniformly by their things like graph algorithms and eigendecompositions. As
degrees into multiple blocks with subsets of the degrees an example, one such set of datasets which we found com-
that can fit into 50% hash table capacity. Using a similar monly used in papers to benchmark sparse linear algebra
logic to that of blocked sparse techniques, our partitioning implementations (Williams et al., 2007; Bell and Garland,
strategy does extra work in exchange for scale. Further, this 2008) is composed almost entirely of square connectivities
technique requires each thread perform a branching condi- graphs, and these would not provide a useful performance
tional so it can test whether each nonzero column of B is indicator for the objective of creating connectivites graphs
part of the current partition. As we show in section 4, we from bipartite graphs. For this reason, and the lack of prior
do find that this strategy can perform well on some datasets research in our objective, we establish a new baseline us-
when most of the degrees are small enough to fit in the hash ing datasets that our algorithm would be expected to en-
table. For example, we found this strategy spent a minis- counter in practice. Our baseline uses cuSPARSE for all
cule amount of time in this step on the Movielens dataset. the expanded distance measures, along with the naive CSR
full-union semiring implementation as described in section
3.4 Norms and Expansion Functions 3.2.2 for the distances which cuSPARSE does not support.
The MovieLens (Harper and Konstan, 2015) Large dataset
Distances which can be computed in their expanded forms
contains ratings given by 283k users for 194k movies. We
can use the dot product semiring directly and only require a
used a dataset of 70k cells and gene expressions for 26k
single pass through our SPSV. Computing distances in their
genes from the human cell atlas (Travaglini et al., 2020)
expanded form often requires one or more vectors of row
as an example of a single-cell RNA workflow. For nat-
norms as well as an expansion function, which uses some
ural language processing examples, we benchmarked two
arithmetic to combine the norm vectors with the individual
different datasets containing TF-IDF vectors for two dif-
dot products (refer to Table 1 for examples). Row norms
ferent use-cases. We used the NY Times Bag of Words
can be computed over CSR matrices using a row-wise re-
dataset(Newman, 2008) for an example of document simi-
duction on the GPU as each row can be mapped to a sin-
larity and n-grams generated from a list of company names
gle block or warp and the norm computed by a warp-level
from the SEC EDGAR company names database for an ex-
collective reduction. The reduction primitive necessary for
ample of string matching.
computing these row norms is already part of the Graph-
BLAS specification. Table 2: Datasets used in experiments
The actual arithmetic in each expansion function is depen-
dent upon the distance measure, however the kernel to ap- Dataset Size Density Min Deg Max Deg
ply the expansion function can be executed embarrassingly Movielens Large (283K, 194K) 0.05% 0 24K
parallel using an element-wise primitive, also part of the SEC Edgar (663K, 858K) 0.0007% 0 51
scRNA (66K, 26K) 7% 501 9.6K
GraphBLAS specification, to map each entry in the dot NY Times BoW (300K, 102K) 0.2% 0 2K
product matrix to an individual GPU thread to coalesce the
reads and writes.
4.2 Runtime Performance
To get an idea of how each supported distance performed
on data of different shapes and degree distributions, we
GPU Semiring Primitives for Sparse Neighborhood Methods

Table 3: Benchmark Results for all datasets under consideration. All times are in seconds, best result in bold. The first
italicized set of distances can all be computed as dot products, which are already highly optimized for sparse comparisons
today. This easier case we are still competitive, and sometimes faster, than the dot-product based metrics. The Non-trivial
set of distances that are not well supported by existing software are below, and our approach dominates amongst all these
metrics.

MovieLens scRNA NY Times Bag of Words SEC Edgar


Distance Baseline RAFT Baseline RAFT Baseline RAFT Baseline RAFT
Correlation 130.57 111.20 207.00 235.00 257.36 337.11 134.79 87.99
Dot Product Based

Cosine 131.39 110.01 206.00 233.00 257.73 334.86 127.63 87.96


Dice 130.52 110.94 206.00 233.00 130.35 335.49 134.36 88.19
Euclidean 131.93 111.38 206.00 233.00 258.38 336.63 134.75 87.77
Hellinger 129.79 110.82 205.00 232.00 258.22 334.80 134.11 87.83
Jaccard 130.51 110.67 206.00 233.00 258.24 336.01 134.55 87.73
Russel-Rao 130.35 109.68 206.00 232.00 257.58 332.93 134.31 87.94
Canberra 3014.34 268.11 4027.00 598.00 4164.98 819.80 505.71 102.79
Non-Trivial Metrics

Chebyshev 1621.00 336.05 3907.00 546.00 2709.30 1072.35 253.00 146.41


Hamming 1635.30 229.59 3902.00 481.00 2724.86 728.05 258.27 97.65
Jensen-Shannon 7187.27 415.12 4257.00 1052.00 10869.32 1331.37 1248.83 142.96
KL Divergence 5013.65 170.06 4117.00 409.00 7099.08 525.32 753.56 87.72
Manhattan 1632.05 227.98 3904.00 477.00 2699.91 715.78 254.69 98.05
Minkowski 1632.05 367.17 4051.00 838.00 5855.79 1161.31 646.71 129.47

benchmarked all of the supported distances for each of the


datasets, even though some of them may provide irrele-
vant geometries in practice. Benchmarks were performed
on a DGX1 containing dual 20-core Intel Xeon ES-2698
CPUs (80 total threads) at 2.20GHZ and a Volta V100
1
GPU running CUDA 11.0 for both the driver and toolkit.
Each benchmark performs a k-nearest neighbors query to
test our primitives end-to-end and allow scaling to datasets
0.5 where the dense pairwise distance matrix may not other-
wise fit in the memory of the GPU. We used the brute-force
NearestNeighbors estimator from RAPIDS cuML for the
0 GPU benchmarks since it makes direct use of our primitive.
We used Scikit-learn’s corresponding brute-force Nearest-
100 101 102 103 104 Neighbors estimator as a CPU baseline and configured it to
ny times movielens scrna sec edgar use all the available CPU cores. Each experiment trains the
NearestNeighbors estimator on the entire dataset and then
Figure 1: CDFs of Degree Distributions for the datasets queries the entire dataset, timing only the query. Compared
used in our benchmark on the interval 0-99%. We can see to the CPU, we observed an average of 28.78× speedup
that 99% of the degrees in the SEC Edgar datasets are ¡10 for the dot-product-based distances and 29.17× speedup
while 88% of the degrees for Movielens are ¡200. On av- for the distances which require the non-annihilating prod-
erage scRNA has the largest degrees with 98% of the rows uct monoid.
having degree 5k or less. The NY Times dataset has the From the strategies described in Section 3.1, we bench-
highest variance, with 99% of the rows having degree less marked our best performing approach, the Load-balanced
than 1k. Hybrid COO+CSR SPMV described in subsection 3.3, us-
ing the hash table strategy to sparsify the vector in shared
memory.
As evidenced in table 3, our implementation consistently
outperforms the CPU. We also outperform the baseline,
GPU Semiring Primitives for Sparse Neighborhood Methods

from cuml.neighbors import across different sparse datasets.


,→ NearestNeighbors
nn = NearestNeighbors().fit(X) 4.3 Memory Footprint
dists, inds = nn.kneighbors(X)
from cuml.metrics import The density of the dot product matrix that is returned
,→ pairwise_distances from the cuSPARSE csrgemm() is fully dependent upon the
dists = pairwise_distances(X, dataset. Because 2 arrays, each of size nnz, are required to
,→ metric='cosine') represent the cuSPARSE output in CSR format, a density
of 50% would require the same amount of space as the full
Figure 2: Excluding data loading and logging, all the code dense pairwise distance matrix. A density of 100% requires
needed to perform the same GPU accelerated sparse dis- 2x the amount of space as the dense pairwise distance ma-
tance calculations done in this paper are contained within trix. In addition, since the output still needs to be converted
these two snippets. Top shows k-NN search, bottom all to a dense format, this requires an additional allocation of
pairwise distance matrix construction. These are the APIs the dense pairwise distance matrix in a space of contigu-
that most would use. ous memory locations even if the cuSPARSE output was
99.9% dense. We found the density of the cuSPARSE out-
#include put to be at least 57% on average across the batches for
,→ <raft/sparse/distance/coo_spmv.cuh> Movielens, 98% for NY Times BoW and was fully dense
#include <raft/sparse/distance/operators.h> in scRNA. The SEC Edgar datasets had the highest vari-
using namespace raft::sparse::distance ance in density from batch-to-batch and were significantly
different between n-gram sizes. The unigram and bigram
distances_config_t<int, float> conf; dataset ranged from 5% to 25% output density, for exam-
ple, while trigrams ranged from 24% to 43%.
// Use conf to set input data arguments...
This provides further evidence of the subtle but impor-
balanced_coo_pairwise_generalized_spmv( tant differences between the types of data we expect
out_dists, conf, coo_rows_a, to encounter in neighborhood methods, however even
AbsDiff(), Sum(), AtomicSum());
more evident is that the matrix resulting from comput-
balanced_coo_pairwise_generalized_spmv_rev( ing the dot product semiring over the square connectivities
out_dists, conf, coo_rows_b, graphs used in other sparse matrix multiplication research
AbsDiff(), Sum(), AtomicSum()); (Williams et al., 2007; Bell and Garland, 2008) is extremely
sparse. In addition to the output memory, cuSPARSE re-
Figure 3: The C++ API can be used to construct new quired an internal temporary workspace in device memory
semirings. Dot-product-based semirings only need in- with anywhere from 300mb to 550mb of additional mem-
voke the first function while NAMMs can be constructed ory per batch while our dot product semiring required a
by invoking both. While the Python API is part of the workspace buffer of size nnz(B) per batch. Strangely, the
RAPIDS cuML project, the C++ API is provided by the size of this temporary workspace seemed almost identical
RAFT project (http://github.com/rapidsai/raft). RAFT is a even when computed on the square connectivities graphs
header only library that contains fundamental algorithms mentioned above.
and primitives for data science, graph, and machine learn-
ing applications. 5 R ELATED W ORK
5.1 Sparse matrix multiplication
cuSPARSE, for the distances that it supports in two out The task of efficient and performant sparse matrix multipli-
of the four datasets. In addition to maintaining compara- cation is an active area of research, with implementations
ble performance in the remaining two datasets, our design spanning the spectrum of scientific computing. In high
is also flexible enough to provide distances which require performance computing environments, these solutions are
the NAMM outlined in subsection 2.2 while using less designed around both hardware and software constraints
memory. As mentioned in section 5, it is not uncommon (Jeon et al., 2020; Guo et al., 2020; Gale et al., 2020;
to see different sparse implementations performing better Gray et al., 2017; Bell and Garland, 2008), often mak-
on some datasets than others (Sedaghati et al., 2015) and ing use of specialized hardware capabilities and optimiz-
the flexibility of our implementation, as well as our well- ing for specific sparsity patterns, an unfortunate side-effect
defined set of rules for supporting a wide array of distances, that can reduce their potential for reuse. What compli-
will allow us to continue optimizing our execution strate- cates this further are the number of different optimized
gies to support patterns that we find frequently occurring
GPU Semiring Primitives for Sparse Neighborhood Methods

variants of sparse matrix multiplication available in open paradigm in research. Semirings have also been used for
source libraries, each using different concurrency patterns some time to implement more modern machine learning
and available memory to provide speedups based on either methods (Belle and De Raedt, 2020), with the more re-
supported sparse formats or the assumed density of either cent invention of semiring programming attempting to fur-
the inputs or the outputs (Sedaghati et al., 2015; Mattson ther consolidate these concepts under a single framework
et al., 2013). We have compareda against the seminal cuS- and set of symbolic routines. Semirings can be a useful
PARSE (Naumov et al., 2010) that is highly optimized for building-block for linear models (Jananthan et al., 2017),
sparse dot product based k-nearest neighbors (Zhou, 2018), probabilistic models, such as Bayesian networks (Wachter
and found our approach is faster or competitive in all cases, et al., 2007) and the use of Tropical semiring in Markov
but is not limited to dot product based measures. networks (Ilic, 2011). The Tropical semiring is also be-
ing used to implement sparse non-negative matrix factor-
Better able to make use of critical optimizations inherent
izations (Omanović et al., 2020).
in their dense counterparts, block compressed sparse for-
mats have become widely popular for representing sparse
data (Zachariadis et al., 2020), in part because they can 5.3 Neighborhood Methods
improve load balancing by grouping nonzeros into fixed- Our work is positioned to have an impact on numerous
sized tiles and scheduling the tiles more uniformly across down-stream tasks that often depend on sparse nearest-
the processing cores. Enabling sparse formats to be pro- neighbor retrieval. This includes classic Information
cessed more similar to their dense counterparts allows the Retrieval problems where such methods are still highly
use of specialized hardware optimizations such as tensor competitive or preferred (Mitra and Craswell, 2018; Li,
cores. While we do hope to someday support block-sparse 2016; Soboroff, 2018; Voorhees et al., 2017; Bouthillier
formats, it is most often assumed that users will be calling et al., 2021). Dimensional reduction approaches like t-
code that invokes our primitive with matrices in the stan- SNE (Maaten and Hinton, 2008) and UMAP (McInnes
dard compressed sparse row (CSR) format (Williams et al., et al., 2018) that lack sparse input support on GPUs with-
2007) and so a conversion would be necessary in order to out our method (Nolet et al., 2020). ML models based on
use a blocked format. the kernel trick, such as Guassian Process (Lawrence and
Urtasun, 2009) also stand to benefit. The breadth and fre-
5.2 Semirings quency of nearest neighbor methods on high dimensional
data make our work relevant to an especially wide class of
Consolidating seemingly disparate concepts into a
practioners.
lightweight, terse, and abstract set of building-blocks can
increase flexibility and promote reuse (Izbicki, 2013).
This especially benefits fields which require non-trivial 6 C ONCLUSION
and highly-optimized implementations where the design
In this paper, we demonstrated a flexible sparse pairwise
complexities and costs are high, the basic linear-algebra
distance primitive that is able to collectively support, to
subroutines (BLAS) API and GPU-accelerated computing
our knowledge, a larger assortment of widely-used dis-
being common examples. Semirings provide the efficiency
tance measures than any other package on the GPU. We
and flexibility to enable algorithms in which the represen-
consolidated the design of these distance measures using a
tation and assumptions of the typical BLAS API for dense
couple minor enhancements to the rules of classical semir-
linear algebra comes up short (Mattson et al., 2013). NIST
ings, which are traditionally used to implement graph algo-
published a sparse BLAS standard back in 2001 (Duff
rithms, and we discussed the impact of our primitive as a
et al., 2002) and cuSPARSE is one of the most sophisti-
core building block of many important neighborhood meth-
cated implementations of the sparse BLAS standard that
ods for machine learning and data mining. Finally, we pro-
has been built on the GPU, however as mentioned above,
vided a novel implementation as an example of how these
its multiplication routines fix the inner product to the dot
semirings can be implemented on the GPU with a lower
product. GraphBLAS (Davis, 2018) provides a set of
memory footprint and performance comparable to, or bet-
primitives, along with an API, for using semiring algebras
ter than, the current state of the art.
to implement graph algorithms. The GraphBLAST (Yang
et al., 2019) and SuiteSparse (Davis, 2019) libraries
provide implementations of the GraphBLAS that also R EFERENCES
include GPU-accelerated primitives. Dan A Alcantara, Andrei Sharf, Fatemeh Abbasinejad,
The use of semirings in graph theory dates back to the early Shubhabrata Sengupta, Michael Mitzenmacher, John D
1970s (Ratti and Lin, 1971), when ”good old-fashioned Owens, and Nina Amenta. 2009. Real-time parallel
artificial intelligence”, or Symbolic AI, was a dominant hashing on the GPU. In ACM SIGGRAPH Asia 2009
GPU Semiring Primitives for Sparse Neighborhood Methods

papers. 1–9. Steven Dalton, Luke Olson, and Nathan Bell. 2015. Opti-
mizing sparse matrix—matrix multiplication for the gpu.
Dan A Alcantara, Vasily Volkov, Shubhabrata Sengupta, ACM Transactions on Mathematical Software (TOMS)
Michael Mitzenmacher, John D Owens, and Nina 41, 4 (2015), 1–20.
Amenta. 2012. Building an efficient hash table on the
GPU. In GPU Computing Gems Jade Edition. Elsevier, Timothy A Davis. 2018. Algorithm 9xx: SuiteS-
39–53. parse:GraphBLAS: graph algorithms in the language of
sparse linear algebra. Technical Report. 24 pages.
Daniel Alpay. 2012. Reproducing kernel spaces and appli-
cations. Vol. 143. Birkhäuser. Timothy A Davis. 2019. Algorithm 1000: SuiteSparse:
GraphBLAS: Graph algorithms in the language of sparse
Pham Nguyen Quang Anh, Rui Fan, and Yonggang Wen. linear algebra. ACM Transactions on Mathematical Soft-
2016. Balanced Hashing and Efficient GPU Sparse Gen- ware (TOMS) 45, 4 (2019), 1–25.
eral Matrix-Matrix Multiplication.(2016), 1–12. Google
Scholar Google Scholar Digital Library Digital Library Iain S Duff, Michael A Heroux, and Roldan Pozo. 2002.
(2016). An overview of the sparse basic linear algebra subpro-
grams: The new standard from the BLAS technical
Hartwig Anzt, Terry Cojean, Chen Yen-Chen, Jack Don- forum. ACM Transactions on Mathematical Software
garra, Goran Flegar, Pratik Nayak, Stanimire Tomov, (TOMS) 28, 2 (2002), 239–267.
Yuhsiang M. Tsai, and Weichung Wang. 2020. Load-
Balancing Sparse Matrix Vector Product Kernels on Kento Emoto, Sebastian Fischer, and Zhenjiang Hu. 2012.
GPUs. ACM Trans. Parallel Comput. 7, 1, Article 2 Filter-embedding semiring fusion for programming with
(March 2020), 26 pages. https://doi.org/10. MapReduce. Formal Aspects of Computing 24, 4 (2012),
1145/3380930 623–645.

Saman Ashkiani, Martin Farach-Colton, and John D Alexandre Fender. 2017. Parallel solutions for large-scale
Owens. 2018. A dynamic hash table for the GPU. In eigenvalue problems arising in graph analytics. Ph.D.
2018 IEEE International Parallel and Distributed Pro- Dissertation. Université Paris-Saclay.
cessing Symposium (IPDPS). IEEE, 419–429.
Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen.
Nathan Bell and Michael Garland. 2008. Efficient sparse 2020. Sparse GPU kernels for deep learning. arXiv
matrix-vector multiplication on CUDA. Technical Re- preprint arXiv:2006.10901 (2020).
port. Citeseer.
Brandon Gildemaster, Prerana Ghalsasi, and Sanjay
Vaishak Belle and Luc De Raedt. 2020. Semiring pro- Rajopadhye. 2020. A Tropical Semiring Multiple
gramming: A semantic framework for generalized sum Matrix-Product Library on GPUs: (not just) a step
product problems. International Journal of Approximate towards RNA-RNA Interaction Computations. In 2020
Reasoning 126 (2020), 181–201. IEEE International Parallel and Distributed Pro-
cessing Symposium Workshops (IPDPSW). 160–169.
Alain Berlinet and Christine Thomas-Agnan. 2011. Repro- https://doi.org/10.1109/IPDPSW50202.
ducing kernel Hilbert spaces in probability and statis- 2020.00037
tics. Springer Science & Business Media.
Scott Gray, Alec Radford, and Diederik P Kingma. 2017.
Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, As- Gpu kernels for block-sparse weights. arXiv preprint
sya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz arXiv:1711.09224 3 (2017).
Sepah, Edward Raff, Kanika Madan, Vikram Voleti,
Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Cong Guo, Bo Yang Hsueh, Jingwen Leng, Yuxian Qiu,
Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, and Pas- Yue Guan, Zehuan Wang, Xiaoying Jia, Xipeng Li,
cal Vincent. 2021. Accounting for Variance in Machine Minyi Guo, and Yuhao Zhu. 2020. Accelerating sparse
Learning Benchmarks. In Machine Learning and Sys- dnn models without hardware-support via tile-wise spar-
tems (MLSys). arXiv:2103.03098 http://arxiv. sity. arXiv preprint arXiv:2008.13006 (2020).
org/abs/2103.03098
F Maxwell Harper and Joseph A Konstan. 2015. The
Nathan Cassee and Anton Wijs. 2017. Analysing the per- movielens datasets: History and context. Acm transac-
formance of GPU hash tables for state space exploration. tions on interactive intelligent systems (tiis) 5, 4 (2015),
arXiv preprint arXiv:1712.09494 (2017). 1–19.
GPU Semiring Primitives for Sparse Neighborhood Methods

Velimir M Ilic. 2011. Entropy semiring forward-backward Bhaskar Mitra and Nick Craswell. 2018. An Introduc-
algorithm for HMM entropy computation. arXiv tion to Neural Information Retrieval t. Foundations and
preprint arXiv:1108.0347 (2011). Trends® in Information Retrieval 13, 1 (2018), 1–126.
https://doi.org/10.1561/1500000061
Michael Izbicki. 2013. Algebraic classifiers: a generic ap-
proach to fast cross-validation, online training, and par- Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka.
allel training. In International Conference on Machine 2017. High-performance and memory-saving sparse
Learning. PMLR, 648–656. general matrix-matrix multiplication for nvidia pascal
gpu. In 2017 46th International Conference on Parallel
Hayden Jananthan, Suna Kim, and Jeremy Kepner. 2017.
Processing (ICPP). IEEE, 101–110.
Linear systems over join-blank algebras. In 2017 IEEE
MIT Undergraduate Research Technology Conference M Naumov, LS Chien, P Vandermersch, and U Kapasi.
(URTC). IEEE, 1–4. 2010. Cusparse library. In GPU Technology Conference.
Yongkweon Jeon, Baeseong Park, Se Jung Kwon, Byeong- David Newman. 2008. UCI Machine Learning Repository.
wook Kim, Jeongin Yun, and Dongsoo Lee. 2020. http://archive.ics.uci.edu/ml
BiQGEMM: matrix multiplication with lookup table for
binary-coding-based quantized DNNs. arXiv preprint Corey J Nolet, Victor Lafargue, Edward Raff, Thejaswi
arXiv:2005.09904 (2020). Nanditale, Tim Oates, John Zedlewski, and Joshua Pat-
terson. 2020. Bringing UMAP Closer to the Speed
Rakshith Kunchum. 2017. On improving sparse matrix- of Light with GPU Acceleration. arXiv preprint
matrix multiplication on gpus. Ph.D. Dissertation. The arXiv:2008.00325 (2020).
Ohio State University.
Amra Omanović, Hilal Kazan, Polona Oblak, and Tomaž
Neil D Lawrence and Raquel Urtasun. 2009. Non-linear
Curk. 2020. Data embedding and prediction by sparse
matrix factorization with Gaussian processes. In Pro-
tropical matrix factorization. arXiv:2012.05210 [cs.LG]
ceedings of the 26th annual international conference on
machine learning. 601–608. Jia Pan and Dinesh Manocha. 2011. Fast GPU-based local-
ity sensitive hashing for k-nearest neighbor computation.
Richard Lettich. 2021. GALATIC: GPU Accelerated
In Proceedings of the 19th ACM SIGSPATIAL interna-
Sparse Matrix Multiplication over Arbitrary Semirings
tional conference on advances in geographic informa-
(GALATIC) v1. 0. Technical Report. Lawrence Berkeley
tion systems. 211–220.
National Lab.(LBNL), Berkeley, CA (United States).

Hang Li. 2016. Does IR Need Deep Learning ? IR and DL. Sebastian Raschka, Joshua Patterson, and Corey Nolet.
Keynote speech at SIGIR 2016 Neu-IR workshop (2016). 2020. Machine learning in python: Main developments
and technology trends in data science, machine learning,
Weifeng Liu and Brian Vinter. 2014. An efficient GPU and artificial intelligence. Information 11, 4 (2020), 193.
general sparse matrix-matrix multiplication for irregu-
lar data. In 2014 IEEE 28th International Parallel and JS Ratti and Y-F Lin. 1971. The graphs of semirings. II.
Distributed Processing Symposium. IEEE, 370–381. Proc. Amer. Math. Soc. 30, 3 (1971), 473–478.

Laurens Van Der Maaten and Geoffrey Hinton. 2008. Visu- Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola.
alizing Data using t-SNE. Journal of Machine Learning 2001. A generalized representer theorem. In Inter-
Research 9 (2008), 2579–2605. national conference on computational learning theory.
Springer, 416–426.
Tim Mattson, David Bader, Jon Berry, Aydin Buluc, Jack
Dongarra, Christos Faloutsos, John Feo, John Gilbert, Bernhard Scholkopf and Alexander J Smola. 2018. Learn-
Joseph Gonzalez, Bruce Hendrickson, et al. 2013. Stan- ing with kernels: support vector machines, regulariza-
dards for graph algorithm primitives. In 2013 IEEE High tion, optimization, and beyond. Adaptive Computation
Performance Extreme Computing Conference (HPEC). and Machine Learning series.
IEEE, 1–2.
Naser Sedaghati, Arash Ashari, Louis-Noël Pouchet, Srini-
Leland McInnes, John Healy, and James Melville. vasan Parthasarathy, and P Sadayappan. 2015. Char-
2018. UMAP: Uniform Manifold Approximation acterizing dataset dependence for sparse matrix-vector
and Projection for Dimension Reduction. arXiv multiplication on GPUs. In Proceedings of the 2nd work-
(2018). arXiv:1802.03426 http://arxiv.org/ shop on parallel programming for analytics applica-
abs/1802.03426 tions. 17–24.
GPU Semiring Primitives for Sparse Neighborhood Methods

Alex Smola, Arthur Gretton, Le Song, and Bernhard Zhekai Zhang, Hanrui Wang, Song Han, and William J
Schölkopf. 2007. A Hilbert space embedding for dis- Dally. 2020. Sparch: Efficient architecture for sparse
tributions. In International Conference on Algorithmic matrix multiplication. In 2020 IEEE International Sym-
Learning Theory. Springer, 13–31. posium on High Performance Computer Architecture
(HPCA). IEEE, 261–274.
Ian Soboroff. 2018. Meta-Analysis for Retrieval Experi-
ments Involving Multiple Test Collections. In Proceed- Brady Beida Zhou. 2018. GPU accelerated k-nearest
ings of the 27th ACM International Conference on Infor- neighbor kernel for sparse feature datasets. Ph.D. Dis-
mation and Knowledge Management (CIKM ’18). ACM, sertation.
New York, NY, USA, 713–722. https://doi.
org/10.1145/3269206.3271719

Nvidia Tesla. 2018. V100 GPU architecture.


Online verfügbar unter http://images. nvidia.
com/content/volta-architecture/pdf/volta-architecture-
whitepaper. pdf, zuletzt geprüft am 21 (2018).

Kyle J Travaglini, Ahmad N Nabhan, Lolita Penland,


Rahul Sinha, Astrid Gillich, Rene V Sit, Stephen Chang,
Stephanie D Conley, Yasuo Mori, Jun Seita, et al. 2020.
A molecular cell atlas of the human lung from single-cell
RNA sequencing. Nature 587, 7835 (2020), 619–625.

Ellen M Voorhees, Daniel Samarov, and Ian Soboroff.


2017. Using Replicates in Information Retrieval Evalu-
ation. ACM Trans. Inf. Syst. 36, 2 (aug 2017). https:
//doi.org/10.1145/3086701

Michael Wachter, Rolf Haenni, and Marc Pouly. 2007. Op-


timizing inference in Bayesian networks and semiring
valuation algebras. In Mexican International Conference
on Artificial Intelligence. Springer, 236–247.

Samuel Williams, Leonid Oliker, Richard Vuduc, John


Shalf, Katherine Yelick, and James Demmel. 2007.
Optimization of sparse matrix-vector multiplication on
emerging multicore platforms. In SC’07: Proceedings
of the 2007 ACM/IEEE Conference on Supercomputing.
IEEE, 1–12.

Carl Yang, Aydin Buluc, and John D Owens. 2019.


GraphBLAST: A high-performance linear algebra-based
graph framework on the GPU. arXiv preprint
arXiv:1908.01407 (2019).

Orestis Zachariadis, Nitin Satpute, Juan Gómez-Luna, and


Joaquı́n Olivares. 2020. Accelerating sparse matrix–
matrix multiplication with GPU Tensor Cores. Comput-
ers & Electrical Engineering 88 (2020), 106848.

Fan Zhang, Di Wu, Naiyong Ao, Gang Wang, Xiaoguang


Liu, and Jing Liu. 2011. Fast lists intersection with
bloom filter using graphics processing units. In Proceed-
ings of the 2011 ACM Symposium on Applied Comput-
ing. 825–826.
GPU Semiring Primitives for Sparse Neighborhood Methods

A A PPENDIX Removing the multiplicative annihilator results in the need


to consider the union of non-zero columns, and so all
A.1 Deriving Distances With Semirings columns need to be considered in this example. However
All of the distances in this paper can be categorized into if only the nonzero columns in the vectors of A are vis-
one of two groups- those which can be computed using the ited, the nonzero columns in b, which are zero in A, will be
dot product and vector norms and those which cannot. The missed.
non-annhiliating multiplicative monoid (NAMM) is used Recall that we can decompose a full union across all
for the latter group, which requires exhaustive computation nonzero columns into a union of the symmetric difference
over the union of non-zeros from each input. The following between nonzero columns of A and b (that is, all columns
example derives the semiring for the Manhattan distance, which are nonzero in A and zero in b), the intersection
demonstrating why the dot-product cannot be used. between nonzero columns of A and b (where both sides
Let vector a = [1, 0, 1] and b = [0, 1, 0] are nonzero), and the symmetric difference between the
nonzero columns of b and A (that is, all columns which
We can compute the L1 distance between these two vectors are nonzero in b and zero in A).
by taking the sum of the absolute value of their differences:
A spmv will often compute the intersection between the
nonzero columns of A and b and the symmetric difference
X
(|a − b|) = (4) between nonzero columns of A and b will be computed
X only as a side-effect. In order to compute the union be-
([|1 − 0|, |0 − 1|, |1 − 0|]) = (5) tween the nonzero columns of A and b, the symmetric dif-
X ference between the nonzero columns of b and A still needs
([1, 1, 1]) = 3 (6) to be computed. We compute this with a second pass of the
spmv by flipping the inputs to the spmv and ignoring the
Semiring standards such as GraphBLAS, for example, of- intersecting columns in the second pass.
ten make use of the detail that the multiplicative annihilator
is equal to the additive identity. If we follow this detail in B A RTIFACT A PPENDIX
our example, we end up with the following result (if any
side is 0, the arithmetic evaluates to 0): B.1 Abstract
High-performance primitives for mathematical operations
X on sparse vectors must deal with the challenges of skewed
(|a − b|) = (7) degree distributions and limits on memory consumption
that are typically not issues in dense operations. We
X
([|1 − 0|, |0 − 1|, |1 − 0|]) = (8)
X demonstrate that a sparse semiring primitive can be flex-
([0, 0, 0]) = 0 (9) ible enough to support a wide range of critical distance
measures while maintaining performance and memory effi-
What we need here instead is for the multiplicative identity ciency on the GPU. We further show that this primitive is a
to be non-annihilating, such that it equals the additive iden- foundational component for enabling many neighborhood-
tity, so that the difference in our example behaves like an based information retrieval and machine learning algo-
XOR, evaluating to the other side when either side is zero rithms to accept sparse input. To our knowledge, this
and evaluating to 0 only in the case where both sides have is the first work aiming to unify the computation of sev-
the same value. For example, eral critical distance measures on the GPU under a sin-
gle flexible design paradigm and we hope that it provides
|1 − 0| = 1 |0 − 1| = 1 |0 − 0| = 0 |1 − 1| = 0 a good baseline for future research in this area. Our im-
Now let’s perform a sparse-matrix sparse-vector multiply plementation is fully open source and publicly available at
where A = [[1, 0, 1]] and b = [0, 1, 1] https://github.com/rapidsai/raft.

We can parallelize this by evaluating the semiring of b B.2 Artifact check-list (meta-information)
over each row vector of A independently, iterating through
the nonzero columns from each vector in A and fetching • Algorithm: sparse matrix-vector multiplication, pair-
wise distance
or looking up the corresponding column from b (if it is
nonzero). With the standard dot-product semiring, which • Program: rapids, cuml, raft
annihilates multiplicatively over the additive identity, we
• Compilation: cmake, python
only need to consider the intersection of columns where
both sides are nonzero– column 3 in this example. • Binary: source build
GPU Semiring Primitives for Sparse Neighborhood Methods

• Data set: movielens, ny times bow, sec edgar, scrna B.4 Installation
• Run-time environment: linux, 64-bit, x86 64 After installing the software dependencies in an Anaconda en-
vironment by following the instructions provided in the Github
• Hardware: gpu, dgx1, v100, volta repository, the cuML source code needs to be built and installed
• Metrics: End-to-end runtime performance from the branch-0.19 branch.

• Output: Runtimes printed to screen


1. Create an Anaconda environment with the necessary
• Experiments: Scripts provided to load data and execute dependencies:
algorithm on datasets with various distance measures https://github.com/rapidsai/cuml/blob/branch-
0.19/BUILD.md#setting-up-your-build-environment.
• How much disk space required (approximately)?: 20gb
2. Check out the code for both the baseline and our novel
• How much time is needed to prepare workflow (approx- implementation outlined here:
imately)?: 1-2 hours https://github.com/cjnolet/sparse neighborhood semiring paper
• How much time is needed to complete experiments (ap- 3. Build from source using the build instructions at
proximately)?: 2 days https://github.com/rapidsai/cuml/blob/branch-
• Publicly available?: Yes 0.19/BUILD.md#installing-from-source.

• Code licenses (if publicly available)?: Yes


B.5 Experiment workflow
• Data licenses (if publicly available)?: Yes
1. Clone the repository containing the benchmark scripts and
instructions:
B.3 Description https://github.com/cjnolet/sparse neighborhood semiring paper
B.3.1 How delivered 2. Download all of the datasets and place them in a directory
datasets in the root of the repository.
The artifact is delivered as a public github repository contain-
ing the datasets, Python benchmarking scripts, and instructions 3. Clone and build the two provided cuml branches for the
to build the source code. baseline and the optimized versions, installing only one at
The instructions for building the artifacts, along with the detailed a time.
specifications of the system used to produce the paper results, are 4. Execute the scripts in the scripts directory for both the base-
located at line and optimized versions of cuML.
https://github.com/cjnolet/sparse neighborhood semiring paper
5. Execute the scripts in the scripts directory for the CPU
B.3.2 Hardware dependencies benchmarks.

An Nvidia DGX1 was used to produce the results in the paper.


Similar results can be produced with any Nvidia GPU which is B.6 Evaluation and expected result
capable of running Nvidia RAPIDS (Pascal architecture or newer
• All of the GPU benchmarks should be consistently faster
and CUDA 11.0+).
than the CPU benchmarks.

B.3.3 Software dependencies • For MovieLens, the optimized version should be faster than
the baseline for all distances.
These benchmarks were executed using a custom branch of
RAPIDS cuML version 0.19 and CUDA toolkit 11.0. All depen- • For SEC Edgar, the optimized version should be faster than
dencies should be installed with Anaconda and instructions are the baseline for all distances.
provided in the documentation to install them.
• for NY Times, the optimized version should be faster than
the baseline for all of the non-trivial distances. For the dot-
B.3.4 Data sets product based distances, the baseline should be faster than
MovieLense 20M Ratings: the optimized version.
https://files.grouplens.org/datasets/movielens/ml-20m.zip
• for scRNA, the optimized version should be faster than the
NY Times Bag of Words: baseline for all of the non-trivial distances. For the dot-
https://archive.ics.uci.edu/ml/machine-learning-databases/bag- product based distances, the baseline shoul dbe faster than
of-words/docword.nytimes.txt.gz the optimized version.

SEC EDGAR Company Names:


https://www.kaggle.com/dattapiy/sec-edgar-companies-list

scRNA 70k Lung Cell:


https://rapids-single-cell-examples.s3.us-east-
2.amazonaws.com/krasnow hlca 10x.sparse.h5ad

You might also like