MLSys 2022 Gpu Semiring Primitives For Sparse Neighborhood Methods Paper
MLSys 2022 Gpu Semiring Primitives For Sparse Neighborhood Methods Paper
Corey J. Nolet 1 2 Divye Gala 1 Edward Raff 2 3 Joe Eaton 1 Brad Rees 1 John Zedlewski 1 Tim Oates 2
A BSTRACT
High-performance primitives for mathematical operations on sparse vectors must deal with the challenges of
skewed degree distributions and limits on memory consumption that are typically not issues in dense operations.
We demonstrate that a sparse semiring primitive can be flexible enough to support a wide range of critical distance
measures while maintaining performance and memory efficiency on the GPU. We further show that this primitive
is a foundational component for enabling many neighborhood-based information retrieval and machine learning
algorithms to accept sparse input. To our knowledge, this is the first work aiming to unify the computation of
several critical distance measures on the GPU under a single flexible design paradigm and we hope that it provides
a good baseline for future research in this area. Our implementation is fully open source and publicly available
as part of the RAFT library of GPU-accelerated machine learning primitives (https://github.com/rapidsai/raft).
4. Use as little memory as necessary. For a distance to define a metric space, it must follow
four properties- implication (d(a, b) = 0 =⇒ a = b),
5. Enable semirings in addition to the simple dot prod- positivity (d(a, b) >= 0), symmetry (d(a, b) = d(b, a)),
uct. and the triangle inequality (d(a, c) ≤ d(a, b) + d(b, c)).
Several metrics, including Chebyshev, Manhattan, and Eu-
clidean, are derived from the generalized Minkowski for-
P 1/p
k p
2 S EMIRINGS AND PAIRWISE D ISTANCES mula i |ai − b i | where p defines a degree. The
absolute value in this equation defines a commutative
We formalize the concepts of semirings and distance mea- semiring which requires commutativity in the difference
sures in this section and describe building blocks required of each vector dimension. Euclidean distance is equiva-
to implement several popular distance measures, often en- Pk
lent to Minkowski with a degree of 2 (( i |ai − bi |2 )1/2 ).
countered in machine learning applications, into the semir-
Because the square of a number is always positive, this
ing framework.
equation can be expanded to (a − b)p for all even degrees
In machine learning applications, a distance measure is of- and still preserve the absolute value, such as (a − b)2 =
ten performed on two row matrices containing data samples a2 − 2ha, bi + b2 in the case of Euclidean distance. While
with columns that represent some number of observations, numerical instabilities can arise from cancellations in these
or features. In this paper, we will refer to these two ma- expanded equations, we will show in section 2.2 that the
trices as A and B in upper-case where A ∈ Rm×k , and expanded form is often preferred in sparse algebras, when
B ∈ Rn×k and a single vector as a and b in lowercase distances can make use of it, because it requires less com-
where a ∈ Rk or b ∈ Rk . As we show in this section, putations than the exhaustive evaluation over the nonzeros
the core of computing pairwise distances between A and of k. By example, the distances which don’t have an ex-
B is a matrix multiplication AB > in a topological space panded form, such as Manhattan (Minkowski with degree
equipped with an inner product semiring that defines dis- 1) and Chebyshev (Minkowski with degree max) distance,
tances between vectors. When this inner product is defined are often non-annihilating (e.g. x∗0 = x) and require com-
to be the dot product semiring, the topological space de- putation over the full union of nonzero columns from both
fines the standard matrix multiply but we can capture many vectors in order to preserve commutativity.
other core distances in machine learning applications by
simply redefining the inner product semiring. 2.2 Semirings
While some distance measures can make use of the sim- A monoid is a semigroup containing an associative bi-
ple dot product semiring from matrix-matrix multiplication nary relation, such as addition (⊕), and an identity el-
routines, we show that a more comprehensive package for ement (id⊕ ). A semiring (Ratti and Lin, 1971), de-
computing pairwise distances requires more flexibility in noted (S, R, {⊕, id⊕ }, {⊗, id⊗ }), is a tuple endowed with
terms of the arithmetic operations supported. Further, the a domain along with additive (⊕) and multiplicative (⊗)
explicit transposition of B which is required in routines monoids where
such as the cuSPARSE csrgemm() requires a full copy of
B, since no elements can be shared between the original 1. ⊕ is commutative, distributive, and has an identity el-
and transposed versions in the CSR data format. This has ement 0
a negative impact on scalability in memory-constrained en-
vironments such as GPUs. 2. ⊗ distributes over ⊕
GPU Semiring Primitives for Sparse Neighborhood Methods
Table 1: Common distances and their semirings. While all distances can be computed with the NAMM (where id⊗ = 0),
the distances in this table which require it have their ⊗ listed. The expansion function and any potential norms are provided
for distances that can be computed in the more efficient expanded form.
Some formal definitions of semirings require that id⊗ = 1. Table 1 uses semirings to construct several commonly used
Given two sparse vectors a, b ∈ Rk , a semiring with distances common in machine learning and data science ap-
(S, R, {⊕, 0}, {⊗, 1}) and annihilator⊗ = 0 has the ef- plications. When an expanded form is possible, an expan-
fect of only requiring ⊗ be computed on columns that sion function can be performed as an element-wise oper-
are both nonzero (e.g. nonzeros(a) ∩ nonzeros(b)). ation on a simple pairwise dot product semiring with ar-
These rules are often relaxed in practice, for example in rays of row-vector norms. While most of the expanded
tropical semirings in Equation 1, which can solve dy- form distances can directly use the dot product semiring,
namic programming problems such as the Viterbi algo- KL-divergence directly replaces the ⊗ with ai log(ai /bi )
rithm. An annihilator is an input that will always cause and makes no further assumption of symmetry. A NAMM
a monoid to evaluate to 0 and the multiplicative annihilator is required for all unexpanded distance measures where
(annihilator⊗ ) is often assumed to be id⊕ . A monoid is id⊗ = 0 and special care must be taken to ensure it is
non-annihilating when it does not have a defined annihila- applied to the full union of the non-zero columns of cor-
tor. When an expanded form is not possible or efficient, ⊗ responding elements from each pair of vectors.
also must be commutative in metric spaces, and thus must
As mentioned in the previous section, the Euclidean dis-
be non-annihilating and id⊗ = 0. We refer to this monoid
tance can be expanded to kAk − 2hAB > i + kBk. This
as a non-annihilating multiplicative monoid (NAMM).
equation can be decomposed into the sum of individual L2
norms, a matrix product, and an element-wise expansion
(S, R ∪ {+∞}, {min, +∞}, {+, 0}) (1) function executed in parallel over the individual dot prod-
GPU Semiring Primitives for Sparse Neighborhood Methods
ucts from the matrix product to combine the parts into a Existing semiring implementations currently require that
single scalar distance. Given vectors Ai , Bj , the expan- the id⊕ be used as annihilator⊗ . For example, the Graph-
sion function for Euclidean distance can be derived by dis- BLAS specification enables the re-interpretation of the ze-
tributing their squared difference over the exponent to pro- roth element, but this is necessary to define the identity of
duce (Ai − Bi ) × (Ai − Bi ) and further expanding it to the ⊕ monoid.
kAki + 2hAi , Bj i − kBkj .
The annihilator⊗ and id⊗ determine the number of times 3 GPU-ACCELERATED S EMIRINGS
the ⊗ monoid must be applied during the computation of
In this section, we briefly introduce GPU architecture be-
pairwise distances. When annihilator⊗ = id⊕ , then
fore discussing some naive designs and the inefficiencies
⊗(ai , 0) = 0 and ⊗(0, bi ) = 0 so ⊗ can be applied only to
that led to the construction of our final design. Our goal
the intersection of columns. When annihilator⊗ is un-
was to preserve as many of the ideal design characteristics
defined and id⊗ = 0, then ⊗ must be applied exhaus-
from section 5 as possible but we found a need to accept
tively over the union of columns because ⊗(ai , 0) = ai
trade offs during implementation.
and ⊗(0, bi ) = bi .
A union between two sets can be decomposed into an in- 3.1 GPU Architecture
tersection between the two sets, along with the union of
the symmetric differences between them. These are shown The largest GPUs today contain hundreds of hardware pro-
in Equation 3, where a complement is denoted with a bar. cessing cores called streaming multiprocessors (SM) which
The nonzero columns of two sparse vectors can be used as execute groups of threads called warps. Each warp can
sets a and b in this equation and the sparse matrix multi- process a single instruction at a time in parallel using a
ply with an ordinary dot product only requires the applica- paradigm called single-instruction multiple data (SIMD).
tion of product() across a ∩ b. The NAMM, however, re- It’s important that threads within a warp minimize condi-
quires the application of the product() across the full union tional branching that will cause the threads to wait for each
of nonzero columns a ∪ b. branch to complete before proceeding. This is called thread
divergence, and can severely limit effective parallel execu-
tion. On the Volta and Ampere architectures, each SM can
track the progress of up to 64 warps concurrently (Tesla,
2018), and rapidly switch between them to fully utilize the
a∩b a∩b SM. Each SM has a set of registers available which allows
a∩b warps to perform collective operations, such as reductions.
Warps can be grouped into blocks and a small amount of
(2) memory can be shared across the threads and warps.
Global, or device, memory can be accessed by all of the
SMs in the GPU. Accesses to contiguous device mem-
a ∪ b = {a ∩ b} ∪ {a ∩ b} ∪ {a ∩ b} (3) ory locations within a warp can be coalesced into a single
blocked transaction so long as the accesses are performed
A common approach to implementing sparse matrix multi- in the same operation. In SIMD architectures, uniform pat-
ply is to iterate over the nonzeros from b in order to lookup terns can be critical to performance unless latencies from
and compute the intersection with the nonzeros from a. non-uniform processing, such as uncoalesced memory ac-
This design will also implicitly compute the symmetric dif- cesses, can be hidden with increased parallelism.
ference between either of the two sets of nonzeros, a ∩ b or Registers provide the fastest storage, and it’s generally
a ∩ b, depending on which vector is chosen in the iteration preferable to perform reductions and arithmetic as intra-
over nonzeros. To compute a full union, the remaining set warp collective operations where possible. Intra-block
difference can be computed in a second pass of the matrix shared memory is also generally preferred over global
multiply by looping over the nonzeros from the vector that memory when a problem can be made small enough to
remains. We will show in subsection 3.1 that we accom- benefit. However, contiguous locations of shared memory
plish this efficiently in our implementation in two passes- are partitioned across contiguous banks and any accesses to
one pass to compute the first two terms and another pass to different addresses in the same bank by the same warp will
compute the third term. Distances which can be computed create a bank conflict and be serialized within the warp,
with an expansion function only need the first pass while causing the threads to diverge.
distances which require the NAMM need both. Please re-
fer to subsection A.1 for an example of using semirings to
compute the Manhattan distance using the NAMM.
GPU Semiring Primitives for Sparse Neighborhood Methods
3.2 Naive Semi-Ring Full-Union CSR Designs creased the potential for coalesced global memory accesses
and created large thread divergences. Further, the exhaus-
3.2.1 Expand-Sort-Contract
tive nature of this design, while it will guarantee the ⊗
Initial implementations tried to minimize the memory foot- monoid is computed on the full union of nonzero columns,
print as much as possible by directly computing the output will end up performing many unnecessary computations
distances from the input CSR format. The CSR format re- when distances can be computed with the rules of a sim-
quires columns to be sorted with respect to row and we ple dot product semiring.
initially attempted to use a modified variant of the expand-
sort-contract (Dalton et al., 2015) pattern on the nonzero Algorithm 2 Semring on CSR inputs. Each thread com-
columns from each pair of row vectors, a, b ∈ Rk , con- putes a single dot product.
catenating the vectors together, sorting them, and applying Input: Ai , Bj , product op, reduce op
the ⊗ monoid on pairs of duplicate columns to contract Result: Cij = d(Ai , Bj )
the sorted array and invoking ⊗ with the identity for all startA = indptrAi , endA = indptrAi+1
other columns. At the row-level of the output matrix, no startB = indptrBj , endB = indptrBj+1
computations would be able to be reused by subsequent icolA = startA, icolB = startB
pairs of vectors so we implemented this pattern on the while icolA < endA —— icolB < endB do
GPU and mapped the nonzero columns and values for each colA = icolA < endA ? indicesicolA : MAX INT
row-vector pair to individual thread-blocks, expanding both colB = i colB < endB ? indicesicolB : MAX INT
vectors by concatenating them in shared memory, perform- valueA = 0, valueB = 0 if colA ≤ colB then
ing a sort-by-key, and compressing them in parallel. We at- valueA = valuesAicolA ++
tempted several efficient sorting algorithms on the GPU in- end
cluding the popular radix sort and bitonic sorting networks if colB ≤ colA then
and, while the use of shared memory in the sort step en- valueB = valuesBicolB ++
abled coalesced reads from global memory for the nonzero end
columns and values, the sorting step dominated the perfor- v = product op(valueA, valueB)
mance of the algorithm. Another downside with this partic- Cij = reduce op(Cij , v)
ular design is that both vectors need to fit in shared mem- end
ory, requiring space for 2 ∗ (nonzeros(a) + nonzeros(b))
elements in order to fit both the columns and correspond-
ing values at the same time. In addition to the need for We found marginal gains in performance by coalescing the
n ∗ m blocks to be scheduled, the shared memory require- reads of the vectors from A into shared memory and shar-
ment became a severe limit to scale, which was further ing it across all threads of each thread-block. We attempted
compounded by the shared memory size limiting the num- to load balance this algorithm by maintaining arrays to look
ber of blocks that could be scheduled concurrently on each up row information for each column but this increased warp
SM. divergence from the overly complicated conditionals re-
quired to maintain state across threads and warp bound-
Algorithm 1 Semiring on CSR inputs using expand-sort- aries.
contract pattern, parallelized across threads in each block.
Input: Ai , Bj , product op, reduce op 3.3 Load Balanced Hybrid CSR+COO
Result: Cij = d(Ai , Bj ) While the CSR format enables algorithms to be parallelized
smem[0..nnzai−1 ] = Ai over threads for individual rows, we found that using a row
smem[nnzai ..nnzbj−1 ] = Bj index array in coordinate format (COO) for B enabled load
sort(smem) balancing, coalescing the loads from each vector from A
Cij = reduce(smem, product op, reduce op) into shared memory, once per block, and threads of each
block parallelizing the application of the semiring over
nonzero elements of B. Since the columns in B are assumed
3.2.2 Iterating Sorted Nonzeros
to be sorted by their respective row, we use a segmented re-
Since columns will often be sorted within their respective duction by key within each warp, bounding the number of
rows in the CSR format, we removed the sort step from potential writes to global memory by the number of active
algorithm 1 by exhaustively iterating over the non-zeros warps over each row of B. Our design extends the work
of each O(m ∗ n) pair of vectors in parallel, one pair per of the COO sparse-matrix dense-vector multiplication de-
thread, as shown in algorithm 2. We found that even when scribed in (Anzt et al., 2020) by storing the vectors from A
the neighboring threads processed rows of similar degree, in dense form in shared memory only when the number of
the differing distributions of nonzeros within each row de- columns are small enough. Our extension enables sparse-
GPU Semiring Primitives for Sparse Neighborhood Methods
matrix sparse-vector multiplication by storing the vectors a max dimensionality of 40K with single-precision. Cou-
in sparse form when their degrees are small enough. We pling the amount of shared memory to the dimensionality
achieve full occupancy on the Volta architecture by trading creates a problem for occupancy as it approaches capacity.
off the size of the L1 cache to double the amount of shared Both of these architectures limit the maximum block sizes
memory per GPU, allowing each SM to use 96KiB. Since to 1024 threads and max concurrent warps per SM to 64
our design uses less than 32 registers, a block size of 32 so anything over 48KB of shared memory per block is go-
warps allows two blocks, the full 64 warps, to be scheduled ing to decrease occupancy. For this reason, the maximum
concurrently on each SM. dimensionality of dense vectors that can be processed with
full occupancy is actually 12K and 20K, respectively.
Algorithm 3 Load-balanced Hybrid CSR+COO SPMV.
This boundary becomes too small for many sparse datasets
Input: Ai , B, product op, reduce op which would instead benefit from coupling the shared
Result: Cij = d(Ai , Bj ) memory size to individual row degrees. Inspired by other
read Ai into shared memory sparse matrix multiplication implementations on the GPU
cur row=rowidx[ind] (Anh et al., 2016; Kunchum, 2017; Liu and Vinter, 2014;
ind = idx of first elem to be processed by this thread Nagasaka et al., 2017), we enhanced the vector insertion
c = product op(A[ind], x[colidx[ind]]) and lookup patterns of the COO SPMV design outlined in
for i ← 1 to nz per chunk ; by warp size do (Anzt et al., 2020) by building a hash table to store these
next row = cur row + warp size
columns in shared memory. Unlike many other hash ta-
if next row != cur row —— is final iter ? then
v = segmented scan(cur row, c, product op) ble implementations on the GPU (Alcantara et al., 2009;
if is segment leader ? then Ashkiani et al., 2018; Alcantara et al., 2012; Pan and
atomic reduce(v, reduce op) Manocha, 2011; Cassee and Wijs, 2017), our implementa-
end tion builds an independent hash table per thread-block and
c=0 so many other designs and concurrency patterns that opti-
end mize the key distribution and collision-resolution strategies
cur row = next row for the GPU are not efficient or cannot be easily ported for
ind += warp size our use-case. For this reason, we used a simple hash table
c = product op(A[ind], x[colidx[ind]]) with a Murmur hash function and linear probing and leave
end the investigation of a better and more optimized design to
future work.
3.3.1 Two-pass execution Hash tables have the best performance when the number of
entries is less than 50% of the capacity. As the hash table
As described in subsection 2.2, a single execution of this size grows beyond 50% capacity, the collision resolution
strategy will compute the intersection and symmetric dif- cycles of linear probing, which are non-uniform, increase
ference a ∩ b between nonzero columns from each vector the serialization of instructions from warp divergences and
a, and b so long as ⊗ is applied to all nonzero columns also increase the number of transactions from global mem-
of b. While only a single pass covers distance measures ory reads of B since they can no longer be coalesced. The
which require only a column intersection (e.g. dot product hash table strategy decreases the amount of shared memory
semiring (S, R, {+, 0}, {∗, 1})), a second pass can com- available, often by a factor of 2, because the nonzeros need
pute the remaining symmetric difference required for the to be stored together as key/value pairs to avoid an addi-
full union between non-zero columns by commuting A and tional costly lookup to global memory, a side-effect which
B and skipping the application of of id⊗ in B for the sec- would only further increase serialized execution from di-
ond pass. verging threads. Our hash table strategy allows for a max
degree of 3K on Volta architectures and 5K on Ampere.
3.3.2 Sparsifying the Vector in Shared Memory
Another unfortunate side-effect from the linear-probing
While we found storing the vectors from A in dense form collision strategy of our hash table is the increase in lookup
in shared memory to have the highest throughput rate and times for columns even for elements that aren’t in the ta-
least amount of thread divergence within each warp, sparse ble. For example, as the hash table approaches capac-
datasets are generally assumed to have high dimensional- ity, the increase in collisions can cause a lookup to probe
ity and the limited amount of shared memory that can be through multiple candidates, sometimes hundreds, before
allocated per SM bounds the size of the vectors that can finding an element doesn’t exist. Bloom filters have been
be stored in it. For example, The 96KiB limit per block used to implement fast list intersection problems for sparse
on Volta allows a max dimensionality of 23K with single- matrix multiplication problems on the GPU (Zhang et al.,
precision and the 163KiB limit per SM on Ampere allows
GPU Semiring Primitives for Sparse Neighborhood Methods
Table 3: Benchmark Results for all datasets under consideration. All times are in seconds, best result in bold. The first
italicized set of distances can all be computed as dot products, which are already highly optimized for sparse comparisons
today. This easier case we are still competitive, and sometimes faster, than the dot-product based metrics. The Non-trivial
set of distances that are not well supported by existing software are below, and our approach dominates amongst all these
metrics.
variants of sparse matrix multiplication available in open paradigm in research. Semirings have also been used for
source libraries, each using different concurrency patterns some time to implement more modern machine learning
and available memory to provide speedups based on either methods (Belle and De Raedt, 2020), with the more re-
supported sparse formats or the assumed density of either cent invention of semiring programming attempting to fur-
the inputs or the outputs (Sedaghati et al., 2015; Mattson ther consolidate these concepts under a single framework
et al., 2013). We have compareda against the seminal cuS- and set of symbolic routines. Semirings can be a useful
PARSE (Naumov et al., 2010) that is highly optimized for building-block for linear models (Jananthan et al., 2017),
sparse dot product based k-nearest neighbors (Zhou, 2018), probabilistic models, such as Bayesian networks (Wachter
and found our approach is faster or competitive in all cases, et al., 2007) and the use of Tropical semiring in Markov
but is not limited to dot product based measures. networks (Ilic, 2011). The Tropical semiring is also be-
ing used to implement sparse non-negative matrix factor-
Better able to make use of critical optimizations inherent
izations (Omanović et al., 2020).
in their dense counterparts, block compressed sparse for-
mats have become widely popular for representing sparse
data (Zachariadis et al., 2020), in part because they can 5.3 Neighborhood Methods
improve load balancing by grouping nonzeros into fixed- Our work is positioned to have an impact on numerous
sized tiles and scheduling the tiles more uniformly across down-stream tasks that often depend on sparse nearest-
the processing cores. Enabling sparse formats to be pro- neighbor retrieval. This includes classic Information
cessed more similar to their dense counterparts allows the Retrieval problems where such methods are still highly
use of specialized hardware optimizations such as tensor competitive or preferred (Mitra and Craswell, 2018; Li,
cores. While we do hope to someday support block-sparse 2016; Soboroff, 2018; Voorhees et al., 2017; Bouthillier
formats, it is most often assumed that users will be calling et al., 2021). Dimensional reduction approaches like t-
code that invokes our primitive with matrices in the stan- SNE (Maaten and Hinton, 2008) and UMAP (McInnes
dard compressed sparse row (CSR) format (Williams et al., et al., 2018) that lack sparse input support on GPUs with-
2007) and so a conversion would be necessary in order to out our method (Nolet et al., 2020). ML models based on
use a blocked format. the kernel trick, such as Guassian Process (Lawrence and
Urtasun, 2009) also stand to benefit. The breadth and fre-
5.2 Semirings quency of nearest neighbor methods on high dimensional
data make our work relevant to an especially wide class of
Consolidating seemingly disparate concepts into a
practioners.
lightweight, terse, and abstract set of building-blocks can
increase flexibility and promote reuse (Izbicki, 2013).
This especially benefits fields which require non-trivial 6 C ONCLUSION
and highly-optimized implementations where the design
In this paper, we demonstrated a flexible sparse pairwise
complexities and costs are high, the basic linear-algebra
distance primitive that is able to collectively support, to
subroutines (BLAS) API and GPU-accelerated computing
our knowledge, a larger assortment of widely-used dis-
being common examples. Semirings provide the efficiency
tance measures than any other package on the GPU. We
and flexibility to enable algorithms in which the represen-
consolidated the design of these distance measures using a
tation and assumptions of the typical BLAS API for dense
couple minor enhancements to the rules of classical semir-
linear algebra comes up short (Mattson et al., 2013). NIST
ings, which are traditionally used to implement graph algo-
published a sparse BLAS standard back in 2001 (Duff
rithms, and we discussed the impact of our primitive as a
et al., 2002) and cuSPARSE is one of the most sophisti-
core building block of many important neighborhood meth-
cated implementations of the sparse BLAS standard that
ods for machine learning and data mining. Finally, we pro-
has been built on the GPU, however as mentioned above,
vided a novel implementation as an example of how these
its multiplication routines fix the inner product to the dot
semirings can be implemented on the GPU with a lower
product. GraphBLAS (Davis, 2018) provides a set of
memory footprint and performance comparable to, or bet-
primitives, along with an API, for using semiring algebras
ter than, the current state of the art.
to implement graph algorithms. The GraphBLAST (Yang
et al., 2019) and SuiteSparse (Davis, 2019) libraries
provide implementations of the GraphBLAS that also R EFERENCES
include GPU-accelerated primitives. Dan A Alcantara, Andrei Sharf, Fatemeh Abbasinejad,
The use of semirings in graph theory dates back to the early Shubhabrata Sengupta, Michael Mitzenmacher, John D
1970s (Ratti and Lin, 1971), when ”good old-fashioned Owens, and Nina Amenta. 2009. Real-time parallel
artificial intelligence”, or Symbolic AI, was a dominant hashing on the GPU. In ACM SIGGRAPH Asia 2009
GPU Semiring Primitives for Sparse Neighborhood Methods
papers. 1–9. Steven Dalton, Luke Olson, and Nathan Bell. 2015. Opti-
mizing sparse matrix—matrix multiplication for the gpu.
Dan A Alcantara, Vasily Volkov, Shubhabrata Sengupta, ACM Transactions on Mathematical Software (TOMS)
Michael Mitzenmacher, John D Owens, and Nina 41, 4 (2015), 1–20.
Amenta. 2012. Building an efficient hash table on the
GPU. In GPU Computing Gems Jade Edition. Elsevier, Timothy A Davis. 2018. Algorithm 9xx: SuiteS-
39–53. parse:GraphBLAS: graph algorithms in the language of
sparse linear algebra. Technical Report. 24 pages.
Daniel Alpay. 2012. Reproducing kernel spaces and appli-
cations. Vol. 143. Birkhäuser. Timothy A Davis. 2019. Algorithm 1000: SuiteSparse:
GraphBLAS: Graph algorithms in the language of sparse
Pham Nguyen Quang Anh, Rui Fan, and Yonggang Wen. linear algebra. ACM Transactions on Mathematical Soft-
2016. Balanced Hashing and Efficient GPU Sparse Gen- ware (TOMS) 45, 4 (2019), 1–25.
eral Matrix-Matrix Multiplication.(2016), 1–12. Google
Scholar Google Scholar Digital Library Digital Library Iain S Duff, Michael A Heroux, and Roldan Pozo. 2002.
(2016). An overview of the sparse basic linear algebra subpro-
grams: The new standard from the BLAS technical
Hartwig Anzt, Terry Cojean, Chen Yen-Chen, Jack Don- forum. ACM Transactions on Mathematical Software
garra, Goran Flegar, Pratik Nayak, Stanimire Tomov, (TOMS) 28, 2 (2002), 239–267.
Yuhsiang M. Tsai, and Weichung Wang. 2020. Load-
Balancing Sparse Matrix Vector Product Kernels on Kento Emoto, Sebastian Fischer, and Zhenjiang Hu. 2012.
GPUs. ACM Trans. Parallel Comput. 7, 1, Article 2 Filter-embedding semiring fusion for programming with
(March 2020), 26 pages. https://doi.org/10. MapReduce. Formal Aspects of Computing 24, 4 (2012),
1145/3380930 623–645.
Saman Ashkiani, Martin Farach-Colton, and John D Alexandre Fender. 2017. Parallel solutions for large-scale
Owens. 2018. A dynamic hash table for the GPU. In eigenvalue problems arising in graph analytics. Ph.D.
2018 IEEE International Parallel and Distributed Pro- Dissertation. Université Paris-Saclay.
cessing Symposium (IPDPS). IEEE, 419–429.
Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen.
Nathan Bell and Michael Garland. 2008. Efficient sparse 2020. Sparse GPU kernels for deep learning. arXiv
matrix-vector multiplication on CUDA. Technical Re- preprint arXiv:2006.10901 (2020).
port. Citeseer.
Brandon Gildemaster, Prerana Ghalsasi, and Sanjay
Vaishak Belle and Luc De Raedt. 2020. Semiring pro- Rajopadhye. 2020. A Tropical Semiring Multiple
gramming: A semantic framework for generalized sum Matrix-Product Library on GPUs: (not just) a step
product problems. International Journal of Approximate towards RNA-RNA Interaction Computations. In 2020
Reasoning 126 (2020), 181–201. IEEE International Parallel and Distributed Pro-
cessing Symposium Workshops (IPDPSW). 160–169.
Alain Berlinet and Christine Thomas-Agnan. 2011. Repro- https://doi.org/10.1109/IPDPSW50202.
ducing kernel Hilbert spaces in probability and statis- 2020.00037
tics. Springer Science & Business Media.
Scott Gray, Alec Radford, and Diederik P Kingma. 2017.
Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, As- Gpu kernels for block-sparse weights. arXiv preprint
sya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz arXiv:1711.09224 3 (2017).
Sepah, Edward Raff, Kanika Madan, Vikram Voleti,
Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Cong Guo, Bo Yang Hsueh, Jingwen Leng, Yuxian Qiu,
Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, and Pas- Yue Guan, Zehuan Wang, Xiaoying Jia, Xipeng Li,
cal Vincent. 2021. Accounting for Variance in Machine Minyi Guo, and Yuhao Zhu. 2020. Accelerating sparse
Learning Benchmarks. In Machine Learning and Sys- dnn models without hardware-support via tile-wise spar-
tems (MLSys). arXiv:2103.03098 http://arxiv. sity. arXiv preprint arXiv:2008.13006 (2020).
org/abs/2103.03098
F Maxwell Harper and Joseph A Konstan. 2015. The
Nathan Cassee and Anton Wijs. 2017. Analysing the per- movielens datasets: History and context. Acm transac-
formance of GPU hash tables for state space exploration. tions on interactive intelligent systems (tiis) 5, 4 (2015),
arXiv preprint arXiv:1712.09494 (2017). 1–19.
GPU Semiring Primitives for Sparse Neighborhood Methods
Velimir M Ilic. 2011. Entropy semiring forward-backward Bhaskar Mitra and Nick Craswell. 2018. An Introduc-
algorithm for HMM entropy computation. arXiv tion to Neural Information Retrieval t. Foundations and
preprint arXiv:1108.0347 (2011). Trends® in Information Retrieval 13, 1 (2018), 1–126.
https://doi.org/10.1561/1500000061
Michael Izbicki. 2013. Algebraic classifiers: a generic ap-
proach to fast cross-validation, online training, and par- Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka.
allel training. In International Conference on Machine 2017. High-performance and memory-saving sparse
Learning. PMLR, 648–656. general matrix-matrix multiplication for nvidia pascal
gpu. In 2017 46th International Conference on Parallel
Hayden Jananthan, Suna Kim, and Jeremy Kepner. 2017.
Processing (ICPP). IEEE, 101–110.
Linear systems over join-blank algebras. In 2017 IEEE
MIT Undergraduate Research Technology Conference M Naumov, LS Chien, P Vandermersch, and U Kapasi.
(URTC). IEEE, 1–4. 2010. Cusparse library. In GPU Technology Conference.
Yongkweon Jeon, Baeseong Park, Se Jung Kwon, Byeong- David Newman. 2008. UCI Machine Learning Repository.
wook Kim, Jeongin Yun, and Dongsoo Lee. 2020. http://archive.ics.uci.edu/ml
BiQGEMM: matrix multiplication with lookup table for
binary-coding-based quantized DNNs. arXiv preprint Corey J Nolet, Victor Lafargue, Edward Raff, Thejaswi
arXiv:2005.09904 (2020). Nanditale, Tim Oates, John Zedlewski, and Joshua Pat-
terson. 2020. Bringing UMAP Closer to the Speed
Rakshith Kunchum. 2017. On improving sparse matrix- of Light with GPU Acceleration. arXiv preprint
matrix multiplication on gpus. Ph.D. Dissertation. The arXiv:2008.00325 (2020).
Ohio State University.
Amra Omanović, Hilal Kazan, Polona Oblak, and Tomaž
Neil D Lawrence and Raquel Urtasun. 2009. Non-linear
Curk. 2020. Data embedding and prediction by sparse
matrix factorization with Gaussian processes. In Pro-
tropical matrix factorization. arXiv:2012.05210 [cs.LG]
ceedings of the 26th annual international conference on
machine learning. 601–608. Jia Pan and Dinesh Manocha. 2011. Fast GPU-based local-
ity sensitive hashing for k-nearest neighbor computation.
Richard Lettich. 2021. GALATIC: GPU Accelerated
In Proceedings of the 19th ACM SIGSPATIAL interna-
Sparse Matrix Multiplication over Arbitrary Semirings
tional conference on advances in geographic informa-
(GALATIC) v1. 0. Technical Report. Lawrence Berkeley
tion systems. 211–220.
National Lab.(LBNL), Berkeley, CA (United States).
Hang Li. 2016. Does IR Need Deep Learning ? IR and DL. Sebastian Raschka, Joshua Patterson, and Corey Nolet.
Keynote speech at SIGIR 2016 Neu-IR workshop (2016). 2020. Machine learning in python: Main developments
and technology trends in data science, machine learning,
Weifeng Liu and Brian Vinter. 2014. An efficient GPU and artificial intelligence. Information 11, 4 (2020), 193.
general sparse matrix-matrix multiplication for irregu-
lar data. In 2014 IEEE 28th International Parallel and JS Ratti and Y-F Lin. 1971. The graphs of semirings. II.
Distributed Processing Symposium. IEEE, 370–381. Proc. Amer. Math. Soc. 30, 3 (1971), 473–478.
Laurens Van Der Maaten and Geoffrey Hinton. 2008. Visu- Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola.
alizing Data using t-SNE. Journal of Machine Learning 2001. A generalized representer theorem. In Inter-
Research 9 (2008), 2579–2605. national conference on computational learning theory.
Springer, 416–426.
Tim Mattson, David Bader, Jon Berry, Aydin Buluc, Jack
Dongarra, Christos Faloutsos, John Feo, John Gilbert, Bernhard Scholkopf and Alexander J Smola. 2018. Learn-
Joseph Gonzalez, Bruce Hendrickson, et al. 2013. Stan- ing with kernels: support vector machines, regulariza-
dards for graph algorithm primitives. In 2013 IEEE High tion, optimization, and beyond. Adaptive Computation
Performance Extreme Computing Conference (HPEC). and Machine Learning series.
IEEE, 1–2.
Naser Sedaghati, Arash Ashari, Louis-Noël Pouchet, Srini-
Leland McInnes, John Healy, and James Melville. vasan Parthasarathy, and P Sadayappan. 2015. Char-
2018. UMAP: Uniform Manifold Approximation acterizing dataset dependence for sparse matrix-vector
and Projection for Dimension Reduction. arXiv multiplication on GPUs. In Proceedings of the 2nd work-
(2018). arXiv:1802.03426 http://arxiv.org/ shop on parallel programming for analytics applica-
abs/1802.03426 tions. 17–24.
GPU Semiring Primitives for Sparse Neighborhood Methods
Alex Smola, Arthur Gretton, Le Song, and Bernhard Zhekai Zhang, Hanrui Wang, Song Han, and William J
Schölkopf. 2007. A Hilbert space embedding for dis- Dally. 2020. Sparch: Efficient architecture for sparse
tributions. In International Conference on Algorithmic matrix multiplication. In 2020 IEEE International Sym-
Learning Theory. Springer, 13–31. posium on High Performance Computer Architecture
(HPCA). IEEE, 261–274.
Ian Soboroff. 2018. Meta-Analysis for Retrieval Experi-
ments Involving Multiple Test Collections. In Proceed- Brady Beida Zhou. 2018. GPU accelerated k-nearest
ings of the 27th ACM International Conference on Infor- neighbor kernel for sparse feature datasets. Ph.D. Dis-
mation and Knowledge Management (CIKM ’18). ACM, sertation.
New York, NY, USA, 713–722. https://doi.
org/10.1145/3269206.3271719
We can parallelize this by evaluating the semiring of b B.2 Artifact check-list (meta-information)
over each row vector of A independently, iterating through
the nonzero columns from each vector in A and fetching • Algorithm: sparse matrix-vector multiplication, pair-
wise distance
or looking up the corresponding column from b (if it is
nonzero). With the standard dot-product semiring, which • Program: rapids, cuml, raft
annihilates multiplicatively over the additive identity, we
• Compilation: cmake, python
only need to consider the intersection of columns where
both sides are nonzero– column 3 in this example. • Binary: source build
GPU Semiring Primitives for Sparse Neighborhood Methods
• Data set: movielens, ny times bow, sec edgar, scrna B.4 Installation
• Run-time environment: linux, 64-bit, x86 64 After installing the software dependencies in an Anaconda en-
vironment by following the instructions provided in the Github
• Hardware: gpu, dgx1, v100, volta repository, the cuML source code needs to be built and installed
• Metrics: End-to-end runtime performance from the branch-0.19 branch.
B.3.3 Software dependencies • For MovieLens, the optimized version should be faster than
the baseline for all distances.
These benchmarks were executed using a custom branch of
RAPIDS cuML version 0.19 and CUDA toolkit 11.0. All depen- • For SEC Edgar, the optimized version should be faster than
dencies should be installed with Anaconda and instructions are the baseline for all distances.
provided in the documentation to install them.
• for NY Times, the optimized version should be faster than
the baseline for all of the non-trivial distances. For the dot-
B.3.4 Data sets product based distances, the baseline should be faster than
MovieLense 20M Ratings: the optimized version.
https://files.grouplens.org/datasets/movielens/ml-20m.zip
• for scRNA, the optimized version should be faster than the
NY Times Bag of Words: baseline for all of the non-trivial distances. For the dot-
https://archive.ics.uci.edu/ml/machine-learning-databases/bag- product based distances, the baseline shoul dbe faster than
of-words/docword.nytimes.txt.gz the optimized version.