0% found this document useful (0 votes)
102 views7 pages

Accelerating Matrix Multiplication With Block Sparse Format and NVIDIA Tensor Cores - NVIDIA Technical Blog

This document summarizes research on accelerating sparse matrix multiplication using NVIDIA GPUs and tensor cores. It introduces a blocked sparse format that allows skipping unnecessary computations and exploiting tensor cores. Benchmark results show the blocked sparse method outperforms dense multiplication when sparsity is above 40-50%, achieving near-linear speedups with sparsity. The document provides code examples for performing blocked sparse matrix multiplication using the NVIDIA cuSPARSE library.

Uploaded by

Shrey Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views7 pages

Accelerating Matrix Multiplication With Block Sparse Format and NVIDIA Tensor Cores - NVIDIA Technical Blog

This document summarizes research on accelerating sparse matrix multiplication using NVIDIA GPUs and tensor cores. It introduces a blocked sparse format that allows skipping unnecessary computations and exploiting tensor cores. Benchmark results show the blocked sparse method outperforms dense multiplication when sparsity is above 40-50%, achieving near-linear speedups with sparsity. The document provides code examples for performing blocked sparse matrix multiplication using the NVIDIA cuSPARSE library.

Uploaded by

Shrey Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

8/4/23, 1:45 PM Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores | NVIDIA Technical

rse Format and NVIDIA Tensor Cores | NVIDIA Technical Blog

DEVELOPER

Technical Blog Subscribe

Data Center / Cloud

Accelerating Matrix Multiplication with Block Sparse


Format and NVIDIA Tensor Cores
Mar 19, 2021  +3 Like 􏒦 Discuss (21)

By Takuma Yamaguchi and Federico Busato

Sparse-matrix dense-matrix multiplication (SpMM) is a fundamental linear algebra operation and a building block for more complex algorithms such as finding the solutions of
linear systems, computing eigenvalues through the preconditioned conjugate gradient, and multiple right-hand sides Krylov subspace iterative solvers. SpMM is also an important
kernel used in many domains such as fluid dynamics, deep learning, graph analytics, and economic modeling. In the specific context of deep learning, sparsity has emerged as one
of the leading approaches for increasing training and inference performance as well as reducing the model sizes while keeping the accuracy.

Even though sparse linear algebra allows representing huge matrices very efficiently, it typically does not provide competitive
performance compared to dense counterparts in cases when sparsity is below 95%. This is due to irregular computation and
scattered memory accesses. In fact, many of the linear algebra applications that benefit from sparsity have over 99% sparsity in their
matrices.

To overcome this limitation, the NVIDIA Ampere architecture introduces the concept of fine-grained structured sparsity, which doubles
throughput of dense-matrix multiplies by skipping the computation of zero values in a 2:4 pattern. Recently, NVIDIA introduced the
cuSPARSELt library to fully exploit third-generation Sparse Tensor Core capabilities.

The primary alternative to fine-grained sparsity is through the organization of matrix entries/network weights in groups, such as vectors or blocks. This coarse-grained sparsity
allows regular access pattern and locality, making the computation amenable for GPUs. In deep learning, block sparse matrix multiplication is successfully adopted to reduce the
complexity of the standard self-attention mechanism, such as in Sparse Transformer models or in its extensions like Longformer.

Starting with cuSPARSE 11.4.0, the CUDA Toolkit provides a new high-performance block sparse matrix multiplication routine that allows exploiting NVIDIA GPU dense Tensor
Cores for nonzero sub-matrices and significantly outperforms dense computations on Volta and newer architecture GPUs.

cuSPARSE Block-SpMM: Efficient, block-wise SpMM


Figure 1 shows the general matrix multiplication (GEMM) operation by using the block sparse format. On the left are the full matrix organized in blocks and its internal memory
representation: compressed values and block indices. As the usual dense GEMM, the computation partitions the output matrix into tiles. The kernel computes the output tile by
stepping through the active tiles (left to right) and accumulates the results into the C matrix. Differently from classical GEMM, not all values of the dense-matrix B are accessed
for computing the output. This approach allows skipping unnecessary computation represented by nonzero values and dramatically improves the performance.

https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/ 1/7
8/4/23, 1:45 PM Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores | NVIDIA Technical Blog

DEVELOPER

Figure 1. GEMM using block sparse weights and dense activations.

Blocked-Ellpack format
Figure 2 shows that the Blocked-Ellpack (Blocked-ELL) storage format contains two 2-D arrays. The right array stores nonzero values in consecutive blocks, while the second array
contains the column indices of the corresponding nonzero blocks. All rows in the arrays must have the same number of blocks. Non-structural zero blocks are also accepted as
padding. These arrays store components in row-major order, like the compressed sparse row (CSR) format.

Figure 2. Definition of Blocked-ELL format.

cuSPARSE SpMM
The cuSPARSE library provides cusparseSpMM routine for SpMM operations. Compute the following multiplication:

In this operation, A is a sparse matrix of size MxK, while B and C are dense matrices of size KxN MxN, respectively. Denote the layouts of the matrix B with N for row-major order,
where op is non-transposed, and T for column-major order, where op is transposed.

cusparseSpMM selects suitable kernels depending on the storage format, the number of nonzero components, and matrix layouts. This routine supports CSR, Coordinate (COO), as
well as the new Blocked-ELL storage formats. Table 1 shows the supported data types, layouts, and compute types.

A/B type C type Compute type op(A) op(B) Compute capability Architecture

half half half N N, T ≥ 7.0 ≥ Volta

half half float N N, T ≥ 7.0 ≥ Volta

half float float N N, T ≥ 7.0 ≥ Volta

int8 int8 int N N ≥ 7.5 ≥ Turing

bfloat16 bfloat16 float N N, T ≥ 8.0 ≥ NVIDIA AMPERE

bfloat16 float float N N, T ≥ 8.0 ≥ NVIDIA AMPERE

float float float N N, T ≥ 8.0 ≥ NVIDIA AMPERE

double double double N N, T ≥ 8.0 ≥ NVIDIA AMPERE

Table 1. Supported data types, layouts, and architectures in cusparseSpMM with Blocked-ELL storage format.

Block-SpMM performance
Here’s a snapshot of the relative performance of dense and sparse-matrix multiplications exploiting NVIDIA GPU Tensor Cores. Figures 3 and 4 show the performance of Block-
SpMM on NVIDIA V100 and A100 GPUs with the following settings:

https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/ 2/7
8/4/23, 1:45 PM Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores | NVIDIA Technical Blog

Matrix sizes: M=N=K=4096.


Block sizes: 32 and 16. DEVELOPER
Input/output data type: half (fp16).
Computation data type: float (fp32).

The speedup ratio compared to cuBLAS is nearly linear to the sparsity on both NVIDIA V100 and A100 GPUs. When the block size is 32, the kernel is faster than cuBLAS if the
density is less than 40% on NVIDIA Volta and 50% on NVIDIA Ampere architecture.

For better performance, it is important to satisfy the following conditions:

Use large block sizes, preferably a power-of-2.


Use 128-byte aligned pointers for matrices for vectorized memory access.

Figure 3. Speedup of cuSPARSE Block-SpMM over Dense GEMM in cuBLAS on NVIDIA V100, fp16 in/out, fp32 compute, NN layout, CUDA Toolkit 11.2.1.

Figure 4. Speedup of cuSPARSE Block-SpMM over Dense GEMM in cuBLAS on NVIDIA A100, fp16 in/out, fp32 compute, NN layout, CUDA Toolkit 11.2.1.

Block-SpMM code example


For this new storage format, perform similar steps as with CSR and COO cusparseSpMM. For more information, see the cuSPARSE/spmm_blockedell repo.

First, include the cuSPARSE header, set up some device pointers, and initialize the cuSPARSE handle:

#include <cusparse.h>
cusparseHandle_t handle = nullptr;
cusparseCreate(&handle);
float alpha = 1.0f;
float beta = 0.0f;
int* d_ell_colidx = ...
__half* d_ell_values = ...
__half* dB = ...
__half* dC = …
int ell_blocksize = 32;

Next, create the block sparse input matrix A, dense input matrix B, and dense output matrix C descriptors:

cusparseSpMatDescr_t matA;
cusparseDnMatDescr_t matB, matC;
cusparseCreateBlockedEll(&matA, A_num_rows, A_num_cols,
ell_blocksize, ell_cols,
d_ell_colidx, d_ell_values,
CUSPARSE_INDEX_32I, CUSPARSE_INDEX_BASE_ZERO,
AB_type);
cusparseCreateDnMat(&matB, B_num_rows, B_num_cols, B_ld,
d_B, AB_type, CUSPARSE_ORDER_ROW);
cusparseCreateDnMat(&matC, C_num_rows, C_num_cols, C_ld,
d_C, C_type, CUSPARSE_ORDER_ROW);

Then, allocate an external buffer for the multiplication:

void* d_buffer;
cusparseSpMM_bufferSize(handle,
CUSPARSE_OPERATION_NON_TRANSPOSE,
CUSPARSE_OPERATION_NON_TRANSPOSE,
&alpha, matA, matB, &beta, matC, CUDA_R_32F,
CUSPARSE_SPMM_ALG_DEFAULT, &bufferSize);
cudaMalloc(&dBuffer, bufferSize);

Now you can execute SpMM:

cusparseSpMM(handle, opA, opB, alpha, matA, matB,


beta, matC, compute_type,
CUSPARSE_SPMM_ALG_DEFAULT, d_buffer);

Finally, destroy the cuSPARSE descriptors and handle and clean up the used memory:

cusparseDestroySpMat(matA);
cusparseDestroyDnMat(matB);
cusparseDestroyDnMat(matC);
cusparseDestroy(handle);
cudaFree(dBuffer);

https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/ 3/7
8/4/23, 1:45 PM Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores | NVIDIA Technical Blog

Get startedDEVELOPER
with cuSPARSE Block-SpMM
The cuSPARSE library now provides fast kernels for block SpMM exploiting NVIDIA Tensor Cores. With the Blocked-ELL format, you can compute faster than dense-matrix
multiplication depending on the sparsity of the matrix. The latest version of cuSPARSE can be found in the CUDA Toolkit.

For more information, see the following resources:

OpenAI Block-Sparse GPU Kernels


Efficient Transformers: A Survey
Generating Long Sequences with Sparse Transformers
Fast Block Sparse Matrices for Pytorch
cuSPARSE documentation

Related resources
DLI course: Fundamentals of Accelerated Computing with CUDA Python
GTC session: Developing Optimal CUDA Kernels on Hopper Tensor Cores (Spring 2023)
GTC session: Recent Developments in NVIDIA Math Libraries (Spring 2023)
SDK: cuBLASMg
SDK: cuBLAS
SDK: cuBLASXt

􏒦 Discuss (21)
  +3 Like

Tags
Data Center / Cloud | Simulation / Modeling / Design | Ampere | CUDA | CuSPARSE | Featured

About the Authors


About Takuma Yamaguchi
Takuma Yamaguchi is a senior software engineer in the CUDA Math Libraries group at NVIDIA, where he works on the optimization of quantum algorithms
in cuStateVec. He holds a Ph.D. in civil engineering from the University of Tokyo.
View all posts by Takuma Yamaguchi

About Federico Busato


Federico Busato is a senior software engineer in the CUDA Math Libraries group and team lead at NVIDIA since 2018. He primarily works on the cuSPARSE
and cuSPARSELt libraries, focusing on new features and performance optimization. Federico holds a PhD in computer science and his background is in
graph algorithms and sparse computation for GPU architectures.
Follow @fedebusato on Twitter
View all posts by Federico Busato

Comments

Notable Replies
neoblizzz

May 10, 2021


Hi, I am trying to use this new blocked-ELL SpMM, having issues understanding how blocked-ELL is constructed. Does cusparse or any other library provide a dense matrix to blocked-ELL
conversion (much like CSR or other sparse-formats in cusparse). It is really intuitive to understand what CSR and COO are and do, but a construction of ELL seems implementation based, and it is
not necessarily clear to me how that is done with a single paragraph and a picture.
Thanks!

fbusato

May 11, 2021


Hi, my first suggestion is to take a look at the figure in the documentation https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-spmat-create-blockedell that highlights the
memory layout. You can also find an example of usage in the Github samples page https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSE/spmm_blockedell. Finally, we are going
to provide in the upcoming release a conversion routine from dense to BlockedEll by using the routine DenseToSparse https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-
function-dense2sparse

neoblizzz

May 12, 2021


Thank you! I have seen those resources, I guess what I am struggling with is how the column-idx array for Blocked-ELL is created, and the memory layout or the very small example doesn’t seem
super sufficient to understand how I can take an existing dense matrix, write values in this blocked-format to a values array, and somehow generate the column indices for those.
The most useful thing I found was the figure in this blog, any more context or open-source code that you have available that I can read/look through?
Also looking forward to DenseToSparse support for ELL! Really cool stuff, thank you so much!

neoblizzz

May 12, 2021

https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/ 4/7
8/4/23, 1:45 PM Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores | NVIDIA Technical Blog
Just to further add to that, even if you had a larger example hand-coded with a visualization to go along with it (like the one in the blog) that will be great!
I also had some other questions: DEVELOPER
It seems like batched support for ELL-SpMM version is missing. Are their any plans for that? Is that done through cuda streams or as complete kernel launches?
Are their any samples available that show batched SpMM for CSR/CSC/COO? The docs say that it is supported but I can’t seem to find how to actually use it.
Again, thanks!

fbusato

May 12, 2021


thanks for the feedback. Batched Ell-SpMM is currently not supported. We will consider this feature in the future. While about bached SpMM for standard format, it is supported but the sample is
not available. I will add it in the next few days.

Continue the discussion at forums.developer.nvidia.com


16 more replies

Participants

Related posts

Structured Sparsity in the NVIDIA Ampere Architecture and Applications in Search Engines

Accelerating ReLu and GeLu Activation Functions, and Batched Sparse GEMM in cuSPARSELt v0.2.0

https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/ 5/7
8/4/23, 1:45 PM Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores | NVIDIA Technical Blog

DEVELOPER

cuSPARSELt v0.1.0 Now Available: Arm and Windows Support

Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt

CUTLASS: Fast Linear Algebra in CUDA C++

https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/ 6/7
8/4/23, 1:45 PM Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores | NVIDIA Technical Blog

DEVELOPER

Follow NVIDIA Developer


Sign Up for NVIDIA Developer News Subscribe

Copyright © 2023 NVIDIA Corporation Legal Information | Terms of Use | Privacy Policy | Cookie Policy | Contact

https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/ 7/7

You might also like