0% found this document useful (0 votes)

104 views7 pages

Accelerating Matrix Multiplication With Block Sparse Format and NVIDIA Tensor Cores - NVIDIA Technical Blog

This document summarizes research on accelerating sparse matrix multiplication using NVIDIA GPUs and tensor cores. It introduces a blocked sparse format that allows skipping unnecessary computations and exploiting tensor cores. Benchmark results show the blocked sparse method outperforms dense multiplication when sparsity is above 40-50%, achieving near-linear speedups with sparsity. The document provides code examples for performing blocked sparse matrix multiplication using the NVIDIA cuSPARSE library.

Uploaded by

Shrey Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

104 views7 pages

Accelerating Matrix Multiplication With Block Sparse Format and NVIDIA Tensor Cores - NVIDIA Technical Blog

Uploaded by

Shrey Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

8/4/23, 1:45 PM Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores | NVIDIA Technical

rse Format and NVIDIA Tensor Cores | NVIDIA Technical Blog

DEVELOPER

Technical Blog Subscribe

Data Center / Cloud

Accelerating Matrix Multiplication with Block Sparse

Format and NVIDIA Tensor Cores
Mar 19, 2021  +3 Like 􏒦 Discuss (21)


By Takuma Yamaguchi and Federico Busato

Sparse-matrix dense-matrix multiplication (SpMM) is a fundamental linear algebra operation and a building block for more complex algorithms such as finding the solutions of
linear systems, computing eigenvalues through the preconditioned conjugate gradient, and multiple right-hand sides Krylov subspace iterative solvers. SpMM is also an important
kernel used in many domains such as fluid dynamics, deep learning, graph analytics, and economic modeling. In the specific context of deep learning, sparsity has emerged as one
of the leading approaches for increasing training and inference performance as well as reducing the model sizes while keeping the accuracy.

Even though sparse linear algebra allows representing huge matrices very efficiently, it typically does not provide competitive
performance compared to dense counterparts in cases when sparsity is below 95%. This is due to irregular computation and
scattered memory accesses. In fact, many of the linear algebra applications that benefit from sparsity have over 99% sparsity in their
matrices.

To overcome this limitation, the NVIDIA Ampere architecture introduces the concept of fine-grained structured sparsity, which doubles
throughput of dense-matrix multiplies by skipping the computation of zero values in a 2:4 pattern. Recently, NVIDIA introduced the
cuSPARSELt library to fully exploit third-generation Sparse Tensor Core capabilities.

The primary alternative to fine-grained sparsity is through the organization of matrix entries/network weights in groups, such as vectors or blocks. This coarse-grained sparsity
allows regular access pattern and locality, making the computation amenable for GPUs. In deep learning, block sparse matrix multiplication is successfully adopted to reduce the
complexity of the standard self-attention mechanism, such as in Sparse Transformer models or in its extensions like Longformer.

Starting with cuSPARSE 11.4.0, the CUDA Toolkit provides a new high-performance block sparse matrix multiplication routine that allows exploiting NVIDIA GPU dense Tensor
Cores for nonzero sub-matrices and significantly outperforms dense computations on Volta and newer architecture GPUs.

cuSPARSE Block-SpMM: Efficient, block-wise SpMM

Figure 1 shows the general matrix multiplication (GEMM) operation by using the block sparse format. On the left are the full matrix organized in blocks and its internal memory
representation: compressed values and block indices. As the usual dense GEMM, the computation partitions the output matrix into tiles. The kernel computes the output tile by
stepping through the active tiles (left to right) and accumulates the results into the C matrix. Differently from classical GEMM, not all values of the dense-matrix B are accessed
for computing the output. This approach allows skipping unnecessary computation represented by nonzero values and dramatically improves the performance.

https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/ 1/7
8/4/23, 1:45 PM Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores | NVIDIA Technical Blog

DEVELOPER

Figure 1. GEMM using block sparse weights and dense activations.

Blocked-Ellpack format
Figure 2 shows that the Blocked-Ellpack (Blocked-ELL) storage format contains two 2-D arrays. The right array stores nonzero values in consecutive blocks, while the second array
contains the column indices of the corresponding nonzero blocks. All rows in the arrays must have the same number of blocks. Non-structural zero blocks are also accepted as
padding. These arrays store components in row-major order, like the compressed sparse row (CSR) format.

Figure 2. Definition of Blocked-ELL format.

cuSPARSE SpMM
The cuSPARSE library provides cusparseSpMM routine for SpMM operations. Compute the following multiplication:

In this operation, A is a sparse matrix of size MxK, while B and C are dense matrices of size KxN MxN, respectively. Denote the layouts of the matrix B with N for row-major order,
where op is non-transposed, and T for column-major order, where op is transposed.

cusparseSpMM selects suitable kernels depending on the storage format, the number of nonzero components, and matrix layouts. This routine supports CSR, Coordinate (COO), as
well as the new Blocked-ELL storage formats. Table 1 shows the supported data types, layouts, and compute types.

A/B type C type Compute type op(A) op(B) Compute capability Architecture

half half half N N, T ≥ 7.0 ≥ Volta

half half float N N, T ≥ 7.0 ≥ Volta

half float float N N, T ≥ 7.0 ≥ Volta

int8 int8 int N N ≥ 7.5 ≥ Turing

bfloat16 bfloat16 float N N, T ≥ 8.0 ≥ NVIDIA AMPERE

bfloat16 float float N N, T ≥ 8.0 ≥ NVIDIA AMPERE

float float float N N, T ≥ 8.0 ≥ NVIDIA AMPERE

double double double N N, T ≥ 8.0 ≥ NVIDIA AMPERE

Table 1. Supported data types, layouts, and architectures in cusparseSpMM with Blocked-ELL storage format.

Block-SpMM performance
Here’s a snapshot of the relative performance of dense and sparse-matrix multiplications exploiting NVIDIA GPU Tensor Cores. Figures 3 and 4 show the performance of Block-
SpMM on NVIDIA V100 and A100 GPUs with the following settings:

https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/ 2/7
8/4/23, 1:45 PM Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores | NVIDIA Technical Blog

Matrix sizes: M=N=K=4096.

Block sizes: 32 and 16. DEVELOPER
Input/output data type: half (fp16).
Computation data type: float (fp32).

The speedup ratio compared to cuBLAS is nearly linear to the sparsity on both NVIDIA V100 and A100 GPUs. When the block size is 32, the kernel is faster than cuBLAS if the
density is less than 40% on NVIDIA Volta and 50% on NVIDIA Ampere architecture.

For better performance, it is important to satisfy the following conditions:

Use large block sizes, preferably a power-of-2.

Use 128-byte aligned pointers for matrices for vectorized memory access.

Figure 3. Speedup of cuSPARSE Block-SpMM over Dense GEMM in cuBLAS on NVIDIA V100, fp16 in/out, fp32 compute, NN layout, CUDA Toolkit 11.2.1.

Figure 4. Speedup of cuSPARSE Block-SpMM over Dense GEMM in cuBLAS on NVIDIA A100, fp16 in/out, fp32 compute, NN layout, CUDA Toolkit 11.2.1.

Block-SpMM code example

For this new storage format, perform similar steps as with CSR and COO cusparseSpMM. For more information, see the cuSPARSE/spmm_blockedell repo.

First, include the cuSPARSE header, set up some device pointers, and initialize the cuSPARSE handle:

#include <cusparse.h>
cusparseHandle_t handle = nullptr;
cusparseCreate(&handle);
float alpha = 1.0f;
float beta = 0.0f;
int* d_ell_colidx = ...
__half* d_ell_values = ...
__half* dB = ...
__half* dC = …
int ell_blocksize = 32;

Next, create the block sparse input matrix A, dense input matrix B, and dense output matrix C descriptors:

cusparseSpMatDescr_t matA;
cusparseDnMatDescr_t matB, matC;
cusparseCreateBlockedEll(&matA, A_num_rows, A_num_cols,
ell_blocksize, ell_cols,
d_ell_colidx, d_ell_values,
CUSPARSE_INDEX_32I, CUSPARSE_INDEX_BASE_ZERO,
AB_type);
cusparseCreateDnMat(&matB, B_num_rows, B_num_cols, B_ld,
d_B, AB_type, CUSPARSE_ORDER_ROW);
cusparseCreateDnMat(&matC, C_num_rows, C_num_cols, C_ld,
d_C, C_type, CUSPARSE_ORDER_ROW);

Then, allocate an external buffer for the multiplication:

void* d_buffer;
cusparseSpMM_bufferSize(handle,
CUSPARSE_OPERATION_NON_TRANSPOSE,
CUSPARSE_OPERATION_NON_TRANSPOSE,
&alpha, matA, matB, &beta, matC, CUDA_R_32F,
CUSPARSE_SPMM_ALG_DEFAULT, &bufferSize);
cudaMalloc(&dBuffer, bufferSize);

Now you can execute SpMM:

cusparseSpMM(handle, opA, opB, alpha, matA, matB,

beta, matC, compute_type,
CUSPARSE_SPMM_ALG_DEFAULT, d_buffer);

Finally, destroy the cuSPARSE descriptors and handle and clean up the used memory:

cusparseDestroySpMat(matA);
cusparseDestroyDnMat(matB);
cusparseDestroyDnMat(matC);
cusparseDestroy(handle);
cudaFree(dBuffer);

https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/ 3/7
8/4/23, 1:45 PM Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores | NVIDIA Technical Blog

Get startedDEVELOPER
with cuSPARSE Block-SpMM
The cuSPARSE library now provides fast kernels for block SpMM exploiting NVIDIA Tensor Cores. With the Blocked-ELL format, you can compute faster than dense-matrix
multiplication depending on the sparsity of the matrix. The latest version of cuSPARSE can be found in the CUDA Toolkit.

For more information, see the following resources:

OpenAI Block-Sparse GPU Kernels

Efficient Transformers: A Survey
Generating Long Sequences with Sparse Transformers
Fast Block Sparse Matrices for Pytorch
cuSPARSE documentation

Related resources
DLI course: Fundamentals of Accelerated Computing with CUDA Python
GTC session: Developing Optimal CUDA Kernels on Hopper Tensor Cores (Spring 2023)
GTC session: Recent Developments in NVIDIA Math Libraries (Spring 2023)
SDK: cuBLASMg
SDK: cuBLAS
SDK: cuBLASXt

􏒦 Discuss (21)
  +3 Like

About the Authors

About Takuma Yamaguchi
Takuma Yamaguchi is a senior software engineer in the CUDA Math Libraries group at NVIDIA, where he works on the optimization of quantum algorithms
in cuStateVec. He holds a Ph.D. in civil engineering from the University of Tokyo.
View all posts by Takuma Yamaguchi

About Federico Busato

Federico Busato is a senior software engineer in the CUDA Math Libraries group and team lead at NVIDIA since 2018. He primarily works on the cuSPARSE
and cuSPARSELt libraries, focusing on new features and performance optimization. Federico holds a PhD in computer science and his background is in
graph algorithms and sparse computation for GPU architectures.
Follow @fedebusato on Twitter
View all posts by Federico Busato

Comments

Notable Replies
neoblizzz

May 10, 2021

Hi, I am trying to use this new blocked-ELL SpMM, having issues understanding how blocked-ELL is constructed. Does cusparse or any other library provide a dense matrix to blocked-ELL
conversion (much like CSR or other sparse-formats in cusparse). It is really intuitive to understand what CSR and COO are and do, but a construction of ELL seems implementation based, and it is
not necessarily clear to me how that is done with a single paragraph and a picture.
Thanks!

fbusato

May 11, 2021

Hi, my first suggestion is to take a look at the figure in the documentation https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-spmat-create-blockedell that highlights the
memory layout. You can also find an example of usage in the Github samples page https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSE/spmm_blockedell. Finally, we are going
to provide in the upcoming release a conversion routine from dense to BlockedEll by using the routine DenseToSparse https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-
function-dense2sparse

neoblizzz

May 12, 2021

Thank you! I have seen those resources, I guess what I am struggling with is how the column-idx array for Blocked-ELL is created, and the memory layout or the very small example doesn’t seem
super sufficient to understand how I can take an existing dense matrix, write values in this blocked-format to a values array, and somehow generate the column indices for those.
The most useful thing I found was the figure in this blog, any more context or open-source code that you have available that I can read/look through?
Also looking forward to DenseToSparse support for ELL! Really cool stuff, thank you so much!

neoblizzz

May 12, 2021

https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/ 4/7
8/4/23, 1:45 PM Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores | NVIDIA Technical Blog
Just to further add to that, even if you had a larger example hand-coded with a visualization to go along with it (like the one in the blog) that will be great!
I also had some other questions: DEVELOPER
It seems like batched support for ELL-SpMM version is missing. Are their any plans for that? Is that done through cuda streams or as complete kernel launches?
Are their any samples available that show batched SpMM for CSR/CSC/COO? The docs say that it is supported but I can’t seem to find how to actually use it.
Again, thanks!

fbusato

May 12, 2021

thanks for the feedback. Batched Ell-SpMM is currently not supported. We will consider this feature in the future. While about bached SpMM for standard format, it is supported but the sample is
not available. I will add it in the next few days.

Continue the discussion at forums.developer.nvidia.com

16 more replies

Participants

Structured Sparsity in the NVIDIA Ampere Architecture and Applications in Search Engines

Accelerating ReLu and GeLu Activation Functions, and Batched Sparse GEMM in cuSPARSELt v0.2.0

https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/ 5/7
8/4/23, 1:45 PM Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores | NVIDIA Technical Blog

DEVELOPER

cuSPARSELt v0.1.0 Now Available: Arm and Windows Support

Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt

CUTLASS: Fast Linear Algebra in CUDA C++

https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/ 6/7
8/4/23, 1:45 PM Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores | NVIDIA Technical Blog

DEVELOPER

Follow NVIDIA Developer

https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/ 7/7

BOB Sustainability Report English 2023 09-01-2024 V2
No ratings yet
BOB Sustainability Report English 2023 09-01-2024 V2
196 pages
San Miguel Corporation Business Model Canvas
71% (7)
San Miguel Corporation Business Model Canvas
2 pages
(Viral) Kamal Kaur Viral Video Original Link
No ratings yet
(Viral) Kamal Kaur Viral Video Original Link
5 pages
Biometry and Experimental Design
100% (1)
Biometry and Experimental Design
106 pages
2007 GMC Acadia 3.6L Vin 7 Electric Diagrams 4of5
57% (7)
2007 GMC Acadia 3.6L Vin 7 Electric Diagrams 4of5
1 page
High Performance Computing - Project Report
100% (2)
High Performance Computing - Project Report
58 pages
Nvidia XID - Errors
No ratings yet
Nvidia XID - Errors
12 pages
Salih GÖKMEN - 07.2021
No ratings yet
Salih GÖKMEN - 07.2021
114 pages
5 PDF
No ratings yet
5 PDF
1 page
PSAT Bahasa Inggris Kelas 10
No ratings yet
PSAT Bahasa Inggris Kelas 10
5 pages
Ece408 Lecture19 Sparse Matrix VK SP23
No ratings yet
Ece408 Lecture19 Sparse Matrix VK SP23
28 pages
Yang 2018 Europa R
No ratings yet
Yang 2018 Europa R
16 pages
Waiver
No ratings yet
Waiver
6 pages
Assignment 4 (MAN 001)
No ratings yet
Assignment 4 (MAN 001)
2 pages
Japan - ICFG International Cold Forging Group
No ratings yet
Japan - ICFG International Cold Forging Group
12 pages
Assignment 1 (MAN 001)
No ratings yet
Assignment 1 (MAN 001)
3 pages
Aluminium Foil
0% (1)
Aluminium Foil
45 pages
Mohammed - PMP, ASM - ITIL - Resume For - SAP Project Manager
No ratings yet
Mohammed - PMP, ASM - ITIL - Resume For - SAP Project Manager
5 pages
Zoom System Design PDF
No ratings yet
Zoom System Design PDF
9 pages
AMV in Pharma
No ratings yet
AMV in Pharma
13 pages
Cotton Case Study
No ratings yet
Cotton Case Study
2 pages
Assignment 3 (MAN 001)
No ratings yet
Assignment 3 (MAN 001)
2 pages
Koala - Wikipedia
No ratings yet
Koala - Wikipedia
24 pages
Design Patterns
No ratings yet
Design Patterns
94 pages
Assignment 5 (MAN 001)
No ratings yet
Assignment 5 (MAN 001)
2 pages
Lecture - 8 Steady State Diffusion Equation
No ratings yet
Lecture - 8 Steady State Diffusion Equation
16 pages
Project Africa Now
No ratings yet
Project Africa Now
6 pages
Ankur's Resume
No ratings yet
Ankur's Resume
2 pages
Avr4311 E2
No ratings yet
Avr4311 E2
2 pages
Be 20230428
No ratings yet
Be 20230428
8 pages
Scalar Security Study 2019
No ratings yet
Scalar Security Study 2019
76 pages
NPN Opportunity Registration 2.0: Partner Update
No ratings yet
NPN Opportunity Registration 2.0: Partner Update
13 pages
Sustainability Analysis On NVDA
100% (1)
Sustainability Analysis On NVDA
24 pages
Module 2 Class 1
No ratings yet
Module 2 Class 1
9 pages
NVIDIA Ampere GA102 GPU Architecture Whitepaper V1 PDF
No ratings yet
NVIDIA Ampere GA102 GPU Architecture Whitepaper V1 PDF
44 pages
ResQ Redis Cache ER Diagram
No ratings yet
ResQ Redis Cache ER Diagram
1 page
SWOT Analysis NVIDIA
No ratings yet
SWOT Analysis NVIDIA
7 pages
Educ 102
No ratings yet
Educ 102
3 pages
High Performance Computing
No ratings yet
High Performance Computing
1 page
Versa CSeries Aluminum Solenoid Valves
No ratings yet
Versa CSeries Aluminum Solenoid Valves
24 pages
Network Administrator or Configuration Manager or Application de
No ratings yet
Network Administrator or Configuration Manager or Application de
2 pages
GB922 Addendum 5LR Logical Resource R9 0 V9-3
No ratings yet
GB922 Addendum 5LR Logical Resource R9 0 V9-3
351 pages
Using Ffmpeg With Nvidia Gpu Hardware Acceleration: Application Note
No ratings yet
Using Ffmpeg With Nvidia Gpu Hardware Acceleration: Application Note
20 pages
Introduction To High Performance Scientific Computing
No ratings yet
Introduction To High Performance Scientific Computing
464 pages
DGX Solution Stack Whitepaper
No ratings yet
DGX Solution Stack Whitepaper
24 pages
Nvidia - Rapids
No ratings yet
Nvidia - Rapids
33 pages
Using FFmpeg With NVIDIA GPU Hardware Acceleration
No ratings yet
Using FFmpeg With NVIDIA GPU Hardware Acceleration
22 pages
dgx2 User Guide
No ratings yet
dgx2 User Guide
125 pages
Slurm. Our Way.: Douglas Jacobsen, James Botts, Helen He Nersc
No ratings yet
Slurm. Our Way.: Douglas Jacobsen, James Botts, Helen He Nersc
13 pages
Chemical Catalog
No ratings yet
Chemical Catalog
58 pages
CUDA Installation Guide Windows
100% (1)
CUDA Installation Guide Windows
17 pages
Cap 100M PDF
No ratings yet
Cap 100M PDF
35 pages
Tesla V100 Performance Guide
No ratings yet
Tesla V100 Performance Guide
23 pages
Nvidia Opencl Best Practices Guide: Optimization
No ratings yet
Nvidia Opencl Best Practices Guide: Optimization
49 pages
Motor Learning
No ratings yet
Motor Learning
3 pages
HPC - Unit Test-I (9 July 2020) : Mark Only One Oval
No ratings yet
HPC - Unit Test-I (9 July 2020) : Mark Only One Oval
5 pages
Lecture Week - 1 Introduction 1 - SP-24
No ratings yet
Lecture Week - 1 Introduction 1 - SP-24
51 pages
Dgx1 v100 System Architecture Whitepaper
No ratings yet
Dgx1 v100 System Architecture Whitepaper
43 pages
OpenAI SOC 3 Report
No ratings yet
OpenAI SOC 3 Report
12 pages
Triton X-100-1
No ratings yet
Triton X-100-1
9 pages
High Performance Computing Lecture 2 Parallel Programming With MPI Pub
No ratings yet
High Performance Computing Lecture 2 Parallel Programming With MPI Pub
50 pages
The Right Tools For Professionals: Nvidia Workstation Gpus
No ratings yet
The Right Tools For Professionals: Nvidia Workstation Gpus
4 pages
ACER Altos R3 Server Datasheet
No ratings yet
ACER Altos R3 Server Datasheet
2 pages
Introduction To Algorithms
No ratings yet
Introduction To Algorithms
18 pages
CUDA Installation Guide Windows
No ratings yet
CUDA Installation Guide Windows
28 pages
Bulacan Agricultural State College: Republic of The Philippines
No ratings yet
Bulacan Agricultural State College: Republic of The Philippines
10 pages
FOSDEM14 HPC Devroom 12 Sniper
No ratings yet
FOSDEM14 HPC Devroom 12 Sniper
33 pages
Emerging AI Trends: How AI and Its Applications Impact Healthcare Industry
No ratings yet
Emerging AI Trends: How AI and Its Applications Impact Healthcare Industry
18 pages
CC 1 Unit Notes
No ratings yet
CC 1 Unit Notes
8 pages
Speaker - A02 - 5747 - Best Practices in Networking For AI
No ratings yet
Speaker - A02 - 5747 - Best Practices in Networking For AI
15 pages
Insert Project Title: Business Requirements Specification
No ratings yet
Insert Project Title: Business Requirements Specification
19 pages
344.48 Nvidia Control Panel Quick Start Guide PDF
No ratings yet
344.48 Nvidia Control Panel Quick Start Guide PDF
33 pages
Microsoft in High Performance Computing: An Introduction: Aditya Krishnan Technical Product Manager Microsoft Corp
No ratings yet
Microsoft in High Performance Computing: An Introduction: Aditya Krishnan Technical Product Manager Microsoft Corp
21 pages
CATIAV5 Generative Part Structural Analysis
No ratings yet
CATIAV5 Generative Part Structural Analysis
24 pages
Uncertainty in Modeling
No ratings yet
Uncertainty in Modeling
25 pages
Nvidia-Learning-Training Course-Catalog
No ratings yet
Nvidia-Learning-Training Course-Catalog
27 pages
Nvidia RTX A2000 Datasheet
No ratings yet
Nvidia RTX A2000 Datasheet
1 page
High Performance Network-on-Chip Through MPLS
No ratings yet
High Performance Network-on-Chip Through MPLS
4 pages
UBL Operations Management
No ratings yet
UBL Operations Management
18 pages
2021-02-04 DAIM Company Presentation
No ratings yet
2021-02-04 DAIM Company Presentation
17 pages
TB 04631 001 - v01
No ratings yet
TB 04631 001 - v01
25 pages
Nvidia DGX A100 Datasheet
No ratings yet
Nvidia DGX A100 Datasheet
2 pages
Nvidia Presentation
No ratings yet
Nvidia Presentation
6 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Software Life-Cycle Management: Openup and Architecture Handbook Overview
No ratings yet
Software Life-Cycle Management: Openup and Architecture Handbook Overview
58 pages
High Performance Computing Update 0908 - InCOSE
No ratings yet
High Performance Computing Update 0908 - InCOSE
19 pages
362.00 Nvidia Control Panel Quick Start Guide
No ratings yet
362.00 Nvidia Control Panel Quick Start Guide
33 pages
Iec60364 5 52 (Ed2.0) en - D PDF
50% (2)
Iec60364 5 52 (Ed2.0) en - D PDF
8 pages
NGC Registry Launch Technical Overview
No ratings yet
NGC Registry Launch Technical Overview
11 pages
HPC Datasheet sc23 h200 Datasheet 3002446
No ratings yet
HPC Datasheet sc23 h200 Datasheet 3002446
3 pages
Nvidia DGX Station Print Infographic 738375 Web
No ratings yet
Nvidia DGX Station Print Infographic 738375 Web
1 page
Tutorial #1:the Essential ANSYS.: ME309: Finite Element Analysis in Mechanical Design
No ratings yet
Tutorial #1:the Essential ANSYS.: ME309: Finite Element Analysis in Mechanical Design
9 pages
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet

Accelerating Matrix Multiplication With Block Sparse Format and NVIDIA Tensor Cores - NVIDIA Technical Blog

Uploaded by

Accelerating Matrix Multiplication With Block Sparse Format and NVIDIA Tensor Cores - NVIDIA Technical Blog

Uploaded by

8/4/23, 1:45 PM Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores | NVIDIA Technical

rse Format and NVIDIA Tensor Cores | NVIDIA Technical Blog

Technical Blog Subscribe

Data Center / Cloud

Accelerating Matrix Multiplication with Block Sparse

By Takuma Yamaguchi and Federico Busato

cuSPARSE Block-SpMM: Efficient, block-wise SpMM

Figure 1. GEMM using block sparse weights and dense activations.

Figure 2. Definition of Blocked-ELL format.

half half half N N, T ≥ 7.0 ≥ Volta

half half float N N, T ≥ 7.0 ≥ Volta

half float float N N, T ≥ 7.0 ≥ Volta

int8 int8 int N N ≥ 7.5 ≥ Turing

bfloat16 bfloat16 float N N, T ≥ 8.0 ≥ NVIDIA AMPERE

bfloat16 float float N N, T ≥ 8.0 ≥ NVIDIA AMPERE

float float float N N, T ≥ 8.0 ≥ NVIDIA AMPERE

double double double N N, T ≥ 8.0 ≥ NVIDIA AMPERE

Matrix sizes: M=N=K=4096.

For better performance, it is important to satisfy the following conditions:

Use large block sizes, preferably a power-of-2.

Block-SpMM code example

Then, allocate an external buffer for the multiplication:

Now you can execute SpMM:

cusparseSpMM(handle, opA, opB, alpha, matA, matB,

For more information, see the following resources:

OpenAI Block-Sparse GPU Kernels

About the Authors

About Federico Busato

May 10, 2021

May 11, 2021

May 12, 2021

May 12, 2021

May 12, 2021

Continue the discussion at forums.developer.nvidia.com

cuSPARSELt v0.1.0 Now Available: Arm and Windows Support

Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt

CUTLASS: Fast Linear Algebra in CUDA C++

Follow NVIDIA Developer

You might also like