0% found this document useful (0 votes)

38 views9 pages

Sparsity in INT8 - Training Workflow and Best Practices For NVIDIA TensorRT Acceleration - NVIDIA Technical Blog

Uploaded by

wxxu218

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views9 pages

Sparsity in INT8 - Training Workflow and Best Practices For NVIDIA TensorRT Acceleration - NVIDIA Technical Blog

Uploaded by

wxxu218

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

2024/4/17 15:25 Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration | NVIDIA Technical

ractices for NVIDIA TensorRT Acceleration | NVIDIA Technical Blog

DEVELOPER

Simulation / Modeling / Design

Sparsity in INT8: Training Workflow and Best Practices for

NVIDIA TensorRT Acceleration
May 16, 2023  +8 Like 􏒦 Discuss (0)


By Gwena Cunha Sergio, Sagar Shelke, Jinkyu Koo, Le An and Josh Park

The training stage of deep learning (DL) models consists of learning numerous dense floating-point weight matrices, which results in a massive amount of floating-point
computations during inference. Research has shown that many of those computations can be skipped by forcing some weights to be zero, with little impact on the final accuracy.

In parallel to that, previous posts have shown that lower precision, such as INT8, is often sufficient to obtain similar accuracies to FP32 during inference. Sparsity and quantization
are popular optimization techniques used to tackle these points, improving inference time and reducing memory footprint.

Quantization support has been available in NVIDIA TensorRT for a while (as of the 2.1 release), and support for sparsity was more recently built into NVIDIA Ampere architecture
Tensor Cores and introduced in TensorRT 8.0.

This post is a step-by-step guide on how to accelerate DL models with TensorRT using sparsity and quantization techniques. Although each of these optimizations has been
individually discussed, there’s still a need to demonstrate the end-to-end workflow from training to deployment with TensorRT, considering both optimizations.

In this post, we aim to bridge that gap and to help you understand what the sparsity-quantization training workflow looks like, advise on best practices for sparsity with regards to
TensorRT acceleration, and present an end-to-end case study with ResNet-34.

Structured sparsity
NVIDIA Sparse Tensor Cores use a 2:4 pattern, meaning that two out of each contiguous block of four values must be zero. In other words, we follow a 50% fine-grained structured
sparsity recipe, with no computations being done on zero-values due to the available support directly on the Tensor Cores. This results in more workload being computed in the
same amount of time. In this post, we refer to this process as pruning.

For more information, see Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT.

Quantization
Quantization refers to the process of mapping continuous infinite values to a finite set of discrete values (for example, FP32 to INT8). There are two main quantization techniques
discussed in this post:

https://developer.nvidia.com/blog/sparsity-in-int8-training-workflow-and-best-practices-for-tensorrt-acceleration/ 1/9
2024/4/17 15:25 Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration | NVIDIA Technical Blog
Post-training quantization (PTQ): Uses an implicit quantization workflow. In implicitly quantized networks, each quantized tensor has an associated scale that is used to implicitly quantize
DEVELOPER
and dequantize values through calibration. TensorRT then checks in which precision that layer runs faster and executes it accordingly.
Quantization-aware training (QAT): Uses an explicit quantization workflow. Explicitly quantized networks make use of quantize and dequantize (Q/DQ) nodes to explicitly indicate which
layers must be quantized. This means that you have more control over which layers are running in INT8. For more information, see Q/DQ Layer-Placement Recommendations.

For more information about quantization basics, a comparison between PTQ and QAT quantization techniques, insights on when to choose which, and quantization in TensorRT,
see Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT.

Workflow for deploying sparse-quantized models in TensorRT

The workflow for deploying sparse-quantized models in TensorRT, considering PyTorch as the DL framework, has the following steps:

Technical Blog
1. Sparsifying and fine-tuning a pretrained dense model in PyTorch. Subscribe
2. Quantizing the sparsified model through the PTQ or QAT workflow.
3. Deploying the obtained sparse INT8 engine in TensorRT.

Figure 1 shows all three steps. One distinction in step 2 is that Q/DQ nodes are present in the ONNX graph generated through QAT but absent in the ONNX graph generated
through PTQ. For more information, see Working with INT8.

Given that, here’s the full workflow for QAT:

1. Sparsifying and fine-tuning a pretrained dense model in PyTorch.

2. Quantizing, calibrating, and fine-tuning the sparsified model in PyTorch.
3. Exporting the PyTorch model to ONNX.
4. Generating a TensorRT engine through ONNX.
5. Deploying the obtained Sparse INT8 engine in TensorRT.

On the other hand, here’s the full workflow for PTQ:

1. Sparsifying and fine-tuning a pretrained dense model in PyTorch.

2. Exporting the PyTorch model to ONNX.
3. Calibrating and quantizing the sparsified ONNX model through the TensorRT builder, generating a TensorRT engine.
4. Deploying the obtained sparse INT8 engine in TensorRT.

Figure 1. End-to-end workflow for deploying sparse-quantized models in TensorRT

Case study: ResNet-34

This section demonstrates a case study of the Sparsity-Quantization workflow with ResNet-34. For more information, see the full code example on the /SparsityINT8 GitHub repo.

Requirements
Here’s the basic configuration required to follow this case study:

Python 3.8
PyTorch 1.11 (also tested with 2.0.0)
PyTorch vision
apex sparsity toolkit
pytorch-quantization toolkit
TensorRT 8.6
Polygraphy
ONNX opset>=13
NVIDIA Ampere architecture GPU for Tensor Core support

This case study requires the ImageNet 2012 dataset for image classification. For more information about downloading the dataset and converting it to the required format, see
the readme on the GitHub repo.

This dataset is needed for sparsity training, sparse-QAT model fine-tuning, and sparse-PTQ model calibration. It is also used to evaluate the models.

Step 1: Sparsify and fine-tune from the dense model

Load the pretrained dense model and augment the model and the optimizer for sparsity training. For more information, see the NVIDIA/apex/tree/master/apex/contrib/sparsity
folder.

https://developer.nvidia.com/blog/sparsity-in-int8-training-workflow-and-best-practices-for-tensorrt-acceleration/ 2/9
2024/4/17 15:25 Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration | NVIDIA Technical Blog
import copy
from torchvision import models
DEVELOPER
from apex.contrib.sparsity import ASP

# Load dense model

model_dense = models.__dict__["resnet34"](pretrained=True)

# Initialize sparsity mode before starting sparse training

model_sparse = copy.deepcopy(model_dense)
ASP.prune_trained_model(model_sparse, optimizer)

# Re-train model
for e in range(0, epoch):
for i, (image, target) in enumerate(data_loader):
image, target = image.to(device), target.to(device)
output = model_sparse(image)
loss = criterion(output, target)
optimizer.zero_grad()
loss.backward()
optimizer.step()

# Save model
torch.save(model_sparse.state_dict(), "sparse_finetuned.pth")

Step 2: Quantize the PyTorch model

There are two quantization methods that you can choose for this step: PTQ or QAT.

PTQ through TensorRT calibration

This option exports a PyTorch model to ONNX and calibrates it through the TensorRT Python API. This generates a calibration cache and a TensorRT engine that is ready for
deployment.

Export the sparse PyTorch model to ONNX:

dummy_input = torch.randn(batch_size, 3, 224, 224, device="cuda")

torch.onnx.export(model_sparse, dummy_input, "sparse_finetuned.onnx", opset_version=13, do_constant_folding=True)

Calibrate the ONNX model, exported in the previous step, with a calibration dataset. The following code example assumes an ONNX model with static input shape and batch size.

from infer_engine import infer

from polygraphy.backend.trt import Calibrator, CreateConfig, EngineFromNetwork, NetworkFromOnnxPath, TrtRunner, SaveEngine
from polygraphy.logger import G_LOGGER

# Data loader argument to `Calibrator`

def calib_data(val_batches, input_name):
for iteration, (images, labels) in enumerate(val_batches):
yield {input_name: images.numpy()}

# Set path to ONNX model

onnx_path = "sparse_finetuned.onnx"

# Set calibrator
calibration_cache_path = onnx_path.replace(".onnx", "_calibration.cache")
calibrator = Calibrator(
data_loader=calib_data(data_loader_calib, args.onnx_input_name),
cache=calibration_cache_path
)

# Build engine from ONNX model by enabling INT8 and sparsity weights, and providing the calibrator
build_engine = EngineFromNetwork(
NetworkFromOnnxPath(onnx_path),
config=CreateConfig(
int8=True,
calibrator=calibrator,
sparse_weights=True
)
)

# Trigger engine saving

engine_path = onnx_path.replace(".onnx", ".engine")
build_engine = SaveEngine(build_engine, path=engine_path)

# Calibrate engine (activated by the runner)

with G_LOGGER.verbosity(G_LOGGER.VERBOSE), TrtRunner(build_engine) as runner:
print("Calibrated engine!")

# Infer PTQ engine and evaluate its accuracy

log_file = engine_path.split("/")[-1].replace(".engine", "_accuracy.txt")
infer(
engine_path,
data_loader_test,
batch_size=args.batch_size,
log_file=log_file
)

QAT through the pytorch-quantization toolkit

This option uses the pytorch-quantization toolkit to add Q/DQ nodes in the Sparse PyTorch model, calibrates it, and fine-tunes it for a few epochs. The fine-tuned model is then
exported to ONNX and converted to a TensorRT engine for deployment.

https://developer.nvidia.com/blog/sparsity-in-int8-training-workflow-and-best-practices-for-tensorrt-acceleration/ 3/9
2024/4/17 15:25 Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration | NVIDIA Technical Blog

To ensure that the already-calculated sparse floating-point weights won’t be overwritten, ensuring that the QAT weights will also be structured as sparse, you must again prepare
the model for pruning. DEVELOPER

Initialize the QAT model and optimizer for pruning before loading the fine-tuned sparse weights. Sparse mask re-computations must also be disabled as they were already
computed in Step 1. This requires a custom function that is a slight modification of the APEX toolkit’s prune_trained_model function. The modifications are highlighted in the
code example:

from apex.contrib.sparsity import ASP

def prune_trained_model_custom(model, optimizer, compute_sparse_masks=False):

asp = ASP()
asp.init_model_for_pruning(model, mask_calculator="m4n2_1d", verbosity=2, whitelist=[torch.nn.Linear, torch.nn.Conv2d], allow_recompute_mask=False)
asp.init_optimizer_for_pruning(optimizer)
if compute_sparse_masks:
asp.compute_sparse_masks()

For optimal Q/DQ node placement, you must modify the model’s definition to quantize residual branches, as shown in the pytorch-quantization toolkit example. For example, for
ResNet, the modification needed to add Q/DQ nodes in the residual branch are highlighted as follows:

from pytorch_quantization import nn as quant_nn

class BasicBlock(nn.Module):

def init(self, ..., quantize: bool = False) -> None:

super().__init__()
...
if self._quantize:
self.residual_quantizer = quant_nn.TensorQuantizer(quant_nn.QuantConv2d.default_quant_desc_input)

def forward(self, x: Tensor) -> Tensor:

identity = x
...
if self._quantize:
out += self.residual_quantizer(identity)
else:
out += identity
out = self.relu(out)
return out

The same modification must be repeated for the Bottleneck class and the quantize bool parameter must be propagated through the ResNet, _resnet, and resnet34
functions. After those modifications are done, instantiate the model with quantize=True. For more information, see line 734 in resnet.py.

The first step of quantizing a sparse model through QAT is to enable quantization and pruning in the model. The second step is to load the fine-tuned sparse checkpoint, calibrate
it, and then finally fine-tune that model for some epochs. For more information about the collect_stats and compute_amax functions, see the calibrate_quant_resnet50.ipynb
notebook.

# Add Q/DQ nodes to the dense model

from pytorch_quantization import quant_modules
quant_modules.initialize()
model_qat = models.__dict__["resnet34"](pretrained=True, quantize=True)

# Initialize sparsity mode before starting Sparse-QAT fine-tuning

prune_trained_model_custom(model_qat, optimizer, compute_sparse_masks=False)

# Load Sparse weights

load_dict = torch.load("sparse_finetuned.pth")
model_qat.load_state_dict(load_dict["model_state_dict"])

# Calibrate model
collect_stats(model_qat, data_loader_calib, num_batches=len(data_loader_calib))
compute_amax(model_qat, method="entropy”)

# Fine-tune model
for e in range(0, epoch):
for i, (image, target) in enumerate(data_loader):
image, target = image.to(device), target.to(device)
output = model_qat(image)
...

# Save model
torch.save(model_qat.state_dict(), "quant_finetuned.pth")

To prepare the TensorRT engine for deployment, you must export the sparse-quantized PyTorch model to ONNX. TensorRT expects QAT ONNX models to indicate which layers
should be quantized through a set of QuantizeLinear and DequantizeLinear ONNX ops. This requirement is fulfilled by enabling fake quantization when exporting a quantized
PyTorch model to ONNX.

from pytorch_quantization import nn as quant_nn

quant_nn.TensorQuantizer.use_fb_fake_quant = True
dummy_input = torch.randn(batch_size, 3, 224, 224, device="cuda")
torch.onnx.export(model_qat, dummy_input, "quant_finetuned.onnx", opset_version=13, do_constant_folding=True)

Finally, build the TensorRT engine:

$ trtexec --onnx=quant_finetuned.onnx --int8 --sparsity=enable --saveEngine=quant_finetuned.engine --skipInference

https://developer.nvidia.com/blog/sparsity-in-int8-training-workflow-and-best-practices-for-tensorrt-acceleration/ 4/9
2024/4/17 15:25 Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration | NVIDIA Technical Blog

Step 3: Deploy the TensorRT engine

DEVELOPER
$ trtexec --loadEngine=quant_finetuned.engine -v

Results
Here are the performance measurements, in terms of classification accuracy and runtime, for ResNet-34 dense-quantized and sparse-quantized models on an NVIDIA A40 GPU
with TensorRT 8.6-GA (8.6.1.6). To reproduce these results, follow the workflow described in the previous section.

Figure 2 shows the dense accuracy compared to sparse accuracy in TensorRT for ResNet-34 in three settings:

Dense vs. sparse in FP32

Dense-PTQ vs. sparse-PTQ in INT8
Dense-QAT vs. sparse-QAT in INT8

As you can see, the sparse variants can mostly preserve accuracy compared to their dense equivalents for all settings.

Figure 2. Top -1 accuracy (%) of the dense and sparse models

Input resolution of 3x224x224, using TensorRT 8.6-GA.

In regard to runtime performance, Figure 3 shows a ~1.4x speedup for sparse-quantized models over dense-quantized for both PTQ and QAT workflows.

Figure 3. TensorRT runtime for dense-quantized and sparse-quantized settings

Batch sizes and input resolutions

A large workload gives sparse kernels more of an opportunity to shine. In this section, we evaluate how different batch sizes (bs) and input resolutions affect speedups between
dense and sparse models running in INT8 precision.

Figure 4 shows a speedup improvement between dense-quantized and sparse-quantized settings in ResNet-34 as batch size increases for both PTQ and QAT workflows.

Speedup for the PTQ workflow ranges from 1.21x for bs=1 up to 1.40x for bs=2048.
Speedups for the QAT workflow are in the same ballpark, ranging from 1.20x for bs=1 up to 1.42x for bs=2048.

https://developer.nvidia.com/blog/sparsity-in-int8-training-workflow-and-best-practices-for-tensorrt-acceleration/ 5/9
2024/4/17 15:25 Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration | NVIDIA Technical Blog

DEVELOPER

Figure 4. Speedup improvement between dense-quantized and sparse-quantized settings

Figure 5 shows a speedup improvement between dense-quantized and sparse-quantized settings in ResNet-34 as input resolution increases for both PTQ and QAT workflows.
Speedups for the PTQ and QAT workflows range from 1.26x and 1.25x for input resolution of 3x224x224, up to 1.66x and 1.65x for input resolution of 3x4096x2048, respectively.

Figure 5. Speedup ratio between dense-quantized and sparse-quantized settings

Here are some additional best practices that we observed during our experiments:

Output channels in a multiple of 32 are friendly to leverage TensorCore or IMMA for INT8. For more information, see Deep Learning Performance Guide.
High output channels (typically >128) help better pick up sparse kernels due to the large workload.

Conclusion
In this post, we demonstrated that significant latency reduction can be achieved with minimal impact on accuracy through a sparse INT8-based training workflow and TensorRT
deployment strategies. We provided a thorough step-by-step guide with ResNet-34 as a use case, followed up by a discussion on the observed performance with respect to
accuracy and latency.

Experimental results showed that accuracy was mostly maintained when comparing dense and sparse models, while runtime improved to up to ~1.7x for the highest explored
workload (input resolution 3x4096x2048, bs=8). Finally, we shared some best practices for sparsity that we observed during our experiments.

For more information about sparsity, see the following related resources:

Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT
Making the Most of Structured Sparsity in the NVIDIA Ampere Architecture (GTC session)
Accelerating Sparsity in the NVIDIA Ampere Architecture (GTC session)
Accelerating Sparse Deep Neural Networks (whitepaper)

For more information about quantization, see the following related resources:

Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT
Accelerating Quantized Networks with the NVIDIA QAT Toolkit for TensorFlow and NVIDIA TensorRT
Toward INT8 Inference: An End-to-End Workflow for Deploying Quantization-Aware Trained Networks Using TensorRT (GTC session)
Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation (whitepaper)

Related resources
GTC session: Optimize Generative AI inference with Quantization in TensorRT-LLM and TensorRT
GTC session: Deep Dive into Math Libraries
GTC session: Optimizing Inference Performance and Incorporating New LLM Features in Desktops and Workstations
SDK: FasterTransformer

https://developer.nvidia.com/blog/sparsity-in-int8-training-workflow-and-best-practices-for-tensorrt-acceleration/ 6/9
2024/4/17 15:25 Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration | NVIDIA Technical Blog
SDK: TensorRT
SDK: Torch-TensorRT DEVELOPER

􏒦 Discuss (0)
  +8 Like

About the Authors

About Gwena Cunha Sergio
Gwenaelle Cunha Sergio is a senior deep-learning software engineer at NVIDIA. Her research interests include deep learning optimization and inference
acceleration, computer vision, and natural language tasks. She received her Ph.D. degree in electronic and electrical engineering from Kyungpook National
University, South Korea, and her bachelor's degree from the Federal University of Rio Grande do Norte (UFRN), Brazil, during which time she also
participated in the Science Without Borders exchange program at Brown University, USA.
View all posts by Gwena Cunha Sergio

About Sagar Shelke

Sagar Shelke works as a deep learning software engineer at NVIDIA, focusing on autonomous driving applications. His interests include neural network
optimization for deployment and machine learning systems. Sagar holds a master's degree in electrical and computer engineering from San Diego State
University.
View all posts by Sagar Shelke

About Jinkyu Koo

Jinkyu Koo is a senior deep-learning software engineer at NVIDIA. He has been working on optimization issues of deep learning models, especially for
autonomous vehicles. He received his Ph.D. in electrical and computer engineering from Purdue University.
View all posts by Jinkyu Koo

About Le An
Le An is an engineering manager at NVIDIA who works on machine learning, deep learning, and computer vision techniques and their applications in
autonomous vehicles and beyond. Le received his Ph.D. from the University of California, Riverside, his M.S. from the Eindhoven University of Technology in
the Netherlands, and his B.S. from Zhejiang University in China.
View all posts by Le An

About Josh Park

Josh Park is a senior manager at NVIDIA, where he specializes in the development of deep learning solutions using DL frameworks on multi-GPU and multi-
node servers and embedded systems. His expertise extends to the evaluation and enhancement of training and inference performances across diverse
GPU architectures, including x86_64 and aarch64. He earned his Ph.D. in computer science from Texas A&M University.
View all posts by Josh Park

Comments

Start the discussion at forums.developer.nvidia.com

https://developer.nvidia.com/blog/sparsity-in-int8-training-workflow-and-best-practices-for-tensorrt-acceleration/ 7/9
2024/4/17 15:25 Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration | NVIDIA Technical Blog

Structured Sparsity in the NVIDIA Ampere Architecture and Applications in Search Engines
DEVELOPER

Accelerating Quantized Networks with the NVIDIA QAT Toolkit for TensorFlow and NVIDIA TensorRT

Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT

https://developer.nvidia.com/blog/sparsity-in-int8-training-workflow-and-best-practices-for-tensorrt-acceleration/ 8/9
2024/4/17 15:25 Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration | NVIDIA Technical Blog

Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt

DEVELOPER

NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch

Sign up for NVIDIA News Subscribe Follow NVIDIA Developer

https://developer.nvidia.com/blog/sparsity-in-int8-training-workflow-and-best-practices-for-tensorrt-acceleration/ 9/9

IDC China Accelerated Computing Market Overview, 2024H1
100% (1)
IDC China Accelerated Computing Market Overview, 2024H1
32 pages
Mag Pi 147
No ratings yet
Mag Pi 147
132 pages
Artificial Intelligence Hardware Design: Challenges and Solutions 1st Edition Albert Chun-Chen Liu PDF Download
No ratings yet
Artificial Intelligence Hardware Design: Challenges and Solutions 1st Edition Albert Chun-Chen Liu PDF Download
42 pages
Laptop & 2-In-1
No ratings yet
Laptop & 2-In-1
52 pages
LLM Training Update
100% (1)
LLM Training Update
31 pages
Intel AI Portfolio-Telco AI Usecases - Airtel
No ratings yet
Intel AI Portfolio-Telco AI Usecases - Airtel
47 pages
Deep Learning Optimization
No ratings yet
Deep Learning Optimization
62 pages
GPU Bootcamp Samhar
100% (1)
GPU Bootcamp Samhar
96 pages
MN906 AI Watermarking
No ratings yet
MN906 AI Watermarking
99 pages
EuroSys19 Parallax
No ratings yet
EuroSys19 Parallax
15 pages
The Single Board Computer Handbook
No ratings yet
The Single Board Computer Handbook
20 pages
Exploring The Potential of VLSI Computing
100% (1)
Exploring The Potential of VLSI Computing
28 pages
PQAT
No ratings yet
PQAT
25 pages
Mobilenetv 4
No ratings yet
Mobilenetv 4
32 pages
Ai 04 00047
No ratings yet
Ai 04 00047
23 pages
A T I T 2:4 A S: Ccelerating Ransformer Nference AND Raining With Ctivation Parsity
No ratings yet
A T I T 2:4 A S: Ccelerating Ransformer Nference AND Raining With Ctivation Parsity
6 pages
Deep Learning With Tensor Ow and Google Cloud Ai 2-In-1
No ratings yet
Deep Learning With Tensor Ow and Google Cloud Ai 2-In-1
6 pages
Difference Between CPU, GPU, TPU and NPU - by Abhishek Jain - Medium
No ratings yet
Difference Between CPU, GPU, TPU and NPU - by Abhishek Jain - Medium
14 pages
Arxiv M4
No ratings yet
Arxiv M4
16 pages
Tensorflow 2 - 0 Slides PDF
No ratings yet
Tensorflow 2 - 0 Slides PDF
100 pages
09 Tensorflow101 Slide
No ratings yet
09 Tensorflow101 Slide
78 pages
Sprada 8
No ratings yet
Sprada 8
10 pages
TENSORRT
No ratings yet
TENSORRT
26 pages
4a TensorCores
No ratings yet
4a TensorCores
18 pages
Zeroquant Efficient and Affordable Post Training Quantization For Large Scale Transformers Paper Conference
No ratings yet
Zeroquant Efficient and Affordable Post Training Quantization For Large Scale Transformers Paper Conference
16 pages
4b Image Processing
No ratings yet
4b Image Processing
63 pages
XGS Desktop Gen.2 Overview
No ratings yet
XGS Desktop Gen.2 Overview
21 pages
1 - A Day in The Life of ChatGPT As A Researcher
No ratings yet
1 - A Day in The Life of ChatGPT As A Researcher
20 pages
ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration For Large Language Models
No ratings yet
ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration For Large Language Models
19 pages
Latitude 7350 Detachable Spec Sheet
No ratings yet
Latitude 7350 Detachable Spec Sheet
18 pages
Asymo: Scalable and Efficient Deep-Learning Inference On Asymmetric Mobile Cpus
No ratings yet
Asymo: Scalable and Efficient Deep-Learning Inference On Asymmetric Mobile Cpus
14 pages
Integer Quantization For Deep Learning Inference
No ratings yet
Integer Quantization For Deep Learning Inference
20 pages
LLM - Int8 - 8-Bit Matrix Multiplication For Transformer at Scale - Removed
No ratings yet
LLM - Int8 - 8-Bit Matrix Multiplication For Transformer at Scale - Removed
11 pages
SmoothQuant - Accurate and Efficient Post-Training Quantization For Large Language Models
No ratings yet
SmoothQuant - Accurate and Efficient Post-Training Quantization For Large Language Models
13 pages
GPU友好稀疏量化Boost Vision Transformer
No ratings yet
GPU友好稀疏量化Boost Vision Transformer
11 pages
Training Large Language Models Efficiently With Sparsity and Dataflow
No ratings yet
Training Large Language Models Efficiently With Sparsity and Dataflow
11 pages
Paper Colossal-AI - A Unified Deep Learning System For Large-Scale Parallel Training
No ratings yet
Paper Colossal-AI - A Unified Deep Learning System For Large-Scale Parallel Training
10 pages
Design and Implementation of Convolutional Neural Network Accelerator Based On RISCV
No ratings yet
Design and Implementation of Convolutional Neural Network Accelerator Based On RISCV
6 pages
A Survey Comparing Specialized Hardware and Evolution in TPUs For Neural Networks
No ratings yet
A Survey Comparing Specialized Hardware and Evolution in TPUs For Neural Networks
7 pages
Serverless AI
No ratings yet
Serverless AI
5 pages
Ari@Image Processing - Met 1233
No ratings yet
Ari@Image Processing - Met 1233
12 pages
3rd Generation Intel Xeon Scalable Processors
No ratings yet
3rd Generation Intel Xeon Scalable Processors
20 pages
RA2211026010557 - SEAI Scenario 2
No ratings yet
RA2211026010557 - SEAI Scenario 2
3 pages
Embedded Artificial Intelligence: Intelligence On Devices
No ratings yet
Embedded Artificial Intelligence: Intelligence On Devices
4 pages
Tensorflow Implementation For Job Market Classification: Taras Mitran Jeff Waller
No ratings yet
Tensorflow Implementation For Job Market Classification: Taras Mitran Jeff Waller
46 pages
8 Bit Matrix Multiplication For Transformers
No ratings yet
8 Bit Matrix Multiplication For Transformers
20 pages
A Guide To CPU, GPU, NPU, and Windows - Microsoft Windows
No ratings yet
A Guide To CPU, GPU, NPU, and Windows - Microsoft Windows
3 pages
The Ai PC Opportunity White Paper
No ratings yet
The Ai PC Opportunity White Paper
8 pages
TensorRT Support Matrix Guide
No ratings yet
TensorRT Support Matrix Guide
17 pages
Prediction Guard Case Study
No ratings yet
Prediction Guard Case Study
3 pages
Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications
No ratings yet
Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications
12 pages
Embedded AI-Accelerator DRP-AI - White Paper - Hideaki Abe - Koichi Nose - Kazutaka Kikuchi - Renesas
No ratings yet
Embedded AI-Accelerator DRP-AI - White Paper - Hideaki Abe - Koichi Nose - Kazutaka Kikuchi - Renesas
8 pages
Large-Scale Deep Learning With Tensorflow: Jeff Dean Google Brain Team
No ratings yet
Large-Scale Deep Learning With Tensorflow: Jeff Dean Google Brain Team
119 pages
Evolving CPU Architectures For AI
No ratings yet
Evolving CPU Architectures For AI
5 pages
A Survey of Model Compression and Acceleration For Deep Neural Networks
No ratings yet
A Survey of Model Compression and Acceleration For Deep Neural Networks
10 pages
Bay Learn 2015 Deep Mind
No ratings yet
Bay Learn 2015 Deep Mind
69 pages
Special Issue On Contemporary Industry Products 2024
No ratings yet
Special Issue On Contemporary Industry Products 2024
2 pages
Amd Ryzen Embedded 8000 Product Brief
No ratings yet
Amd Ryzen Embedded 8000 Product Brief
4 pages
Ai ML On Cpu Whitepaper PDF
No ratings yet
Ai ML On Cpu Whitepaper PDF
10 pages
Green Ai For Iiot: Energy Efficient Intelligent Edge Computing For Industrial Internet of Things
No ratings yet
Green Ai For Iiot: Energy Efficient Intelligent Edge Computing For Industrial Internet of Things
10 pages
Google JAX Cookbook: Perform machine learning and numerical computing with combined capabilities of TensorFlow and NumPy
From Everand
Google JAX Cookbook: Perform machine learning and numerical computing with combined capabilities of TensorFlow and NumPy
Zephyr Quent
No ratings yet
Google JAX Cookbook
From Everand
Google JAX Cookbook
Zephyr Quent
5/5 (1)
PyTorch Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
PyTorch Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The IT4IT™ Reference Architecture, Version 2.1
From Everand
The IT4IT™ Reference Architecture, Version 2.1
The Open Group
No ratings yet
Python-Based Evolutionary Algorithms for Engineers
From Everand
Python-Based Evolutionary Algorithms for Engineers
Pankaj Jayaraman
No ratings yet
Data Driven Guide for Python Programming : Master Essentials to Advanced Data Structures
From Everand
Data Driven Guide for Python Programming : Master Essentials to Advanced Data Structures
Younes Hamdani
No ratings yet
PyTorch Essentials: A Comprehensive Guide to Machine Learning Techniques
From Everand
PyTorch Essentials: A Comprehensive Guide to Machine Learning Techniques
Adam Jones
No ratings yet
Hands-On Python for DevOps: Leverage Python's native libraries to streamline your workflow and save time with automation
From Everand
Hands-On Python for DevOps: Leverage Python's native libraries to streamline your workflow and save time with automation
Ankur Roy
No ratings yet
Machine Learning with PyTorch: From Basics to Expert Proficiency
From Everand
Machine Learning with PyTorch: From Basics to Expert Proficiency
William Smith
No ratings yet
Data Structures and Algorithms with Python
From Everand
Data Structures and Algorithms with Python
Aadinath Pothuvaal
No ratings yet
Backtrader Essentials: Building Successful Strategies with Python
From Everand
Backtrader Essentials: Building Successful Strategies with Python
Ali AZARY
No ratings yet
DeepSeek vs. ChatGPT – Why DeepSeek is the Superior AI.
From Everand
DeepSeek vs. ChatGPT – Why DeepSeek is the Superior AI.
Gary Thatcher
No ratings yet
Learning PyTorch 2.0, Second Edition: Utilize PyTorch 2.3 and CUDA 12 to experiment neural networks and deep learning models
From Everand
Learning PyTorch 2.0, Second Edition: Utilize PyTorch 2.3 and CUDA 12 to experiment neural networks and deep learning models
Matthew Rosch
No ratings yet
PyTorch Cookbook
From Everand
PyTorch Cookbook
Matthew Rosch
No ratings yet
Learning PyTorch 2.0, Second Edition
From Everand
Learning PyTorch 2.0, Second Edition
Matthew Rosch
No ratings yet
PyTorch Cookbook: 100+ Solutions across RNNs, CNNs, python tools, distributed training and graph networks
From Everand
PyTorch Cookbook: 100+ Solutions across RNNs, CNNs, python tools, distributed training and graph networks
Matthew Rosch
No ratings yet
Practical Rust 1.x Cookbook
From Everand
Practical Rust 1.x Cookbook
Rustacean Team
No ratings yet
Practical Rust 1.x Cookbook: 100+ Solutions across Command Line, CI/CD, Kubernetes, Networking, Code Performance and Microservices
From Everand
Practical Rust 1.x Cookbook: 100+ Solutions across Command Line, CI/CD, Kubernetes, Networking, Code Performance and Microservices
Rustacean Team
No ratings yet
Mastering Kubernetes
From Everand
Mastering Kubernetes
Manish Soni
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Digital Engineering: Complex System Design
From Everand
Digital Engineering: Complex System Design
S Mathioudakis
No ratings yet
Mastering Python Scientific Computing: A complete guide for Python programmers to master scientific computing using Python APIs and tools
From Everand
Mastering Python Scientific Computing: A complete guide for Python programmers to master scientific computing using Python APIs and tools
Hemant Kumar Mehta
4/5 (1)
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Internet of Things (IoT) A Quick Start Guide: A to Z of IoT Essentials
From Everand
Internet of Things (IoT) A Quick Start Guide: A to Z of IoT Essentials
Chitra Lele
No ratings yet
What's New in .NET 8? A Complete Guide to the Latest Features
From Everand
What's New in .NET 8? A Complete Guide to the Latest Features
Nitika
No ratings yet
Machine Learning with Python: A Comprehensive Guide with a Practical Example
From Everand
Machine Learning with Python: A Comprehensive Guide with a Practical Example
MARTIN NEEL
No ratings yet
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
TensorFlow Developer Certificate Exam Practice Tests 2024 Made Easy
From Everand
TensorFlow Developer Certificate Exam Practice Tests 2024 Made Easy
Mr Troy
No ratings yet
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
C Programming for the Pc the Mac and the Arduino Microcontroller System
From Everand
C Programming for the Pc the Mac and the Arduino Microcontroller System
Peter D Minns
No ratings yet

Sparsity in INT8 - Training Workflow and Best Practices For NVIDIA TensorRT Acceleration - NVIDIA Technical Blog

Uploaded by

Sparsity in INT8 - Training Workflow and Best Practices For NVIDIA TensorRT Acceleration - NVIDIA Technical Blog

Uploaded by

2024/4/17 15:25 Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration | NVIDIA Technical

ractices for NVIDIA TensorRT Acceleration | NVIDIA Technical Blog

Simulation / Modeling / Design

Sparsity in INT8: Training Workflow and Best Practices for

Workflow for deploying sparse-quantized models in TensorRT

Given that, here’s the full workflow for QAT:

1. Sparsifying and fine-tuning a pretrained dense model in PyTorch.

On the other hand, here’s the full workflow for PTQ:

1. Sparsifying and fine-tuning a pretrained dense model in PyTorch.

Figure 1. End-to-end workflow for deploying sparse-quantized models in TensorRT

Case study: ResNet-34

Step 1: Sparsify and fine-tune from the dense model

# Load dense model

# Initialize sparsity mode before starting sparse training

Step 2: Quantize the PyTorch model

PTQ through TensorRT calibration

Export the sparse PyTorch model to ONNX:

dummy_input = torch.randn(batch_size, 3, 224, 224, device="cuda")

from infer_engine import infer

# Data loader argument to `Calibrator`

# Set path to ONNX model

# Trigger engine saving

# Calibrate engine (activated by the runner)

# Infer PTQ engine and evaluate its accuracy

QAT through the pytorch-quantization toolkit

from apex.contrib.sparsity import ASP

def prune_trained_model_custom(model, optimizer, compute_sparse_masks=False):

from pytorch_quantization import nn as quant_nn

def __init__(self, ..., quantize: bool = False) -> None:

def forward(self, x: Tensor) -> Tensor:

# Add Q/DQ nodes to the dense model

# Initialize sparsity mode before starting Sparse-QAT fine-tuning

# Load Sparse weights

from pytorch_quantization import nn as quant_nn

Finally, build the TensorRT engine:

$ trtexec --onnx=quant_finetuned.onnx --int8 --sparsity=enable --saveEngine=quant_finetuned.engine --skipInference

Step 3: Deploy the TensorRT engine

Dense vs. sparse in FP32

Figure 2. Top -1 accuracy (%) of the dense and sparse models

Input resolution of 3x224x224, using TensorRT 8.6-GA.

Figure 3. TensorRT runtime for dense-quantized and sparse-quantized settings

Batch sizes and input resolutions

Figure 4. Speedup improvement between dense-quantized and sparse-quantized settings

Figure 5. Speedup ratio between dense-quantized and sparse-quantized settings

About the Authors

About Sagar Shelke

About Jinkyu Koo

About Josh Park

Start the discussion at forums.developer.nvidia.com

Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt

NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch

Sign up for NVIDIA News Subscribe Follow NVIDIA Developer

Copyright © 2024 NVIDIA Corporation

You might also like

def init(self, ..., quantize: bool = False) -> None: