Sparsity in INT8 - Training Workflow and Best Practices For NVIDIA TensorRT Acceleration - NVIDIA Technical Blog
Sparsity in INT8 - Training Workflow and Best Practices For NVIDIA TensorRT Acceleration - NVIDIA Technical Blog
DEVELOPER
By Gwena Cunha Sergio, Sagar Shelke, Jinkyu Koo, Le An and Josh Park
The training stage of deep learning (DL) models consists of learning numerous dense floating-point weight matrices, which results in a massive amount of floating-point
computations during inference. Research has shown that many of those computations can be skipped by forcing some weights to be zero, with little impact on the final accuracy.
In parallel to that, previous posts have shown that lower precision, such as INT8, is often sufficient to obtain similar accuracies to FP32 during inference. Sparsity and quantization
are popular optimization techniques used to tackle these points, improving inference time and reducing memory footprint.
Quantization support has been available in NVIDIA TensorRT for a while (as of the 2.1 release), and support for sparsity was more recently built into NVIDIA Ampere architecture
Tensor Cores and introduced in TensorRT 8.0.
This post is a step-by-step guide on how to accelerate DL models with TensorRT using sparsity and quantization techniques. Although each of these optimizations has been
individually discussed, there’s still a need to demonstrate the end-to-end workflow from training to deployment with TensorRT, considering both optimizations.
In this post, we aim to bridge that gap and to help you understand what the sparsity-quantization training workflow looks like, advise on best practices for sparsity with regards to
TensorRT acceleration, and present an end-to-end case study with ResNet-34.
Structured sparsity
NVIDIA Sparse Tensor Cores use a 2:4 pattern, meaning that two out of each contiguous block of four values must be zero. In other words, we follow a 50% fine-grained structured
sparsity recipe, with no computations being done on zero-values due to the available support directly on the Tensor Cores. This results in more workload being computed in the
same amount of time. In this post, we refer to this process as pruning.
For more information, see Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT.
Quantization
Quantization refers to the process of mapping continuous infinite values to a finite set of discrete values (for example, FP32 to INT8). There are two main quantization techniques
discussed in this post:
https://developer.nvidia.com/blog/sparsity-in-int8-training-workflow-and-best-practices-for-tensorrt-acceleration/ 1/9
2024/4/17 15:25 Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration | NVIDIA Technical Blog
Post-training quantization (PTQ): Uses an implicit quantization workflow. In implicitly quantized networks, each quantized tensor has an associated scale that is used to implicitly quantize
DEVELOPER
and dequantize values through calibration. TensorRT then checks in which precision that layer runs faster and executes it accordingly.
Quantization-aware training (QAT): Uses an explicit quantization workflow. Explicitly quantized networks make use of quantize and dequantize (Q/DQ) nodes to explicitly indicate which
layers must be quantized. This means that you have more control over which layers are running in INT8. For more information, see Q/DQ Layer-Placement Recommendations.
For more information about quantization basics, a comparison between PTQ and QAT quantization techniques, insights on when to choose which, and quantization in TensorRT,
see Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT.
Technical Blog
1. Sparsifying and fine-tuning a pretrained dense model in PyTorch. Subscribe
2. Quantizing the sparsified model through the PTQ or QAT workflow.
3. Deploying the obtained sparse INT8 engine in TensorRT.
Figure 1 shows all three steps. One distinction in step 2 is that Q/DQ nodes are present in the ONNX graph generated through QAT but absent in the ONNX graph generated
through PTQ. For more information, see Working with INT8.
Requirements
Here’s the basic configuration required to follow this case study:
Python 3.8
PyTorch 1.11 (also tested with 2.0.0)
PyTorch vision
apex sparsity toolkit
pytorch-quantization toolkit
TensorRT 8.6
Polygraphy
ONNX opset>=13
NVIDIA Ampere architecture GPU for Tensor Core support
This case study requires the ImageNet 2012 dataset for image classification. For more information about downloading the dataset and converting it to the required format, see
the readme on the GitHub repo.
This dataset is needed for sparsity training, sparse-QAT model fine-tuning, and sparse-PTQ model calibration. It is also used to evaluate the models.
https://developer.nvidia.com/blog/sparsity-in-int8-training-workflow-and-best-practices-for-tensorrt-acceleration/ 2/9
2024/4/17 15:25 Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration | NVIDIA Technical Blog
import copy
from torchvision import models
DEVELOPER
from apex.contrib.sparsity import ASP
# Re-train model
for e in range(0, epoch):
for i, (image, target) in enumerate(data_loader):
image, target = image.to(device), target.to(device)
output = model_sparse(image)
loss = criterion(output, target)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Save model
torch.save(model_sparse.state_dict(), "sparse_finetuned.pth")
Calibrate the ONNX model, exported in the previous step, with a calibration dataset. The following code example assumes an ONNX model with static input shape and batch size.
# Set calibrator
calibration_cache_path = onnx_path.replace(".onnx", "_calibration.cache")
calibrator = Calibrator(
data_loader=calib_data(data_loader_calib, args.onnx_input_name),
cache=calibration_cache_path
)
# Build engine from ONNX model by enabling INT8 and sparsity weights, and providing the calibrator
build_engine = EngineFromNetwork(
NetworkFromOnnxPath(onnx_path),
config=CreateConfig(
int8=True,
calibrator=calibrator,
sparse_weights=True
)
)
https://developer.nvidia.com/blog/sparsity-in-int8-training-workflow-and-best-practices-for-tensorrt-acceleration/ 3/9
2024/4/17 15:25 Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration | NVIDIA Technical Blog
To ensure that the already-calculated sparse floating-point weights won’t be overwritten, ensuring that the QAT weights will also be structured as sparse, you must again prepare
the model for pruning. DEVELOPER
Initialize the QAT model and optimizer for pruning before loading the fine-tuned sparse weights. Sparse mask re-computations must also be disabled as they were already
computed in Step 1. This requires a custom function that is a slight modification of the APEX toolkit’s prune_trained_model function. The modifications are highlighted in the
code example:
For optimal Q/DQ node placement, you must modify the model’s definition to quantize residual branches, as shown in the pytorch-quantization toolkit example. For example, for
ResNet, the modification needed to add Q/DQ nodes in the residual branch are highlighted as follows:
class BasicBlock(nn.Module):
The same modification must be repeated for the Bottleneck class and the quantize bool parameter must be propagated through the ResNet, _resnet, and resnet34
functions. After those modifications are done, instantiate the model with quantize=True. For more information, see line 734 in resnet.py.
The first step of quantizing a sparse model through QAT is to enable quantization and pruning in the model. The second step is to load the fine-tuned sparse checkpoint, calibrate
it, and then finally fine-tune that model for some epochs. For more information about the collect_stats and compute_amax functions, see the calibrate_quant_resnet50.ipynb
notebook.
# Calibrate model
collect_stats(model_qat, data_loader_calib, num_batches=len(data_loader_calib))
compute_amax(model_qat, method="entropy”)
# Fine-tune model
for e in range(0, epoch):
for i, (image, target) in enumerate(data_loader):
image, target = image.to(device), target.to(device)
output = model_qat(image)
...
# Save model
torch.save(model_qat.state_dict(), "quant_finetuned.pth")
To prepare the TensorRT engine for deployment, you must export the sparse-quantized PyTorch model to ONNX. TensorRT expects QAT ONNX models to indicate which layers
should be quantized through a set of QuantizeLinear and DequantizeLinear ONNX ops. This requirement is fulfilled by enabling fake quantization when exporting a quantized
PyTorch model to ONNX.
https://developer.nvidia.com/blog/sparsity-in-int8-training-workflow-and-best-practices-for-tensorrt-acceleration/ 4/9
2024/4/17 15:25 Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration | NVIDIA Technical Blog
Results
Here are the performance measurements, in terms of classification accuracy and runtime, for ResNet-34 dense-quantized and sparse-quantized models on an NVIDIA A40 GPU
with TensorRT 8.6-GA (8.6.1.6). To reproduce these results, follow the workflow described in the previous section.
Figure 2 shows the dense accuracy compared to sparse accuracy in TensorRT for ResNet-34 in three settings:
As you can see, the sparse variants can mostly preserve accuracy compared to their dense equivalents for all settings.
In regard to runtime performance, Figure 3 shows a ~1.4x speedup for sparse-quantized models over dense-quantized for both PTQ and QAT workflows.
Figure 4 shows a speedup improvement between dense-quantized and sparse-quantized settings in ResNet-34 as batch size increases for both PTQ and QAT workflows.
Speedup for the PTQ workflow ranges from 1.21x for bs=1 up to 1.40x for bs=2048.
Speedups for the QAT workflow are in the same ballpark, ranging from 1.20x for bs=1 up to 1.42x for bs=2048.
https://developer.nvidia.com/blog/sparsity-in-int8-training-workflow-and-best-practices-for-tensorrt-acceleration/ 5/9
2024/4/17 15:25 Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration | NVIDIA Technical Blog
DEVELOPER
Figure 5 shows a speedup improvement between dense-quantized and sparse-quantized settings in ResNet-34 as input resolution increases for both PTQ and QAT workflows.
Speedups for the PTQ and QAT workflows range from 1.26x and 1.25x for input resolution of 3x224x224, up to 1.66x and 1.65x for input resolution of 3x4096x2048, respectively.
Here are some additional best practices that we observed during our experiments:
Output channels in a multiple of 32 are friendly to leverage TensorCore or IMMA for INT8. For more information, see Deep Learning Performance Guide.
High output channels (typically >128) help better pick up sparse kernels due to the large workload.
Conclusion
In this post, we demonstrated that significant latency reduction can be achieved with minimal impact on accuracy through a sparse INT8-based training workflow and TensorRT
deployment strategies. We provided a thorough step-by-step guide with ResNet-34 as a use case, followed up by a discussion on the observed performance with respect to
accuracy and latency.
Experimental results showed that accuracy was mostly maintained when comparing dense and sparse models, while runtime improved to up to ~1.7x for the highest explored
workload (input resolution 3x4096x2048, bs=8). Finally, we shared some best practices for sparsity that we observed during our experiments.
For more information about sparsity, see the following related resources:
Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT
Making the Most of Structured Sparsity in the NVIDIA Ampere Architecture (GTC session)
Accelerating Sparsity in the NVIDIA Ampere Architecture (GTC session)
Accelerating Sparse Deep Neural Networks (whitepaper)
For more information about quantization, see the following related resources:
Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT
Accelerating Quantized Networks with the NVIDIA QAT Toolkit for TensorFlow and NVIDIA TensorRT
Toward INT8 Inference: An End-to-End Workflow for Deploying Quantization-Aware Trained Networks Using TensorRT (GTC session)
Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation (whitepaper)
Related resources
GTC session: Optimize Generative AI inference with Quantization in TensorRT-LLM and TensorRT
GTC session: Deep Dive into Math Libraries
GTC session: Optimizing Inference Performance and Incorporating New LLM Features in Desktops and Workstations
SDK: FasterTransformer
https://developer.nvidia.com/blog/sparsity-in-int8-training-workflow-and-best-practices-for-tensorrt-acceleration/ 6/9
2024/4/17 15:25 Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration | NVIDIA Technical Blog
SDK: TensorRT
SDK: Torch-TensorRT DEVELOPER
Discuss (0)
+8 Like
Tags
Simulation / Modeling / Design | HPC / Scientific Computing | TensorRT | Intermediate Technical | Tutorial | AI Inference | PyTorch | Sparsity | Technical Walkthrough | Training
About Le An
Le An is an engineering manager at NVIDIA who works on machine learning, deep learning, and computer vision techniques and their applications in
autonomous vehicles and beyond. Le received his Ph.D. from the University of California, Riverside, his M.S. from the Eindhoven University of Technology in
the Netherlands, and his B.S. from Zhejiang University in China.
View all posts by Le An
Comments
Related posts
https://developer.nvidia.com/blog/sparsity-in-int8-training-workflow-and-best-practices-for-tensorrt-acceleration/ 7/9
2024/4/17 15:25 Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration | NVIDIA Technical Blog
Structured Sparsity in the NVIDIA Ampere Architecture and Applications in Search Engines
DEVELOPER
Accelerating Quantized Networks with the NVIDIA QAT Toolkit for TensorFlow and NVIDIA TensorRT
Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT
https://developer.nvidia.com/blog/sparsity-in-int8-training-workflow-and-best-practices-for-tensorrt-acceleration/ 8/9
2024/4/17 15:25 Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration | NVIDIA Technical Blog
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Use | Cookie Policy | Contact
https://developer.nvidia.com/blog/sparsity-in-int8-training-workflow-and-best-practices-for-tensorrt-acceleration/ 9/9