Easy Quantization in PyTorch Using Fine-Grained FX

Improve Quantization Productivity with Intel Neural Compressor and Hugging Face Optimum-Intel

3 min readSep 22, 2022

Authors: Xin He, Haihao Shen, and Feng Tian, Intel Corporation

In a previous blog, we introduced Intel Neural Compressor, an open-source Python library for model compression:

PyTorch Inference Acceleration with Intel® Neural Compressor

Authors: Feng Tian, Haihao Shen, Huma Abidi, Chandan Damannagari

medium.com

In this blog, we illustrate one of its new features to help you do easy quantization on broader models.

PyTorch provides a FX toolkit for developers to transform a torch.nn.Module into a torch.fx.GraphModule. With the generated GraphModule, FX can execute static quantization by automatically inserting quantize and dequantize operations.

It’s useful to convert an imperative model into a graph model because the latter gives better performance with multiple optimization options such as post-training static quantization. However, FX cannot handle dynamic control flows automatically, and there are many cases that will block the model transformation because of dynamic control flows.

Our Solution

Our solution, namely fine-grained FX, helps models with dynamic control flows on ease-of-use quantization. It is integrated into the pytorch_fx backend of Intel Neural Compressor and supports three popular quantization methods: post-training dynamic and static quantization, and quantization-aware training.

PyTorch recommends post-training dynamic quantization for NLP models because its real-time variable scales and zero-points shows stable accuracy after quantization. Post-training static quantization performs quantization based on fixed scales and zero-points. It supports continuous quantization modules, avoiding redundant quantization and dequantization operations. Theoretically, static quantization has a better performance than dynamic quantization. Quantization-aware training for static quantization requires an additional training process to adjust model weights to reduce quantization loss. It can provide high accuracy based on the best performance of static quantization.

Because the imperative model consists of several blocks, fine-grained FX will aggressively and recursively detect these blocks for module transformation. Two examples are shown below: natural language processing (Figure 1) and object detection (Figure 2). The dark green blocks are detected as suitable for module transformation because they are the largest blocks without any control flow. We leverage the FX toolkit on these blocks and do quantization automatically. By reassembling these processed blocks using the original control flows, the resulting model maintains the same behavior and provides higher performance by leveraging INT8.

Figure 1. Fine-grained FX for BERT natural language processing

Figure 2. Fine-grained FX for YOLO-V2 object detection

Adopting Our Solution

Intel Neural Compressor Examples

We provide two kinds of examples for natural language processing models based on Hugging Face Transformers. You can easily replace the input model with your own and quantize it based on fine-grained FX:

Hugging Face Optimum-Intel Examples

Optimum-Intel is an extension of Transformers that enable the use of popular compression techniques such as quantization and pruning via Intel Neural Compressor. All tasks in Optimum-Intel support fine-grained FX: language modeling, multiple choice, question answering, summarization, text classification, token classification, and translation. We also uploaded several INT8 models into the Hugging Face model hub that can be easily initialized and leveraged with Intel Neural Compressor, e.g.:

Future Work

The vision of fine-grained FX is to improve the productivity of PyTorch quantization, especially of the static quantization approach. We are continuously uploading INT8 models to the Hugging Face model hub for quick deployment. We invite users to try Intel Neural Compressor and Hugging Face Optimum-Intel and share your models on the model hub. We also encourage you to check out Intel’s other AI Tools and Framework optimizations and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio.

Intel Analytics Software