Module2- Optimization & Quantization of AI Models for Improved Performance
Module2- Optimization & Quantization of AI Models for Improved Performance
Performance varies by use, configuration, and other factors. Learn more at intel.com/PerformanceIndex .
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available
updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel® technologies may require enabled hardware, software, or service activation.
Intel® optimizations, for Intel® compilers or other products, may not optimize to the same degree for non-Intel products.
Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.
Results have been estimated or simulated.
Intel is committed to respecting human rights and avoiding complicity in human rights abuses.
See Intel’s Global Human Rights Principles. Intel® products and software are intended only to be used in
applications that do not cause or contribute to a violation of an internationally recognized human right.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.
Other names and brands may be claimed as the property of others.
3
Module 2
Optimization & Quantization of AI
Models for Improved Performance
4
Table of Contents
• Model Optimizer
• Setting Inputs Shape of a Model
• Cutting Off Parts of a Model
• Model Optimizer Optimization Techniques
• Generic Optimization
• Framework or topology specific optimization
• Model Quantization
• Compression of a Model
• Post Training Optimization (POT)
• Benchmark Tool
• Hands-on Labs
• Exercise 1 : Download a model from OMZ using OpenVINO™ Notebooks - 104-model- tools
• Exercise 2 : Tiny YOLO* V3 to IR conversion using OpenVINO toolkit
5
Module 2: Learning Objective
• Recognize the importance of optimizing and tuning pre-trained models for AI Inference.
• Understand the Model Optimizer, Post-training Optimization Tool, and their functions.
• Learn about OpenVINOTM Intermediate Representation (IR).
• Implement the model optimization strategies–Quantization and Topology optimization.
• Understand the workflow and factors to consider when using quantization for Deep Learning models.
• Work on practical projects to understand the difference between pre- and post-optimization model
performance.
6
Module 2: Learning Outcomes
• Explain why optimization and tuning Deep Learning models for inference are necessary.
• Use the Model Optimizer and POT tools from the OpenVINOTM toolkit and become acquainted with how to use them.
• Make informed technical decisions in order to select the best optimization strategy.
• Describe the advantages and disadvantages of various model optimization strategies.
7
Module 2: Key Questions Addressed
• Why do pre-trained Deep Learning models need further optimization?
• What exactly is a Model Optimizer? What roles does it play?
• What is the Intermediate Representation (IR) used by the OpenVINO™ toolkit?
• What is quantization, and what factors need to be kept in mind while using this optimization method?
• What are the different optimization strategies available with the OpenVINO™ toolkit?
• What is the Post-training Optimization tool? How is it useful?
• How can you benchmark model performance with the OpenVINO™ toolkit?
8
Model Optimizer
9
Convert model with Model Optimizer
▪ A Python* based tool to read trained models and The simplest way to convert a model is:
convert them to Intermediate Representation format > mo --input_model <INPUT_MODEL>
▪ Optimizes for performance or space with conservative To get the full list of conversion parameters
topology transformations available in Model Optimizer, run the following
▪ Hardware-agnostic optimizations command:
> mo --help
Intermediate
Run Model Representation
Optimizer (IR)
.xml and .bin
10
Discussion Points
• What are some other reasons for pre-trained models requiring optimization?
• What are some tradeoffs that need to be kept in mind while optimizing a pre-trained
model?
11
Model Optimizer: Generic Optimization
Operations Pruning
Linear Operations Fusion Example
Drop unused operations that only matter for
training
12
Setting Input Shapes
Use the CLI options --input_shape. Model Optimizer supports conversion of models with dynamic
input shapes that contain undefined dimensions.
However, if the shape of data is fixed, then it’s recommended to set up fully defined shape for the
inputs. It can be beneficial from a performance perspective and memory consumption.
• Example 1: Run the Model Optimizer for the TensorFlow* MobileNet model with the single input
and specify input shape [2,300,300,3].
mo --input_model MobileNet.pb --input_shape [2,300,300,3]
• Example 2: Run the Model Optimizer for the ONNX* OCR model with a pair of inputs, data and
seq_len. Then specify shapes [3,150,200,1] and [3] respectively.
mo --input_model ocr.onnx --input data,seq_len --input_shape [3,150,200,1],[3]
You can read about other strategies in depth in the Model Optimizer documentation 16
Discussion Points
• What are some changes you noticed between the original model's IR and the optimized
OpenVINO™ IR?
• Why is it essential for MO's model optimizations to be hardware agnostic?
17
Model Quantization
Model Quantization is a method of representing Deep Learning models using less memory
Most Deep Learning models are trained using full precision or FP32 representation. But research has
indicated that you can perform inference with lower numerical precision and minimal change in model
accuracy.
At lower numerical representation, INT8 or a similar data format is used to store the weights and biases of
the deep learning model.
In addition to accuracy consideration, you need to ensure that the hardware platform supports the data format of your
quantization. Figure above provides this compatibility information.
18
Compression of a Model to FP16
Use the CLI option --data_type
• Model Optimizer can convert all floating-point weights to 16-bit.
• The resulting model will occupy about half the disk space and runtime memory.
• FP16 is the recommended data type for GPU optimizations and is the only supported data type for
MYRIAD VPUs
Note: FP16 compression may have some accuracy drop, although for the majority of
models accuracy degradation is negligible.
20
Post-Training Optimization Tool (POT)
21
Overview of Post-Training Optimization Tool
The POT uses a conversion
technique that reduces the
model size into low precision
without retraining POT
Configuration
• Improves latency with little Model (CLI/API)
degradation in model accuracy
• Different optimization OpenVINO™ Optimized
approaches are supported: Model IR model Post-Training INT8
Optimizer FP32 or FP16 Optimization Tool OpenVINO
quantization algorithms, etc. .xml & .bin IR model
POT
Configuration
• The Default Quantization (CLI)
algorithm
• is designed to do a fast and, in OpenVINO™ Post-Training Optimization Optimized
many cases, accurate IR model
Tool INT8
FP32 or FP16 OpenVINO
quantization. .xml & .bin IR model
• The accuracy metric does not
change but provides a lot of
knobs that can be used to
improve it. Dataset
models:
- name: mobilenet-ssd
launchers:
- framework: openvino #backend frameworks for Accuracy Checker
adapter: ssd #Adapter converts raw output produced by framework to high level problem specific
representation (e.g., ClassificationPrediction, DetectionPrediction, etc).
datasets:
- name: VOC2007_detection
data_source: <DATASET_PATH>
preprocessing: #list of preprocessing steps applied to input data.
- type: resize
size: 300
postprocessing: #list of postprocessing steps.
- type: resize_prediction_boxes
metrics: #list of metrics that should be computed.
- type: map
integral: 11point
ignore_difficult: True
presenter: print_scalar
reference: 0.67
Command:
pot -c mobilenet-ssd.json
Accuracy Checker
POT Configuration
Configuration
(CLI reads .json file)
.yaml
Dataset
26
Sample Configuration File of POT
{
"model": {
"model_name": "mobilenet-ssd",
Logically all parameters are divided into "model": "./public/mobilenet-ssd /FP32/mobilenet-
ssd.xml",
three groups: "weights": "./public/mobilenet-ssd /FP32/mobilenet-
• Model parameters are related to the model ssd.bin"
},
definition
"engine": {
• Engine parameters define parameters of the "config“: "./mobilenet-ssd.yaml"
engine that are responsible for the model },
inference and data preparation used for "compression": {
optimization and evaluation “algorithm”: {
"name": "AccuracyAwareQuantization",
• Compression parameters are related to the "params":
optimization algorithm {
"preset": "performance",
"stat_subset_size": 300,
"maximal_drop": 0.01
}
}
}
}
28
POT Python* API
Default Quantization algorithm using an unannotated dataset
To use this method, you need to create a Python* script that implements data loader and quantization pipeline:
1. Prepare data and dataset interface - Using openvino.tools.pot.DataLoader
2. Select quantization parameters - Same as the configuration.json file but in your Python code
3. Define and run quantization process - from openvino.tools.pot import IEEngine, load_model, save_model,
compress_model_weights, create_pipeline
Configuration OpenVINO
save_model()
INT8 IR Model
User’s Implementation Existing API helpers
Learn more about POT Python API for Default Quantization Algorthm
https://intel.ly/DE4bwYp 29
Code Example of Defining DataLoader
for Image Dataset
In most cases, it is required to implement only
openvino.tools.pot.DataLoader interface which allows
acquiring data from a dataset and applying model-
specific pre-processing providing access by index. Any
implementation should override the following methods:
Configuration OpenVINO
save_model()
INT8 IR Model
User’s Implementation Existing API helpers
Building Pipeline
Same as configuration.json file
34
Sample Application
Quantizing Object Detection Model with Accuracy Control
35
OpenVINO™ Notebooks - 105-language-quantize-bert
This tutorial demonstrates how to apply INT8 quantization to
the Natural Language Processing model BERT, using the Post-
Training Optimization Tool API (part of OpenVINO). We will use
HuggingFace BERT PyTorch model fine-tuned for Microsoft
Research Paraphrase Corpus (MRPC) task. The code of the
tutorial is designed to be extendable to custom models and
datasets.
https://intel.ly/DE3HQXJ
36
Benchmark Tool
The benchmark app allows you to benchmark your model's throughput and latency. Performance for
a particular application can also be evaluated virtually using Intel® DevCloud for the Edge Workloads,
a remote development environment with access to Intel® hardware and the latest versions of the
Intel® Distribution of the OpenVINO™ Toolkit.
Basic Usage
The Python benchmark_app is automatically installed when you install OpenVINO Developer Tools
using PyPI. Before running benchmark_app, make sure the openvino_env virtual environment is
activated, and navigate to the directory where your model is located.
The benchmarking application works with models in the OpenVINO IR (model.xml and model.bin)
and ONNX* (model.onnx) formats. Make sure to convert your models if necessary.
To run benchmarking with default options on a model, use the following command:
benchmark_app -m model.xml
37
Summary
• We learned about the concepts and applications of artificial intelligence in this module.
38
Summary
• We learned about optimizing Deep Learning Models in this module.
• To begin, we learned why pre-trained deep learning models must be optimized for
inference.
• Then we delved deep into the OpenVINO™ toolkit's optimization tools, including the
Model Optimizer, Post-Training Optimization Tool, and other supporting software.
39
Hands-on Lab
40
Hands-on Labs
Exercise 1: Download a model from OMZ using OpenVINO Notebooks - 104-model- tools
https://intel.ly/104-model-tools
41
Hands-on Labs
Exercise 2: Tiny YOLO* V3 to IR conversion using OpenVINO™ toolkit
https://intel.ly/tinyyolov3-IR
42
System configuration
System board Intel prototype, TGL U DDR4 SODIMM RVP ASUSTek COMPUTER INC./Prime z370-a
CPU 11th Gen Intel® Core™ i5-1145G7 @ 2.6 GHz 8th Gen Intel ® Core™ i5-8500t @ 3.0 GHz
Software Intel® Distribution of OpenVINO™ toolkit 2021.1.075 Intel Distribution of OpenVINO toolkit 2021.1.075
BIOS setting Load default settings Load default settings, set XMP to 2667
Precision and batch size CPU: int8, GPU: FP16-int8, batch size: 1 CPU: int8, GPU: FP16-int8, batch size: 1
1) Memory is installed such that all primary memory slots are populated.
2) Testing by Intel as of September 9, 2020.
44
Compounding effect of hardware and software configuration
See the compounding effect
1) Purley E63448-400,
System board 2) Intel® Server Board S2600STB 3) Intel Server Board S2600STB 4) Intel® Internal Reference System
Intel® Internal Reference System
CPU Intel® Xeon® Silver 4116 @ 2.1 GHz Intel® Xeon® Silver 4216 CPU @ 2.10 GHz Intel® Xeon® Silver 4216R CPU @ 2.20 GHz Intel® Xeon® Silver 4316 CPU @ 2.30 GHz
Memory 12x 16 GB DDR4 2400 MHz 12x 64 GB DDR4 2400 MHz 12x 32GB DDR4 2666 MHz 16 x32GB DDR4 2666 MHz
45