0% found this document useful (0 votes)

36 views25 pages

Paper Survey - Training With Quantization Noise For Extreme Model Compression

This document outlines a method for training neural networks with quantization noise for extreme model compression. It introduces quantization as a lossy compression technique that maps floating point values to integers, resulting in information loss. Prior work on model compression through techniques like pruning and knowledge distillation are discussed. The proposed method trains networks with quantization noise applied to a subset of weights, making high compression rates more stable than post-training quantization. Results show the method can compress models like EfficientNet-B3 and RoBERTa to 3.3MB and 14MB respectively while maintaining high accuracy.

Uploaded by

thisisveryunsafe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views25 pages

Paper Survey - Training With Quantization Noise For Extreme Model Compression

Uploaded by

thisisveryunsafe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Training with Quantization Noise

for Extreme Model Compression

Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Remi
Gribonval, Herve Jegou, Armand Joulin
Facebook AI
ICLR 2021

1
Outline
Quantization is lossy: Small range of float32 values mapped to int8
is a lossy conversion since int8 only has 255 information channels
• Introduction - Why Quantization for Model Compression?

• Related Work – Model Compression (Pruning, …), Quantization Methods

• Method - Training Networks with Quantization Noise

• Result - The impact of Quant-Noise on the performance of NLP & CV tasks

• Conclusion – Proposed Method can be performed high compression rate

with little to no loss in accuracy
2
Models are Large

• Many of the best performing neural network architectures in real-world

applications have a large number of parameters
• Transformer has layer of millions of params

• Even models that are designed to jointly optimize the performance and the
parameter efficiency, such as EfficientNets still require dozens to hundreds of
megabytes

Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
Tan, Mingxing, and Quoc Le. "Efficientnet: Rethinking model scaling for convolutional neural networks." International Conference
3
on Machine Learning. PMLR, 2019.
AI Model Performance on ImageNet
Top 1 Accuracy vs FLOPs & # of Parameters
Data Label is in the format of Network Name, # of Parameters (M), The size of the bubble ∝ # of Parameters
90

85
PNASNet, 86
80 Inception-v4, 43
Xception, 23 ResNet 152, 60
DenseNet 161, 29
75
Accuracy (%)

NasNet mobile, 5.3

MobileNetV2, 3.4 VGG 19, 143
70 ShuffeNet-v2.1, 0.34

65
MnasNet, 3.9
60
SqueezeNet-v1.1, 1.2
AlexNet, 61
55

50
0 5 10 15 20 25 30
FLOPS (G)
Accuracy and inference time are in a hyperbolic relationship: a little increment in accuracy costs a lot of computational time [*]
[*] Canziani, Alfredo, Adam Paszke, and Eugenio Culurciello. "An analysis of deep neural network models for practical
4
applications." arXiv preprint arXiv:1605.07678 (2016).
LeCun, Yann, John S. Denker, and Sara A. Solla. "Optimal
brain damage." Advances in neural information processing
systems. 1990.
Model Compression Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the
knowledge in a neural network." arXiv preprint
arXiv:1503.02531 (2015).

• Reduce the memory footprint of overparametrized models

• Pruning
• Distillation

• Remove parameters by reducing the number of network weights

• Quantization focuses on reducing the bits per weight

• Whereas deleting weights or whole hidden units will inevitably lead to a

drop in performance, they demonstrate that quantizing the weights can be
performed with little to no loss in accuracy (quant-noise during training) 5
• Int8 * Int8 = Int16
Quantization Network Principle
• Float32 -> Int8 • De-quantize

• Original Calculation

𝑥𝑓𝑙𝑜𝑎𝑡
• After Quantization 𝑥𝑞𝑢𝑎𝑛𝑡𝑖𝑧𝑒𝑑 = +𝑧
𝑆
32
N bit fix-point #, compression rate ×
𝑛
2𝑁
𝑆 scale = , 𝑧 zero point 6
𝑎
Problem of Scalar
Quantization
• Postprocessing quantization methods: scalar quantization
• replace the floating-point weights of a trained network by a lower-precision
representation (ex: 32 bit -> 8 bit quantization)

• Pros: achieve a good compression rate with the additional benefit of

accelerating inference on supporting hardware (CPU …)

• Cons: a significant drop in performance

Vanhoucke, Vincent, Andrew Senior, and Mark Z. Mao. "Improving the speed of neural networks on CPUs." (2011).
Stock, Pierre, et al. "And the bit goes down: Revisiting the quantization of neural networks." arXiv preprint
7
arXiv:1907.05686 (2019).
Quantize the Network during Training!
𝑥ො = 𝑓𝑞𝑢𝑎𝑛𝑡 (𝑥)

Challenges: the discretization operators have a null gradient — the

derivative with respect to the input is zero almost everywhere

Jacob, Benoit, et al. "Quantization and training of neural networks for efficient integer-arithmetic-only inference." Proceedings of
8
the IEEE conference on computer vision and pattern recognition. 2018.
Straight Through Estimator (STE)

This works when the error

introduced by STE is small

Bengio, Yoshua, Nicholas Léonard, and Aaron Courville. "Estimating or propagating gradients through stochastic neurons for
9
conditional computation." arXiv preprint arXiv:1308.3432 (2013).
This Work, Training with Quantization Noise
for Extreme Model Compression
• Quantizing only a subset of weights instead of the entire network
during training is more stable for high compression schemes

AI Model Task Original Size Quantized Original Acc Quantized Acc

Name Size

EfficientNet- ImageNet 50 MB 3.3 MB 81.7% 80%

B3
RoBERTa Base MNLI 480 MB 14 MB 84.8% 82.5%
model
10
Related Work – Model Compression

• Weight Pruning: 3-stage pruning

• Unstructured: remove individual weights
• LeCun et al., 1990;
• Molchanov et al., 2017;

• Structured: follow the structure of the weights

• Li et al., 2016;
• Luo et al., 2017;
• Fan et al., 2019; 11
Related Work – Model Compression

• Lightweight Achitecture
• MobileNet

• ShuffleNet

• EfficientNet

Howard, Andrew, et al. "Searching for mobilenetv3." Proceedings of the IEEE/CVF International Conference on Computer Vision.
2019.
Zhang, Xiangyu, et al. "Shufflenet: An extremely efficient convolutional neural network for mobile devices." Proceedings of the
IEEE conference on computer vision and pattern recognition. 2018.
Tan, Mingxing, and Quoc Le. "Efficientnet: Rethinking model scaling for convolutional neural networks." International Conference
12
on Machine Learning. PMLR, 2019.
Related Work – Model Compression
• Knowledge Distillation has been applied to sentence representation

Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint
arXiv:1503.02531 (2015).
Sanh, Victor, et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv preprint
arXiv:1910.01108 (2019).
13
Jiao, Xiaoqi, et al. "Tinybert: Distilling bert for natural language understanding." arXiv preprint arXiv:1909.10351 (2019).
Related Work - Quantization

• Scalar, Vector, and Product

Quantization

• Vector quantization techniques

split the matrix 𝑊 into its 𝒑
columns and learn a codebook on
the resulting 𝒑 vectors

• Codebook 𝐶 = {𝑐 1 , … , 𝑐[𝐾]} Vector Quantization with Codebook

14
Quantizing Neural Networks - Product
Quantization
• Product Quantization splits each column into 𝒎 subvectors and learns the
same codebook for each of the resulting 𝒎 × 𝒑 subvectors

• However, PQ induces a quantization drift as reconstruction error

accumulates

• Iterative PQ (iPQ)! -> update codebook 15

Method - Training Networks
with Quantization Noise
Compression ration: 16 for int32->int8
• STE introduces a bias in the gradients that depends on level of
quantization of the weights, and thus, the compression ratio

• They propose a simple modification to control this induced bias

with a stochastic amelioration of QAT, called Quant-Noise

• They quantize a randomly selected fraction of the weights

instead of the full network as in QAT, leaving some unbiased
gradients flow through unquantized weights 16
Method - Adding Noise to Specific
Quantization Methods Distortion or noise function 𝜑: quantization
Block 𝑏: scalar -> single weight

𝑏11 … 𝑏1𝑞
• Forward Pass… Consider a real weight matrix 𝑾 = … … … ∈ 𝑹𝑛×𝑝
𝑏𝑚1 … 𝑏𝑚𝑞
• Tuples of indices 𝑱 ⊂ 𝑘, 𝑙 for 1 ≤ 𝑘 ≤ 𝑚, 1 ≤ 𝑙 ≤ 𝑞

𝜑 𝑏𝑘𝑙 𝑖𝑓 𝑘, 𝑙 ∈ 𝐽
• Noise function 𝜓 𝒃𝑘𝑙 | 𝐽 = ቊ
𝑏𝑘𝑙 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

• Noise matrix 𝑊𝑛𝑜𝑖𝑠𝑒 = 𝜓 𝒃𝑘𝑙 | 𝐽 𝑘𝑙

, 𝑦𝑛𝑜𝑖𝑠𝑒 = 𝑥𝑊𝑛𝑜𝑖𝑠𝑒

• Backward Pass… STE

17
Result - Improving
Compression with Quant-
Noise
• 2 Network Settings
• a Transformer network trained for
language modeling on WikiText-103

• a EfficientNet-B3 convolutional
network trained for image
classification on ImageNet-1k

18
PPL: A low perplexity indicates the probability
distribution is good at predicting the sample.
Result - Improving Compression with Quant-
Noise

19
Result - Comparison with State of the Art

• Dataset
• MNLI: 433k sentence
pairs annotated with
textual entailment
information

• ImageNet

20
Result - Comparison with State of the Art - NLP

• Trade-off: performance vs model

size
• Their RoBERTa

• Share: weight sharing and pruning

• TinyBert

• MobileBert

• AdaBert
21
Result - Comparison with State of the Art - Image
• Their quantized EfficientNet-B3
• Share: weight sharing and
pruning
• Moble-v2
• Shuffle-v2x1

22
Result - Finetuning with Quant-Noise for Post-
Processing Quantization
• Existing models and post-processing with Quant-Noise

• For language modeling and RoBERTa, they train for 10 additional

epochs

23
Conclusion

• Training with Quantization Noise for Extreme Model Compression

• Quantizing a random subset of weights during training maintains performance

• Validate that Quant-Noise works with a variety of different quantization schemes on

several applications in text and vision

• Can be applied to a combination of iPQ and int8 to benefit from extreme

compression ratio and fixed-point arithmetic

• can be used as a post-processing step to prepare already trained networks for

subsequent quantization, to improve the performance of the compressed model
24
Thanks for Listening!

Stable Diffusion
No ratings yet
Stable Diffusion
58 pages
인공지능 하드웨어 연구를 위한 딥러닝 기초 및 경량화 기술 (Hw 이론)
No ratings yet
인공지능 하드웨어 연구를 위한 딥러닝 기초 및 경량화 기술 (Hw 이론)
96 pages
Unit IV DL
No ratings yet
Unit IV DL
122 pages
And The Bit Goes Down
No ratings yet
And The Bit Goes Down
11 pages
Tutorial On DNN 6 of 9 Network and Hardware Co Design
No ratings yet
Tutorial On DNN 6 of 9 Network and Hardware Co Design
60 pages
Unit 6.4 Compressing Neural Networks
No ratings yet
Unit 6.4 Compressing Neural Networks
45 pages
Efficient Hardware For DNN
No ratings yet
Efficient Hardware For DNN
77 pages
Pruning and Quantization For Deep Neural Network Acceleration: A Survey
No ratings yet
Pruning and Quantization For Deep Neural Network Acceleration: A Survey
41 pages
Stable Diffusion A Tutorial
100% (1)
Stable Diffusion A Tutorial
66 pages
Auto QNN
No ratings yet
Auto QNN
23 pages
Dlunit 4
No ratings yet
Dlunit 4
122 pages
Compressing Neural Networks Using The Variational Information Bottleneck
No ratings yet
Compressing Neural Networks Using The Variational Information Bottleneck
27 pages
Training High-Performance and Large-Scale Deep Neural Networks With Full 8-Bit Integers
No ratings yet
Training High-Performance and Large-Scale Deep Neural Networks With Full 8-Bit Integers
14 pages
Lecture19 Efficient Transformer
No ratings yet
Lecture19 Efficient Transformer
64 pages
A Survey of Quantization Methods For Efficient Neural Network Inference
No ratings yet
A Survey of Quantization Methods For Efficient Neural Network Inference
33 pages
EAI FinalProject Team3
No ratings yet
EAI FinalProject Team3
24 pages
BN Free
No ratings yet
BN Free
11 pages
Lec06 Quantization II
No ratings yet
Lec06 Quantization II
82 pages
DL Unit - 4
No ratings yet
DL Unit - 4
26 pages
Quantum Neural Network Compression: Zhirui Hu, Peiyan Dong, Zhepeng Wang, Youzuo Lin, Yanzhi Wang, Weiwen Jiang
No ratings yet
Quantum Neural Network Compression: Zhirui Hu, Peiyan Dong, Zhepeng Wang, Youzuo Lin, Yanzhi Wang, Weiwen Jiang
11 pages
2020 Emnlp-Main 37
No ratings yet
2020 Emnlp-Main 37
13 pages
Lec16 DiffusionModels
No ratings yet
Lec16 DiffusionModels
57 pages
Differentiable Quantization of Deep Neural Networks: Equal Contribution
No ratings yet
Differentiable Quantization of Deep Neural Networks: Equal Contribution
21 pages
3 DL
No ratings yet
3 DL
15 pages
Jacob Quantization and Training
No ratings yet
Jacob Quantization and Training
10 pages
Clip
No ratings yet
Clip
15 pages
Major Project
No ratings yet
Major Project
13 pages
CS701 Final Report
No ratings yet
CS701 Final Report
5 pages
Model Quantization
No ratings yet
Model Quantization
48 pages
5 Low Bit Quantization 1
No ratings yet
5 Low Bit Quantization 1
6 pages
Vae Gan
No ratings yet
Vae Gan
214 pages
Data-Free Quantization Through Weight Equalization and Bias Correction
No ratings yet
Data-Free Quantization Through Weight Equalization and Bias Correction
13 pages
Quantization and Training of Neural Networks For Efficient Integer-Arithmetic-Only Inference
No ratings yet
Quantization and Training of Neural Networks For Efficient Integer-Arithmetic-Only Inference
14 pages
KD-Lib - A PyTorch Library For Knowledge Distillation, Pruning and Quantization
No ratings yet
KD-Lib - A PyTorch Library For Knowledge Distillation, Pruning and Quantization
4 pages
Autoencoder
No ratings yet
Autoencoder
39 pages
Exploring Quantization For Efficient Pre-Training of Transformer Language Models
No ratings yet
Exploring Quantization For Efficient Pre-Training of Transformer Language Models
14 pages
Integer Quantization For Deep Learning Inference
No ratings yet
Integer Quantization For Deep Learning Inference
20 pages
Back To Simplicit - How To Train Accurate BNNs From Scratch
No ratings yet
Back To Simplicit - How To Train Accurate BNNs From Scratch
9 pages
Jungwok Choi - tinyML Asia 2023
No ratings yet
Jungwok Choi - tinyML Asia 2023
17 pages
ICASSP - 2025 - Copie
No ratings yet
ICASSP - 2025 - Copie
5 pages
Nips10 Workshop Tutorial Final PDF
No ratings yet
Nips10 Workshop Tutorial Final PDF
73 pages
Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)
No ratings yet
Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)
61 pages
Unit5 Autoencoders
No ratings yet
Unit5 Autoencoders
45 pages
D5 PPT
No ratings yet
D5 PPT
79 pages
Model Compression Is The Big ML Flavour of 2021
No ratings yet
Model Compression Is The Big ML Flavour of 2021
4 pages
Lecture 23b Auto Encoder
No ratings yet
Lecture 23b Auto Encoder
27 pages
Bitnet: Scaling 1-Bit Transformers For Large Language Models
No ratings yet
Bitnet: Scaling 1-Bit Transformers For Large Language Models
14 pages
Auto Encoder
No ratings yet
Auto Encoder
39 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
AAI Module 3
No ratings yet
AAI Module 3
11 pages
DL M3 Tech
No ratings yet
DL M3 Tech
15 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
Unit IV V Deep Learning Material
No ratings yet
Unit IV V Deep Learning Material
32 pages
Study Materials - Denoising Autoencoders
No ratings yet
Study Materials - Denoising Autoencoders
7 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
A Survey of Model Compression and Acceleration For Deep Neural Networks
No ratings yet
A Survey of Model Compression and Acceleration For Deep Neural Networks
10 pages
Unit-V Deep Learning Techniques
100% (1)
Unit-V Deep Learning Techniques
31 pages
Area Efficient VLSI ASIC Implementation of Multilayer Perceptrons
No ratings yet
Area Efficient VLSI ASIC Implementation of Multilayer Perceptrons
4 pages
CIKM
No ratings yet
CIKM
173 pages
Lec16 - Autoencoders
No ratings yet
Lec16 - Autoencoders
18 pages
??????? ???????? ???????? ??????????
No ratings yet
??????? ???????? ???????? ??????????
6 pages
Fitness Project Report
No ratings yet
Fitness Project Report
62 pages
Binary Neural Networks
No ratings yet
Binary Neural Networks
218 pages
TLM For CNN
No ratings yet
TLM For CNN
32 pages
Defense Against ML-based Power Side-Channel Attacks On DNN Accelerators With Adversarial Attacks
No ratings yet
Defense Against ML-based Power Side-Channel Attacks On DNN Accelerators With Adversarial Attacks
13 pages
AutoML A Survey of State-Of-The-Art
No ratings yet
AutoML A Survey of State-Of-The-Art
33 pages
Neural Architecture Search For Transformers A Surv
No ratings yet
Neural Architecture Search For Transformers A Surv
39 pages
Performance Comparison of YOLOV8 and YOLOV5 For Vessel Detection For Controlling A Barrier System
No ratings yet
Performance Comparison of YOLOV8 and YOLOV5 For Vessel Detection For Controlling A Barrier System
25 pages
2023 IEEE TNNLS A Survey On Evolutionary Neural Architecture Search
No ratings yet
2023 IEEE TNNLS A Survey On Evolutionary Neural Architecture Search
21 pages
Chen Progressive Differentiable Architecture Search Bridging The Depth Gap Between Search ICCV 2019 Paper
No ratings yet
Chen Progressive Differentiable Architecture Search Bridging The Depth Gap Between Search ICCV 2019 Paper
10 pages
YOLO Object Detection Explained - A Beginner's Guide - DataCamp
No ratings yet
YOLO Object Detection Explained - A Beginner's Guide - DataCamp
14 pages
Synopsis Vyom
No ratings yet
Synopsis Vyom
11 pages
Research Paper-Final Template
No ratings yet
Research Paper-Final Template
9 pages
MLDD
No ratings yet
MLDD
19 pages
2023 - Adaptive Disentangled Transformer For Sequential Recommendation
No ratings yet
2023 - Adaptive Disentangled Transformer For Sequential Recommendation
13 pages
Automated Machine Learning - Docx Final
No ratings yet
Automated Machine Learning - Docx Final
15 pages
858 Submission
No ratings yet
858 Submission
7 pages
Wang Lite Pose Efficient Architecture Design For 2D Human Pose Estimation CVPR 2022 Paper
No ratings yet
Wang Lite Pose Efficient Architecture Design For 2D Human Pose Estimation CVPR 2022 Paper
11 pages
Deep Learning For Automatic Violence Detection - Tests On The AIRTLab Dataset
No ratings yet
Deep Learning For Automatic Violence Detection - Tests On The AIRTLab Dataset
16 pages
Condense Net
No ratings yet
Condense Net
10 pages
Machine-Learning Space Applications On SmallSat Platforms With Te
No ratings yet
Machine-Learning Space Applications On SmallSat Platforms With Te
8 pages
An Overview of Deep Learning in Medical Imaging Focusing On MRI
No ratings yet
An Overview of Deep Learning in Medical Imaging Focusing On MRI
26 pages
(2020-ECCV) Rethinking Bottleneck Structure For Efficient Mobile Network Design
No ratings yet
(2020-ECCV) Rethinking Bottleneck Structure For Efficient Mobile Network Design
24 pages
Combating The Elsagate Phenomenon: Deep Learning Architectures For Disturbing Cartoons
No ratings yet
Combating The Elsagate Phenomenon: Deep Learning Architectures For Disturbing Cartoons
7 pages
Achieving On-Mobile Real-Time Super-Resolution With Neural Architecture and Pruning Search
No ratings yet
Achieving On-Mobile Real-Time Super-Resolution With Neural Architecture and Pruning Search
11 pages
Efficientdet: Scalable and Efficient Object Detection: Mingxing Tan Ruoming Pang Quoc V. Le Google Research, Brain Team (
No ratings yet
Efficientdet: Scalable and Efficient Object Detection: Mingxing Tan Ruoming Pang Quoc V. Le Google Research, Brain Team (
10 pages
Recovering Quantitative Models of Human Information Processing With Differentiable Architecture Search
No ratings yet
Recovering Quantitative Models of Human Information Processing With Differentiable Architecture Search
8 pages
Efficient Neural Architecture Search (NAS)
No ratings yet
Efficient Neural Architecture Search (NAS)
2 pages
Auto-Keras: An Efficient Neural Architecture Search System: Haifeng Jin, Qingquan Song, Xia Hu
No ratings yet
Auto-Keras: An Efficient Neural Architecture Search System: Haifeng Jin, Qingquan Song, Xia Hu
11 pages
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
From Everand
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
César Pérez López
No ratings yet