0% found this document useful (0 votes)
32 views25 pages

Paper Survey - Training With Quantization Noise For Extreme Model Compression

This document outlines a method for training neural networks with quantization noise for extreme model compression. It introduces quantization as a lossy compression technique that maps floating point values to integers, resulting in information loss. Prior work on model compression through techniques like pruning and knowledge distillation are discussed. The proposed method trains networks with quantization noise applied to a subset of weights, making high compression rates more stable than post-training quantization. Results show the method can compress models like EfficientNet-B3 and RoBERTa to 3.3MB and 14MB respectively while maintaining high accuracy.

Uploaded by

thisisveryunsafe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views25 pages

Paper Survey - Training With Quantization Noise For Extreme Model Compression

This document outlines a method for training neural networks with quantization noise for extreme model compression. It introduces quantization as a lossy compression technique that maps floating point values to integers, resulting in information loss. Prior work on model compression through techniques like pruning and knowledge distillation are discussed. The proposed method trains networks with quantization noise applied to a subset of weights, making high compression rates more stable than post-training quantization. Results show the method can compress models like EfficientNet-B3 and RoBERTa to 3.3MB and 14MB respectively while maintaining high accuracy.

Uploaded by

thisisveryunsafe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Training with Quantization Noise

for Extreme Model Compression


Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Remi
Gribonval, Herve Jegou, Armand Joulin
Facebook AI
ICLR 2021

1
Outline
Quantization is lossy: Small range of float32 values mapped to int8
is a lossy conversion since int8 only has 255 information channels
• Introduction - Why Quantization for Model Compression?

• Related Work – Model Compression (Pruning, …), Quantization Methods

• Method - Training Networks with Quantization Noise

• Result - The impact of Quant-Noise on the performance of NLP & CV tasks

• Conclusion – Proposed Method can be performed high compression rate


with little to no loss in accuracy
2
Models are Large

• Many of the best performing neural network architectures in real-world


applications have a large number of parameters
• Transformer has layer of millions of params

• Even models that are designed to jointly optimize the performance and the
parameter efficiency, such as EfficientNets still require dozens to hundreds of
megabytes

Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
Tan, Mingxing, and Quoc Le. "Efficientnet: Rethinking model scaling for convolutional neural networks." International Conference
3
on Machine Learning. PMLR, 2019.
AI Model Performance on ImageNet
Top 1 Accuracy vs FLOPs & # of Parameters
Data Label is in the format of Network Name, # of Parameters (M), The size of the bubble ∝ # of Parameters
90

85
PNASNet, 86
80 Inception-v4, 43
Xception, 23 ResNet 152, 60
DenseNet 161, 29
75
Accuracy (%)

NasNet mobile, 5.3


MobileNetV2, 3.4 VGG 19, 143
70 ShuffeNet-v2.1, 0.34

65
MnasNet, 3.9
60
SqueezeNet-v1.1, 1.2
AlexNet, 61
55

50
0 5 10 15 20 25 30
FLOPS (G)
Accuracy and inference time are in a hyperbolic relationship: a little increment in accuracy costs a lot of computational time [*]
[*] Canziani, Alfredo, Adam Paszke, and Eugenio Culurciello. "An analysis of deep neural network models for practical
4
applications." arXiv preprint arXiv:1605.07678 (2016).
LeCun, Yann, John S. Denker, and Sara A. Solla. "Optimal
brain damage." Advances in neural information processing
systems. 1990.
Model Compression Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the
knowledge in a neural network." arXiv preprint
arXiv:1503.02531 (2015).

• Reduce the memory footprint of overparametrized models


• Pruning
• Distillation

• Remove parameters by reducing the number of network weights

• Quantization focuses on reducing the bits per weight

• Whereas deleting weights or whole hidden units will inevitably lead to a


drop in performance, they demonstrate that quantizing the weights can be
performed with little to no loss in accuracy (quant-noise during training) 5
• Int8 * Int8 = Int16
Quantization Network Principle
• Float32 -> Int8 • De-quantize

• Original Calculation

𝑥𝑓𝑙𝑜𝑎𝑡
• After Quantization 𝑥𝑞𝑢𝑎𝑛𝑡𝑖𝑧𝑒𝑑 = +𝑧
𝑆
32
N bit fix-point #, compression rate ×
𝑛
2𝑁
𝑆 scale = , 𝑧 zero point 6
𝑎
Problem of Scalar
Quantization
• Postprocessing quantization methods: scalar quantization
• replace the floating-point weights of a trained network by a lower-precision
representation (ex: 32 bit -> 8 bit quantization)

• Pros: achieve a good compression rate with the additional benefit of


accelerating inference on supporting hardware (CPU …)

• Cons: a significant drop in performance


Vanhoucke, Vincent, Andrew Senior, and Mark Z. Mao. "Improving the speed of neural networks on CPUs." (2011).
Stock, Pierre, et al. "And the bit goes down: Revisiting the quantization of neural networks." arXiv preprint
7
arXiv:1907.05686 (2019).
Quantize the Network during Training!
𝑥ො = 𝑓𝑞𝑢𝑎𝑛𝑡 (𝑥)

Challenges: the discretization operators have a null gradient — the


derivative with respect to the input is zero almost everywhere

Jacob, Benoit, et al. "Quantization and training of neural networks for efficient integer-arithmetic-only inference." Proceedings of
8
the IEEE conference on computer vision and pattern recognition. 2018.
Straight Through Estimator (STE)

This works when the error


introduced by STE is small

Bengio, Yoshua, Nicholas Léonard, and Aaron Courville. "Estimating or propagating gradients through stochastic neurons for
9
conditional computation." arXiv preprint arXiv:1308.3432 (2013).
This Work, Training with Quantization Noise
for Extreme Model Compression
• Quantizing only a subset of weights instead of the entire network
during training is more stable for high compression schemes

AI Model Task Original Size Quantized Original Acc Quantized Acc


Name Size

EfficientNet- ImageNet 50 MB 3.3 MB 81.7% 80%


B3
RoBERTa Base MNLI 480 MB 14 MB 84.8% 82.5%
model
10
Related Work – Model Compression

• Weight Pruning: 3-stage pruning


• Unstructured: remove individual weights
• LeCun et al., 1990;
• Molchanov et al., 2017;

• Structured: follow the structure of the weights


• Li et al., 2016;
• Luo et al., 2017;
• Fan et al., 2019; 11
Related Work – Model Compression

• Lightweight Achitecture
• MobileNet

• ShuffleNet

• EfficientNet

Howard, Andrew, et al. "Searching for mobilenetv3." Proceedings of the IEEE/CVF International Conference on Computer Vision.
2019.
Zhang, Xiangyu, et al. "Shufflenet: An extremely efficient convolutional neural network for mobile devices." Proceedings of the
IEEE conference on computer vision and pattern recognition. 2018.
Tan, Mingxing, and Quoc Le. "Efficientnet: Rethinking model scaling for convolutional neural networks." International Conference
12
on Machine Learning. PMLR, 2019.
Related Work – Model Compression
• Knowledge Distillation has been applied to sentence representation

Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint
arXiv:1503.02531 (2015).
Sanh, Victor, et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv preprint
arXiv:1910.01108 (2019).
13
Jiao, Xiaoqi, et al. "Tinybert: Distilling bert for natural language understanding." arXiv preprint arXiv:1909.10351 (2019).
Related Work - Quantization

• Scalar, Vector, and Product


Quantization

• Vector quantization techniques


split the matrix 𝑊 into its 𝒑
columns and learn a codebook on
the resulting 𝒑 vectors

• Codebook 𝐶 = {𝑐 1 , … , 𝑐[𝐾]} Vector Quantization with Codebook


14
Quantizing Neural Networks - Product
Quantization
• Product Quantization splits each column into 𝒎 subvectors and learns the
same codebook for each of the resulting 𝒎 × 𝒑 subvectors

• However, PQ induces a quantization drift as reconstruction error


accumulates

• Iterative PQ (iPQ)! -> update codebook 15


Method - Training Networks
with Quantization Noise
Compression ration: 16 for int32->int8
• STE introduces a bias in the gradients that depends on level of
quantization of the weights, and thus, the compression ratio

• They propose a simple modification to control this induced bias


with a stochastic amelioration of QAT, called Quant-Noise

• They quantize a randomly selected fraction of the weights


instead of the full network as in QAT, leaving some unbiased
gradients flow through unquantized weights 16
Method - Adding Noise to Specific
Quantization Methods Distortion or noise function 𝜑: quantization
Block 𝑏: scalar -> single weight

𝑏11 … 𝑏1𝑞
• Forward Pass… Consider a real weight matrix 𝑾 = … … … ∈ 𝑹𝑛×𝑝
𝑏𝑚1 … 𝑏𝑚𝑞
• Tuples of indices 𝑱 ⊂ 𝑘, 𝑙 for 1 ≤ 𝑘 ≤ 𝑚, 1 ≤ 𝑙 ≤ 𝑞

𝜑 𝑏𝑘𝑙 𝑖𝑓 𝑘, 𝑙 ∈ 𝐽
• Noise function 𝜓 𝒃𝑘𝑙 | 𝐽 = ቊ
𝑏𝑘𝑙 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

• Noise matrix 𝑊𝑛𝑜𝑖𝑠𝑒 = 𝜓 𝒃𝑘𝑙 | 𝐽 𝑘𝑙


, 𝑦𝑛𝑜𝑖𝑠𝑒 = 𝑥𝑊𝑛𝑜𝑖𝑠𝑒

• Backward Pass… STE


17
Result - Improving
Compression with Quant-
Noise
• 2 Network Settings
• a Transformer network trained for
language modeling on WikiText-103

• a EfficientNet-B3 convolutional
network trained for image
classification on ImageNet-1k

18
PPL: A low perplexity indicates the probability
distribution is good at predicting the sample.
Result - Improving Compression with Quant-
Noise

19
Result - Comparison with State of the Art

• Dataset
• MNLI: 433k sentence
pairs annotated with
textual entailment
information

• ImageNet

20
Result - Comparison with State of the Art - NLP

• Trade-off: performance vs model


size
• Their RoBERTa

• Share: weight sharing and pruning

• TinyBert

• MobileBert

• AdaBert
21
Result - Comparison with State of the Art - Image
• Their quantized EfficientNet-B3
• Share: weight sharing and
pruning
• Moble-v2
• Shuffle-v2x1

22
Result - Finetuning with Quant-Noise for Post-
Processing Quantization
• Existing models and post-processing with Quant-Noise

• For language modeling and RoBERTa, they train for 10 additional


epochs

23
Conclusion

• Training with Quantization Noise for Extreme Model Compression


• Quantizing a random subset of weights during training maintains performance

• Validate that Quant-Noise works with a variety of different quantization schemes on


several applications in text and vision

• Can be applied to a combination of iPQ and int8 to benefit from extreme


compression ratio and fixed-point arithmetic

• can be used as a post-processing step to prepare already trained networks for


subsequent quantization, to improve the performance of the compressed model
24
Thanks for Listening!

25

You might also like