Paper Survey - Training With Quantization Noise For Extreme Model Compression
Paper Survey - Training With Quantization Noise For Extreme Model Compression
1
Outline
Quantization is lossy: Small range of float32 values mapped to int8
is a lossy conversion since int8 only has 255 information channels
• Introduction - Why Quantization for Model Compression?
• Even models that are designed to jointly optimize the performance and the
parameter efficiency, such as EfficientNets still require dozens to hundreds of
megabytes
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
Tan, Mingxing, and Quoc Le. "Efficientnet: Rethinking model scaling for convolutional neural networks." International Conference
3
on Machine Learning. PMLR, 2019.
AI Model Performance on ImageNet
Top 1 Accuracy vs FLOPs & # of Parameters
Data Label is in the format of Network Name, # of Parameters (M), The size of the bubble ∝ # of Parameters
90
85
PNASNet, 86
80 Inception-v4, 43
Xception, 23 ResNet 152, 60
DenseNet 161, 29
75
Accuracy (%)
65
MnasNet, 3.9
60
SqueezeNet-v1.1, 1.2
AlexNet, 61
55
50
0 5 10 15 20 25 30
FLOPS (G)
Accuracy and inference time are in a hyperbolic relationship: a little increment in accuracy costs a lot of computational time [*]
[*] Canziani, Alfredo, Adam Paszke, and Eugenio Culurciello. "An analysis of deep neural network models for practical
4
applications." arXiv preprint arXiv:1605.07678 (2016).
LeCun, Yann, John S. Denker, and Sara A. Solla. "Optimal
brain damage." Advances in neural information processing
systems. 1990.
Model Compression Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the
knowledge in a neural network." arXiv preprint
arXiv:1503.02531 (2015).
• Original Calculation
𝑥𝑓𝑙𝑜𝑎𝑡
• After Quantization 𝑥𝑞𝑢𝑎𝑛𝑡𝑖𝑧𝑒𝑑 = +𝑧
𝑆
32
N bit fix-point #, compression rate ×
𝑛
2𝑁
𝑆 scale = , 𝑧 zero point 6
𝑎
Problem of Scalar
Quantization
• Postprocessing quantization methods: scalar quantization
• replace the floating-point weights of a trained network by a lower-precision
representation (ex: 32 bit -> 8 bit quantization)
Jacob, Benoit, et al. "Quantization and training of neural networks for efficient integer-arithmetic-only inference." Proceedings of
8
the IEEE conference on computer vision and pattern recognition. 2018.
Straight Through Estimator (STE)
Bengio, Yoshua, Nicholas Léonard, and Aaron Courville. "Estimating or propagating gradients through stochastic neurons for
9
conditional computation." arXiv preprint arXiv:1308.3432 (2013).
This Work, Training with Quantization Noise
for Extreme Model Compression
• Quantizing only a subset of weights instead of the entire network
during training is more stable for high compression schemes
• Lightweight Achitecture
• MobileNet
• ShuffleNet
• EfficientNet
Howard, Andrew, et al. "Searching for mobilenetv3." Proceedings of the IEEE/CVF International Conference on Computer Vision.
2019.
Zhang, Xiangyu, et al. "Shufflenet: An extremely efficient convolutional neural network for mobile devices." Proceedings of the
IEEE conference on computer vision and pattern recognition. 2018.
Tan, Mingxing, and Quoc Le. "Efficientnet: Rethinking model scaling for convolutional neural networks." International Conference
12
on Machine Learning. PMLR, 2019.
Related Work – Model Compression
• Knowledge Distillation has been applied to sentence representation
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint
arXiv:1503.02531 (2015).
Sanh, Victor, et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv preprint
arXiv:1910.01108 (2019).
13
Jiao, Xiaoqi, et al. "Tinybert: Distilling bert for natural language understanding." arXiv preprint arXiv:1909.10351 (2019).
Related Work - Quantization
𝑏11 … 𝑏1𝑞
• Forward Pass… Consider a real weight matrix 𝑾 = … … … ∈ 𝑹𝑛×𝑝
𝑏𝑚1 … 𝑏𝑚𝑞
• Tuples of indices 𝑱 ⊂ 𝑘, 𝑙 for 1 ≤ 𝑘 ≤ 𝑚, 1 ≤ 𝑙 ≤ 𝑞
𝜑 𝑏𝑘𝑙 𝑖𝑓 𝑘, 𝑙 ∈ 𝐽
• Noise function 𝜓 𝒃𝑘𝑙 | 𝐽 = ቊ
𝑏𝑘𝑙 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
• a EfficientNet-B3 convolutional
network trained for image
classification on ImageNet-1k
18
PPL: A low perplexity indicates the probability
distribution is good at predicting the sample.
Result - Improving Compression with Quant-
Noise
19
Result - Comparison with State of the Art
• Dataset
• MNLI: 433k sentence
pairs annotated with
textual entailment
information
• ImageNet
20
Result - Comparison with State of the Art - NLP
• TinyBert
• MobileBert
• AdaBert
21
Result - Comparison with State of the Art - Image
• Their quantized EfficientNet-B3
• Share: weight sharing and
pruning
• Moble-v2
• Shuffle-v2x1
22
Result - Finetuning with Quant-Noise for Post-
Processing Quantization
• Existing models and post-processing with Quant-Noise
23
Conclusion
25