Lec06 Quantization II
Lec06 Quantization II
ai Lecture 06
Quantization
Part II
Song Han
Associate Professor, MIT
Distinguished Scientist, NVIDIA
@SongHan_MIT
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai
ffi
ffi
Lecture Plan
Today we will:
1. Review Linear Quantization.
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 2
ffi
fi
ffi
fl
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1
K-Means-based Linear
Quantization Quantization
Integer Weights;
Floating-Point
Storage Floating-Point Integer Weights
Weights
Codebook
Floating-Point Floating-Point
Computation Integer Arithmetic
Arithmetic Arithmetic
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 3
ffi
ffi
K-Means-based Weight Quantization
-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
2% 5% 8% 11% 14% 17% 20%
Model Size Ratio after Compression
Deep Compression [Han et al., ICLR 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 5
ffi
ffi
Linear Quantization
An a ne mapping of integers to real numbers r = S(q − Z)
Binary Decimal
01 1
00 0
11 -1
10 -2
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 6
fl
fl
ffi
ffi
ffi
Linear Quantization
An a ne mapping of integers to real numbers r = S(q − Z)
rmin 0 rmax
r Floating-point range
Floating-point
×S
Floating-point Scale
q
Integer qmin Z qmax Bit Width
2
qmin
-2
qmax
1
Zero point 3 -4 3
4 -8 7
N -2N-1 2N-1-1
Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 7
ffi
ffi
ffi
ffi
Linear Quantized Fully-Connected Layer
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• Consider the following fully-connected layer.
Y = WX + b
ZW = 0
Zb = 0, Sb = SWSX
qbias = qb − ZXqW
SWSX
qY = (qWqX + qbias) + ZY
SY
Rescale to N-bit Int Mult. N-bit Int
N-bit Int 32-bit Int Add. Add
Note: both qb and qbias are 32 bits.
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 8
ffi
ffi
ffi
Linear Quantized Convolution Layer
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• Consider the following convolution layer.
Y = Conv (W, X) + b
ZW = 0
Zb = 0, Sb = SWSX
qbias = qb − Conv (qW, ZX)
SWSX
qY =
SY ( Conv ( q ,
W Xq ) + qbias) + ZY
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 9
ffi
ffi
ffi
Post-Training Quantization
How should we get the optimal linear quantization parameters (S, Z)?
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 10
ffi
ffi
Post-Training Quantization
Topic I: Quantization Granularity
Topic II: Dynamic Range Clipping
Topic III: Rounding
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 11
ffi
ffi
Quantization Granularity
• Per-Tensor Quantization
• Per-Channel Quantization
• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 12
ffi
ffi
Quantization Granularity
• Per-Tensor Quantization
• Per-Channel Quantization
• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 13
ffi
ffi
Symmetric Linear Quantization on Weights
• | r |max = | W |max
− | r |max | r |max
( rst depthwise-separable layer in MobileNetV2)
• Using single scale S for whole weight tensor
r (Per-Tensor Quantization)
• works well for large models
• accuracy drops for small models
q
qmin Z=0 qmax • Common failure results from
• large di erences (more than 100×) in
kw
ranges of weights for di erent output
wi kh wo channels — outlier weight
ci
ho
hi co
co
ci • Solution: Per-Channel Quantization
X W Y
Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 14
fi
ffi
ff
ffi
ff
ffi
Quantization Granularity
• Per-Tensor Quantization
• Per-Channel Quantization
• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 15
ffi
ffi
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 16
ffi
ffi
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = 2.12
0.05 -0.14 -1.08 2.12
oc | r |max 2.12
-0.91 1.92 0 -1.03 S= = 2−1 = 2.12
qmax 2 −1
1.87 0 1.53 1.49
1 0 1 0 2.12 0 2.12 0
0 0 -1 1 0 0 -2.12 2.12
0 1 0 0 0 2.12 0 0
Quantized Reconstructed
∥W − SqW∥F = 2.28
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 17
ffi
ffi
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = 2.09 S0 = 2.09 | r |max = 2.12
0.05 -0.14 -1.08 2.12 | r |max = 2.12 S1 = 2.12
oc | r |max 2.12
-0.91 1.92 0 -1.03 | r |max = 1.92 S2 = 1.92 S= = 2−1 = 2.12
qmax 2 −1
1.87 0 1.53 1.49 | r |max = 1.87 S3 = 1.87
1 0 1 0 2.12 0 2.12 0
0 0 -1 1 0 0 -2.12 2.12
0 1 0 0 0 2.12 0 0
Quantized Reconstructed
∥W − SqW∥F = 2.28
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 18
ffi
ffi
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = 2.09 S0 = 2.09 | r |max = 2.12
0.05 -0.14 -1.08 2.12 | r |max = 2.12 S1 = 2.12
oc | r |max 2.12
-0.91 1.92 0 -1.03 | r |max = 1.92 S2 = 1.92 S= = 2−1 = 2.12
qmax 2 −1
1.87 0 1.53 1.49 | r |max = 1.87 S3 = 1.87
1 0 1 0 2.09 0 2.09 0 1 0 1 0 2.12 0 2.12 0
Quantized
https://e cientml.ai 19
ffi
ffi
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = 2.09 S0 = 2.09 | r |max = 2.12
0.05 -0.14 -1.08 2.12 | r |max = 2.12 S1 = 2.12
oc | r |max 2.12
-0.91 1.92 0 -1.03 | r |max = 1.92 S2 = 1.92 S= = 2−1 = 2.12
qmax 2 −1
1.87 0 1.53 1.49 | r |max = 1.87 S3 = 1.87
1 0 1 0 2.09 0 2.09 0 1 0 1 0 2.12 0 2.12 0
Quantized
https://e cientml.ai 20
ffi
ffi
Quantization Granularity
• Per-Tensor Quantization
• Per-Channel Quantization
• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 21
ffi
ffi
Quantization Granularity
• Per-Tensor Quantization
• Per-Channel Quantization
• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 22
ffi
ffi
Group Quantization
Achieve a balance between quantization accuracy and hardware e ciency
• Blackwell GPUs support “micro-tensor scaling” to optimize accuracy for FP4 AI.
• FP4 tensor core provides 2x higher theoretical throughputs than FP8/FP6/INT8 tensor core.
VS-Quant: Per-Vector Scaled Quantization for Accurate Low-Precision Neural Network Inference [Steve Dai, et al.]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 24
fl
ffi
ffi
fl
ffi
ff
fi
Group Quantization
Multi-level scaling scheme
r = (q − z) ⋅ s →
W11 r = (q − z) ⋅ sl0 ⋅ sl1 ⋅ ⋯
r : real number value
q : quantized value
z : zero point (z = 0 is symmetric quantization)
s : scale factors of di erent levels
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 25
ffi
ff
ffi
Group Quantization
Multi-level scaling scheme
FP16 INT4
sl0 q
W11 r = (q − z) ⋅ sl0 ⋅ sl1 ⋅ ⋯
r : real number value
q : quantized value
z : zero point (z = 0 is symmetric quantization)
s : scale factors of di erent levels
L0
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 26
ff
ffi
ff
ffi
Group Quantization
Multi-level scaling scheme
L0 FP16 INT4
W01 sl0 q
W11 r = (q − z) ⋅ sl0 ⋅ sl1 ⋅ ⋯
INT4
W21
r : real number value FP16 UINT4
INT4
INT4
W31
q : quantized value INT4
z : zero point (z = 0 is symmetric quantization) sl1 sl0 q
s : scale factors of di erent levels
L1
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 29
ffi
ffi
Linear Quantization on Activations
rmin 0 rmax
• Unlike weights, the activation range varies across
inputs.
r
• To determine the oating-point range, the activations
×S
statistics are gathered before deploying the model.
q
qmin Z qmax
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 30
ffi
fl
ffi
Dynamic Range for Activation Quantization
Collect activations statistics before deploying the model
(t)
rmax (t) (t−1)
̂ , min = α ⋅ rmax , min + (1 − α) ⋅ rmax
̂ , min • Type 1: During training
• Exponential moving averages (EMA)
rmin 0 rmax • observed ranges are smoothed across
r thousands of training steps
×S
q
qmin Z qmax
Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 31
ffi
ffi
ffi
Dynamic Range for Activation Quantization
Collect activations statistics before deploying the model
• Type 2: By running a few “calibration” batches of
samples on the trained FP32 model
rmin 0 rmax outliers• spending dynamic range on the outliers hurts the
r representation ability.
• use mean of the min/max of each sample in the
batches
×S • analytical calculation (see next slide)
q
qmin Z qmax
Post-Training 4-Bit Quantization of Convolution Networks for Rapid-Deployment [Banner et al., NeurIPS 2019]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai
𝔼
33
ffi
ffi
Dynamic Range for Activation Quantization
Collect activations statistics before deploying the model
• Type 2: By running a few “calibration” batches of
samples on the trained FP32 model
• minimize loss of information, since integer model
encodes the same information as the original
oating-point model.
• loss of information is measured by Kullback-
Leibler divergence (relative entropy or
information divergence):
• for two discrete probability distributions P, Q
N
P(xi)
∑
DKL(P∥Q) = P(xi)log
i
Q(xi )
• intuition: KL divergence measures the amount
of information lost when approximating a
given encoding.
GoogleNet:
AlexNet: Pool 2
incpetion_5a/5x5
ResNet-152: GoogleNet:
res4b8_branch2a incpetion_3a/pool
clip clip
Optimal Clipping and Magnitude-aware Di erentiation for Improved Quantization-aware Training [Sakr et al., ICML 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 37
ffi
ffi
ff
Dynamic Range for Quantization
Minimize mean-square-error (MSE) using Newton-Raphson method
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 39
ffi
ffi
Adaptive Rounding for Weight Quantization
Rounding-to-nearest is not optimal
• Philosophy
• Rounding-to-nearest is not optimal
• Weights are correlated with each other. The best rounding for each weight (to nearest) is not
the best rounding for the whole tensor
rounding-to-nearest
0.3 0.5 0.7 0.2 0 1 1 0
AdaRound
0.3 0.5 0.7 0.2 0 0 1 0
(one potential result)
• What is optimal? Rounding that reconstructs the original activation the best, which may be very
di erent
• For weight quantization only
• With short-term tuning, (almost) post-training quantization
Up or Down? Adaptive Rounding for Post-Training Quantization [Nagel et al., PMLR 2020]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 40
ff
ffi
ffi
Adaptive Rounding for Weight Quantization
Rounding-to-nearest is not optimal
• Method:
• Instead of⌊w⌉, we want to choose from {⌊w⌋, ⌈w⌉} to get the best reconstruction
• We took a learning-based method to nd quantized value w̃ = ⌊⌊w⌋ + δ⌉, δ ∈ [0,1]
Up or Down? Adaptive Rounding for Post-Training Quantization [Nagel et al., PMLR 2020]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 41
ffi
ffi
fi
Adaptive Rounding for Weight Quantization
Rounding-to-nearest is not optimal
• Method:
• Instead of⌊w⌉, we want to choose from {⌊w⌋, ⌈w⌉} to get the best reconstruction
• We took a learning-based method to nd quantized value w̃ = ⌊⌊w⌋ + δ⌉, δ ∈ [0,1]
• We optimize the following equation (omit the derivation):
2
argminV∥Wx − W̃x∥F + λfreg(V)
2
→ argminV∥Wx − ⌊⌊W⌋ + h(V)⌉x∥F + λfreg(V)
Up or Down? Adaptive Rounding for Post-Training Quantization [Nagel et al., PMLR 2020]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 42
ffi
ffi
fi
fi
Neural Network Quantization • Zero Point
• Asymmetric
• Symmetric
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1
• Scaling Granularity
0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50
(
-1 -1 -2 1
-1 ) 1.07 • Per-Tensor
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2
• Per-Channel
1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0
• Group Quantization
K-Means-based Linear • Range Clipping
Quantization Quantization • Exponential Moving
Integer Weights; Average
Floating-Point • Minimizing KL
Storage Floating-Point Integer Weights
Weights Divergence
Codebook
• Minimizing Mean-
Floating-Point Floating-Point Square-Error
Computation Integer Arithmetic
Arithmetic Arithmetic
• Rounding
• Round-to-Nearest
• AdaRound
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 43
ffi
ffi
Post-Training INT8 Linear Quantization
Symmetric Asymmertric
Per-Tensor Per-Tensor
Activation
Exponential Moving Average
Minimize KL-Divergence
(EMA)
Symmetric Symmetric
Weight
Per-Tensor Per-Channel
GoogleNet -0.45% 0%
MobileNetV2 - -2.1%
Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
Quantizing Deep Convolutional Networks for E cient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv 2018]
8-bit Inference with TensorRT [Szymon Migacz, 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 44
ffi
ffi
ffi
Post-Training INT8 Linear Quantization
Symmetric Asymmertric
Per-Tensor Per-Tensor
Activation
Exponential Moving Average
Minimize KL-Divergence
(EMA)
Symmetric Symmetric
Weight
Per-Tensor Per-Channel
?
GoogleNet
Smaller -0.45%
models seem to not respond 0%
How should we
as well to post-training quantization,
ResNet-50 -0.13% improve-0.6%
performance
presumabley due to their smaller
Neural of quantized models?
representational capacity.
ResNet-152 -0.08% -1.8%
Network
MobileNetV1 - -11.8%
MobileNetV2 - -2.1%
Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
Quantizing Deep Convolutional Networks for E cient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv 2018]
8-bit Inference with TensorRT [Szymon Migacz, 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 45
ffi
ffi
ffi
Quantization-Aware Training
How should we improve performance of quantized models?
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 46
ffi
ffi
Quantization-Aware Training
• To minimize the loss of accuracy, especially aggressive quantization with 4 bits and lower bit width,
neural network will be trained/ ne-tuned with quantized weights and activations.
• Usually, ne-tuning a pre-trained oating point model provides better accuracy than training from
scratch.
weights cluster index ne-tuned
(32-bit oat) (2-bit int) centroids centroids
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
weight quantization
Batch
Conv ReLU
Norm
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 48
ffi
ffi
Quantization-Aware Training
• A full precision copy of the weights W is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
“Simulated/Fake Quantization” weights
Backward
W
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 49
ffi
ffi
Linear Quantization
An a ne mapping of integers to real numbers r = S(q − Z)
W qW Q(W)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 50
fl
fl
ffi
ffi
ffi
Quantization-Aware Training
Train the model taking quantization into consideration
• A full precision copy of the weights W is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
“Simulated/Fake Quantization” weights
Backward
W
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 51
ffi
ffi
Straight-Through Estimator (STE)
5
• Quantization is discrete-valued, and thus the
Neural Networks for Machine Learning [Hinton et al., Coursera Video Lecture, 2012]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation [Bengio, arXiv 2013]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 52
ffi
ffi
Quantization-Aware Training
Train the model taking quantization into consideration
• A full precision copy of the weights is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
“Simulated/Fake Quantization” weights
Backward
W → SWqW = Q(W) W Y → SY (qY − ZY) = Q(Y)
∂L weight quantization ∂L
gW ← gY ←
∂Q (W) Q (W) ∂Q (Y)
Layer Q (X) Layer Y Q (Y) Layer
inputs activation outputs
N-1 N quantization N+1
example for operations
ensure discrete-valued
Conv
Batch
ReLU weights and activations
Norm
in the boundaries
these operations still run in full precision
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 53
ffi
ffi
INT8 Linear Quantization-Aware Training
Quantizing Deep Convolutional Networks for E cient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 54
ffi
ffi
ffi
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1
K-Means-based Linear
Quantization Quantization
Integer Weights;
Floating-Point
Storage Floating-Point Integer Weights
Weights
Computation
Floating-Point
Codebook
Floating-Point
Integer Arithmetic
?
Arithmetic Arithmetic
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 55
ffi
ffi
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1 1 0 1 1
Floating-Point Floating-Point
Computation Integer Arithmetic Bit Operations
Arithmetic Arithmetic
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 56
ffi
ffi
Binary/Ternary Quantization
Can we push the quantization precision to 1 bit?
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 57
ffi
ffi
Can quantization bit width go even lower?
yi =
∑
j
Wij ⋅ xj
= ×
= 8×5 + (-3)×2 + 5×0 + (-1)×1 yi Wij
xj
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 58
ffi
ffi
If weights are quantized to +1 and -1
yi =
∑
j
Wij ⋅ xj
= × 5
=5-2+0-1 8 -3 5 -1 2
0
1
BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations [Courbariaux et al., NeurIPS 2015]
XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 59
ffi
fi
ffi
Binarization
• Deterministic Binarization
• directly computes the bit value based on a threshold, usually 0, resulting in a sign function.
{−1, r < 0
+1, r ≥ 0
q = sign (r) =
• Stochastic Binarization
• use global statistics or the value of input data to determine the probability of being -1 or +1
• e.g., in Binary Connect (BC), probability is determined by hard sigmoid function σ(r)
• harder to implement as it requires the hardware to generate random bits when quantizing.
BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations [Courbariaux et al., NeurIPS 2015]
BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1. [Courbariaux et al., Arxiv 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 60
ffi
ffi
Minimizing Quantization Error in Binarization
AlexNet-based ImageNet Top-1
weights binary weights
Network Accuracy Delta
(32-bit oat) (1-bit)
2.09 -0.98 1.48 0.09 1 -1 1 1 BinaryConnect -21.2%
2
1.87 0 1.53 1.49 1 1 1 1 ∥W − W ∥F = 9.28
1 -1 1 1 scale
W = sign (W) (32-bit oat)
1 -1 -1 1 1
1 1.05 =
16
∥W∥1
α = ∥W∥1
n αW -1 1 1 -1
2
1 1 1 1 ∥W − αW ∥F = 9.24
XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai
𝔹
𝔹
61
𝔹
𝔹
𝔹
fl
fl
ffi
fi
ffi
If both activations and weights are binarized
yi =
∑
j
Wij ⋅ xj
= × 5
= × 1
1 -1 1 -1 1
-1
1
XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 62
ffi
fi
ffi
If both activations and weights are binarized
∑
yi = Wij ⋅ xj
j
= -4 + popcount(1000) ≪ 1 = -4 + 2 = -2 0
1
BWN 1 32 0.2%
XNOR-Net 1 1 -12.4%
ImageNet Top-1
Full Precision 1 bit (BWN) 2 bit (TWN)
Accuracy
ImageNet Top-1
Full Precision 1 bit (BWN) 2 bit (TWN) TTQ
Accuracy
Floating-Point Floating-Point
Computation Integer Arithmetic Bit Operations
Arithmetic Arithmetic
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 71
ffi
ffi
Mixed-Precision Quantization
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 72
ffi
ffi
Uniform Quantization
……
weight activation
Layer 1
1 1 1 0 1 1 1 0 1 0 1 0 1 0 1 0
8 bits / 8 bits
Layer 2 1 1 0 1 1 0 0 1 1 0 1 0 1 0 0 0
8 bits / 8 bits
Layer 3
0 1 0 1 0 1 0 1 0 0 1 0 0 0 1 0
8 bits / 8 bits
……
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 73
ffi
ffi
Mixed-Precision Quantization
……
weight activation
Layer 1
1 1 1 0 1 0 1 0 1
4 bits / 5 bits
Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0
6 bits / 7 bits
Layer 3
1 0 1 0 1 0 0 1 0
5 bits / 4 bits
……
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 74
ffi
ffi
Challenge: Huge Design Space
……
weight activation
Layer 1
1 1 1 0 1 0 1 0 1 Choices: 8 x 8 = 64
4 bits / 5 bits
Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0 Choices: 8 x 8 = 64
6 bits / 7 bits
Layer 3
1 0 1 0 1 0 0 1 0 Choices: 8 x 8 = 64
5 bits / 4 bits
……
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 75
ffi
ffi
Solution: Design Automation
……
weight activation
Layer 1
1 1 1 0 1 0 1 0 1
4 bits / 5 bits
Action
Critic
Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0
6 bits / 7 bits
State
Actor
Reward
Layer 3
1 0 1 0 1 0 0 1 0
5 bits / 4 bits
……
HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 76
ffi
ffi
Solution: Design Automation
…… BitFusion (Edge)
⋯
1 1 1 0 1 0 1 0 1 PE PE
⋯
PE
4 bits / 5 bits PE PE PE
⋯
wn⋯ w0 an ⋯ a0
PE ⋯ PE ⋯ PE Cycle T
Action Hardware
⋯
Critic wn ⋯ w0 an ⋯ a0 (LSB)
⋯
Mapping PE ⋯ PE
⋯ ⋯ PE
⋯ CycleCycle
T 0
wn ⋯ w0 an ⋯ a0 (LSB)(MSB)
⋯
⋯ ⋯ &⋯ ⋯ CycleCycle
Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0
T 0
(LSB)(MSB)
⋯
+
6 bits / 7 bits ⋯ &
+
⋯
<<
Cycle 0
(MSB)
State Direct
+
&
<<
Actor +
Reward Feedback <<
+
Layer 3
1 0 1 0 1 0 0 1 0
+
5 bits / 4 bits
Hardware
Accelerator
……
HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 77
ffi
ffi
HAQ Outperforms Uniform Quantization
HAQ Uniform
log #
2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50
e bits Layer index
Edge
layer
Cloud
layer
er bits
bit (depthwise)
ion bit (depthwise)
layer #weight bit (pointwise) #weight bit (depthwise) #activation bit (pointwise) #activation bit (depthwise)
r Byte (depthwise) Mixed-Precision Quantized MobileNetV2
HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 80
ffi
ffi
Summary of Today’s Lecture
In this lecture, we rmin 0 rmax
Batch
Conv ReLU
Norm
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 81
ffi
fi
ffi
fl
References
1. Deep Compression [Han et al., ICLR 2016]
2. Neural Network Distiller: https://intellabs.github.io/distiller/algo_quantization.html
3. Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR
2018]
4. Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
5. Post-Training 4-Bit Quantization of Convolution Networks for Rapid-Deployment [Banner et al., NeurIPS 2019]
6. 8-bit Inference with TensorRT [Szymon Migacz, 2017]
7. Quantizing Deep Convolutional Networks for E cient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv
2018]
8. Neural Networks for Machine Learning [Hinton et al., Coursera Video Lecture, 2012]
9. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation [Bengio, arXiv 2013]
10.Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1.
[Courbariaux et al., Arxiv 2016]
11.DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients [Zhou et al., arXiv
2016]
12.PACT: Parameterized Clipping Activation for Quantized Neural Networks [Choi et al., arXiv 2018]
13.WRPN: Wide Reduced-Precision Networks [Mishra et al., ICLR 2018]
14.Towards Accurate Binary Convolutional Neural Network [Lin et al., NeurIPS 2017]
15.Incremental Network Quantization: Towards Lossless CNNs with Low-precision Weights [Zhou et al., ICLR 2017]
16.HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 82
ffi
ffi
ffi
ffi