0% found this document useful (0 votes)
49 views82 pages

Lec06 Quantization II

Uploaded by

peter.yeh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views82 pages

Lec06 Quantization II

Uploaded by

peter.yeh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

EfficientML.

ai Lecture 06
Quantization
Part II

Song Han
Associate Professor, MIT
Distinguished Scientist, NVIDIA
@SongHan_MIT

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai
ffi
ffi
Lecture Plan
Today we will:
1. Review Linear Quantization.

2. Introduce Post-Training Quantization (PTQ) that quantizes a oating-point neural network


model, including: channel quantization, group quantization, and range clipping.

3. Introduce Quantization-Aware Training (QAT) that emulates inference-time quantization during


the training/ ne-tuning and recover the accuracy.

4. Introduce binary and ternary quantization.

5. Introduce automatic mixed-precision quantization.

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 2
ffi
fi
ffi
fl
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1


( -1 ) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0

K-Means-based Linear
Quantization Quantization
Integer Weights;
Floating-Point
Storage Floating-Point Integer Weights
Weights
Codebook

Floating-Point Floating-Point
Computation Integer Arithmetic
Arithmetic Arithmetic

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 3
ffi
ffi
K-Means-based Weight Quantization

weights cluster index ne-tuned


(32-bit oat) (2-bit int) centroids centroids
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 -0.97

Deep Compression [Han et al., ICLR 2016]


MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 4
fi
fl
ffi
ffi
K-Means-based Weight Quantization
Accuracy vs. compression rate for AlexNet on ImageNet dataset

Pruning + Quantization Pruning Only Quantization Only


0.5%
0.0%
-0.5%
Accuracy Loss

-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
2% 5% 8% 11% 14% 17% 20%
Model Size Ratio after Compression
Deep Compression [Han et al., ICLR 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 5
ffi
ffi
Linear Quantization
An a ne mapping of integers to real numbers r = S(q − Z)

weights quantized weights zero point scale


(32-bit oat) (2-bit signed int) (2-bit signed int) (32-bit oat)

2.09 -0.98 1.48 0.09 1 -2 0 -1 2.14 -1.07 1.07 0

0.05 -0.14 -1.08 2.12 -1 -1 -2 1 0 0 -1.07 2.14

-0.91 1.92 0 -1.03


( -2 1 -1 -2
-1 ) 1.07 = -1.07 2.14 0 -1.07

1.87 0 1.53 1.49 1 -1 0 0 2.14 0 1.07 1.07

Binary Decimal
01 1
00 0
11 -1
10 -2

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 6
fl
fl
ffi
ffi
ffi
Linear Quantization
An a ne mapping of integers to real numbers r = S(q − Z)

rmin 0 rmax

r Floating-point range

Floating-point

×S
Floating-point Scale

q
Integer qmin Z qmax Bit Width
2
qmin
-2
qmax
1
Zero point 3 -4 3
4 -8 7
N -2N-1 2N-1-1
Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 7
ffi
ffi
ffi
ffi
Linear Quantized Fully-Connected Layer
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• Consider the following fully-connected layer.

Y = WX + b
ZW = 0
Zb = 0, Sb = SWSX
qbias = qb − ZXqW

SWSX
qY = (qWqX + qbias) + ZY
SY
Rescale to N-bit Int Mult. N-bit Int
N-bit Int 32-bit Int Add. Add
Note: both qb and qbias are 32 bits.

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 8
ffi
ffi
ffi
Linear Quantized Convolution Layer
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• Consider the following convolution layer.

Y = Conv (W, X) + b
ZW = 0
Zb = 0, Sb = SWSX
qbias = qb − Conv (qW, ZX)

SWSX
qY =
SY ( Conv ( q ,
W Xq ) + qbias) + ZY

Rescale to N-bit Int Mult. N-bit Int


N-bit Int 32-bit Int Add. Add
Note: both qb and qbias are 32 bits.

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 9
ffi
ffi
ffi
Post-Training Quantization
How should we get the optimal linear quantization parameters (S, Z)?

Topic I: Quantization Granularity


Topic II: Dynamic Range Clipping
Topic III: Rounding

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 10
ffi
ffi
Post-Training Quantization
Topic I: Quantization Granularity
Topic II: Dynamic Range Clipping
Topic III: Rounding

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 11
ffi
ffi
Quantization Granularity
• Per-Tensor Quantization

• Per-Channel Quantization

• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 12
ffi
ffi
Quantization Granularity
• Per-Tensor Quantization

• Per-Channel Quantization

• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 13
ffi
ffi
Symmetric Linear Quantization on Weights
• | r |max = | W |max
− | r |max | r |max
( rst depthwise-separable layer in MobileNetV2)
• Using single scale S for whole weight tensor
r (Per-Tensor Quantization)
• works well for large models
• accuracy drops for small models
q
qmin Z=0 qmax • Common failure results from
• large di erences (more than 100×) in
kw
ranges of weights for di erent output
wi kh wo channels — outlier weight
ci
ho
hi co
co
ci • Solution: Per-Channel Quantization
X W Y

Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 14
fi
ffi
ff
ffi
ff
ffi
Quantization Granularity
• Per-Tensor Quantization

• Per-Channel Quantization

• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 15
ffi
ffi
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09

0.05 -0.14 -1.08 2.12


oc
-0.91 1.92 0 -1.03

1.87 0 1.53 1.49

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 16
ffi
ffi
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = 2.12
0.05 -0.14 -1.08 2.12
oc | r |max 2.12
-0.91 1.92 0 -1.03 S= = 2−1 = 2.12
qmax 2 −1
1.87 0 1.53 1.49

1 0 1 0 2.12 0 2.12 0

0 0 -1 1 0 0 -2.12 2.12

0 1 0 0 0 2.12 0 0

1 0 1 1 2.12 0 2.12 2.12

Quantized Reconstructed
∥W − SqW∥F = 2.28
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 17
ffi
ffi
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = 2.09 S0 = 2.09 | r |max = 2.12
0.05 -0.14 -1.08 2.12 | r |max = 2.12 S1 = 2.12
oc | r |max 2.12
-0.91 1.92 0 -1.03 | r |max = 1.92 S2 = 1.92 S= = 2−1 = 2.12
qmax 2 −1
1.87 0 1.53 1.49 | r |max = 1.87 S3 = 1.87
1 0 1 0 2.12 0 2.12 0

0 0 -1 1 0 0 -2.12 2.12

0 1 0 0 0 2.12 0 0

1 0 1 1 2.12 0 2.12 2.12

Quantized Reconstructed
∥W − SqW∥F = 2.28
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 18
ffi
ffi
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = 2.09 S0 = 2.09 | r |max = 2.12
0.05 -0.14 -1.08 2.12 | r |max = 2.12 S1 = 2.12
oc | r |max 2.12
-0.91 1.92 0 -1.03 | r |max = 1.92 S2 = 1.92 S= = 2−1 = 2.12
qmax 2 −1
1.87 0 1.53 1.49 | r |max = 1.87 S3 = 1.87
1 0 1 0 2.09 0 2.09 0 1 0 1 0 2.12 0 2.12 0

0 0 -1 1 0 0 -2.12 2.12 0 0 -1 1 0 0 -2.12 2.12

0 1 0 -1 0 1.92 0 -1.92 0 1 0 0 0 2.12 0 0

1 0 1 1 1.87 0 1.87 1.87 1 0 1 1 2.12 0 2.12 2.12

Quantized

MIT 6.5940: TinyML and E cient Deep Learning Computing


Reconstructed
∥W − S ⊙ qW∥F = 2.08 < Quantized
∥W − SqW∥F = 2.28
Reconstructed

https://e cientml.ai 19
ffi
ffi
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = 2.09 S0 = 2.09 | r |max = 2.12
0.05 -0.14 -1.08 2.12 | r |max = 2.12 S1 = 2.12
oc | r |max 2.12
-0.91 1.92 0 -1.03 | r |max = 1.92 S2 = 1.92 S= = 2−1 = 2.12
qmax 2 −1
1.87 0 1.53 1.49 | r |max = 1.87 S3 = 1.87
1 0 1 0 2.09 0 2.09 0 1 0 1 0 2.12 0 2.12 0

0 0 -1 1 0 0 -2.12 2.12 0 0 -1 1 0 0 -2.12 2.12

0 1 0 -1 0 1.92 0 -1.92 0 1 0 0 0 2.12 0 0

1 0 1 1 1.87 0 1.87 1.87 1 0 1 1 2.12 0 2.12 2.12

Quantized

MIT 6.5940: TinyML and E cient Deep Learning Computing


Reconstructed
∥W − S ⊙ qW∥F = 2.08 < Quantized
∥W − SqW∥F = 2.28
Reconstructed

https://e cientml.ai 20
ffi
ffi
Quantization Granularity
• Per-Tensor Quantization

• Per-Channel Quantization

• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 21
ffi
ffi
Quantization Granularity
• Per-Tensor Quantization

• Per-Channel Quantization

• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type

Why do we need group quantization?

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 22
ffi
ffi
Group Quantization
Achieve a balance between quantization accuracy and hardware e ciency
• Blackwell GPUs support “micro-tensor scaling” to optimize accuracy for FP4 AI.

• FP4 tensor core provides 2x higher theoretical throughputs than FP8/FP6/INT8 tensor core.

Blackwell Architecture for Generative AI Image Credit: NVIDIA


MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 23
ffi
ffi
ffi
VS-Quant: Per-vector Scaled Quantization
Hierarchical scaling factor
• r = S(q − Z) → r = γ ⋅ Sq(q − Z)
• γ is a oating-point coarse grained scale factor scale factor Sq for each vector
• Sq is an integer per-vector scale factor
• achieves a balance between accuracy and
hardware e ciency by
• less expensive integer scale factors at ner
granularity
M K = M

• more expensive oating-point scale factors


at coarser granularity
K N N

• Memory Overhead of two-level scaling:


• Given 4-bit quantization with 4-bit per-vector scale factor γ for each tensor
scale for every 16 elements, the e ective bit
width is 4 + 4 / 16 = 4.25 bits.

VS-Quant: Per-Vector Scaled Quantization for Accurate Low-Precision Neural Network Inference [Steve Dai, et al.]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 24
fl
ffi
ffi
fl
ffi
ff
fi
Group Quantization
Multi-level scaling scheme
r = (q − z) ⋅ s →
W11 r = (q − z) ⋅ sl0 ⋅ sl1 ⋅ ⋯
r : real number value
q : quantized value
z : zero point (z = 0 is symmetric quantization)
s : scale factors of di erent levels

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 25
ffi
ff
ffi
Group Quantization
Multi-level scaling scheme
FP16 INT4
sl0 q
W11 r = (q − z) ⋅ sl0 ⋅ sl1 ⋅ ⋯
r : real number value
q : quantized value
z : zero point (z = 0 is symmetric quantization)
s : scale factors of di erent levels

L0

Quantization L0 L0 Scale L1 L1 Scale E ective


Data Type
Approach Group Size Data Type Group Size Data Type Bit Width
Per-Channel Quant INT4 Per Channel FP16 - - 4
VSQ INT4 16 UINT4 Per Channel FP16 4+4/16=4.25
MX4 S1M2 2 E1M0 16 E8M0 3+1/2+8/16=4
MX6 S1M4 2 E1M0 16 E8M0 5+1/2+8/16=6
MX9 S1M7 2 E1M0 16 E8M0 8+1/2+8/16=9

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 26
ff
ffi
ff
ffi
Group Quantization
Multi-level scaling scheme
L0 FP16 INT4
W01 sl0 q
W11 r = (q − z) ⋅ sl0 ⋅ sl1 ⋅ ⋯
INT4
W21
r : real number value FP16 UINT4
INT4
INT4
W31
q : quantized value INT4
z : zero point (z = 0 is symmetric quantization) sl1 sl0 q
s : scale factors of di erent levels

L1

Quantization L0 L0 Scale L1 L1 Scale E ective


Data Type
Approach Group Size Data Type Group Size Data Type Bit Width
Per-Channel Quant INT4 Per Channel FP16 - - 4
VSQ INT4 16 UINT4 Per Channel FP16 4+4/16=4.25
MX4 S1M2 2 E1M0 16 E8M0 3+1/2+8/16=4
MX6 S1M4 2 E1M0 16 E8M0 5+1/2+8/16=6
MX9 S1M7 2 E1M0 16 E8M0 8+1/2+8/16=9
VS-Quant: Per-Vector Scaled Quantization for Accurate Low-Precision Neural Network Inference [Steve Dai, et al.]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 27
ff
ffi
ff
ffi
Group Quantization
Multi-level scaling scheme
L0 FP16 INT4
W01 sl0 q
W11 r = (q − z) ⋅ sl0 ⋅ sl1 ⋅ ⋯
INT4
r : real number value FP16 UINT4
INT4
INT4
q : quantized value INT4
L1
z : zero point (z = 0 is symmetric quantization) sl1 sl0 q
s : scale factors of di erent levels
S MAG 2
E8 E1 S MAG 2
sl1 sl0 q
Quantization L0 L0 Scale L1 L1 Scale E ective
Data Type
Approach Group Size Data Type Group Size Data Type Bit Width
Per-Channel Quant INT4 Per Channel FP16 - - 4
VSQ INT4 16 UINT4 Per Channel FP16 4+4/16=4.25
MX4 S1M2 2 E1M0 16 E8M0 3+1/2+8/16=4
MX6 S1M4 2 E1M0 16 E8M0 5+1/2+8/16=6
MX9 S1M7 2 E1M0 16 E8M0 8+1/2+8/16=9
With Shared Microexponents, A Little Shifting Goes a Long Way [Bita Rouhani et al.]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 28
ff
ffi
ff
ffi
Post-Training Quantization
How should we get the optimal linear quantization parameters (S, Z)?

Topic I: Quantization Granularity


Topic II: Dynamic Range Clipping
Topic III: Rounding

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 29
ffi
ffi
Linear Quantization on Activations
rmin 0 rmax
• Unlike weights, the activation range varies across
inputs.
r
• To determine the oating-point range, the activations
×S
statistics are gathered before deploying the model.
q
qmin Z qmax

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 30
ffi
fl
ffi
Dynamic Range for Activation Quantization
Collect activations statistics before deploying the model
(t)
rmax (t) (t−1)
̂ , min = α ⋅ rmax , min + (1 − α) ⋅ rmax
̂ , min • Type 1: During training
• Exponential moving averages (EMA)
rmin 0 rmax • observed ranges are smoothed across
r thousands of training steps

×S

q
qmin Z qmax

Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 31
ffi
ffi
ffi
Dynamic Range for Activation Quantization
Collect activations statistics before deploying the model
• Type 2: By running a few “calibration” batches of
samples on the trained FP32 model
rmin 0 rmax outliers• spending dynamic range on the outliers hurts the

r representation ability.
• use mean of the min/max of each sample in the
batches
×S • analytical calculation (see next slide)

q
qmin Z qmax

Neural Network Distiller


Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 32
ffi
ffi
ffi
Dynamic Range for Activation Quantization
Collect activations statistics before deploying the model
• Type 2: By running a few “calibration” batches of
samples on the trained FP32 model
• minimize the mean-square-error between inputs X
and (reconstructed) quantized inputs Q(X),
min [(X − Q(X)) ]
2
|r|max

• assume inputs are in a Gaussian or Laplace


distribution. For Laplace (0, b) distribution,
optimal clipping values can be solved
numerically as
| r |max = 2.83b, 3.89b, 5.03b for 2, 3, 4 bits.
• the Laplace parameter b can be estimated from
calibration input distribution.

Post-Training 4-Bit Quantization of Convolution Networks for Rapid-Deployment [Banner et al., NeurIPS 2019]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai
𝔼
33
ffi
ffi
Dynamic Range for Activation Quantization
Collect activations statistics before deploying the model
• Type 2: By running a few “calibration” batches of
samples on the trained FP32 model
• minimize loss of information, since integer model
encodes the same information as the original
oating-point model.
• loss of information is measured by Kullback-
Leibler divergence (relative entropy or
information divergence):
• for two discrete probability distributions P, Q
N
P(xi)

DKL(P∥Q) = P(xi)log
i
Q(xi )
• intuition: KL divergence measures the amount
of information lost when approximating a
given encoding.

8-bit Inference with TensorRT [Szymon Migacz, 2017]


MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 34
fl
ffi
ffi
Dynamic Range for Activation Quantization
Minimize loss of information by minimizing the KL divergence

8-bit Inference with TensorRT [Szymon Migacz, 2017]


MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 35
ffi
ffi
Dynamic Range for Activation Quantization
Minimize loss of information by minimizing the KL divergence

GoogleNet:
AlexNet: Pool 2
incpetion_5a/5x5

ResNet-152: GoogleNet:
res4b8_branch2a incpetion_3a/pool

8-bit Inference with TensorRT [Szymon Migacz, 2017]


MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 36
ffi
ffi
Dynamic Range for Quantization
Minimize mean-square-error (MSE) using Newton-Raphson method

max-scaled quantization clipped quantization

clip clip

large quantization noise low density data

Optimal Clipping and Magnitude-aware Di erentiation for Improved Quantization-aware Training [Sakr et al., ICML 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 37
ffi
ffi
ff
Dynamic Range for Quantization
Minimize mean-square-error (MSE) using Newton-Raphson method

Network FP32 Accuracy OCTAV int4


ResNet-50 76.07 75.84
MobileNet-V2 71.71 70.88
Bert-Large 91.00 87.09
Optimal Clipping and Magnitude-aware Di erentiation for Improved Quantization-aware Training [Sakr et al., ICML 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 38
ffi
ffi
ff
Post-Training Quantization
How should we get the optimal linear quantization parameters (S, Z)?

Topic I: Quantization Granularity


Topic II: Dynamic Range Clipping
Topic III: Rounding

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 39
ffi
ffi
Adaptive Rounding for Weight Quantization
Rounding-to-nearest is not optimal
• Philosophy
• Rounding-to-nearest is not optimal
• Weights are correlated with each other. The best rounding for each weight (to nearest) is not
the best rounding for the whole tensor
rounding-to-nearest
0.3 0.5 0.7 0.2 0 1 1 0

AdaRound
0.3 0.5 0.7 0.2 0 0 1 0
(one potential result)

• What is optimal? Rounding that reconstructs the original activation the best, which may be very
di erent
• For weight quantization only
• With short-term tuning, (almost) post-training quantization

Up or Down? Adaptive Rounding for Post-Training Quantization [Nagel et al., PMLR 2020]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 40
ff
ffi
ffi
Adaptive Rounding for Weight Quantization
Rounding-to-nearest is not optimal
• Method:
• Instead of⌊w⌉, we want to choose from {⌊w⌋, ⌈w⌉} to get the best reconstruction
• We took a learning-based method to nd quantized value w̃ = ⌊⌊w⌋ + δ⌉, δ ∈ [0,1]

Up or Down? Adaptive Rounding for Post-Training Quantization [Nagel et al., PMLR 2020]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 41
ffi
ffi
fi
Adaptive Rounding for Weight Quantization
Rounding-to-nearest is not optimal
• Method:
• Instead of⌊w⌉, we want to choose from {⌊w⌋, ⌈w⌉} to get the best reconstruction
• We took a learning-based method to nd quantized value w̃ = ⌊⌊w⌋ + δ⌉, δ ∈ [0,1]
• We optimize the following equation (omit the derivation):
2
argminV∥Wx − W̃x∥F + λfreg(V)
2
→ argminV∥Wx − ⌊⌊W⌋ + h(V)⌉x∥F + λfreg(V)

• x is the input to the layer, V is a random variable of the same shape 2

• h() is a function to map the range to (0,1), such as recti ed sigmoid


1

• freg(V) is a regularization that encourages h(V) to be binary -1


-6 0 6

Up or Down? Adaptive Rounding for Post-Training Quantization [Nagel et al., PMLR 2020]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 42
ffi
ffi
fi
fi
Neural Network Quantization • Zero Point
• Asymmetric
• Symmetric
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1
• Scaling Granularity
0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50
(
-1 -1 -2 1
-1 ) 1.07 • Per-Tensor
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2
• Per-Channel
1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0
• Group Quantization
K-Means-based Linear • Range Clipping
Quantization Quantization • Exponential Moving
Integer Weights; Average
Floating-Point • Minimizing KL
Storage Floating-Point Integer Weights
Weights Divergence
Codebook
• Minimizing Mean-
Floating-Point Floating-Point Square-Error
Computation Integer Arithmetic
Arithmetic Arithmetic
• Rounding
• Round-to-Nearest
• AdaRound
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 43
ffi
ffi
Post-Training INT8 Linear Quantization
Symmetric Asymmertric

Per-Tensor Per-Tensor
Activation
Exponential Moving Average
Minimize KL-Divergence
(EMA)

Symmetric Symmetric
Weight
Per-Tensor Per-Channel

GoogleNet -0.45% 0%

ResNet-50 -0.13% -0.6%


Neural
ResNet-152 -0.08% -1.8%
Network
MobileNetV1 - -11.8%

MobileNetV2 - -2.1%

Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
Quantizing Deep Convolutional Networks for E cient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv 2018]
8-bit Inference with TensorRT [Szymon Migacz, 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 44
ffi
ffi
ffi
Post-Training INT8 Linear Quantization
Symmetric Asymmertric

Per-Tensor Per-Tensor
Activation
Exponential Moving Average
Minimize KL-Divergence
(EMA)

Symmetric Symmetric
Weight
Per-Tensor Per-Channel
?
GoogleNet
Smaller -0.45%
models seem to not respond 0%
How should we
as well to post-training quantization,
ResNet-50 -0.13% improve-0.6%
performance
presumabley due to their smaller
Neural of quantized models?
representational capacity.
ResNet-152 -0.08% -1.8%
Network
MobileNetV1 - -11.8%

MobileNetV2 - -2.1%

Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
Quantizing Deep Convolutional Networks for E cient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv 2018]
8-bit Inference with TensorRT [Szymon Migacz, 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 45
ffi
ffi
ffi
Quantization-Aware Training
How should we improve performance of quantized models?

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 46
ffi
ffi
Quantization-Aware Training
• To minimize the loss of accuracy, especially aggressive quantization with 4 bits and lower bit width,
neural network will be trained/ ne-tuned with quantized weights and activations.
• Usually, ne-tuning a pre-trained oating point model provides better accuracy than training from
scratch.
weights cluster index ne-tuned
(32-bit oat) (2-bit int) centroids centroids
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 ×lr -0.97


gradient
-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04
group
-0.01 0.01 -0.02 0.12 by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Deep Compression [Han et al., ICLR 2016]


MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 47
fi
fl
ffi
fi
ffi
fi
fl
Quantization-Aware Training
• A full precision copy of the weights W is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
weights
Backward

weight quantization

Layer Layer Layer


inputs outputs
N-1 N N+1
example for operations

Batch
Conv ReLU
Norm

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 48
ffi
ffi
Quantization-Aware Training
• A full precision copy of the weights W is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
“Simulated/Fake Quantization” weights
Backward
W

W → SWqW = Q(W) weight quantization


Q (W)
Layer Layer activation Layer
inputs outputs
N-1 N quantization N+1
example for operations
ensure discrete-valued
Conv
Batch
ReLU weights and activations
Norm
in the boundaries
these operations still run in full precision

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 49
ffi
ffi
Linear Quantization
An a ne mapping of integers to real numbers r = S(q − Z)

weights quantized weights zero point scale


(32-bit oat) (2-bit signed int) (2-bit signed int) (32-bit oat)

2.09 -0.98 1.48 0.09 1 -2 0 -1 2.14 -1.07 1.07 0

0.05 -0.14 -1.08 2.12 -1 -1 -2 1 0 0 -1.07 2.14

-0.91 1.92 0 -1.03


( -2 1 -1 -2
-1 ) 1.07 = -1.07 2.14 0 -1.07

1.87 0 1.53 1.49 1 -1 0 0 2.14 0 1.07 1.07

W qW Q(W)

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 50
fl
fl
ffi
ffi
ffi
Quantization-Aware Training
Train the model taking quantization into consideration
• A full precision copy of the weights W is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
“Simulated/Fake Quantization” weights
Backward
W

W → SWqW = Q(W) weight quantization Y → SY (qY − ZY) = Q(Y)


Q (W)
Layer Q (X) Layer Y Q (Y) Layer
inputs activation outputs
N-1 N quantization N+1
? example for operations
How should gradients ensure discrete-valued
back-propagate through Conv
Batch
ReLU weights and activations
Norm
the (simulated) quantization? in the boundaries
these operations still run in full precision

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 51
ffi
ffi
Straight-Through Estimator (STE)
5
• Quantization is discrete-valued, and thus the

Q (w) = round (w)


derivative is 0 almost everywhere. 4
∂Q (W) 3
=0
∂W
2
• The neural network will learn nothing since
gradients become 0 and the weights won’t get 1
updated. 0
∂L ∂L ∂Q (W) 0 1 2 3 4 5
gW = = ⋅ =0 w
∂W ∂Q (W) ∂W
weights
• Straight-Through Estimator (STE) simply passes the W gW
gradients through the quantization as if it had been weight quantization
the identity function. ∂L
Q (W)
∂L ∂L ∂Q (W)
gW = = Layer
∂W ∂Q (W) N

Neural Networks for Machine Learning [Hinton et al., Coursera Video Lecture, 2012]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation [Bengio, arXiv 2013]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 52
ffi
ffi
Quantization-Aware Training
Train the model taking quantization into consideration
• A full precision copy of the weights is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
“Simulated/Fake Quantization” weights
Backward
W → SWqW = Q(W) W Y → SY (qY − ZY) = Q(Y)
∂L weight quantization ∂L
gW ← gY ←
∂Q (W) Q (W) ∂Q (Y)
Layer Q (X) Layer Y Q (Y) Layer
inputs activation outputs
N-1 N quantization N+1
example for operations
ensure discrete-valued
Conv
Batch
ReLU weights and activations
Norm
in the boundaries
these operations still run in full precision

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 53
ffi
ffi
INT8 Linear Quantization-Aware Training

Post-Training Quantization Quantization-Aware Training

Neural Network Floating-Point Asymmetric Symmetric Asymmetric Symmetric

Per-Tensor Per-Channel Per-Tensor Per-Channel

MobileNetV1 70.9% 0.1% 59.1% 70.0% 70.7%

MobileNetV2 71.9% 0.1% 69.8% 70.9% 71.1%

NASNet-Mobile 74.9% 72.2% 72.1% 73.0% 73.0%

Quantizing Deep Convolutional Networks for E cient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 54
ffi
ffi
ffi
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1


( -1 ) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0

K-Means-based Linear
Quantization Quantization
Integer Weights;
Floating-Point
Storage Floating-Point Integer Weights
Weights

Computation
Floating-Point
Codebook

Floating-Point
Integer Arithmetic
?
Arithmetic Arithmetic

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 55
ffi
ffi
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1 1 0 1 1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1 1 0 0 1


( -1 ) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2 0 1 1 0

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0 1 1 1 1

K-Means-based Linear Binary/Ternary


Quantization Quantization Quantization
Integer Weights;
Floating-Point Binary/Ternary
Storage Floating-Point Integer Weights
Weights Weights
Codebook

Floating-Point Floating-Point
Computation Integer Arithmetic Bit Operations
Arithmetic Arithmetic

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 56
ffi
ffi
Binary/Ternary Quantization
Can we push the quantization precision to 1 bit?

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 57
ffi
ffi
Can quantization bit width go even lower?
yi =

j
Wij ⋅ xj
= ×
= 8×5 + (-3)×2 + 5×0 + (-1)×1 yi Wij
xj

input weight operations memory computation = × 5


8 -3 5 -1 2
R R +× 1× 1×
0
1

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 58
ffi
ffi
If weights are quantized to +1 and -1
yi =

j
Wij ⋅ xj
= × 5

=5-2+0-1 8 -3 5 -1 2
0
1

input weight operations memory computation = × 5


1 -1 1 -1 2
R R +× 1× 1×
0
R B +- ~32× less ~2× less
1

BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations [Courbariaux et al., NeurIPS 2015]
XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 59
ffi
fi
ffi
Binarization
• Deterministic Binarization
• directly computes the bit value based on a threshold, usually 0, resulting in a sign function.

{−1, r < 0
+1, r ≥ 0
q = sign (r) =

• Stochastic Binarization
• use global statistics or the value of input data to determine the probability of being -1 or +1
• e.g., in Binary Connect (BC), probability is determined by hard sigmoid function σ(r)

{−1, with probability 1 − p


1
+1, with probability p = σ(r) r+1
q= , where σ(r) = min(max( ,0),1) 0
2
-1
-3 -1 1 3

• harder to implement as it requires the hardware to generate random bits when quantizing.

BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations [Courbariaux et al., NeurIPS 2015]
BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1. [Courbariaux et al., Arxiv 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 60
ffi
ffi
Minimizing Quantization Error in Binarization
AlexNet-based ImageNet Top-1
weights binary weights
Network Accuracy Delta
(32-bit oat) (1-bit)
2.09 -0.98 1.48 0.09 1 -1 1 1 BinaryConnect -21.2%

0.05 -0.14 -1.08 2.12 W W 1 -1 -1 1 Binary Weight


0.2%
Network (BWN)
-0.91 1.92 0 -1.03 -1 1 1 -1

2
1.87 0 1.53 1.49 1 1 1 1 ∥W − W ∥F = 9.28

1 -1 1 1 scale
W = sign (W) (32-bit oat)
1 -1 -1 1 1
1 1.05 =
16
∥W∥1
α = ∥W∥1
n αW -1 1 1 -1

2
1 1 1 1 ∥W − αW ∥F = 9.24
XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai
𝔹
𝔹
61
𝔹
𝔹
𝔹
fl
fl
ffi
fi
ffi
If both activations and weights are binarized
yi =

j
Wij ⋅ xj
= × 5

= 1×1 + (-1)×1 + 1×(-1) + (-1)×1 8 -3 5 -1 2

= 1 + (-1) + (-1) + (-1) = -2 0


1

= × 1
1 -1 1 -1 1
-1
1

XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 62
ffi
fi
ffi
If both activations and weights are binarized

yi = Wij ⋅ xj
j

= 1×1 + (-1)×1 + 1×(-1) + (-1)×1


= 1 + (-1) + (-1) + (-1) = -2

W X Y=WX bW bX XNOR(bW, bX)


1 1 1 1 1 1
1 -1 -1 1 0 0
-1 -1 1 0 0 1
-1 1 -1 0 1 0
XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 63
ffi
fi
ffi
If both activations and weights are binarized

yi = Wij ⋅ xj
j

= 1×1 + (-1)×1 + 1×(-1) + (-1)×1 = 1 xnor 1 + 0 xnor 1 + 1 xnor 0 + 0 xnor 1


= 1 + (-1) + (-1) + (-1) = -2 =1+0+0+0 = 1
?

W X Y=WX bW bX XNOR(bW, bX)


1 1 1 1 1 1
1 -1 -1 1 0 0
-1 -1 1 0 0 1
-1 1 -1 0 1 0
XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 64
ffi
fi
ffi
If both activations and weights are binarized
∑ ∑
yi = Wij ⋅ xj yi = − n + 2 ⋅ Wij xnor xj
j j

= 1×1 + (-1)×1 + 1×(-1) + (-1)×1 = 1 xnor 1 + 0 xnor 1 + 1 xnor 0 + 0 xnor 1


= 1 + (-1) + (-1) + (-1) = -2 =1+0+0+0 = 1×2
+2 + = -2
Assuming -1 -1 -1 -1 → -4

W X Y=WX bW bX XNOR(bW, bX)


1 1 1 1 1 1
1 -1 -1 1 0 0
-1 -1 1 0 0 1
-1 1 -1 0 1 0
XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 65
ffi
fi
ffi
If both activations and weights are binarized
yi = − n + popcount (Wi xnor x) ≪ 1

yi = − n + 2 ⋅ Wij xnor xj →
j

= -4 + 2 × (1 xnor 1 + 0 xnor 1 + 1 xnor 0 + 0 xnor 1)


= -4 + 2 × (1 + 0 + 0 + 0) = -2
→ popcount: return the number of 1

W X Y=WX bW bX XNOR(bW, bX)


1 1 1 1 1 1
1 -1 -1 1 0 0
-1 -1 1 0 0 1
-1 1 -1 0 1 0
XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 66
ffi
fi
ffi
If both activations and weights are binarized
yi = − n + popcount (Wi xnor x) ≪ 1
= × 5

= -4 + popcount(1010 xnor 1101) ≪ 1 8 -3 5 -1 2

= -4 + popcount(1000) ≪ 1 = -4 + 2 = -2 0
1

input weight operations memory computation = × 1


1 -1 1 -1 1
R R +× 1× 1×
-1
R B +- ~32× less ~2× less
1
B B xnor,
~32× less ~58× less
popcount
XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 67
ffi
fi
ffi
Accuracy Degradation of Binarization
Bit-Width
ImageNet
Neural Network Quantization Top-1 Accuracy
Delta
W A

BWN 1 32 0.2%

AlexNet BNN 1 1 -28.7%

XNOR-Net 1 1 -12.4%

* BWN: Binary Weight Network with scale for weight binarization


* BNN: Binarized Neural Network without scale factors
* XNOR-Net: scale factors for both activation and weight binarization
Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1. [Courbariaux et al., Arxiv 2016]
XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 68
ffi
fi
ffi
Ternary Weight Networks (TWN)
Weights are quantized to +1, -1 and 0
rt, r>Δ
q= 0, | r | ≤ Δ, where Δ = 0.7 × ( | r | ), rt = |r|>Δ ( | r | )
−rt, r < − Δ
weights W ternary weights W
(32-bit oat) (2-bit) 1
Δ = 0.7 × ∥W∥1 = 0.73
2.09 -0.98 1.48 0.09 1 -1 1 0 16
0.05 -0.14 -1.08 2.12 0 0 -1 1
1
-0.91 1.92 0 -1.03 -1 1 0 -1
1.5 =
11
∥WW ≠0∥1

1.87 0 1.53 1.49 1 0 1 1

ImageNet Top-1
Full Precision 1 bit (BWN) 2 bit (TWN)
Accuracy

ResNet-18 69.6 60.8 65.3


𝔼
𝔼
Ternary Weight Networks [Li et al., Arxiv 2016]
𝕋
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 69
𝕋
fl
ffi
ffi
Trained Ternary Quantization (TTQ)
• Instead of using xed scale rt, TTQ introduces two trainable parameters wp and wn to represent
the positive and negative scales in the quantization.
wp, r>Δ
q= 0, |r| ≤ Δ
−wn, r < − Δ
Trained
Normalized
Full Precision Weight Intermediate Ternary Weight Quantization Final Ternary Weight
Full Precision Weight
Scale
Normalize Quantize
Wn Wp
-1 0 1 -1 -t 0 t 1 -1 0 1 -Wn 0 Wp

ImageNet Top-1
Full Precision 1 bit (BWN) 2 bit (TWN) TTQ
Accuracy

ResNet-18 69.6 60.8 65.3 66.6


Trained Ternary Quantization [Zhu et al., ICLR 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 70
ffi
fi
ffi
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1 1 0 1 1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1 1 0 0 1


( -1 ) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2 0 1 1 0

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0 1 1 1 1

K-Means-based Linear Binary/Ternary


Quantization Quantization Quantization
Integer Weights;
Floating-Point Binary/Ternary
Storage Floating-Point Integer Weights
Weights Weights
Codebook

Floating-Point Floating-Point
Computation Integer Arithmetic Bit Operations
Arithmetic Arithmetic

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 71
ffi
ffi
Mixed-Precision Quantization

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 72
ffi
ffi
Uniform Quantization
……
weight activation

Layer 1
1 1 1 0 1 1 1 0 1 0 1 0 1 0 1 0
8 bits / 8 bits

Layer 2 1 1 0 1 1 0 0 1 1 0 1 0 1 0 0 0
8 bits / 8 bits

Layer 3
0 1 0 1 0 1 0 1 0 0 1 0 0 0 1 0
8 bits / 8 bits

……

Bit Widths Quantized Model

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 73
ffi
ffi
Mixed-Precision Quantization
……
weight activation

Layer 1
1 1 1 0 1 0 1 0 1
4 bits / 5 bits

Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0
6 bits / 7 bits

Layer 3
1 0 1 0 1 0 0 1 0
5 bits / 4 bits

……

Bit Widths Quantized Model

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 74
ffi
ffi
Challenge: Huge Design Space
……
weight activation

Layer 1
1 1 1 0 1 0 1 0 1 Choices: 8 x 8 = 64
4 bits / 5 bits

Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0 Choices: 8 x 8 = 64
6 bits / 7 bits

Layer 3
1 0 1 0 1 0 0 1 0 Choices: 8 x 8 = 64
5 bits / 4 bits

……

Bit Widths Quantized Model Design Space: 64n

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 75
ffi
ffi
Solution: Design Automation
……
weight activation

Layer 1
1 1 1 0 1 0 1 0 1
4 bits / 5 bits
Action
Critic

Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0
6 bits / 7 bits
State
Actor
Reward
Layer 3
1 0 1 0 1 0 0 1 0
5 bits / 4 bits

……

Bit Widths Quantized Model

HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 76
ffi
ffi
Solution: Design Automation
…… BitFusion (Edge)

weight activation BISMO (Cloud)



PE PE PE
BISMO (Edge)

PE PE PE
Layer 1


1 1 1 0 1 0 1 0 1 PE PE

PE
4 bits / 5 bits PE PE PE


wn⋯ w0 an ⋯ a0
PE ⋯ PE ⋯ PE Cycle T
Action Hardware


Critic wn ⋯ w0 an ⋯ a0 (LSB)


Mapping PE ⋯ PE
⋯ ⋯ PE
⋯ CycleCycle
T 0
wn ⋯ w0 an ⋯ a0 (LSB)(MSB)


⋯ ⋯ &⋯ ⋯ CycleCycle
Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0
T 0
(LSB)(MSB)


+
6 bits / 7 bits ⋯ &
+

<<
Cycle 0
(MSB)

State Direct
+
&
<<
Actor +
Reward Feedback <<
+

Layer 3
1 0 1 0 1 0 0 1 0
+

5 bits / 4 bits

Hardware
Accelerator
……

Bit Widths Quantized Model

HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 77
ffi
ffi
HAQ Outperforms Uniform Quantization
HAQ Uniform

HAQ (Ours) PACT Baseline

Mixed-Precision Quantized MobileNetV1


HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 78
ffi
ffi
HAQ Supports Multiple Objectives

Model Size Constrained Latency Constrained Energy Constrained

HAQ HAQ HAQ


Uniform Uniform Uniform

Mixed-Precision Quantized MobileNetV1


HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 79
ffi
ffi
#bit
Untitled
Quantization Policy for Edge and Cloud
19 Untitled
22 25
2 5 8 11
MobileNet-V2 OPs per Byte
14 17 20 23 26
Layer index
29 32 35 38 41 44 47 50

DW:less bits PW:more bits

log #
2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50
e bits Layer index
Edge

layer

Cloud

layer

er bits
bit (depthwise)
ion bit (depthwise)

layer #weight bit (pointwise) #weight bit (depthwise) #activation bit (pointwise) #activation bit (depthwise)
r Byte (depthwise) Mixed-Precision Quantized MobileNetV2
HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 80
ffi
ffi
Summary of Today’s Lecture
In this lecture, we rmin 0 rmax

1. Reviewed Linear Quantization. r Floating-point


range

2. Introduced Post-Training Quantization (PTQ) that


quantizes an already-trained oating-point neural ×S
Floating-point
network model.
• Per-tensor vs. per-channel vs. group quantization q Scale

• How to determine dynamic range for quantization


qmin Z qmax
Zero point
3. Introduced Quantization-Aware Training (QAT) that
emulates inference-time quantization during the
Forward
training/ ne-tuning. weights
Backward
• Straight-Through Estimator (STE) weight quantization

4. Introduced binary and ternary quantization. Layer Layer activation Layer


inputs outputs
5. Introduced automatic mixed-precision quantization. N-1 N quantization N+1

Batch
Conv ReLU
Norm

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 81
ffi
fi
ffi
fl
References
1. Deep Compression [Han et al., ICLR 2016]
2. Neural Network Distiller: https://intellabs.github.io/distiller/algo_quantization.html
3. Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR
2018]
4. Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
5. Post-Training 4-Bit Quantization of Convolution Networks for Rapid-Deployment [Banner et al., NeurIPS 2019]
6. 8-bit Inference with TensorRT [Szymon Migacz, 2017]
7. Quantizing Deep Convolutional Networks for E cient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv
2018]
8. Neural Networks for Machine Learning [Hinton et al., Coursera Video Lecture, 2012]
9. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation [Bengio, arXiv 2013]
10.Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1.
[Courbariaux et al., Arxiv 2016]
11.DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients [Zhou et al., arXiv
2016]
12.PACT: Parameterized Clipping Activation for Quantized Neural Networks [Choi et al., arXiv 2018]
13.WRPN: Wide Reduced-Precision Networks [Mishra et al., ICLR 2018]
14.Towards Accurate Binary Convolutional Neural Network [Lin et al., NeurIPS 2017]
15.Incremental Network Quantization: Towards Lossless CNNs with Low-precision Weights [Zhou et al., ICLR 2017]
16.HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 82
ffi
ffi
ffi
ffi

You might also like