0% found this document useful (0 votes)

49 views82 pages

Lec06 Quantization II

Uploaded by

peter.yeh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views82 pages

Lec06 Quantization II

Uploaded by

peter.yeh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 82

EfficientML.

ai Lecture 06
Quantization
Part II

Song Han
Associate Professor, MIT
Distinguished Scientist, NVIDIA
@SongHan_MIT

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai
ffi
ffi
Lecture Plan
Today we will:
1. Review Linear Quantization.

2. Introduce Post-Training Quantization (PTQ) that quantizes a oating-point neural network

model, including: channel quantization, group quantization, and range clipping.

3. Introduce Quantization-Aware Training (QAT) that emulates inference-time quantization during

the training/ ne-tuning and recover the accuracy.

4. Introduce binary and ternary quantization.

5. Introduce automatic mixed-precision quantization.

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 2
ffi
fi
ffi
fl
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1

( -1 ) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0

K-Means-based Linear
Quantization Quantization
Integer Weights;
Floating-Point
Storage Floating-Point Integer Weights
Weights
Codebook

Floating-Point Floating-Point
Computation Integer Arithmetic
Arithmetic Arithmetic

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 3
ffi
ffi
K-Means-based Weight Quantization

weights cluster index ne-tuned

(32-bit oat) (2-bit int) centroids centroids
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 -0.97

Deep Compression [Han et al., ICLR 2016]

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 4
fi
fl
ffi
ffi
K-Means-based Weight Quantization
Accuracy vs. compression rate for AlexNet on ImageNet dataset

Pruning + Quantization Pruning Only Quantization Only

0.5%
0.0%
-0.5%
Accuracy Loss

-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
2% 5% 8% 11% 14% 17% 20%
Model Size Ratio after Compression
Deep Compression [Han et al., ICLR 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 5
ffi
ffi
Linear Quantization
An a ne mapping of integers to real numbers r = S(q − Z)

weights quantized weights zero point scale

(32-bit oat) (2-bit signed int) (2-bit signed int) (32-bit oat)

2.09 -0.98 1.48 0.09 1 -2 0 -1 2.14 -1.07 1.07 0

0.05 -0.14 -1.08 2.12 -1 -1 -2 1 0 0 -1.07 2.14

-0.91 1.92 0 -1.03

( -2 1 -1 -2
-1 ) 1.07 = -1.07 2.14 0 -1.07

1.87 0 1.53 1.49 1 -1 0 0 2.14 0 1.07 1.07

Binary Decimal
01 1
00 0
11 -1
10 -2

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 6
fl
fl
ffi
ffi
ffi
Linear Quantization
An a ne mapping of integers to real numbers r = S(q − Z)

rmin 0 rmax

r Floating-point range

Floating-point

×S
Floating-point Scale

q
Integer qmin Z qmax Bit Width
2
qmin
-2
qmax
1
Zero point 3 -4 3
4 -8 7
N -2N-1 2N-1-1
Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 7
ffi
ffi
ffi
ffi
Linear Quantized Fully-Connected Layer
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• Consider the following fully-connected layer.

Y = WX + b
ZW = 0
Zb = 0, Sb = SWSX
qbias = qb − ZXqW

SWSX
qY = (qWqX + qbias) + ZY
SY
Rescale to N-bit Int Mult. N-bit Int
N-bit Int 32-bit Int Add. Add
Note: both qb and qbias are 32 bits.

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 8
ffi
ffi
ffi
Linear Quantized Convolution Layer
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• Consider the following convolution layer.

Y = Conv (W, X) + b
ZW = 0
Zb = 0, Sb = SWSX
qbias = qb − Conv (qW, ZX)

SWSX
qY =
SY ( Conv ( q ,
W Xq ) + qbias) + ZY

Rescale to N-bit Int Mult. N-bit Int

N-bit Int 32-bit Int Add. Add
Note: both qb and qbias are 32 bits.

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 9
ffi
ffi
ffi
Post-Training Quantization
How should we get the optimal linear quantization parameters (S, Z)?

Topic I: Quantization Granularity

Topic II: Dynamic Range Clipping
Topic III: Rounding

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 10
ffi
ffi
Post-Training Quantization
Topic I: Quantization Granularity
Topic II: Dynamic Range Clipping
Topic III: Rounding

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 11
ffi
ffi
Quantization Granularity
• Per-Tensor Quantization

• Per-Channel Quantization

• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 12
ffi
ffi
Quantization Granularity
• Per-Tensor Quantization

• Per-Channel Quantization

• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 13
ffi
ffi
Symmetric Linear Quantization on Weights
• | r |max = | W |max
− | r |max | r |max
( rst depthwise-separable layer in MobileNetV2)
• Using single scale S for whole weight tensor
r (Per-Tensor Quantization)
• works well for large models
• accuracy drops for small models
q
qmin Z=0 qmax • Common failure results from
• large di erences (more than 100×) in
kw
ranges of weights for di erent output
wi kh wo channels — outlier weight
ci
ho
hi co
co
ci • Solution: Per-Channel Quantization
X W Y

Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 14
fi
ffi
ff
ffi
ff
ffi
Quantization Granularity
• Per-Tensor Quantization

• Per-Channel Quantization

• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 15
ffi
ffi
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09

0.05 -0.14 -1.08 2.12

oc
-0.91 1.92 0 -1.03

1.87 0 1.53 1.49

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 16
ffi
ffi
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = 2.12
0.05 -0.14 -1.08 2.12
oc | r |max 2.12
-0.91 1.92 0 -1.03 S= = 2−1 = 2.12
qmax 2 −1
1.87 0 1.53 1.49

1 0 1 0 2.12 0 2.12 0

0 0 -1 1 0 0 -2.12 2.12

0 1 0 0 0 2.12 0 0

1 0 1 1 2.12 0 2.12 2.12

Quantized Reconstructed
∥W − SqW∥F = 2.28
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 17
ffi
ffi
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = 2.09 S0 = 2.09 | r |max = 2.12
0.05 -0.14 -1.08 2.12 | r |max = 2.12 S1 = 2.12
oc | r |max 2.12
-0.91 1.92 0 -1.03 | r |max = 1.92 S2 = 1.92 S= = 2−1 = 2.12
qmax 2 −1
1.87 0 1.53 1.49 | r |max = 1.87 S3 = 1.87
1 0 1 0 2.12 0 2.12 0

0 0 -1 1 0 0 -2.12 2.12

0 1 0 0 0 2.12 0 0

1 0 1 1 2.12 0 2.12 2.12

Quantized Reconstructed
∥W − SqW∥F = 2.28
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 18
ffi
ffi
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = 2.09 S0 = 2.09 | r |max = 2.12
0.05 -0.14 -1.08 2.12 | r |max = 2.12 S1 = 2.12
oc | r |max 2.12
-0.91 1.92 0 -1.03 | r |max = 1.92 S2 = 1.92 S= = 2−1 = 2.12
qmax 2 −1
1.87 0 1.53 1.49 | r |max = 1.87 S3 = 1.87
1 0 1 0 2.09 0 2.09 0 1 0 1 0 2.12 0 2.12 0

0 0 -1 1 0 0 -2.12 2.12 0 0 -1 1 0 0 -2.12 2.12

0 1 0 -1 0 1.92 0 -1.92 0 1 0 0 0 2.12 0 0

1 0 1 1 1.87 0 1.87 1.87 1 0 1 1 2.12 0 2.12 2.12

Quantized

MIT 6.5940: TinyML and E cient Deep Learning Computing

Reconstructed
∥W − S ⊙ qW∥F = 2.08 < Quantized
∥W − SqW∥F = 2.28
Reconstructed

https://e cientml.ai 19
ffi
ffi
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = 2.09 S0 = 2.09 | r |max = 2.12
0.05 -0.14 -1.08 2.12 | r |max = 2.12 S1 = 2.12
oc | r |max 2.12
-0.91 1.92 0 -1.03 | r |max = 1.92 S2 = 1.92 S= = 2−1 = 2.12
qmax 2 −1
1.87 0 1.53 1.49 | r |max = 1.87 S3 = 1.87
1 0 1 0 2.09 0 2.09 0 1 0 1 0 2.12 0 2.12 0

0 0 -1 1 0 0 -2.12 2.12 0 0 -1 1 0 0 -2.12 2.12

0 1 0 -1 0 1.92 0 -1.92 0 1 0 0 0 2.12 0 0

1 0 1 1 1.87 0 1.87 1.87 1 0 1 1 2.12 0 2.12 2.12

Quantized

MIT 6.5940: TinyML and E cient Deep Learning Computing

Reconstructed
∥W − S ⊙ qW∥F = 2.08 < Quantized
∥W − SqW∥F = 2.28
Reconstructed

https://e cientml.ai 20
ffi
ffi
Quantization Granularity
• Per-Tensor Quantization

• Per-Channel Quantization

• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 21
ffi
ffi
Quantization Granularity
• Per-Tensor Quantization

• Per-Channel Quantization

• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type

Why do we need group quantization?

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 22
ffi
ffi
Group Quantization
Achieve a balance between quantization accuracy and hardware e ciency
• Blackwell GPUs support “micro-tensor scaling” to optimize accuracy for FP4 AI.

• FP4 tensor core provides 2x higher theoretical throughputs than FP8/FP6/INT8 tensor core.

Blackwell Architecture for Generative AI Image Credit: NVIDIA

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 23
ffi
ffi
ffi
VS-Quant: Per-vector Scaled Quantization
Hierarchical scaling factor
• r = S(q − Z) → r = γ ⋅ Sq(q − Z)
• γ is a oating-point coarse grained scale factor scale factor Sq for each vector
• Sq is an integer per-vector scale factor
• achieves a balance between accuracy and
hardware e ciency by
• less expensive integer scale factors at ner
granularity
M K = M

• more expensive oating-point scale factors

at coarser granularity
K N N

• Memory Overhead of two-level scaling:

• Given 4-bit quantization with 4-bit per-vector scale factor γ for each tensor
scale for every 16 elements, the e ective bit
width is 4 + 4 / 16 = 4.25 bits.

VS-Quant: Per-Vector Scaled Quantization for Accurate Low-Precision Neural Network Inference [Steve Dai, et al.]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 24
fl
ffi
ffi
fl
ffi
ff
fi
Group Quantization
Multi-level scaling scheme
r = (q − z) ⋅ s →
W11 r = (q − z) ⋅ sl0 ⋅ sl1 ⋅ ⋯
r : real number value
q : quantized value
z : zero point (z = 0 is symmetric quantization)
s : scale factors of di erent levels

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 25
ffi
ff
ffi
Group Quantization
Multi-level scaling scheme
FP16 INT4
sl0 q
W11 r = (q − z) ⋅ sl0 ⋅ sl1 ⋅ ⋯
r : real number value
q : quantized value
z : zero point (z = 0 is symmetric quantization)
s : scale factors of di erent levels

Quantization L0 L0 Scale L1 L1 Scale E ective

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 26
ff
ffi
ff
ffi
Group Quantization
Multi-level scaling scheme
L0 FP16 INT4
W01 sl0 q
W11 r = (q − z) ⋅ sl0 ⋅ sl1 ⋅ ⋯
INT4
W21
r : real number value FP16 UINT4
INT4
INT4
W31
q : quantized value INT4
z : zero point (z = 0 is symmetric quantization) sl1 sl0 q
s : scale factors of di erent levels

Quantization L0 L0 Scale L1 L1 Scale E ective

Data Type
Approach Group Size Data Type Group Size Data Type Bit Width
Per-Channel Quant INT4 Per Channel FP16 - - 4
VSQ INT4 16 UINT4 Per Channel FP16 4+4/16=4.25
MX4 S1M2 2 E1M0 16 E8M0 3+1/2+8/16=4
MX6 S1M4 2 E1M0 16 E8M0 5+1/2+8/16=6
MX9 S1M7 2 E1M0 16 E8M0 8+1/2+8/16=9
VS-Quant: Per-Vector Scaled Quantization for Accurate Low-Precision Neural Network Inference [Steve Dai, et al.]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 27
ff
ffi
ff
ffi
Group Quantization
Multi-level scaling scheme
L0 FP16 INT4
W01 sl0 q
W11 r = (q − z) ⋅ sl0 ⋅ sl1 ⋅ ⋯
INT4
r : real number value FP16 UINT4
INT4
INT4
q : quantized value INT4
L1
z : zero point (z = 0 is symmetric quantization) sl1 sl0 q
s : scale factors of di erent levels
S MAG 2
E8 E1 S MAG 2
sl1 sl0 q
Quantization L0 L0 Scale L1 L1 Scale E ective
Data Type
Approach Group Size Data Type Group Size Data Type Bit Width
Per-Channel Quant INT4 Per Channel FP16 - - 4
VSQ INT4 16 UINT4 Per Channel FP16 4+4/16=4.25
MX4 S1M2 2 E1M0 16 E8M0 3+1/2+8/16=4
MX6 S1M4 2 E1M0 16 E8M0 5+1/2+8/16=6
MX9 S1M7 2 E1M0 16 E8M0 8+1/2+8/16=9
With Shared Microexponents, A Little Shifting Goes a Long Way [Bita Rouhani et al.]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 28
ff
ffi
ff
ffi
Post-Training Quantization
How should we get the optimal linear quantization parameters (S, Z)?

Topic I: Quantization Granularity

Topic II: Dynamic Range Clipping
Topic III: Rounding

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 29
ffi
ffi
Linear Quantization on Activations
rmin 0 rmax
• Unlike weights, the activation range varies across
inputs.
r
• To determine the oating-point range, the activations
×S
statistics are gathered before deploying the model.
q
qmin Z qmax

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 30
ffi
fl
ffi
Dynamic Range for Activation Quantization
Collect activations statistics before deploying the model
(t)
rmax (t) (t−1)
̂ , min = α ⋅ rmax , min + (1 − α) ⋅ rmax
̂ , min • Type 1: During training
• Exponential moving averages (EMA)
rmin 0 rmax • observed ranges are smoothed across
r thousands of training steps

×S

q
qmin Z qmax

Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 31
ffi
ffi
ffi
Dynamic Range for Activation Quantization
Collect activations statistics before deploying the model
• Type 2: By running a few “calibration” batches of
samples on the trained FP32 model
rmin 0 rmax outliers• spending dynamic range on the outliers hurts the

r representation ability.
• use mean of the min/max of each sample in the
batches
×S • analytical calculation (see next slide)

q
qmin Z qmax

Neural Network Distiller

Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 32
ffi
ffi
ffi
Dynamic Range for Activation Quantization
Collect activations statistics before deploying the model
• Type 2: By running a few “calibration” batches of
samples on the trained FP32 model
• minimize the mean-square-error between inputs X
and (reconstructed) quantized inputs Q(X),
min [(X − Q(X)) ]
2
|r|max

• assume inputs are in a Gaussian or Laplace

distribution. For Laplace (0, b) distribution,
optimal clipping values can be solved
numerically as
| r |max = 2.83b, 3.89b, 5.03b for 2, 3, 4 bits.
• the Laplace parameter b can be estimated from
calibration input distribution.

Post-Training 4-Bit Quantization of Convolution Networks for Rapid-Deployment [Banner et al., NeurIPS 2019]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai
𝔼
33
ffi
ffi
Dynamic Range for Activation Quantization
Collect activations statistics before deploying the model
• Type 2: By running a few “calibration” batches of
samples on the trained FP32 model
• minimize loss of information, since integer model
encodes the same information as the original
oating-point model.
• loss of information is measured by Kullback-
Leibler divergence (relative entropy or
information divergence):
• for two discrete probability distributions P, Q
N
P(xi)
∑
DKL(P∥Q) = P(xi)log
i
Q(xi )
• intuition: KL divergence measures the amount
of information lost when approximating a
given encoding.

8-bit Inference with TensorRT [Szymon Migacz, 2017]

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 34
fl
ffi
ffi
Dynamic Range for Activation Quantization
Minimize loss of information by minimizing the KL divergence

8-bit Inference with TensorRT [Szymon Migacz, 2017]

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 35
ffi
ffi
Dynamic Range for Activation Quantization
Minimize loss of information by minimizing the KL divergence

GoogleNet:
AlexNet: Pool 2
incpetion_5a/5x5

ResNet-152: GoogleNet:
res4b8_branch2a incpetion_3a/pool

8-bit Inference with TensorRT [Szymon Migacz, 2017]

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 36
ffi
ffi
Dynamic Range for Quantization
Minimize mean-square-error (MSE) using Newton-Raphson method

max-scaled quantization clipped quantization

clip clip

large quantization noise low density data

Optimal Clipping and Magnitude-aware Di erentiation for Improved Quantization-aware Training [Sakr et al., ICML 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 37
ffi
ffi
ff
Dynamic Range for Quantization
Minimize mean-square-error (MSE) using Newton-Raphson method

Network FP32 Accuracy OCTAV int4

ResNet-50 76.07 75.84
MobileNet-V2 71.71 70.88
Bert-Large 91.00 87.09
Optimal Clipping and Magnitude-aware Di erentiation for Improved Quantization-aware Training [Sakr et al., ICML 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 38
ffi
ffi
ff
Post-Training Quantization
How should we get the optimal linear quantization parameters (S, Z)?

Topic I: Quantization Granularity

Topic II: Dynamic Range Clipping
Topic III: Rounding

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 39
ffi
ffi
Adaptive Rounding for Weight Quantization
Rounding-to-nearest is not optimal
• Philosophy
• Rounding-to-nearest is not optimal
• Weights are correlated with each other. The best rounding for each weight (to nearest) is not
the best rounding for the whole tensor
rounding-to-nearest
0.3 0.5 0.7 0.2 0 1 1 0

AdaRound
0.3 0.5 0.7 0.2 0 0 1 0
(one potential result)

• What is optimal? Rounding that reconstructs the original activation the best, which may be very
di erent
• For weight quantization only
• With short-term tuning, (almost) post-training quantization

Up or Down? Adaptive Rounding for Post-Training Quantization [Nagel et al., PMLR 2020]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 40
ff
ffi
ffi
Adaptive Rounding for Weight Quantization
Rounding-to-nearest is not optimal
• Method:
• Instead of⌊w⌉, we want to choose from {⌊w⌋, ⌈w⌉} to get the best reconstruction
• We took a learning-based method to nd quantized value w̃ = ⌊⌊w⌋ + δ⌉, δ ∈ [0,1]

Up or Down? Adaptive Rounding for Post-Training Quantization [Nagel et al., PMLR 2020]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 41
ffi
ffi
fi
Adaptive Rounding for Weight Quantization
Rounding-to-nearest is not optimal
• Method:
• Instead of⌊w⌉, we want to choose from {⌊w⌋, ⌈w⌉} to get the best reconstruction
• We took a learning-based method to nd quantized value w̃ = ⌊⌊w⌋ + δ⌉, δ ∈ [0,1]
• We optimize the following equation (omit the derivation):
2
argminV∥Wx − W̃x∥F + λfreg(V)
2
→ argminV∥Wx − ⌊⌊W⌋ + h(V)⌉x∥F + λfreg(V)

• x is the input to the layer, V is a random variable of the same shape 2

• h() is a function to map the range to (0,1), such as recti ed sigmoid

• freg(V) is a regularization that encourages h(V) to be binary -1

-6 0 6

Up or Down? Adaptive Rounding for Post-Training Quantization [Nagel et al., PMLR 2020]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 42
ffi
ffi
fi
fi
Neural Network Quantization • Zero Point
• Asymmetric
• Symmetric
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1
• Scaling Granularity
0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50
(
-1 -1 -2 1
-1 ) 1.07 • Per-Tensor
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2
• Per-Channel
1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0
• Group Quantization
K-Means-based Linear • Range Clipping
Quantization Quantization • Exponential Moving
Integer Weights; Average
Floating-Point • Minimizing KL
Storage Floating-Point Integer Weights
Weights Divergence
Codebook
• Minimizing Mean-
Floating-Point Floating-Point Square-Error
Computation Integer Arithmetic
Arithmetic Arithmetic
• Rounding
• Round-to-Nearest
• AdaRound
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 43
ffi
ffi
Post-Training INT8 Linear Quantization
Symmetric Asymmertric

Per-Tensor Per-Tensor
Activation
Exponential Moving Average
Minimize KL-Divergence
(EMA)

Symmetric Symmetric
Weight
Per-Tensor Per-Channel

GoogleNet -0.45% 0%

ResNet-50 -0.13% -0.6%

Neural
ResNet-152 -0.08% -1.8%
Network
MobileNetV1 - -11.8%

MobileNetV2 - -2.1%

Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
Quantizing Deep Convolutional Networks for E cient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv 2018]
8-bit Inference with TensorRT [Szymon Migacz, 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 44
ffi
ffi
ffi
Post-Training INT8 Linear Quantization
Symmetric Asymmertric

Per-Tensor Per-Tensor
Activation
Exponential Moving Average
Minimize KL-Divergence
(EMA)

Symmetric Symmetric
Weight
Per-Tensor Per-Channel
?
GoogleNet
Smaller -0.45%
models seem to not respond 0%
How should we
as well to post-training quantization,
ResNet-50 -0.13% improve-0.6%
performance
presumabley due to their smaller
Neural of quantized models?
representational capacity.
ResNet-152 -0.08% -1.8%
Network
MobileNetV1 - -11.8%

MobileNetV2 - -2.1%

Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
Quantizing Deep Convolutional Networks for E cient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv 2018]
8-bit Inference with TensorRT [Szymon Migacz, 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 45
ffi
ffi
ffi
Quantization-Aware Training
How should we improve performance of quantized models?

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 46
ffi
ffi
Quantization-Aware Training
• To minimize the loss of accuracy, especially aggressive quantization with 4 bits and lower bit width,
neural network will be trained/ ne-tuned with quantized weights and activations.
• Usually, ne-tuning a pre-trained oating point model provides better accuracy than training from
scratch.
weights cluster index ne-tuned
(32-bit oat) (2-bit int) centroids centroids
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 ×lr -0.97

gradient
-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04
group
-0.01 0.01 -0.02 0.12 by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Deep Compression [Han et al., ICLR 2016]

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 47
fi
fl
ffi
fi
ffi
fi
fl
Quantization-Aware Training
• A full precision copy of the weights W is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
weights
Backward

weight quantization

Layer Layer Layer

inputs outputs
N-1 N N+1
example for operations

Batch
Conv ReLU
Norm

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 48
ffi
ffi
Quantization-Aware Training
• A full precision copy of the weights W is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
“Simulated/Fake Quantization” weights
Backward
W

W → SWqW = Q(W) weight quantization

Q (W)
Layer Layer activation Layer
inputs outputs
N-1 N quantization N+1
example for operations
ensure discrete-valued
Conv
Batch
ReLU weights and activations
Norm
in the boundaries
these operations still run in full precision

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 49
ffi
ffi
Linear Quantization
An a ne mapping of integers to real numbers r = S(q − Z)

weights quantized weights zero point scale

(32-bit oat) (2-bit signed int) (2-bit signed int) (32-bit oat)

2.09 -0.98 1.48 0.09 1 -2 0 -1 2.14 -1.07 1.07 0

0.05 -0.14 -1.08 2.12 -1 -1 -2 1 0 0 -1.07 2.14

-0.91 1.92 0 -1.03

( -2 1 -1 -2
-1 ) 1.07 = -1.07 2.14 0 -1.07

1.87 0 1.53 1.49 1 -1 0 0 2.14 0 1.07 1.07

W qW Q(W)

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 50
fl
fl
ffi
ffi
ffi
Quantization-Aware Training
Train the model taking quantization into consideration
• A full precision copy of the weights W is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
“Simulated/Fake Quantization” weights
Backward
W

W → SWqW = Q(W) weight quantization Y → SY (qY − ZY) = Q(Y)

Q (W)
Layer Q (X) Layer Y Q (Y) Layer
inputs activation outputs
N-1 N quantization N+1
? example for operations
How should gradients ensure discrete-valued
back-propagate through Conv
Batch
ReLU weights and activations
Norm
the (simulated) quantization? in the boundaries
these operations still run in full precision

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 51
ffi
ffi
Straight-Through Estimator (STE)
5
• Quantization is discrete-valued, and thus the

Q (w) = round (w)

derivative is 0 almost everywhere. 4
∂Q (W) 3
=0
∂W
2
• The neural network will learn nothing since
gradients become 0 and the weights won’t get 1
updated. 0
∂L ∂L ∂Q (W) 0 1 2 3 4 5
gW = = ⋅ =0 w
∂W ∂Q (W) ∂W
weights
• Straight-Through Estimator (STE) simply passes the W gW
gradients through the quantization as if it had been weight quantization
the identity function. ∂L
Q (W)
∂L ∂L ∂Q (W)
gW = = Layer
∂W ∂Q (W) N

Neural Networks for Machine Learning [Hinton et al., Coursera Video Lecture, 2012]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation [Bengio, arXiv 2013]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 52
ffi
ffi
Quantization-Aware Training
Train the model taking quantization into consideration
• A full precision copy of the weights is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
“Simulated/Fake Quantization” weights
Backward
W → SWqW = Q(W) W Y → SY (qY − ZY) = Q(Y)
∂L weight quantization ∂L
gW ← gY ←
∂Q (W) Q (W) ∂Q (Y)
Layer Q (X) Layer Y Q (Y) Layer
inputs activation outputs
N-1 N quantization N+1
example for operations
ensure discrete-valued
Conv
Batch
ReLU weights and activations
Norm
in the boundaries
these operations still run in full precision

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 53
ffi
ffi
INT8 Linear Quantization-Aware Training

Post-Training Quantization Quantization-Aware Training

Neural Network Floating-Point Asymmetric Symmetric Asymmetric Symmetric

Per-Tensor Per-Channel Per-Tensor Per-Channel

MobileNetV1 70.9% 0.1% 59.1% 70.0% 70.7%

MobileNetV2 71.9% 0.1% 69.8% 70.9% 71.1%

NASNet-Mobile 74.9% 72.2% 72.1% 73.0% 73.0%

Quantizing Deep Convolutional Networks for E cient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 54
ffi
ffi
ffi
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1

( -1 ) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0

K-Means-based Linear
Quantization Quantization
Integer Weights;
Floating-Point
Storage Floating-Point Integer Weights
Weights

Computation
Floating-Point
Codebook

Floating-Point
Integer Arithmetic
?
Arithmetic Arithmetic

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 55
ffi
ffi
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1 1 0 1 1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1 1 0 0 1

( -1 ) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2 0 1 1 0

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0 1 1 1 1

K-Means-based Linear Binary/Ternary

Quantization Quantization Quantization
Integer Weights;
Floating-Point Binary/Ternary
Storage Floating-Point Integer Weights
Weights Weights
Codebook

Floating-Point Floating-Point
Computation Integer Arithmetic Bit Operations
Arithmetic Arithmetic

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 56
ffi
ffi
Binary/Ternary Quantization
Can we push the quantization precision to 1 bit?

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 57
ffi
ffi
Can quantization bit width go even lower?
yi =
∑
j
Wij ⋅ xj
= ×
= 8×5 + (-3)×2 + 5×0 + (-1)×1 yi Wij
xj

input weight operations memory computation = × 5

8 -3 5 -1 2
R R +× 1× 1×
0
1

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 58
ffi
ffi
If weights are quantized to +1 and -1
yi =
∑
j
Wij ⋅ xj
= × 5

=5-2+0-1 8 -3 5 -1 2
0
1

input weight operations memory computation = × 5

1 -1 1 -1 2
R R +× 1× 1×
0
R B +- ~32× less ~2× less
1

BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations [Courbariaux et al., NeurIPS 2015]
XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 59
ffi
fi
ffi
Binarization
• Deterministic Binarization
• directly computes the bit value based on a threshold, usually 0, resulting in a sign function.

{−1, r < 0
+1, r ≥ 0
q = sign (r) =

• Stochastic Binarization
• use global statistics or the value of input data to determine the probability of being -1 or +1
• e.g., in Binary Connect (BC), probability is determined by hard sigmoid function σ(r)

{−1, with probability 1 − p

1
+1, with probability p = σ(r) r+1
q= , where σ(r) = min(max( ,0),1) 0
2
-1
-3 -1 1 3

• harder to implement as it requires the hardware to generate random bits when quantizing.

BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations [Courbariaux et al., NeurIPS 2015]
BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1. [Courbariaux et al., Arxiv 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 60
ffi
ffi
Minimizing Quantization Error in Binarization
AlexNet-based ImageNet Top-1
weights binary weights
Network Accuracy Delta
(32-bit oat) (1-bit)
2.09 -0.98 1.48 0.09 1 -1 1 1 BinaryConnect -21.2%

0.05 -0.14 -1.08 2.12 W W 1 -1 -1 1 Binary Weight

0.2%
Network (BWN)
-0.91 1.92 0 -1.03 -1 1 1 -1

2
1.87 0 1.53 1.49 1 1 1 1 ∥W − W ∥F = 9.28

1 -1 1 1 scale
W = sign (W) (32-bit oat)
1 -1 -1 1 1
1 1.05 =
16
∥W∥1
α = ∥W∥1
n αW -1 1 1 -1

2
1 1 1 1 ∥W − αW ∥F = 9.24
XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai
𝔹
𝔹
61
𝔹
𝔹
𝔹
fl
fl
ffi
fi
ffi
If both activations and weights are binarized
yi =
∑
j
Wij ⋅ xj
= × 5

= 1×1 + (-1)×1 + 1×(-1) + (-1)×1 8 -3 5 -1 2

= 1 + (-1) + (-1) + (-1) = -2 0

= × 1
1 -1 1 -1 1
-1
1

XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 62
ffi
fi
ffi
If both activations and weights are binarized
∑
yi = Wij ⋅ xj
j

= 1×1 + (-1)×1 + 1×(-1) + (-1)×1

= 1 + (-1) + (-1) + (-1) = -2

W X Y=WX bW bX XNOR(bW, bX)

1 1 1 1 1 1
1 -1 -1 1 0 0
-1 -1 1 0 0 1
-1 1 -1 0 1 0
XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 63
ffi
fi
ffi
If both activations and weights are binarized
∑
yi = Wij ⋅ xj
j

= 1×1 + (-1)×1 + 1×(-1) + (-1)×1 = 1 xnor 1 + 0 xnor 1 + 1 xnor 0 + 0 xnor 1

= 1 + (-1) + (-1) + (-1) = -2 =1+0+0+0 = 1
?

W X Y=WX bW bX XNOR(bW, bX)

1 1 1 1 1 1
1 -1 -1 1 0 0
-1 -1 1 0 0 1
-1 1 -1 0 1 0
XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 64
ffi
fi
ffi
If both activations and weights are binarized
∑ ∑
yi = Wij ⋅ xj yi = − n + 2 ⋅ Wij xnor xj
j j

= 1×1 + (-1)×1 + 1×(-1) + (-1)×1 = 1 xnor 1 + 0 xnor 1 + 1 xnor 0 + 0 xnor 1

= 1 + (-1) + (-1) + (-1) = -2 =1+0+0+0 = 1×2
+2 + = -2
Assuming -1 -1 -1 -1 → -4

W X Y=WX bW bX XNOR(bW, bX)

1 1 1 1 1 1
1 -1 -1 1 0 0
-1 -1 1 0 0 1
-1 1 -1 0 1 0
XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 65
ffi
fi
ffi
If both activations and weights are binarized
yi = − n + popcount (Wi xnor x) ≪ 1
∑
yi = − n + 2 ⋅ Wij xnor xj →
j

= -4 + 2 × (1 xnor 1 + 0 xnor 1 + 1 xnor 0 + 0 xnor 1)

= -4 + 2 × (1 + 0 + 0 + 0) = -2
→ popcount: return the number of 1

W X Y=WX bW bX XNOR(bW, bX)

1 1 1 1 1 1
1 -1 -1 1 0 0
-1 -1 1 0 0 1
-1 1 -1 0 1 0
XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 66
ffi
fi
ffi
If both activations and weights are binarized
yi = − n + popcount (Wi xnor x) ≪ 1
= × 5

= -4 + popcount(1010 xnor 1101) ≪ 1 8 -3 5 -1 2

= -4 + popcount(1000) ≪ 1 = -4 + 2 = -2 0
1

input weight operations memory computation = × 1

1 -1 1 -1 1
R R +× 1× 1×
-1
R B +- ~32× less ~2× less
1
B B xnor,
~32× less ~58× less
popcount
XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 67
ffi
fi
ffi
Accuracy Degradation of Binarization
Bit-Width
ImageNet
Neural Network Quantization Top-1 Accuracy
Delta
W A

BWN 1 32 0.2%

AlexNet BNN 1 1 -28.7%

XNOR-Net 1 1 -12.4%

* BWN: Binary Weight Network with scale for weight binarization

* BNN: Binarized Neural Network without scale factors
* XNOR-Net: scale factors for both activation and weight binarization
Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1. [Courbariaux et al., Arxiv 2016]
XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 68
ffi
fi
ffi
Ternary Weight Networks (TWN)
Weights are quantized to +1, -1 and 0
rt, r>Δ
q= 0, | r | ≤ Δ, where Δ = 0.7 × ( | r | ), rt = |r|>Δ ( | r | )
−rt, r < − Δ
weights W ternary weights W
(32-bit oat) (2-bit) 1
Δ = 0.7 × ∥W∥1 = 0.73
2.09 -0.98 1.48 0.09 1 -1 1 0 16
0.05 -0.14 -1.08 2.12 0 0 -1 1
1
-0.91 1.92 0 -1.03 -1 1 0 -1
1.5 =
11
∥WW ≠0∥1

1.87 0 1.53 1.49 1 0 1 1

ImageNet Top-1
Full Precision 1 bit (BWN) 2 bit (TWN)
Accuracy

ResNet-18 69.6 60.8 65.3

𝔼
𝔼
Ternary Weight Networks [Li et al., Arxiv 2016]
𝕋
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 69
𝕋
fl
ffi
ffi
Trained Ternary Quantization (TTQ)
• Instead of using xed scale rt, TTQ introduces two trainable parameters wp and wn to represent
the positive and negative scales in the quantization.
wp, r>Δ
q= 0, |r| ≤ Δ
−wn, r < − Δ
Trained
Normalized
Full Precision Weight Intermediate Ternary Weight Quantization Final Ternary Weight
Full Precision Weight
Scale
Normalize Quantize
Wn Wp
-1 0 1 -1 -t 0 t 1 -1 0 1 -Wn 0 Wp

ImageNet Top-1
Full Precision 1 bit (BWN) 2 bit (TWN) TTQ
Accuracy

ResNet-18 69.6 60.8 65.3 66.6

Trained Ternary Quantization [Zhu et al., ICLR 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 70
ffi
fi
ffi
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1 1 0 1 1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1 1 0 0 1

( -1 ) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2 0 1 1 0

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0 1 1 1 1

K-Means-based Linear Binary/Ternary

Quantization Quantization Quantization
Integer Weights;
Floating-Point Binary/Ternary
Storage Floating-Point Integer Weights
Weights Weights
Codebook

Floating-Point Floating-Point
Computation Integer Arithmetic Bit Operations
Arithmetic Arithmetic

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 71
ffi
ffi
Mixed-Precision Quantization

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 72
ffi
ffi
Uniform Quantization
……
weight activation

Layer 1
1 1 1 0 1 1 1 0 1 0 1 0 1 0 1 0
8 bits / 8 bits

Layer 2 1 1 0 1 1 0 0 1 1 0 1 0 1 0 0 0
8 bits / 8 bits

Layer 3
0 1 0 1 0 1 0 1 0 0 1 0 0 0 1 0
8 bits / 8 bits

……

Bit Widths Quantized Model

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 73
ffi
ffi
Mixed-Precision Quantization
……
weight activation

Layer 1
1 1 1 0 1 0 1 0 1
4 bits / 5 bits

Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0
6 bits / 7 bits

Layer 3
1 0 1 0 1 0 0 1 0
5 bits / 4 bits

……

Bit Widths Quantized Model

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 74
ffi
ffi
Challenge: Huge Design Space
……
weight activation

Layer 1
1 1 1 0 1 0 1 0 1 Choices: 8 x 8 = 64
4 bits / 5 bits

Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0 Choices: 8 x 8 = 64
6 bits / 7 bits

Layer 3
1 0 1 0 1 0 0 1 0 Choices: 8 x 8 = 64
5 bits / 4 bits

……

Bit Widths Quantized Model Design Space: 64n

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 75
ffi
ffi
Solution: Design Automation
……
weight activation

Layer 1
1 1 1 0 1 0 1 0 1
4 bits / 5 bits
Action
Critic

Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0
6 bits / 7 bits
State
Actor
Reward
Layer 3
1 0 1 0 1 0 0 1 0
5 bits / 4 bits

……

Bit Widths Quantized Model

HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 76
ffi
ffi
Solution: Design Automation
…… BitFusion (Edge)

weight activation BISMO (Cloud)

⋯
PE PE PE
BISMO (Edge)
⋯
PE PE PE
Layer 1

⋯
1 1 1 0 1 0 1 0 1 PE PE
⋯
PE
4 bits / 5 bits PE PE PE

⋯
wn⋯ w0 an ⋯ a0
PE ⋯ PE ⋯ PE Cycle T
Action Hardware

⋯
Critic wn ⋯ w0 an ⋯ a0 (LSB)

⋯
Mapping PE ⋯ PE
⋯ ⋯ PE
⋯ CycleCycle
T 0
wn ⋯ w0 an ⋯ a0 (LSB)(MSB)

⋯
⋯ ⋯ &⋯ ⋯ CycleCycle
Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0
T 0
(LSB)(MSB)

⋯
+
6 bits / 7 bits ⋯ &
+
⋯
<<
Cycle 0
(MSB)

State Direct
+
&
<<
Actor +
Reward Feedback <<
+

Layer 3
1 0 1 0 1 0 0 1 0
+

5 bits / 4 bits

Hardware
Accelerator
……

Bit Widths Quantized Model

HAQ (Ours) PACT Baseline

Mixed-Precision Quantized MobileNetV1

Model Size Constrained Latency Constrained Energy Constrained

HAQ HAQ HAQ

Uniform Uniform Uniform

Mixed-Precision Quantized MobileNetV1

HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 79
ffi
ffi
#bit
Untitled
Quantization Policy for Edge and Cloud
19 Untitled
22 25
2 5 8 11
MobileNet-V2 OPs per Byte
14 17 20 23 26
Layer index
29 32 35 38 41 44 47 50

DW:less bits PW:more bits

log #
2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50
e bits Layer index
Edge

layer

Cloud

layer

er bits
bit (depthwise)
ion bit (depthwise)

layer #weight bit (pointwise) #weight bit (depthwise) #activation bit (pointwise) #activation bit (depthwise)
r Byte (depthwise) Mixed-Precision Quantized MobileNetV2
HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 80
ffi
ffi
Summary of Today’s Lecture
In this lecture, we rmin 0 rmax

1. Reviewed Linear Quantization. r Floating-point

range

2. Introduced Post-Training Quantization (PTQ) that

quantizes an already-trained oating-point neural ×S
Floating-point
network model.
• Per-tensor vs. per-channel vs. group quantization q Scale

• How to determine dynamic range for quantization

qmin Z qmax
Zero point
3. Introduced Quantization-Aware Training (QAT) that
emulates inference-time quantization during the
Forward
training/ ne-tuning. weights
Backward
• Straight-Through Estimator (STE) weight quantization

4. Introduced binary and ternary quantization. Layer Layer activation Layer

inputs outputs
5. Introduced automatic mixed-precision quantization. N-1 N quantization N+1

Batch
Conv ReLU
Norm

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 81
ffi
fi
ffi
fl
References
1. Deep Compression [Han et al., ICLR 2016]
2. Neural Network Distiller: https://intellabs.github.io/distiller/algo_quantization.html
3. Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR
2018]
4. Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
5. Post-Training 4-Bit Quantization of Convolution Networks for Rapid-Deployment [Banner et al., NeurIPS 2019]
6. 8-bit Inference with TensorRT [Szymon Migacz, 2017]
7. Quantizing Deep Convolutional Networks for E cient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv
2018]
8. Neural Networks for Machine Learning [Hinton et al., Coursera Video Lecture, 2012]
9. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation [Bengio, arXiv 2013]
10.Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1.
[Courbariaux et al., Arxiv 2016]
11.DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients [Zhou et al., arXiv
2016]
12.PACT: Parameterized Clipping Activation for Quantized Neural Networks [Choi et al., arXiv 2018]
13.WRPN: Wide Reduced-Precision Networks [Mishra et al., ICLR 2018]
14.Towards Accurate Binary Convolutional Neural Network [Lin et al., NeurIPS 2017]
15.Incremental Network Quantization: Towards Lossless CNNs with Low-precision Weights [Zhou et al., ICLR 2017]
16.HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 82
ffi
ffi
ffi
ffi

Top 500 - Useful Websites List & Bonuses
No ratings yet
Top 500 - Useful Websites List & Bonuses
19 pages
Institute of Pure and Applied Sciences: Implementation of Learning Vector Quantization (LVQ) Using Matlab
No ratings yet
Institute of Pure and Applied Sciences: Implementation of Learning Vector Quantization (LVQ) Using Matlab
8 pages
Lec 13
No ratings yet
Lec 13
79 pages
Integer Quantization For Deep Learning Inference
No ratings yet
Integer Quantization For Deep Learning Inference
20 pages
5 Low Bit Quantization 1
No ratings yet
5 Low Bit Quantization 1
6 pages
Jungwok Choi - tinyML Asia 2023
No ratings yet
Jungwok Choi - tinyML Asia 2023
17 pages
Jacob Quantization and Training
No ratings yet
Jacob Quantization and Training
10 pages
Model Quantization
No ratings yet
Model Quantization
48 pages
Quantization and Training of Neural Networks For Efficient Integer-Arithmetic-Only Inference
No ratings yet
Quantization and Training of Neural Networks For Efficient Integer-Arithmetic-Only Inference
14 pages
Bitnet: Scaling 1-Bit Transformers For Large Language Models
No ratings yet
Bitnet: Scaling 1-Bit Transformers For Large Language Models
14 pages
Quantization in Deep Learning
No ratings yet
Quantization in Deep Learning
2 pages
Lec 02
No ratings yet
Lec 02
91 pages
4) - Lec21 On Device Training and Transfer Learning
No ratings yet
4) - Lec21 On Device Training and Transfer Learning
102 pages
Data-Free Quantization Through Weight Equalization and Bias Correction
No ratings yet
Data-Free Quantization Through Weight Equalization and Bias Correction
13 pages
Paper Survey - Training With Quantization Noise For Extreme Model Compression
No ratings yet
Paper Survey - Training With Quantization Noise For Extreme Model Compression
25 pages
Introduction To Weight Quantization PDF
No ratings yet
Introduction To Weight Quantization PDF
9 pages
Tutorial On DNN 6 of 9 Network and Hardware Co Design
No ratings yet
Tutorial On DNN 6 of 9 Network and Hardware Co Design
60 pages
Lec01 Introduction
No ratings yet
Lec01 Introduction
93 pages
Neural Networks Quantization
No ratings yet
Neural Networks Quantization
31 pages
L S S Q: Earned TEP IZE Uantization
No ratings yet
L S S Q: Earned TEP IZE Uantization
12 pages
SmoothQuant - Accurate and Efficient Post-Training Quantization For Large Language Models
No ratings yet
SmoothQuant - Accurate and Efficient Post-Training Quantization For Large Language Models
13 pages
Auto QNN
No ratings yet
Auto QNN
23 pages
Energy-and-Area-Efficient CNN
No ratings yet
Energy-and-Area-Efficient CNN
14 pages
Counterexample Guided Neural Network Quantization Refinement
No ratings yet
Counterexample Guided Neural Network Quantization Refinement
14 pages
2206 09557v4-2
No ratings yet
2206 09557v4-2
19 pages
Universality of Layer-Level Entropy-Weighted Quantization Beyond Model Architecture and Size
No ratings yet
Universality of Layer-Level Entropy-Weighted Quantization Beyond Model Architecture and Size
29 pages
(GPU-MODE) Quantized Training (20241006)
No ratings yet
(GPU-MODE) Quantized Training (20241006)
26 pages
Lec 01
No ratings yet
Lec 01
84 pages
Quest: Stable Training of Llms With 1-Bit Weights and Activations
No ratings yet
Quest: Stable Training of Llms With 1-Bit Weights and Activations
16 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
RTN: Reparameterized Ternary Network: Yuhang Li, Xin Dong, Sai Qian Zhang, Haoli Bai, Yuanpeng Chen, Wei Wang
No ratings yet
RTN: Reparameterized Ternary Network: Yuhang Li, Xin Dong, Sai Qian Zhang, Haoli Bai, Yuanpeng Chen, Wei Wang
9 pages
And The Bit Goes Down
No ratings yet
And The Bit Goes Down
11 pages
Mixed Precision Training
No ratings yet
Mixed Precision Training
12 pages
LLM Quantization
No ratings yet
LLM Quantization
9 pages
Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework
No ratings yet
Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework
13 pages
An Energy Efficient Convolutional Neural Network Accelerator For Speech Classification Based On FPGA and Quantization
No ratings yet
An Energy Efficient Convolutional Neural Network Accelerator For Speech Classification Based On FPGA and Quantization
13 pages
Quq 1528
No ratings yet
Quq 1528
6 pages
FoQ Unit 5
No ratings yet
FoQ Unit 5
13 pages
Bitdistiller: Unleashing The Potential of Sub-4-Bit Llms Via Self-Distillation
No ratings yet
Bitdistiller: Unleashing The Potential of Sub-4-Bit Llms Via Self-Distillation
14 pages
Electrical and Electronics Engineering: An International Journal (ELELIJ)
No ratings yet
Electrical and Electronics Engineering: An International Journal (ELELIJ)
9 pages
A Survey of Quantization Methods For Efficient Neural Network Inference
No ratings yet
A Survey of Quantization Methods For Efficient Neural Network Inference
33 pages
An Area-Power-Efficient Multiplier-Less Processing Element Design For CNN Accelerators
No ratings yet
An Area-Power-Efficient Multiplier-Less Processing Element Design For CNN Accelerators
4 pages
Differentiable Quantization of Deep Neural Networks: Equal Contribution
No ratings yet
Differentiable Quantization of Deep Neural Networks: Equal Contribution
21 pages
Loftq: Lora-Fine-Tuning-Aware Quantization For Large Language Models
No ratings yet
Loftq: Lora-Fine-Tuning-Aware Quantization For Large Language Models
23 pages
Quantizaion LLM Globalisation
No ratings yet
Quantizaion LLM Globalisation
6 pages
Tensorflow and Deep Learning
No ratings yet
Tensorflow and Deep Learning
51 pages
LLM - Int8 : 8-Bit Matrix Multiplication For Transformers at Scale
No ratings yet
LLM - Int8 : 8-Bit Matrix Multiplication For Transformers at Scale
20 pages
UDL2021 Paper 039
No ratings yet
UDL2021 Paper 039
8 pages
Lecture 26
No ratings yet
Lecture 26
17 pages
8 Bit Inference With TensorRT
No ratings yet
8 Bit Inference With TensorRT
41 pages
Problem Sheet For Soft Computing AI and NN Lab
No ratings yet
Problem Sheet For Soft Computing AI and NN Lab
6 pages
Binary Neural Networks
No ratings yet
Binary Neural Networks
218 pages
2 - Build A Complete OpenSource LLM RAG QA Chatbot - Choose The Model - by Marco Bertelli - Level Up Coding
No ratings yet
2 - Build A Complete OpenSource LLM RAG QA Chatbot - Choose The Model - by Marco Bertelli - Level Up Coding
18 pages
Applsci 15 00688 v3
No ratings yet
Applsci 15 00688 v3
21 pages
The Era of 1-Bit LLMS: All Large Language Models Are in 1.58 Bits
No ratings yet
The Era of 1-Bit LLMS: All Large Language Models Are in 1.58 Bits
8 pages
Back To Simplicit - How To Train Accurate BNNs From Scratch
No ratings yet
Back To Simplicit - How To Train Accurate BNNs From Scratch
9 pages
I N Q: T L CNN L - P W: Ncremental Etwork Uantization Owards Ossless S With OW Recision Eights
No ratings yet
I N Q: T L CNN L - P W: Ncremental Etwork Uantization Owards Ossless S With OW Recision Eights
14 pages
Develop The Following Programs in The MATLAB Environment
No ratings yet
Develop The Following Programs in The MATLAB Environment
7 pages
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
From Everand
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
Fouad Sabry
No ratings yet
Computer Vision Fundamental Matrix: Please, suggest a subtitle for a book with title 'Computer Vision Fundamental Matrix' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
From Everand
Computer Vision Fundamental Matrix: Please, suggest a subtitle for a book with title 'Computer Vision Fundamental Matrix' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
Fouad Sabry
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Natural Stupidity
No ratings yet
Natural Stupidity
3 pages
Computer Ethics
No ratings yet
Computer Ethics
28 pages
My Brothers Wife Scorn
No ratings yet
My Brothers Wife Scorn
69 pages
A CIOs GUIDE TO AIOps
No ratings yet
A CIOs GUIDE TO AIOps
27 pages
Challenges and Issues - HS1501 Artificial Intelligence and Society (2310)
No ratings yet
Challenges and Issues - HS1501 Artificial Intelligence and Society (2310)
12 pages
A Fast 3D CNN For Hyperspectral Image Classification: Muhammad Ahmad
No ratings yet
A Fast 3D CNN For Hyperspectral Image Classification: Muhammad Ahmad
5 pages
Sample 8 Review
No ratings yet
Sample 8 Review
27 pages
CNN Implementation in Python
No ratings yet
CNN Implementation in Python
7 pages
Ownership Dilemmas in Age of Creative Machines - 010752
No ratings yet
Ownership Dilemmas in Age of Creative Machines - 010752
8 pages
Insurtech
No ratings yet
Insurtech
25 pages
Summer - Internship - Report of IRA Coaching
No ratings yet
Summer - Internship - Report of IRA Coaching
16 pages
Chat GPT Alchemy
100% (1)
Chat GPT Alchemy
121 pages
1 s2.0 S0301479724009630 Main
No ratings yet
1 s2.0 S0301479724009630 Main
14 pages
Aiml Project Review
No ratings yet
Aiml Project Review
22 pages
Conversational Ai Chatbot
No ratings yet
Conversational Ai Chatbot
4 pages
GANs
No ratings yet
GANs
13 pages
Computer Studies Notes
No ratings yet
Computer Studies Notes
50 pages
Ai QB 1
No ratings yet
Ai QB 1
2 pages
Kaushal Chavda
No ratings yet
Kaushal Chavda
137 pages
Ai and Machine Learning in Biotechnology A Paradigm
No ratings yet
Ai and Machine Learning in Biotechnology A Paradigm
11 pages
DataAnalyticsfortheInsuranceIndustry AGoldMine
No ratings yet
DataAnalyticsfortheInsuranceIndustry AGoldMine
31 pages
Artificial Intelligence in Oil and Gas Industry
No ratings yet
Artificial Intelligence in Oil and Gas Industry
9 pages
AI Practical Logbook-Sample
No ratings yet
AI Practical Logbook-Sample
35 pages
HRM Cec 03
No ratings yet
HRM Cec 03
10 pages
Umer
No ratings yet
Umer
53 pages
Decision Trees & The Iterative Dichotomiser 3 (ID3) Algorithm
100% (1)
Decision Trees & The Iterative Dichotomiser 3 (ID3) Algorithm
8 pages
100 AI Tools For Content Creation
No ratings yet
100 AI Tools For Content Creation
4 pages
CH 1 AI Project Cycle Class 10 2025 26
No ratings yet
CH 1 AI Project Cycle Class 10 2025 26
4 pages
Annotated Bib
No ratings yet
Annotated Bib
4 pages