ML System Optimization Lecture 11 Quantization
ML System Optimization Lecture 11 Quantization
Learning
Systems
QUANTIZATION
2
Low Bit Operations are Cheaper
Less Bit-Width → Less Energy
7
Integer
• Unsigned Integer
• n-bit Range: 0 0 1 1 0 0 0 1
× × × × × × × ×
27 + 26+25+24+23+22+21 + 20
• Signed Integer
• Sign-Magnitude Representation
• n-bit Range: 1 0 1 1 0 0 0 1
• Both 000…00 and 100…00 represent 0 × × × × × × ×
- 26+25+24+23+22+21 + 20
• Two’s Complement Representation
• n-bit Range:
1 1 0 0 1 1 1 1
• 000…00 represents 0
× × × × × × × ×
• 100…00 represents -27 + 26+25+24+23+22+21 + 20
8
Fixed-Point Number
Integ Fractio
.
er “Decimal” n
Point
0 0 1 1 0 0 0 1
× × × × × × × ×
=
-23 + 22+21+20+2-1+2-2+2-3+ 2-4
3.0625
0 0 1 1 0 0 0 1
× × × × × × × ×
) × 2-4 = 49 × 0.0625 =
( -27 + 26+25+24+23+22+21 + 20
3.0625
9
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754
23 22 21 20 2-1 2-2 2-3 2-4
00111110100010000000000000000000
12 0.062
5 5
1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754
1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754
0 0111110100010000000000000000000 0 0000000000000000000000000000000
12 0.062
0 0
5 5
1 0000000000000000000000000000000
0 0
0.265625 = 1.0625 × 2-2 = (1 + 0.0625) 0 = 0 × 2-
× 2125-127 126
1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754
1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754
0 0000000100000000000000000000000 0 0000000000000000000000000000001
1 0 0 2-23
2-126 = (1 + 0) × 2-149 = 2-23 × 2-
21-127 126
1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754
1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754
0 0000000100000000000000000000000 0 0000000011111111111111111111111
21-127 126
1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754
0 1111111100000000000000000000000 - 11111111 - - - - - - - - - - - - - - - - - - - - - - -
+∞ (positive NaN (Not a
infinity) Number)
1 1111111100000000000000000000000
-∞ (negative much waste. revisit
infinity) in fp8.
1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754
subnormal normal
values values
…
± -149 (1-2-23) 2- (1+1-2-
2 2-126
0 126 23
)×2127
1
Floating-Point Number
Exponent Width → Range; Fraction Width → Precision
Exponent Fraction Total
(bits) (bits) (bits)
8 23 32
IEEE 754 Half Precision 16-bit Float
(IEEE FP16)
5 10 16
8 7 16
2
Numeric Data Types
• Question: What is the following IEEE half precision (IEEE FP16) number in decimal?
Exponent Bias =
1100011100000000
Sig 5 bit 10 bit 1510
n Exponent Fraction
• Sign: -
2
Numeric Data Types
• Question: What is the decimal 2.5 in Brain Float (BF16)?
• Sign: +
0100000000100000
Sig 8 bit 7 bit
n Exponent Fraction
2
Floating-Point Number
Exponent Width → Range; Fraction Width → Precision
Exponent Fraction Total
(bits) (bits) (bits)
8 23 32
IEEE 754 Half Precision 16-bit Float
(IEEE FP16)
5 10 16
2
INT4 and FP4
Exponent Width → Range; Fraction Width → Precision
INT
4 -1,-2,-3,-4,-5,-6,-7,-8 -1,-2,-3,-4,-5,-6,-7,-8
S 0, 1, 2, 3, 4, 5, 6, 7 0, 1, 2, 3, 4, 5, 6, 7
0 1 2 3 4 5 6 7
=
0 0 0 1
1
=
0 1 1 1
7
FP4
(E1M2) -0,-0.5,-1,-1.5,-2,-2.5,-3,-3.5 -0,-1,-2,-3,-4,-5,-6,-7
S E M M 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5 0, 1, 2, 3, 4, 5, 6, 7
×0.5
0 1 2 3 3.5
=0.25×2 1-
0 0 0 1 0
=0.5
=(1+0.75)×21-
0 1 1 1 0
=3.5
FP4
(E2M1) -0,-0.5,-1,-1.5,-2,-3,-4,-6 -0,-1,-2,-3,-4,-6,-8,-12
S E E M 0, 0.5, 1, 1.5, 2, 3, 4, 6 0, 1, 2, 3, 4, 6, 8, 12
×0.5
=0.5×21- 0 1 2 3 4 6
0 0 0 1 1
=0.5
=(1+0.5)×23-
0 1 1 1 no inf, no
1
=1
FP4 NaN
(E3M0) -0,-0.25,-0.5,-1,-2,-4,-8,-16 -0,-1,-2,-4,-8,-16,-32,-64
S E E E 0, 0.25, 0.5, 1, 2, 4, 8, 16 0, 1, 2, 4, 8, 16, 32, 64
×0.25
0 1 2 4 8 16
=(1+0)×2 1-
0 0 0 1 3
=0.25
=(1+0)×27-
0 1 1 1 no inf, no
3
=16 NaN
2
What is Quantization?
tim
e
Quantization [Wikipe
dia]
2
Neural Network Quantization: Agenda
- 3 2.0 1 -2 0 -1
2.0 1.4 0.0 3 0 2 1 1 0 1 1
0.9 : 0
9 8 9
8 2 1.5 -1 -1 -2 1
1 1 0 3 1 0 0 1
:
0.0
- -
0.1 1.0
2.1 1
0
( -1) 1.07
5 2 0 3 1 0 0.0 -2 1 -1 -2 0 1 1 0
4 8 :
0
3 1 2 2 0 1 -1 0 0 1 1 1 1
- - -
1.9 :
0.9 0 1.0 1.0
2
1 3 0
1.8
7
0
1.5 1.4
3 9
K-Means-based Linear Binary/Ternary
Quantization Quantization Quantization
Floating-Point
Storage
Weights
Floating-Point
Computation
Arithmetic
2
Neural Network Quantization: Agenda
- 3 2.0 1 -2 0 -1
2.0 1.4 0.0 3 0 2 1 1 0 1 1
0.9 : 0
9 8 9
8 2 1.5 -1 -1 -2 1
1 1 0 3 1 0 0 1
:
0.0
- -
0.1 1.0
2.1 1
0
( -1) 1.07
5 2 0 3 1 0 0.0 -2 1 -1 -2 0 1 1 0
4 8 :
0
3 1 2 2 0 1 -1 0 0 1 1 1 1
- - -
1.9 :
0.9 0 1.0 1.0
2
1 3 0
1.8
7
0
1.5 1.4
3 9
K-Means-based Linear Binary/Ternary
Quantization Quantization Quantization
Integer Weights;
Floating-Point
Storage Floating-Point
Weights
Codebook
Floating-Point Floating-Point
Computation
Arithmetic Arithmetic
2
Neural Network Quantization
Weight Quantization
weights
(32-bit
float)
-
2.09 1.48 0.09
0.98
- -
0.05 2.12
0.14 1.08
- -
1.92 0
0.91 1.03
2
Neural Network Quantization
Weight Quantization
weights
(32-bit
float)
2.09
-
1.48 0.09
2.09, 2.12, 1.92,
0.98
1.87
- -
0.05 2.12
0.14 1.08
- -
1.92 0
0.91 1.03
2
K-Means-based Weight Quantization
weights
(32-bit
float)
-
2.09 1.48 0.09
0.98
- -
0.05 2.12
0.14 1.08
- -
1.92 0
0.91 1.03
Quantization Only
0.5%
0.0%
2% 5% 8% 11% 14% 17% 20%
-0.5%
Accuracy Loss
-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
Coun
t
Weight
Value
Deep Compression [Han et al.,
ICLR 2016]
3
After Quantization: Discrete Weight
Coun
t
Weight
Value
Deep Compression [Han et al.,
ICLR 2016]
3
After Quantization: Discrete Weight after Retraining
Coun
t
Weight
Value
Deep Compression [Han et al.,
ICLR 2016]
3
How Many Bits do We Need?
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size [Iandola et
al., arXiv 2016] 4 4
Deep Compression on SqueezeNet
Top-1 Top-5
Network Approach Size Ratio
Accuracy Accuracy
Deep
AlexNet 6.9MB 35x 57.2% 80.3%
Compression
SqueezeNe
- 4.8MB 50x 57.5% 80.3%
t
SqueezeNe Deep
0.47MB 510x 57.5% 80.3%
t Compression
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size [Iandola et
al., arXiv 2016] 4
K-Means-based Weight Quantization
floa
t output
weights cluster ReLU
centroid floa s
(32-bit index
s t
float) (2-bit int)
- 3 2.00 bias floa +
2.09 1.48 0.09 3 0 2 1 floa
0.98 : t
t
- - 2 1.50 Conv
0.05 2.12 decod 1 1 0 3
0.14 1.08 e : floa floa
- - t t
1.92 0 0 3 1 0 1 0.00 input
0.91 1.03 : inputs weights
s floa
0 - t
1.87 0 1.53 1.49 3 1 2 2
: 1.00 decode
uin
• quantized t
During In weights
Computation Storage • codebook (float)
• The weights are decompressed using a lookup table (i.e., codebook) during runtime
inference.
• K-Means-based Weight Quantization only saves storage cost of a neural network model.
• All the computation and memory access are still floating-point.
4
Neural Network Quantization
- 3 2.0 1 -2 0 -1
2.0 1.4 0.0 3 0 2 1 1 0 1 1
0.9 : 0
9 8 9
8 2 1.5 -1 -1 -2 1
1 1 0 3 1 0 0 1
:
0.0
- -
0.1 1.0
2.1 1
0
( -1) 1.07
5 2 0 3 1 0 0.0 -2 1 -1 -2 0 1 1 0
4 8 :
0
3 1 2 2 0 1 -1 0 0 1 1 1 1
- - -
1.9 :
0.9 0 1.0 1.0
2
1 3 0
1.8
7
0
1.5 1.4
3 9
K-Means-based Linear Binary/Ternary
Quantization Quantization Quantization
Integer Weights;
Floating-Point
Storage Floating-Point Integer Weights
Weights
Codebook
Floating-Point Floating-Point
Computation Integer Arithmetic
Arithmetic Arithmetic
4
Linear Quantization
4
What is Linear Quantization?
weights
(32-bit
float)
-
2.09 1.48 0.09
0.98
- -
0.05 2.12
0.14 1.08
- -
1.92 0
0.91 1.03
4
What is Linear Quantization?
An affine mapping of integers to real numbers
weights quantized zero point scale reconstructed
(32-bit weights (2-bit signed (32-bit weights
float) (2-bit signed int) int) float) (32-bit float)
- -
2.09 1.48 0.09 1 -2 0 -1 2.14 1.07 0
0.98 1.07
- - -
0.05 2.12 -1 -1 -2 1
- 1.0 0 0 2.14
( ) =
0.14 1.08 1.07
-
0.91
1.92 0
-
1.03
-2 1 -1 -2
1 7 -
1.07
2.14 0
-
1.07
Floating-point
range
Floating-
point
Floating-point
Scale
Floating-
point range
Floating-
point Scale
Zero
point
5
Scale of Linear Quantization
Linear Quantization is an affine mapping of integers to real numbers
-
2.09 1.48 0.09
0.98
- -
0.05 2.12
0.14 1.08
Floating-
point range - -
1.92 0
0.91 1.03
Floating-
point Scale
Zero
point Binary Decimal
01 1
00 0
11 -1
10 -2
5
Zero Point of Linear Quantization
Linear Quantization is an affine mapping of integers to real numbers
Floating-
point range
Floating-
point Scale
Zero
point
5
Zero Point of Linear Quantization
Linear Quantization is an affine mapping of integers to real numbers
-
2.09 1.48 0.09
0.98
- -
0.05 2.12
0.14 1.08
Floating-
point range - -
1.92 0
0.91 1.03
Floating-
point Scale
Zero
point Binary Decimal
01 1
00 0
11 -1
10 -2
5
Linear Quantized Matrix Multiplication
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following matrix multiplication.
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 5
Linear Quantized Matrix Multiplication
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following matrix multiplication.
Precompu
te
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 5
Linear Quantized Matrix Multiplication
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following matrix multiplication.
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 5
Linear Quantized Matrix Multiplication
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following matrix multiplication.
Precompu
te
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 6
Symmetric Linear Quantization
Zero point and Symmetric floating-point range
Floating-
point range
Floating-
point Scale
Zero
point
Floating-
point range
Floating-
point Scale
Zero
point
Precompu
te
6
Linear Quantized Fully-Connected Layer
Linear Quantization is an affine mapping of integers to real numbers
• So far, we ignore bias. Now we consider the following fully-connected layer with bias.
6
Linear Quantized Fully-Connected Layer
Linear Quantization is an affine mapping of integers to real numbers
• So far, we ignore bias. Now we consider the following fully-connected layer with bias.
6
Linear Quantized Fully-Connected Layer
Linear Quantization is an affine mapping of integers to real numbers
• So far, we ignore bias. Now we consider the following fully-connected layer with bias.
Precompu
te
6
Linear Quantized Fully-Connected Layer
Linear Quantization is an affine mapping of integers to real numbers
• So far, we ignore bias. Now we consider the following fully-connected layer with bias.
N-bit Int
Rescale N-bit
Mult.
to Int
32-bit Int
N-bit Int Add
Add.
Note: both and are 32 bits.
6
Linear Quantized Convolution Layer
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following convolution layer.
N-bit Int
Rescale N-bit
Mult.
to Int
32-bit Int
N-bit Int Add
Add.
Note: both and are 32 bits.
7
Linear Quantized Convolution Layer
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following convolution layer.
in in quantized
zero point t + t outputs
in
scale factor t
×
int3
2
quantized bias
int3 +
int3
2 2
Conv
in in
t t
quantized quantized weights
N-bit Int inputs
Rescale N-bit
Mult.
to Int
32-bit Int
N-bit Int Add
Add.
Note: both and are 32 bits.
7
INT8 Linear Quantization
An affine mapping of integers to real numbers
Floating-point
76.4% 78.4%
Accuracy
8-bit Integer-
quantized 74.9% 75.4%
Acurracy
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 7
Neural Network Quantization
- 3 2.0 1 -2 0 -1
2.0 1.4 0.0 3 0 2 1
0.9 : 0
9 8 9
8 2 1.5 -1 -1 -2 1
1 1 0 3
:
0.0
- -
0.1 1.0
2.1 1
0
( -1) 1.07
5 2 0 3 1 0 0.0 -2 1 -1 -2
4 8 :
0
3 1 2 2 0 1 -1 0 0
- - -
1.9 :
0.9 0 1.0 1.0
2
1 3 0
1.8
7
0
1.5 1.4
3 9
K-Means-based Linear
Quantization Quantization
Integer Weights;
Floating-Point
Storage Floating-Point Integer Weights
Weights
?
Codebook
Floating-Point Floating-Point
Computation Integer Arithmetic
Arithmetic Arithmetic
7
Summary of Today’s Lecture
Today, we reviewed and learned
• the numeric data types used in the modern
computing systems, including integers and 1 1 0 0 1 1 1 1
× × × × × × × ×
floating-point numbers. =-
-27 + 26+25+24+23+22+21 + 20
49
9
Lecture
Plan
Today we will:
1. Review Linear Quantization.
K-Means-based Linear
Quantization Quantization
Integer Weights;
Floating-Point Floating-Point Integer Weights
Storage
Weights Codebook
Floating-Point Floating-Point
Computation Integer Arithmetic
Arithmetic Arithmetic
K-Means-based Weight Quantization
weights cluster index fine-tuned
(32-bit float) (2-bit int) centroids centroids
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96
gradient
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
2% 5% 8% 11% 14% 17% 20%
Binary Decimal
01 1
00 0
11 -1
10 -2
Linear Quantization
An affine mapping of integers to real numbers r = S(q −
Z)
rmin rmax
r 0
Floating-point range
Floating-point
×S
Floating-point Scale
q qmin qmax
Bit Width
Integer qmin Z qmax 2 -2 1
Zero point 3 -4 3
4 -8 7
N -2N-1 2N-1-1
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
Linear Quantized Fully-Connected
Layer
Linear Quantization is an affine mapping of integers to real numbers r = S(q −
Z)
• Consider the following fully-connected layer.
Y = WX + b
ZW = 0
Zb = 0, Sb = SWSX
qbias = qb − ZXqW
qY = S WS X
SY (qWqX + qbias) + ZY
Rescale to N-bit Int Mult. N-bit Int
N-bit Int 32-bit Int Add. Add
Note: both qb and qbias are 32 bits.
Linear Quantized Convolution
Layer
Linear Quantization is an affine mapping of integers to real numbers r = S(q −
Z)
• Consider the following convolution layer.
Y = Conv (W, X) + b
ZW = 0
Zb = 0, Sb = SWSX
r Floating-point
range -0.91 1.92 0 -1.03
• Per-Channel Quantization
• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type
Quantization Granularity
• Per-Tensor Quantization
• Per-Channel Quantization
• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type
Symmetric Linear Quantization on Weights
• r|
− | r |max max = | W |max
(first depthwise-separable layer in MobileNetV2)
| r |max
• Using single scale for whole weight tensor
(Per-Tensor Quantization)
r • works well for large models
• accuracy drops for small models
Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et al., CVPR
2018]
Quantization Granularity
• Per-Tensor Quantization
• Per-Channel Quantization
• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09
0 0 -1 1 0 0 -2.12 2.12
0 1 0 0 0 2.12 0 0
Quantized Reconstructed
∥W − SqW∥F =
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = S0 = | r |max =
2.09 2.09 2.12
oc 0.05 -0.14 -1.08 2.12 |r 2.12
| r |max = S1 = S= | max = =
2.12
-0.91 1.92 0 -1.03
2.12 2.12 q max 22−1 − 1
1.87 0 1.53 1.49
| r |max = S2 =
1 0 1 0 2.12 0 2.12 0
1.92 1.92
0 0 -1 1 0 0 -2.12 2.12
| r |max = S3 =
0 1 0 0 0 2.12 0 0
1.87 1.87
1 0 1 1 2.12 0 2.12 2.12
Quantized Reconstructed
∥W − SqW∥F =
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = S0 = | r |max =
2.09 2.09 2.12
oc 0.05 -0.14 -1.08 2.12 |r 2.12
| r |max = S1 = S= | max = =
2.12
-0.91 1.92 0 -1.03
2.12 2.12 q max 22−1 − 1
1.87 0 1.53 1.49
| r0|max1= S2 =
1 0 2.09 0 2.09 0 1 0 1 0 2.12 0 2.12 0
1.92 1.92
0 0 -1 1 0 0 -2.12 2.12 0 0 -1 1 0 0 -2.12 2.12
| r1|max0= S3 =
0 -1 0 1.92 0 -1.92 0 1 0 0 0 2.12 0 0
1.87 1.87
1 0 1 1 1.87 0 1.87 1.87 1 0 1 1 2.12 0 2.12 2.12
Quantized
∥W − S ⊙ qW∥F
Reconstructed
= < Quantized
∥W − SqW∥F
Reconstructed
= 2.28
Quantization Granularity
• Per-Tensor Quantization
• Per-Channel Quantization
• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type
VS-Quant: Per-vector Scaled
Quantization
Hierarchical scaling factor
• r = S(q − Z) → r = γ ⋅ Sq(q −
Z)
• is a floating-point coarse grained scale factor scale factor Sq for each vector
• Sq is an integer per-vector scale factor
• achieves a balance between accuracy and
hardware efficiency by
• less expensive integer scale factors at finer
granularity
M K = M
VS-Quant: Per-Vector Scaled Quantization for Accurate Low-Precision Neural Network Inference [Steve Dai, et al.]
Group
Quantization
Multi-level scaling scheme r = (q − z) ⋅ s →
W
r = (q − z) ⋅ sl ⋅ sl ⋅ ⋯
11 0
1
L0
W11 0 INT4
W21 1 INT4
FP16 UINT4 INT4
W31 : real number value INT4
sl sl q
: quantized value 1 0
L1
Quantization L0 L0 Scale L1 L1 Scale Effective
Data Type
Approach Group Size Data Type Group Size Data Type Bit Width
Per-Channel Quant INT4 Per Channel FP16 - - 4
VSQ INT4 16 UINT4 Per Channel FP16 4+4/16=4.25
MX4 S1M2 2 E1M0 16 E8M0 3+1/2+8/16=4
rmax
×S
q
qmin Z qmax
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
Dynamic Range for Activation
Quantization
Collect activations statistics before deploying the model
•Type 2: By running a few “calibration” batches of samples on the trained FP32 model
q
qmin Z qmax
ResNet-152: GoogleNet:
res4b8_branch2a incpetion_3a/pool
clip clip
Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training [Sakr et al., ICML 2022]
Dynamic Range for Quantization
Minimize mean-square-error (MSE) using Newton-Raphson method
AdaRound
0.3 0.5 0.7 0.2 0 0 1 0
(one potential result)
• What is optimal? Rounding that reconstructs the original activation the best, which may be very
different
• For weight quantization only
• With short-term tuning, (almost) post-training quantization
Up or Down? Adaptive Rounding for Post-Training Quantization [Nagel et al., PMLR 2020]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
Adaptive Rounding for Weight
Quantization
Rounding-to-nearest is not optimal
• Method:
• Instead of ⌊w⌉ , we want to choose from {⌊w ⌋ , ⌈w ⌉ } to get the best reconstruction
• We took a learning-based method to find quantized value w˜ = ⌊⌊w⌋ + δ⌉, δ ∈
[0,1]
Up or Down? Adaptive Rounding for Post-Training Quantization [Nagel et al., PMLR 2020]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
Adaptive Rounding for Weight
Quantization
Rounding-to-nearest is not optimal
• Method:
• Instead of ⌊w⌉ , we want to choose from {⌊w ⌋ , ⌈w ⌉ } to get the best reconstruction
• We took a learning-based method to find quantized value w˜ = ⌊⌊w⌋ + δ⌉, δ ∈
• [0,1] We optimize the following equation (omit the derivation):
˜ 2
argminV∥Wx − x∥ F + λfreg(V)
W
→ argmin ∥Wx − ⌊⌊W⌋ + 2
+ λf
V F reg
h(V)⌉x∥ (V)
• is the input to the layer, V is a random variable of the same shape 2
1
• h() is a function to map the range to (0,1), such as rectified sigmoid
0
• freg(V) is a regularization that encourages h(V) to be binary -1
-6 0 6
• freg(V) = ∑ 1 − | 2h(Vi,j) − 1 |
i,j
β
Up or Down? Adaptive Rounding for Post-Training Quantization [Nagel et al., PMLR 2020]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
Neural Network • Zero Point
• Asymmetric
Quantization • Symmetric
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1
• Scaling
0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1 Granularity
• Per-Tensor
( -1) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2
• Per-Channel
1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0
• Group Quantization
K-Means-based Linear • Range Clipping
Quantization Quantization • Exponential Moving
Integer Weights; Average
Floating-Point Floating-Point Integer Weights • Minimizing KL
Storage
Weights Codebook Divergence
• Minimizing Mean-
Floating-Point Floating-Point Square-Error
Computation Integer Arithmetic
Arithmetic Arithmetic
• Rounding
• Round-to-Nearest
• AdaRound
Post-Training INT8 Linear Quantization
Symmetric Asymmertric
Per-Tensor Per-Tensor
Activation
Exponential Moving Average
Minimize KL-Divergence
(EMA)
Symmetric Symmetric
Weight
Per-Tensor Per-Channel
GoogleNet -0.45% 0%
MobileNetV1 - -11.8%
MobileNetV2 - -2.1%
Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv
2018]
8-bit Inference with TensorRT [Szymon Migacz, 2017]
Post-Training INT8 Linear Quantization
Symmetric Asymmertric
Per-Tensor Per-Tensor
Activation
Exponential Moving Average
Minimize KL-Divergence
(EMA)
Symmetric Symmetric
Weight
Per-Tensor Per-Channel
?
GoogleNet
Smaller -
models seem to not respond 0%
How should we
as 0.45%
well to post-training quantization,
improve -0.6%
performance
presumabley due to their smaller
Neural ResNet-50 - of quantized models?
representational capacity.
0.13% -1.8%
Network
MobileNetV1
ResNet-152 - - -11.8%
0.08%
MobileNetV2 - -2.1%
Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv
2018]
8-bit Inference with TensorRT [Szymon Migacz, 2017]
Quantization-Aware Training
How should we improve performance of quantized models?
Quantization-Aware Training
Train the model taking quantization into consideration
• To minimize the loss of accuracy, especially aggressive quantization with 4 bits and lower bit
width, neural network will be trained/fine-tuned with quantized weights and activations.
• Usually, fine-tuning a pre-trained floating point model provides better accuracy than training from
scratch.
weights cluster index fine-tuned Forward
(32-bit float) (2-bit int) centroids centroids weights
Backward
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96
weight quantization
Batch
Conv ReLU
Norm
Quantization-Aware Training
Train the model taking quantization into consideration
• A full precision copy of the weights W is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
“Simulated/Fake Quantization” weights
Backward
W
W qW Q(W
)
Quantization-Aware Training
Train the model taking quantization into consideration
• A full precision copy of the weights W is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
“Simulated/Fake Quantization” weights
Backward
W
Q (w) = round
∂Q (W )
= 0 3
∂W
2
• The neural network will learn nothing since
gradients become 0 and the weights won’t get 1
(w)
updated. 0
∂L ∂L ∂Q (W) 0 1 2 3 4 5
gW = ∂W = ∂Q (W) ⋅ = 0
∂W weights
• Straight-Through Estimator (STE) simply passes the W gW
gradients through the quantization as if it had been weight quantization
the identity function. ∂L
Q
∂Q
∂L ∂L (W)
Layer (W)
gW = ∂W = ∂Q
N
(W)
Neural Networks for Machine Learning [Hinton et al., Coursera Video Lecture, 2012]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation [Bengio, arXiv 2013]
Quantization-Aware Training
Train the model taking quantization into consideration
• A full precision copy of the weights is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
“Simulated/Fake Quantization” weights
Backward
W → SWqW = Q(W) W
Y → SY (qY − ZY) = Q(Y)
weight quantization
∂L ∂L
gW ← ∂Q Q
g Y ← ∂Q
(W)
Layer Q (W) Layer Y (Y) Q Layer
inputs activation outputs
(X) (Y)
N-1 N quantization N+1
Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv 2018]
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1
K-Means-based Linear
Quantization Quantization
Integer Weights;
Floating-Point Floating-Point Integer Weights
Storage
Weights Codebook
Computation
Floating-Point Floating-Point
Integer Arithmetic
?
Arithmetic Arithmetic
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1 1 0 1 1
Floating-Point Floating-Point
Computation Integer Arithmetic Bit Operations
Arithmetic Arithmetic
Binary/Ternary Quantization
Can we push the quantization precision to 1
bit?
Can quantization bit width go even lower?
yi = ∑ Wij ⋅ xj = ×
j yi
Wij
= 8×5 + (-3)×2 + 5×0 + (-1)×1 xj
BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations [Courbariaux et al., NeurIPS 2015]
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
Binarization
• Deterministic Binarization
• directly computes the bit value based on a threshold, usually 0, resulting in a sign function.
+1, r ≥ 0
q = sign (r) = { r < 0
−1,
• Stochastic Binarization
• use global statistics or the value of input data to determine the probability of being -1 or +1
• e.g., in Binary Connect (BC), probability is determined by hard sigmoid function σ(r)
1
+1, with probability p = σ(r) r+
1 , where σ(r) = min(max( 0
q= 2
{ −1, ,0),1) -1
with probability 1 − p -3 -1 1 3
• harder to implement as it requires the hardware to generate random bits when quantizing.
BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations [Courbariaux et al., NeurIPS 2015]
BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1. [Courbariaux et al., Arxiv 2016]
Minimizing Quantization Error in Binarization
AlexNet-based ImageNet Top-1
weights binary weights Network Accuracy Delta
(32-bit float) (1-bit)
2.09 -0.98 1.48 0.09 1 -1 1 1 BinaryConnect -21.2%
9.28
1 -1 1 1 scale
(32-bit float)
W𝔹 = sign 1 -1 -1 1
(W) = 1 ∥W∥1
1
�
1.0 16
α = ∥W∥1 αW� -1 1 1 -1
5
n 1 1 1 1
∥W − αW𝔹∥F2 =
9.24
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
If both activations and weights are binarized
yi = ∑ Wij ⋅ xj = × 5
j 8 -3 5 -1 2
= 1×1 + (-1)×1 + 1×(-1) + (-1)×1 0
= 1 + (-1) + (-1) + (-1) = -2 1
= × 1
1 -1 1 -1 1
-1
1
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
If both activations and weights are binarized
yi = ∑ Wij ⋅ xj
j
= -4 + popcount(1000) ≪ 1 = -4 + 2 = -2 0
1
BWN 1 32 0.2%
XNOR-Net 1 1 -12.4%
BWN 1 32 -5.80%
GoogleNet
BNN 1 1 -24.20%
BWN 1 32 -8.5%
ResNet-18
* BWN: Binary Weight XNOR-Net 1 binarization
Network with scale for weight 1 -18.1%
* BNN: Binarized Neural Network without scale factors
* XNOR-Net: scale factors for both activation and weight binarization
Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1. [Courbariaux et al., Arxiv 2016]
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
Ternary Weight Networks
(TWN)
Weights are quantized to +1, -1 and 0
rt, r> Δ
q= 0, |r|≤ where Δ = 0.7 × 𝔼 ( | r | ) , rt = 𝔼|r|>Δ ( | r
Δ,
|)
−r t , r< −Δ ternary weights W𝕋
Δ = 0.7 × 1 ∥W∥1 = 0.73
(2-bit)
weights W
2.09 -0.98 1.48 0.09 1 -1 1 0 16
(32-bit float)
ImageNet Top-1
Full Precision 1 bit (BWN) 2 bit (TWN)
Accuracy
ImageNet Top-1
Full Precision 1 bit (BWN) 2 bit (TWN) TTQ
Accuracy
Floating-Point Floating-Point
Computation Integer Arithmetic Bit Operations
Arithmetic Arithmetic
Mixed-Precision Quantization
Uniform Quantization
……
weight activation
Layer 1
1 1 1 0 1 1 1 0 1 0 1 0 1 0 1 0
8 bits / 8 bits
Layer 2 1 1 0 1 1 0 0 1 1 0 1 0 1 0 0 0
8 bits / 8 bits
Layer 3
0 1 0 1 0 1 0 1 0 0 1 0 0 0 1 0
8 bits / 8 bits
……
Layer 1
1 1 1 0 1 0 1 0 1
4 bits / 5 bits
Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0
6 bits / 7 bits
Layer 3
1 0 1 0 1 0 0 1 0
5 bits / 4 bits
……
Layer 1
1 1 1 0 1 0 1 0 1 Choices: 8 x 8 = 64
4 bits / 5 bits
Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0 Choices: 8 x 8 = 64
6 bits / 7 bits
Layer 3
1 0 1 0 1 0 0 1 0 Choices: 8 x 8 = 64
5 bits / 4 bits
……
Layer 1
1 1 1 0 1 0 1 0 1
4 bits / 5 bits
Action
Critic
Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0
6 bits / 7 bits
State
Actor
Reward
Layer 3
1 0 1 0 1 0 0 1 0
5 bits / 4 bits
……
HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
Solution: Design
Automation
…… BitFusion (Edge)
CycleCTycle
wn w0 an a0 0 (LSB)
&
Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0 +
CTycle
Cycle(MSB)
0 (LSB)
6 bits / 7 bits &
<<
Cycle 0
(MSB)
(MSB)
5 bits / 4 bits
Hardware
Accelerator
……
HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
HAQ Outperforms Uniform Quantization
HAQ Uniform
HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
HAQ Supports Multiple Objectives
HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
Quantization Policy for Edge and Cloud
#weight bit (pointwise) #weight bit (depthwise) #activation bit (pointwise) #activation bit (depthwise)
Mixed-Precision Quantized MobileNetV2
HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
Summary of Today’s Lecture
In this lecture, we rmin 0 rmax
Floating-point
1. Reviewed Linear Quantization. range
Batch
Conv ReLU
Norm
References
1. Deep Compression [Han et al., ICLR 2016]
2. Neural Network Distiller: https://intellabs.github.io/distiller/algo_quantization.html
3. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et al.,
CVPR 2018]
4. Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
5. Post-Training 4-Bit Quantization of Convolution Networks for Rapid-Deployment [Banner et al., NeurIPS 2019]
6. 8-bit Inference with TensorRT [Szymon Migacz, 2017]
7. Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper [Raghuraman Krishnamoorthi,
arXiv 2018]
8. Neural Networks for Machine Learning [Hinton et al., Coursera Video Lecture, 2012]
9. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation [Bengio, arXiv 2013]
10.Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1.
[Courbariaux et al., Arxiv 2016]
11. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients [Zhou et al., arXiv
2016]
12.PACT: Parameterized Clipping Activation for Quantized Neural Networks [Choi et al., arXiv 2018]
13.WRPN: Wide Reduced-Precision Networks [Mishra et al., ICLR 2018]
14.Towards Accurate Binary Convolutional Neural Network [Lin et al., NeurIPS 2017]
15.Incremental Network Quantization: Towards Lossless CNNs with Low-precision Weights [Zhou et al., ICLR 2017]
16.HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]