0% found this document useful (0 votes)
15 views

ML System Optimization Lecture 11 Quantization

Uploaded by

allybenson5888
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

ML System Optimization Lecture 11 Quantization

Uploaded by

allybenson5888
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 150

Machine

Learning
Systems
QUANTIZATION

Lecture slides inspired by: Prof. Song


Han
Associate Professor, MIT
Distinguished Scientist, NVIDIA
Agenda
1. Review the numeric data types, including 1 1 0 0 1 1 1 1
integers and floating-point numbers. × × × × × × × ×
-27 + 26+25+24+23+22+21 + 20
2. Learn the basic concept of neural network
quantization
3. Learn three types of common neural network Continuous Quantized
Signal Signal
quantization: Signa
l
1. K-Means-based Quantization
2. Linear Quantization
tim
3. Binary and Ternary Quantization e

2
Low Bit Operations are Cheaper
Less Bit-Width → Less Energy

Operation Energy [pJ] S


30
e
8 bit int ADD 0.03 ✕
ri
32 bit int ADD 0.1 e
s
16 bit float ADD 0.4 1

32 bit float ADD 0.9


16
8 bit int MULT 0.2 ✕

32 bit int MULT 3.1

16 bit float MULT 1.1

32 bit float MULT 3.7


Rough Energy Cost For Various Operations in 1 10 100 1000
45nm 0.9V
=
1
200
Computing's Energy Problem (and What We Can Do About it) [Horowitz, This image is in the public
domain
M., IEEE ISSCC 2014]
5
Low Bit Operations are Cheaper
Less Bit-Width → Less Energy

Operation Energy [pJ] S


30
e
8 bit int ADD 0.03 ✕
ri
32 bit int ADD 0.1 e
s
16 bit float ADD 0.4 1

How should we make deep


32 bit float ADD 0.9 learning more
16 efficient?
8 bit int MULT 0.2 ✕

32 bit int MULT 3.1

16 bit float MULT 1.1

32 bit float MULT 3.7


Rough Energy Cost For Various Operations in 1 10 100 1000
45nm 0.9V

Computing's Energy Problem (and What We Can Do About it) [Horowitz,


M., IEEE ISSCC 2014]
6
Numeric Data Types
How is numeric data represented in modern computing systems?

7
Integer
• Unsigned Integer
• n-bit Range: 0 0 1 1 0 0 0 1
× × × × × × × ×
27 + 26+25+24+23+22+21 + 20
• Signed Integer
• Sign-Magnitude Representation
• n-bit Range: 1 0 1 1 0 0 0 1
• Both 000…00 and 100…00 represent 0 × × × × × × ×
- 26+25+24+23+22+21 + 20
• Two’s Complement Representation
• n-bit Range:
1 1 0 0 1 1 1 1
• 000…00 represents 0
× × × × × × × ×
• 100…00 represents -27 + 26+25+24+23+22+21 + 20

8
Fixed-Point Number

Integ Fractio
.
er “Decimal” n
Point

0 0 1 1 0 0 0 1
× × × × × × × ×
=
-23 + 22+21+20+2-1+2-2+2-3+ 2-4
3.0625

0 0 1 1 0 0 0 1
× × × × × × × ×
) × 2-4 = 49 × 0.0625 =
( -27 + 26+25+24+23+22+21 + 20
3.0625

(using 2’s complement


representation)

9
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754
23 22 21 20 2-1 2-2 2-3 2-4

Sig 8 bit 23 bit (significant /


n Exponent Fraction mantissa)
(-1)sign × (1 + Fraction) × Exponent Bias = 127 =
28-1-1
2Exponent-127

How to represent 0.265625?


0.265625 = 1.0625 × 2-2 = (1 + 0.0625)
× 2125-127

00111110100010000000000000000000
12 0.062
5 5

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit (significant /


n Exponent Fraction mantissa)
(-1)sign × (1 + Fraction) × Exponent Bias = 127 =
28-1-1
2Exponent-127

How should we represent 0?

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit


n Exponent Fraction(-1)sign × (1 + Fraction) ×
Should have 2 0-127
(-1) sign
× (1 + Fraction) × But we force to
been (-1) sign
× Fraction ×
2Exponent-127 be 21-127
Normal Numbers, Subnormal Numbers,
Exponent≠0 Exponent=0

0 0111110100010000000000000000000 0 0000000000000000000000000000000

12 0.062
0 0
5 5
1 0000000000000000000000000000000

0 0
0.265625 = 1.0625 × 2-2 = (1 + 0.0625) 0 = 0 × 2-
× 2125-127 126

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit


n Exponent Fraction
(-1)sign × (1 + Fraction) × (-1)sign × Fraction ×
2Exponent-127 21-127
Normal Numbers, Subnormal Numbers,
Exponent≠0 Exponent=0

What is the smallest positive subnormal value?

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit


n Exponent Fraction
(-1)sign × (1 + Fraction) × (-1)sign × Fraction ×
2Exponent-127 21-127
Normal Numbers, Subnormal Numbers,
Exponent≠0 Exponent=0

0 0000000100000000000000000000000 0 0000000000000000000000000000001

1 0 0 2-23
2-126 = (1 + 0) × 2-149 = 2-23 × 2-
21-127 126

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit


n Exponent Fraction
(-1)sign × (1 + Fraction) × (-1)sign × Fraction ×
2Exponent-127 21-127
Normal Numbers, Subnormal Numbers,
Exponent≠0 Exponent=0

What is the largest positive subnormal value?

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit


n Exponent Fraction
(-1)sign × (1 + Fraction) × (-1)sign × Fraction ×
2Exponent-127 21-127
Normal Numbers, Subnormal Numbers,
Exponent≠0 Exponent=0

0 0000000100000000000000000000000 0 0000000011111111111111111111111

2-23 + 2-22 +…+ 2-1 =1 -


1 0 0
2 -23
2-126 = (1 + 0) × 2-126-2 -149
= (1 - 2 ) × 2
-23 -

21-127 126

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit


n Exponent Fraction
(-1)sign × (1 + Fraction) × (-1)sign × Fraction ×
2Exponent-127 21-127
Normal Numbers, Subnormal Numbers,
Exponent≠0 Exponent=0

0 1111111100000000000000000000000 - 11111111 - - - - - - - - - - - - - - - - - - - - - - -
+∞ (positive NaN (Not a
infinity) Number)
1 1111111100000000000000000000000
-∞ (negative much waste. revisit
infinity) in fp8.

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit


n Exponent Fraction
Fraction= Fraction≠
Exponent Equation
0 0
00H = 0 ±0 subnormal (-1)sign × Fraction × 21-127
01H … FEH = 1 … 254 normal (-1)sign × (1 + Fraction) × 2Exponent-127
FFH = 255 ±INF NaN

subnormal normal
values values

± -149 (1-2-23) 2- (1+1-2-
2 2-126
0 126 23
)×2127

1
Floating-Point Number
Exponent Width → Range; Fraction Width → Precision
Exponent Fraction Total
(bits) (bits) (bits)
8 23 32
IEEE 754 Half Precision 16-bit Float
(IEEE FP16)
5 10 16

8 7 16

2
Numeric Data Types
• Question: What is the following IEEE half precision (IEEE FP16) number in decimal?

Exponent Bias =
1100011100000000
Sig 5 bit 10 bit 1510
n Exponent Fraction

• Sign: -

• Exponent: 100012 - 1510 = 1710 - 1510 = 210

• Fraction: 11000000002 = 0.7510

• Decimal Answer = - (1 + 0.75) × 22 = -1.75 × 22 = -7.010

2
Numeric Data Types
• Question: What is the decimal 2.5 in Brain Float (BF16)?

2.510 = 1.2510 × Exponent Bias =


21 12710

• Sign: +

• Exponent Binary: 110 + 12710 = 12810 = 100000002

• Fraction Binary: 0.2510 = 01000002


• Binary Answer

0100000000100000
Sig 8 bit 7 bit
n Exponent Fraction

2
Floating-Point Number
Exponent Width → Range; Fraction Width → Precision
Exponent Fraction Total
(bits) (bits) (bits)
8 23 32
IEEE 754 Half Precision 16-bit Float
(IEEE FP16)
5 10 16

* FP8 E4M3 does not have INF, and S.1111.1112 is used


for NaN. 4 3 8
* Largest FP8 E4M3 normal value is S.1111.1102 =448.
Nvidia FP8 (E5M2) for gradient in the backward
* FP8 E5M2 have INF (S.11111.002) and NaN
(S.11111.XX2).
* Largest FP8 E5M2 normal value is S.11110.112
5 2 8
=57344.

2
INT4 and FP4
Exponent Width → Range; Fraction Width → Precision
INT
4 -1,-2,-3,-4,-5,-6,-7,-8 -1,-2,-3,-4,-5,-6,-7,-8
S 0, 1, 2, 3, 4, 5, 6, 7 0, 1, 2, 3, 4, 5, 6, 7
0 1 2 3 4 5 6 7
=
0 0 0 1
1
=
0 1 1 1
7
FP4
(E1M2) -0,-0.5,-1,-1.5,-2,-2.5,-3,-3.5 -0,-1,-2,-3,-4,-5,-6,-7
S E M M 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5 0, 1, 2, 3, 4, 5, 6, 7
×0.5
0 1 2 3 3.5
=0.25×2 1-
0 0 0 1 0
=0.5
=(1+0.75)×21-
0 1 1 1 0
=3.5
FP4
(E2M1) -0,-0.5,-1,-1.5,-2,-3,-4,-6 -0,-1,-2,-3,-4,-6,-8,-12
S E E M 0, 0.5, 1, 1.5, 2, 3, 4, 6 0, 1, 2, 3, 4, 6, 8, 12
×0.5
=0.5×21- 0 1 2 3 4 6
0 0 0 1 1
=0.5
=(1+0.5)×23-
0 1 1 1 no inf, no
1
=1
FP4 NaN
(E3M0) -0,-0.25,-0.5,-1,-2,-4,-8,-16 -0,-1,-2,-4,-8,-16,-32,-64
S E E E 0, 0.25, 0.5, 1, 2, 4, 8, 16 0, 1, 2, 4, 8, 16, 32, 64
×0.25
0 1 2 4 8 16
=(1+0)×2 1-
0 0 0 1 3
=0.25
=(1+0)×27-
0 1 1 1 no inf, no
3
=16 NaN
2
What is Quantization?

Quantization is the process of constraining an input


from a continuous or otherwise large set of values to a
discrete set.

Continuous Quantized Original 16-Color


Signal
Quantization Signal Image Image
Signa
l Error

tim
e

Images are in the public


domain.
The difference between an input value and its quantized “Palettizatio
value is referred to as quantization error. n”

Quantization [Wikipe
dia]
2
Neural Network Quantization: Agenda

- 3 2.0 1 -2 0 -1
2.0 1.4 0.0 3 0 2 1 1 0 1 1
0.9 : 0
9 8 9
8 2 1.5 -1 -1 -2 1
1 1 0 3 1 0 0 1
:
0.0
- -
0.1 1.0
2.1 1
0
( -1) 1.07
5 2 0 3 1 0 0.0 -2 1 -1 -2 0 1 1 0
4 8 :
0
3 1 2 2 0 1 -1 0 0 1 1 1 1
- - -
1.9 :
0.9 0 1.0 1.0
2
1 3 0
1.8
7
0
1.5 1.4
3 9
K-Means-based Linear Binary/Ternary
Quantization Quantization Quantization

Floating-Point
Storage
Weights

Floating-Point
Computation
Arithmetic

2
Neural Network Quantization: Agenda

- 3 2.0 1 -2 0 -1
2.0 1.4 0.0 3 0 2 1 1 0 1 1
0.9 : 0
9 8 9
8 2 1.5 -1 -1 -2 1
1 1 0 3 1 0 0 1
:
0.0
- -
0.1 1.0
2.1 1
0
( -1) 1.07
5 2 0 3 1 0 0.0 -2 1 -1 -2 0 1 1 0
4 8 :
0
3 1 2 2 0 1 -1 0 0 1 1 1 1
- - -
1.9 :
0.9 0 1.0 1.0
2
1 3 0
1.8
7
0
1.5 1.4
3 9
K-Means-based Linear Binary/Ternary
Quantization Quantization Quantization

Integer Weights;
Floating-Point
Storage Floating-Point
Weights
Codebook

Floating-Point Floating-Point
Computation
Arithmetic Arithmetic

2
Neural Network Quantization
Weight Quantization
weights
(32-bit
float)
-
2.09 1.48 0.09
0.98
- -
0.05 2.12
0.14 1.08
- -
1.92 0
0.91 1.03

1.87 0 1.53 1.49

2
Neural Network Quantization
Weight Quantization
weights
(32-bit
float)
2.09
-
1.48 0.09
2.09, 2.12, 1.92,
0.98
1.87
- -
0.05 2.12
0.14 1.08
- -
1.92 0
0.91 1.03

1.87 0 1.53 1.49


2.0

2
K-Means-based Weight Quantization
weights
(32-bit
float)
-
2.09 1.48 0.09
0.98
- -
0.05 2.12
0.14 1.08
- -
1.92 0
0.91 1.03

1.87 0 1.53 1.49

Deep Compression [Han et al.,


ICLR 2016]
3
K-Means-based Weight Quantization
weights cluster reconstructed
(32-bit index centroid weights
float) (2-bit int) s (32-bit float)
- 3 2.00 -
2.09 1.48 0.09 3 0 2 1 2.00 1.50 0.00
0.98 : 1.00
- - 2 1.50 -
0.05 2.12 cluste 1 1 0 3 0.00 0.00 2.00
0.14 1.08 r : 1.00
- - 1 0.00 - -
1.92 0 0 3 1 0 2.00 0.00
0.91 1.03 : 1.00 1.00
0 -
1.87 0 1.53 1.49 3 1 2 2 2.00 0.00 1.50 1.50
: 1.00
index codeboo
quantization
32 bit × 16 2 bites
× 16 k×4
32 bit error
storag 20
e
= 512 bit = = 32 bit = = 128 bit = = B -
64 B 4B 16 B 0.09 0.02 0.09
0.02
3.2 ×
- -
smaller
Assume N-bit quantization, and #parameters = M
0.05
0.14 0.08
0.12
322bit
>> N
. × N bit ×
32 bit × - -
M M 0.09
0.08
0
0.03
2N
= 32M = NM
= 2N+5 bit - -
bit 32/N × bit 0 0.03
0.13 0.01
Deep Compression [Han et al.,
smaller
ICLR 2016]
3
K-Means-based Weight Quantization
Fine-tuning Quantized
weights
Weights cluster
(32-bit index centroid
float) (2-bit int) s
- 3 2.00
2.09 1.48 0.09 3 0 2 1
0.98 :
- - 2 1.50
0.05 2.12 cluste 1 1 0 3
0.14 1.08 r :
- - 1 0.00
1.92 0 0 3 1 0
0.91 1.03 :
0 -
1.87 0 1.53 1.49 3 1 2 2
: 1.00
gradie
nt
- -
0.03 0.02
0.03 0.01
- -
0.01 0.12
0.01 0.02
-
0.02 0.04 0.01
0.01
- - -
0.01
0.07 0.02 0.02
Deep Compression [Han et al.,
ICLR 2016]
3
K-Means-based Weight Quantization
Fine-tuning Quantized
weights
Weights cluster
fine-
centroid tuned
(32-bit index
s centroid
float) (2-bit int)
- 3 2.00 s
2.09 1.48 0.09 3 0 2 1 1.96
0.98 :
- - 2 1.50
0.05 2.12 cluste 1 1 0 3 1.48
0.14 1.08 r :
- - 1 0.00 -
1.92 0 0 3 1 0
0.91 1.03 : 0.04
0 - ×l -
1.87 0 1.53 1.49 3 1 2 2
: 1.00 r 0.97
gradie
nt
- - -
0.03 0.02 -0.03 0.12 0.02 0.04
0.03 0.01 0.07
- - group - reduc
0.01 0.12 by 0.03 0.01 e 0.02
0.01 0.02 0.02
- - -
0.02 0.04 0.01 0.02 0.01 0.04 0.04
0.01 0.01 0.02
- - - - - -
0.01 -0.01 0.01
0.07 0.02 0.02 0.02 0.01 0.03
Deep Compression [Han et al.,
ICLR 2016]
3
K-Means-based Weight Quantization
Accuracy vs. compression rate for AlexNet on ImageNet dataset

Quantization Only
0.5%
0.0%
2% 5% 8% 11% 14% 17% 20%
-0.5%
Accuracy Loss

-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%

Model Size Ratio after Compression

Deep Compression [Han et al.,


ICLR 2016]
3
K-Means-based Weight Quantization
Accuracy vs. compression rate for AlexNet on ImageNet dataset

Pruning Only Quantization Only


0.5%
0.0%
2% 5% 8% 11% 14% 17% 20%
-0.5%
Accuracy Loss

-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%

Model Size Ratio after Compression

Deep Compression [Han et al.,


ICLR 2016]
3
K-Means-based Weight Quantization
Accuracy vs. compression rate for AlexNet on ImageNet dataset

Pruning + Quantization Pruning Only Quantization Only


0.5%
0.0%
2% 5% 8% 11% 14% 17% 20%
-0.5%
Accuracy Loss

-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%

Model Size Ratio after Compression

Deep Compression [Han et al.,


ICLR 2016]
3
Before Quantization: Continuous Weight

Coun
t

Weight
Value
Deep Compression [Han et al.,
ICLR 2016]
3
After Quantization: Discrete Weight

Coun
t

Weight
Value
Deep Compression [Han et al.,
ICLR 2016]
3
After Quantization: Discrete Weight after Retraining

Coun
t

Weight
Value
Deep Compression [Han et al.,
ICLR 2016]
3
How Many Bits do We Need?

Deep Compression [Han et al.,


ICLR 2016]
4
Huffman Coding

• In-frequent weights: use more bits to


represent
• Frequent weights: use less bits to
represent
Deep Compression [Han et al.,
ICLR 2016]
4
Summary of Deep Compression

Deep Compression [Han et al.,


ICLR 2016]
4
Deep Compression Results
Original Compresse Compressio Original Compresse
Network
Size d Size n Ratio Accuracy d Accuracy

LeNet-300 1070KB 27KB 40x 98.36% 98.42%

LeNet-5 1720KB 44KB 39x 99.20% 99.26%

AlexNet 240MB 6.9MB 35x 80.27% 80.30%

VGGNet 550MB 11.3MB 49x 88.68% 89.09%

GoogleNet 28MB 2.8MB 10x 88.90% 88.92%

ResNet-18 44.6MB 4.0MB 11x 89.24% 89.28%

Can we make compact models to begin


with?
Deep Compression [Han et al.,
ICLR 2016]
4
SqueezeNet

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size [Iandola et
al., arXiv 2016] 4 4
Deep Compression on SqueezeNet

Top-1 Top-5
Network Approach Size Ratio
Accuracy Accuracy

AlexNet - 240MB 1x 57.2% 80.3%

AlexNet SVD 48MB 5x 56.0% 79.4%

Deep
AlexNet 6.9MB 35x 57.2% 80.3%
Compression

SqueezeNe
- 4.8MB 50x 57.5% 80.3%
t

SqueezeNe Deep
0.47MB 510x 57.5% 80.3%
t Compression

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size [Iandola et
al., arXiv 2016] 4
K-Means-based Weight Quantization
floa
t output
weights cluster ReLU
centroid floa s
(32-bit index
s t
float) (2-bit int)
- 3 2.00 bias floa +
2.09 1.48 0.09 3 0 2 1 floa
0.98 : t
t
- - 2 1.50 Conv
0.05 2.12 decod 1 1 0 3
0.14 1.08 e : floa floa
- - t t
1.92 0 0 3 1 0 1 0.00 input
0.91 1.03 : inputs weights
s floa
0 - t
1.87 0 1.53 1.49 3 1 2 2
: 1.00 decode
uin
• quantized t
During In weights
Computation Storage • codebook (float)

• The weights are decompressed using a lookup table (i.e., codebook) during runtime
inference.
• K-Means-based Weight Quantization only saves storage cost of a neural network model.
• All the computation and memory access are still floating-point.

4
Neural Network Quantization

- 3 2.0 1 -2 0 -1
2.0 1.4 0.0 3 0 2 1 1 0 1 1
0.9 : 0
9 8 9
8 2 1.5 -1 -1 -2 1
1 1 0 3 1 0 0 1
:
0.0
- -
0.1 1.0
2.1 1
0
( -1) 1.07
5 2 0 3 1 0 0.0 -2 1 -1 -2 0 1 1 0
4 8 :
0
3 1 2 2 0 1 -1 0 0 1 1 1 1
- - -
1.9 :
0.9 0 1.0 1.0
2
1 3 0
1.8
7
0
1.5 1.4
3 9
K-Means-based Linear Binary/Ternary
Quantization Quantization Quantization

Integer Weights;
Floating-Point
Storage Floating-Point Integer Weights
Weights
Codebook

Floating-Point Floating-Point
Computation Integer Arithmetic
Arithmetic Arithmetic

4
Linear Quantization

4
What is Linear Quantization?
weights
(32-bit
float)
-
2.09 1.48 0.09
0.98
- -
0.05 2.12
0.14 1.08
- -
1.92 0
0.91 1.03

1.87 0 1.53 1.49

4
What is Linear Quantization?
An affine mapping of integers to real numbers
weights quantized zero point scale reconstructed
(32-bit weights (2-bit signed (32-bit weights
float) (2-bit signed int) int) float) (32-bit float)
- -
2.09 1.48 0.09 1 -2 0 -1 2.14 1.07 0
0.98 1.07
- - -
0.05 2.12 -1 -1 -2 1
- 1.0 0 0 2.14

( ) =
0.14 1.08 1.07
-
0.91
1.92 0
-
1.03
-2 1 -1 -2
1 7 -
1.07
2.14 0
-
1.07

1.87 0 1.53 1.49 1 -1 0 0 2.14 0 1.07 1.07


we will learn how to
determine these quantization
parameters error
-
0.09 0.41 0.09
0.05
Binary Decimal
- - -
01 1 0.05
0.14 0.01 0.02
00 0 -
0.16 0 0.04
11 -1 0.22
10 -2 -
0 0.46 0.42
0.27
5
Linear Quantization
An affine mapping of integers to real numbers
weights quantized zero point scale
(32-bit weights (2-bit signed (32-bit
float) (2-bit signed int) int) float)
-
2.09 1.48 0.09 1 -2 0 -1
0.98
- -
0.05 2.12 -1 -1 -2 1
- 1.0
( )
0.14 1.08
-
0.91
1.92 0
-
1.03
-2 1 -1 -2
1 7
1.87 0 1.53 1.49 1 -1 0 0

Floating- Integ Integ Floating-


point er • quantization er point
parameter• quantization
• allow real number r=0 parameter
be exactly
representable by a
quantized integer
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 5
Linear Quantization
An affine mapping of integers to real numbers

Floating-point
range

Floating-
point

Floating-point
Scale

Integ Bit Width qmin qmax


er 2 -2 1
Zero
point 3 -4 3
4 -8 7
N -2N-1 2N-1-1
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 5
Scale of Linear Quantization
Linear Quantization is an affine mapping of integers to real numbers

Floating-
point range

Floating-
point Scale

Zero
point

5
Scale of Linear Quantization
Linear Quantization is an affine mapping of integers to real numbers

-
2.09 1.48 0.09
0.98
- -
0.05 2.12
0.14 1.08
Floating-
point range - -
1.92 0
0.91 1.03

1.87 0 1.53 1.49

Floating-
point Scale

Zero
point Binary Decimal
01 1
00 0
11 -1
10 -2

5
Zero Point of Linear Quantization
Linear Quantization is an affine mapping of integers to real numbers

Floating-
point range

Floating-
point Scale

Zero
point

5
Zero Point of Linear Quantization
Linear Quantization is an affine mapping of integers to real numbers

-
2.09 1.48 0.09
0.98
- -
0.05 2.12
0.14 1.08
Floating-
point range - -
1.92 0
0.91 1.03

1.87 0 1.53 1.49

Floating-
point Scale

Zero
point Binary Decimal
01 1
00 0
11 -1
10 -2

5
Linear Quantized Matrix Multiplication
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following matrix multiplication.

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 5
Linear Quantized Matrix Multiplication
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following matrix multiplication.

Precompu
te

N-bit Integer Multiplication N-bit


32-bit Integer Integer
Addition/Subtraction Addition

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 5
Linear Quantized Matrix Multiplication
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following matrix multiplication.

• Empirically, the scale is always in the interval (0, 1).

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 5
Linear Quantized Matrix Multiplication
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following matrix multiplication.

Precompu
te

Rescale to N-bit Integer Multiplication N-bit


N-bit 32-bit Integer Integer
Integer Addition/Subtraction Addition

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 6
Symmetric Linear Quantization
Zero point and Symmetric floating-point range

Floating-
point range

Floating-
point Scale

Zero
point

Bit Width qmin qmax


2 -2 1
3 -4 3
4 -8 7
N -2N-1 2N-1-1
6
Symmetric Linear Quantization
Full range mode

Floating-
point range

Floating-
point Scale

Zero
point

Bit Width qmin qmax


2 -2 1
3 -4 3
4 -8 7
N -2N-1 2N-1-1
6
Linear Quantized Matrix Multiplication
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following matrix multiplication, when Zw=0.

Precompu
te

Rescale to N-bit Integer Multiplication N-bit


N-bit 32-bit Integer Integer
Integer Addition/Subtraction Addition

6
Linear Quantized Fully-Connected Layer
Linear Quantization is an affine mapping of integers to real numbers
• So far, we ignore bias. Now we consider the following fully-connected layer with bias.

6
Linear Quantized Fully-Connected Layer
Linear Quantization is an affine mapping of integers to real numbers
• So far, we ignore bias. Now we consider the following fully-connected layer with bias.

6
Linear Quantized Fully-Connected Layer
Linear Quantization is an affine mapping of integers to real numbers
• So far, we ignore bias. Now we consider the following fully-connected layer with bias.

Precompu
te

We will discuss how to


compute activation zero
point in the next lecture.

6
Linear Quantized Fully-Connected Layer
Linear Quantization is an affine mapping of integers to real numbers
• So far, we ignore bias. Now we consider the following fully-connected layer with bias.

N-bit Int
Rescale N-bit
Mult.
to Int
32-bit Int
N-bit Int Add
Add.
Note: both and are 32 bits.

6
Linear Quantized Convolution Layer
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following convolution layer.

N-bit Int
Rescale N-bit
Mult.
to Int
32-bit Int
N-bit Int Add
Add.
Note: both and are 32 bits.

7
Linear Quantized Convolution Layer
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following convolution layer.

in in quantized
zero point t + t outputs
in
scale factor t
×
int3
2
quantized bias
int3 +
int3
2 2
Conv
in in
t t
quantized quantized weights
N-bit Int inputs
Rescale N-bit
Mult.
to Int
32-bit Int
N-bit Int Add
Add.
Note: both and are 32 bits.

7
INT8 Linear Quantization
An affine mapping of integers to real numbers

Neural Network ResNet-50 Inception-V3

Floating-point
76.4% 78.4%
Accuracy

8-bit Integer-
quantized 74.9% 75.4%
Acurracy

Latency-vs-accuracy tradeoff of float vs. integer-


only MobileNets on ImageNet using Snapdragon
835 big cores.

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 7
Neural Network Quantization

- 3 2.0 1 -2 0 -1
2.0 1.4 0.0 3 0 2 1
0.9 : 0
9 8 9
8 2 1.5 -1 -1 -2 1
1 1 0 3
:
0.0
- -
0.1 1.0
2.1 1
0
( -1) 1.07
5 2 0 3 1 0 0.0 -2 1 -1 -2
4 8 :
0
3 1 2 2 0 1 -1 0 0
- - -
1.9 :
0.9 0 1.0 1.0
2
1 3 0
1.8
7
0
1.5 1.4
3 9
K-Means-based Linear
Quantization Quantization

Integer Weights;
Floating-Point
Storage Floating-Point Integer Weights
Weights

?
Codebook

Floating-Point Floating-Point
Computation Integer Arithmetic
Arithmetic Arithmetic

7
Summary of Today’s Lecture
Today, we reviewed and learned
• the numeric data types used in the modern
computing systems, including integers and 1 1 0 0 1 1 1 1
× × × × × × × ×
floating-point numbers. =-
-27 + 26+25+24+23+22+21 + 20
49

• the basic concept of neural network


quantization: converting the weights and
activations of neural networks into a limited Floating-
point range
discrete set of numbers.

• two types of common neural network Floating-


point Scale
quantization:
• K-Means-based Quantization Zero
point
• Linear Quantization
7
References
1. Model Compression and Hardware Acceleration for Neural Networks: A
Comprehensive Survey [Deng et al., IEEE 2020]
2. Computing's Energy Problem (and What We Can Do About it) [Horowitz, M., IEEE
ISSCC 2014]
3. Deep Compression [Han et al., ICLR 2016]
4. Neural Network Distiller: https://intellabs.github.io/distiller/algo_quantization.html
5. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only
Inference [Jacob et al., CVPR 2018]
6. BinaryConnect: Training Deep Neural Networks with Binary Weights during
Propagations [Courbariaux et al., NeurIPS 2015]
7. Binarized Neural Networks: Training Deep Neural Networks with Weights and
Activations Constrained to +1 or −1. [Courbariaux et al., Arxiv 2016]
8. XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks
[Rastegari et al., ECCV 2016]
9. Ternary Weight Networks [Li et al., Arxiv 2016]
10.Trained Ternary Quantization [Zhu et al., ICLR 2017]

9
Lecture
Plan
Today we will:
1. Review Linear Quantization.

2. Introduce Post-Training Quantization (PTQ) that quantizes a floating-point neural network


model, including: channel quantization, group quantization, and range clipping.

3. Introduce Quantization-Aware Training (QAT) that emulates inference-time quantization during


the training/fine-tuning and recover the accuracy.

4. Introduce binary and ternary quantization.

5. Introduce automatic mixed-precision quantization.


Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1


( -1) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0

K-Means-based Linear
Quantization Quantization
Integer Weights;
Floating-Point Floating-Point Integer Weights
Storage
Weights Codebook

Floating-Point Floating-Point
Computation Integer Arithmetic
Arithmetic Arithmetic
K-Means-based Weight Quantization
weights cluster index fine-tuned
(32-bit float) (2-bit int) centroids centroids
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 ×lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Deep Compression [Han et al., ICLR 2016]


MIT 6.5940: TinyML and Efficient Deep Learning Computing
K-Means-based Weight Quantization
Accuracy vs. compression rate for AlexNet on ImageNet dataset

Pruning + Quantization Pruning Only Quantization Only


0.5%
0.0%
-0.5%
Accuracy Loss

-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
2% 5% 8% 11% 14% 17% 20%

Model Size Ratio after


Deep Compression [Han et al., ICLR 2016] Compression
MIT 6.5940: TinyML and Efficient Deep Learning Computing
Linear Quantization
An affine mapping of integers to real numbers r = S(q −
Z)
weights quantized weights zero point scale
(32-bit float) (2-bit signed int) (2-bit signed int) (32-bit float)

2.09 -0.98 1.48 0.09 1 -2 0 -1 2.14 -1.07 1.07 0

0.05 -0.14 -1.08 2.12 -1 -1 -2 1 0 0 -1.07 2.14

-0.91 1.92 0 -1.03


( -2 1 -1 -2
-1 ) 1.07 = -1.07 2.14 0 -1.07

1.87 0 1.53 1.49 1 -1 0 0 2.14 0 1.07 1.07

Binary Decimal
01 1
00 0
11 -1
10 -2
Linear Quantization
An affine mapping of integers to real numbers r = S(q −
Z)

rmin rmax
r 0
Floating-point range
Floating-point

×S
Floating-point Scale

q qmin qmax
Bit Width
Integer qmin Z qmax 2 -2 1
Zero point 3 -4 3
4 -8 7
N -2N-1 2N-1-1

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
Linear Quantized Fully-Connected
Layer
Linear Quantization is an affine mapping of integers to real numbers r = S(q −
Z)
• Consider the following fully-connected layer.

Y = WX + b
ZW = 0
Zb = 0, Sb = SWSX

qbias = qb − ZXqW
qY = S WS X
SY (qWqX + qbias) + ZY
Rescale to N-bit Int Mult. N-bit Int
N-bit Int 32-bit Int Add. Add
Note: both qb and qbias are 32 bits.
Linear Quantized Convolution
Layer
Linear Quantization is an affine mapping of integers to real numbers r = S(q −
Z)
• Consider the following convolution layer.

Y = Conv (W, X) + b
ZW = 0
Zb = 0, Sb = SWSX

qbias SWSX = qb − Conv (qW, ZX)


qY = (Conv ( q , q
W X) + q + Z
SY bias)
Y
Rescale to N-bit Int Mult. N-bit Int
N-bit Int 32-bit Int Add. Add
Note: both qb and qbias are 32 bits.
Scale and Zero Point of Linear Quantization
Linear Quantization is an affine mapping of integers to real numbers r = S(q −
Z)
2.09 -0.98 1.48 0.09
Asymmetric Linear Quantization
rmin 0 rmax 0.05 -0.14 -1.08 2.12

r Floating-point
range -0.91 1.92 0 -1.03

1.87 0 1.53 1.49


×S
Floating-point
q Scale
S = rmax − rmin
rmin
qmin Z qmax Z = qmin −
S
Zero point qmax − qmin
−1.08
= 2.12 − (−1.08) = round(−2 − )
1 − (−2) 1.07
= 1.07 = −1
Scale and Zero Point of Linear Quantization
Linear Quantization is an affine mapping of integers to real numbers r = S(q −
Z)
2.09 -0.98 1.48 0.09
Symmetric Linear Quantization
− | r |max 0.05 -0.14 -1.08 2.12
| r |max
r Floating-point
range -0.91 1.92 0 -1.03

1.87 0 1.53 1.49


×S
Floating-point
q Scale
S = | r |max Z=
qmin Z= 0 qmax
Zero point 0
qmax
2.1
= 2
1
=
2.12
Post-Training
Quantization
How should we get the optimal linear quantization parameters (S, Z)?

Topic I: Quantization Granularity


Topic II: Dynamic Range Clipping
Topic III: Rounding
Post-Training
Quantization
How should we get the optimal linear quantization parameters (S, Z)?

Topic I: Quantization Granularity


Topic II: Dynamic Range Clipping
Topic III: Rounding
Quantization Granularity
• Per-Tensor Quantization

• Per-Channel Quantization

• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type
Quantization Granularity
• Per-Tensor Quantization

• Per-Channel Quantization

• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type
Symmetric Linear Quantization on Weights
• r|
− | r |max max = | W |max
(first depthwise-separable layer in MobileNetV2)

| r |max
• Using single scale for whole weight tensor
(Per-Tensor Quantization)
r • works well for large models
• accuracy drops for small models

q • Common failure results from


qmin Z= 0 qmax • large differences (more than 100×) in
kw
ranges of weights for different
wi kh wo output channels — outlier weight
ci
co ho
hi co
ci • Solution: Per-Channel Quantization
X W Y

Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et al., CVPR
2018]
Quantization Granularity
• Per-Tensor Quantization

• Per-Channel Quantization

• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09

0.05 -0.14 -1.08 2.12


oc
-0.91 1.92 0 -1.03

1.87 0 1.53 1.49


Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09
| r |max = 2.12
0.05 -0.14 -1.08 2.12
oc 2.1
-0.91 1.92 0 -1.03 S= | r | max = 2 =
1.87 0 1.53 1.49
2−1
2 − 1 2.12
qmax
1 0 1 0 2.12 0 2.12 0

0 0 -1 1 0 0 -2.12 2.12

0 1 0 0 0 2.12 0 0

1 0 1 1 2.12 0 2.12 2.12

Quantized Reconstructed
∥W − SqW∥F =
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = S0 = | r |max =
2.09 2.09 2.12
oc 0.05 -0.14 -1.08 2.12 |r 2.12
| r |max = S1 = S= | max = =
2.12
-0.91 1.92 0 -1.03
2.12 2.12 q max 22−1 − 1
1.87 0 1.53 1.49
| r |max = S2 =
1 0 1 0 2.12 0 2.12 0
1.92 1.92
0 0 -1 1 0 0 -2.12 2.12
| r |max = S3 =
0 1 0 0 0 2.12 0 0
1.87 1.87
1 0 1 1 2.12 0 2.12 2.12

Quantized Reconstructed
∥W − SqW∥F =
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = S0 = | r |max =
2.09 2.09 2.12
oc 0.05 -0.14 -1.08 2.12 |r 2.12
| r |max = S1 = S= | max = =
2.12
-0.91 1.92 0 -1.03
2.12 2.12 q max 22−1 − 1
1.87 0 1.53 1.49
| r0|max1= S2 =
1 0 2.09 0 2.09 0 1 0 1 0 2.12 0 2.12 0
1.92 1.92
0 0 -1 1 0 0 -2.12 2.12 0 0 -1 1 0 0 -2.12 2.12
| r1|max0= S3 =
0 -1 0 1.92 0 -1.92 0 1 0 0 0 2.12 0 0
1.87 1.87
1 0 1 1 1.87 0 1.87 1.87 1 0 1 1 2.12 0 2.12 2.12

Quantized Quantized Reconstructed


∥W − S ⊙ qW∥F = 2.08 Reconstructed ∥W − SqW∥F = 2.28
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 21
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = S0 = | r |max =
2.09 2.09 2.12
oc 0.05 -0.14 -1.08 2.12 |r 2.12
| r |max = S1 = S= | max = =
2.12
-0.91 1.92 0 -1.03
2.12 2.12 q max 22−1 − 1
1.87 0 1.53 1.49
| r0|max1= S2 =
1 0 2.09 0 2.09 0 1 0 1 0 2.12 0 2.12 0
1.92 1.92
0 0 -1 1 0 0 -2.12 2.12 0 0 -1 1 0 0 -2.12 2.12
| r1|max0= S3 =
0 -1 0 1.92 0 -1.92 0 1 0 0 0 2.12 0 0
1.87 1.87
1 0 1 1 1.87 0 1.87 1.87 1 0 1 1 2.12 0 2.12 2.12

Quantized
∥W − S ⊙ qW∥F
Reconstructed
= < Quantized
∥W − SqW∥F
Reconstructed
= 2.28
Quantization Granularity
• Per-Tensor Quantization

• Per-Channel Quantization

• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type
VS-Quant: Per-vector Scaled
Quantization
Hierarchical scaling factor
• r = S(q − Z) → r = γ ⋅ Sq(q −
Z)
• is a floating-point coarse grained scale factor scale factor Sq for each vector
• Sq is an integer per-vector scale factor
• achieves a balance between accuracy and
hardware efficiency by
• less expensive integer scale factors at finer
granularity
M K = M

• more expensive floating-point scale factors


at coarser granularity
K N N

• Memory Overhead of two-level scaling:


• Given 4-bit quantization with 4-bit per-vector another scale factor for each tensor
scale for every 16 elements, the effective bit
width is 4 + 4 / 16 = 4.25 bits.

VS-Quant: Per-Vector Scaled Quantization for Accurate Low-Precision Neural Network Inference [Steve Dai, et al.]
Group
Quantization
Multi-level scaling scheme r = (q − z) ⋅ s →
W
r = (q − z) ⋅ sl ⋅ sl ⋅ ⋯
11 0
1

: real number value


: quantized value
: zero point (z = 0 is symmetric quantization)
: scale factors of different levels
Group
Quantization FP16 INT4
Multi-level scaling scheme sl0 q
r = (q − z) ⋅ sl ⋅ sl ⋅ ⋯
W11 0
1

: real number value


: quantized value
: zero point (z = 0 is symmetric quantization)
: scale factors of different levels

L0

Quantization L0 L0 Scale L1 L1 Scale Effective


Data Type
Approach Group Size Data Type Group Size Data Type Bit Width
Per-Channel Quant INT4 Per Channel FP16 - - 4
VSQ INT4 16 UINT4 Per Channel FP16 4+4/16=4.25

MX4 S1M2 2 E1M0 16 E8M0 3+1/2+8/16=4

MX6 S1M4 2 E1M0 16 E8M0 5+1/2+8/16=6

MX9 S1M7 2 E1M0 16 E8M0 8+1/2+8/16=9


Group
Quantization
L0 FP16 INT4
Multi-level
W scaling scheme sl0 q
r = (q − z) ⋅ sl ⋅ sl ⋅ ⋯
01

W11 0 INT4
W21 1 INT4
FP16 UINT4 INT4
W31 : real number value INT4
sl sl q
: quantized value 1 0

: zero point (z = 0 is symmetric quantization)


: scale factors of different levels

L1
Quantization L0 L0 Scale L1 L1 Scale Effective
Data Type
Approach Group Size Data Type Group Size Data Type Bit Width
Per-Channel Quant INT4 Per Channel FP16 - - 4
VSQ INT4 16 UINT4 Per Channel FP16 4+4/16=4.25
MX4 S1M2 2 E1M0 16 E8M0 3+1/2+8/16=4

MX6 S1M4 2 E1M0 16 E8M0 5+1/2+8/16=6

MX9 VS-Quant: Per-Vector


S1M7Scaled Quantization
2 for Accurate Low-Precision
E1M0 16
Neural Network E8M0Dai, et al.]
Inference [Steve 8+1/2+8/16=9
Group Quantization
Multi-level scaling scheme
L0 FP16 INT4
sl0 q
W01
r = (q − z) ⋅ sl ⋅ sl ⋅ ⋯
W11 0 INT4
1 INT4
FP16 UINT4 INT4
: real number value INT4
L1
: quantized value sl1 sl0 q
: zero point (z = 0 is symmetric quantization)
: scale factors of different levels S MAG 2
E8 E1 S MAG 2
sl1 sl0 q
Quantization L0 L0 Scale L1 L1 Scale Effective
Data Type
Approach Group Size Data Type Group Size Data Type Bit Width
Per-Channel Quant INT4 Per Channel FP16 - - 4
VSQ INT4 16 UINT4 Per Channel FP16 4+4/16=4.25
MX4 S1M2 2 E1M0 16 E8M0 3+1/2+8/16=4
MX6 S1M4 2 E1M0 16 E8M0 5+1/2+8/16=6
MX9 S1M7 2 E1M0 16 E8M0 8+1/2+8/16=9
With Shared Microexponents, A Little Shifting Goes a Long Way [Bita Rouhani et al.]
Post-Training
Quantization
How should we get the optimal linear quantization parameters (S, Z)?

Topic I: Quantization Granularity


Topic II: Dynamic Range Clipping
Topic III: Rounding
Linear Quantization on Activations
rmin 0 rmax
• Unlike weights, the activation range varies across
inputs.
r
• To determine the floating-point range, the activations
×S
statistics are gathered before deploying the model.
q
qmin Z qmax
Dynamic Range for Activation Quantization
Collect activations statistics before deploying the model
= α ⋅ r(t)
+ (1 − α) ⋅ r ̂
( t−1) • Type 1: During training
r(max
̂t , max , min • Exponential moving averages (EMA)
) min max , min
• observed ranges are smoothed across
r rmin 0 thousands of training steps

rmax
×S

q
qmin Z qmax

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
Dynamic Range for Activation
Quantization
Collect activations statistics before deploying the model
•Type 2: By running a few “calibration” batches of samples on the trained FP32 model

0 spending dynamic range on the outliers hurts the


rmin rmax outliers•
representation ability.
r • use mean of the min/max of each sample in the
batches
×S • analytical calculation (see next slide)

q
qmin Z qmax

Neural Network Distiller


Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
Dynamic Range for Activation
Quantization
Collect activations statistics before deploying the model
•Type 2: By running a few “calibration” batches of samples on the trained FP32 model
•minimize loss of information, since integer model encodes the same information as the original
floating-point model.
•loss of information is measured by Kullback- Leibler divergence (relative entropy or information
divergence):

• for two discrete probability distributions P, Q


N
P(xi)
DKL(P∥Q) = ∑ P(xi )lo
i g
Q(xi
)
• intuition: KL divergence measures the amount
of information lost when approximating a
given encoding.

8-bit Inference with TensorRT [Szymon Migacz, 2017]


MIT 6.5940: TinyML and Efficient Deep Learning Computing
Dynamic Range for Activation
Quantization
Minimize loss of information by minimizing the KL divergence

8-bit Inference with TensorRT [Szymon Migacz, 2017]


MIT 6.5940: TinyML and Efficient Deep Learning Computing
Dynamic Range for Activation
Quantization
Minimize loss of information by minimizing the KL divergence
GoogleNet:
AlexNet: Pool 2
incpetion_5a/5x5

ResNet-152: GoogleNet:
res4b8_branch2a incpetion_3a/pool

8-bit Inference with TensorRT [Szymon Migacz, 2017]


MIT 6.5940: TinyML and Efficient Deep Learning Computing
Dynamic Range for Quantization
Minimize mean-square-error (MSE) using Newton-Raphson method

max-scaled quantization clipped quantization

clip clip

large quantization noise low density data

Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training [Sakr et al., ICML 2022]
Dynamic Range for Quantization
Minimize mean-square-error (MSE) using Newton-Raphson method

Network FP32 Accuracy OCTAV int4


ResNet-50 76.07 75.84
MobileNet-V2 71.71 70.88
Bert-Large 91.00 87.09
Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training [Sakr et al., ICML 2022]
Post-Training
Quantization
How should we get the optimal linear quantization parameters (S, Z)?

Topic I: Quantization Granularity


Topic II: Dynamic Range Clipping
Topic III: Rounding
Adaptive Rounding for Weight
Quantization
Rounding-to-nearest is not optimal
• Philosophy
• Rounding-to-nearest is not optimal
• Weights are correlated with each other. The best rounding for each weight (to nearest) is not
the best rounding for the whole tensor
rounding-to-nearest
0.3 0.5 0.7 0.2 0 1 1 0

AdaRound
0.3 0.5 0.7 0.2 0 0 1 0
(one potential result)

• What is optimal? Rounding that reconstructs the original activation the best, which may be very
different
• For weight quantization only
• With short-term tuning, (almost) post-training quantization

Up or Down? Adaptive Rounding for Post-Training Quantization [Nagel et al., PMLR 2020]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
Adaptive Rounding for Weight
Quantization
Rounding-to-nearest is not optimal
• Method:
• Instead of ⌊w⌉ , we want to choose from {⌊w ⌋ , ⌈w ⌉ } to get the best reconstruction
• We took a learning-based method to find quantized value w˜ = ⌊⌊w⌋ + δ⌉, δ ∈
[0,1]

Up or Down? Adaptive Rounding for Post-Training Quantization [Nagel et al., PMLR 2020]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
Adaptive Rounding for Weight
Quantization
Rounding-to-nearest is not optimal
• Method:
• Instead of ⌊w⌉ , we want to choose from {⌊w ⌋ , ⌈w ⌉ } to get the best reconstruction
• We took a learning-based method to find quantized value w˜ = ⌊⌊w⌋ + δ⌉, δ ∈
• [0,1] We optimize the following equation (omit the derivation):
˜ 2
argminV∥Wx − x∥ F + λfreg(V)
W
→ argmin ∥Wx − ⌊⌊W⌋ + 2
+ λf
V F reg
h(V)⌉x∥ (V)
• is the input to the layer, V is a random variable of the same shape 2

1
• h() is a function to map the range to (0,1), such as rectified sigmoid
0
• freg(V) is a regularization that encourages h(V) to be binary -1
-6 0 6

• freg(V) = ∑ 1 − | 2h(Vi,j) − 1 |
i,j
β

Up or Down? Adaptive Rounding for Post-Training Quantization [Nagel et al., PMLR 2020]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
Neural Network • Zero Point
• Asymmetric
Quantization • Symmetric
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1
• Scaling
0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1 Granularity
• Per-Tensor
( -1) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2
• Per-Channel
1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0
• Group Quantization
K-Means-based Linear • Range Clipping
Quantization Quantization • Exponential Moving
Integer Weights; Average
Floating-Point Floating-Point Integer Weights • Minimizing KL
Storage
Weights Codebook Divergence
• Minimizing Mean-
Floating-Point Floating-Point Square-Error
Computation Integer Arithmetic
Arithmetic Arithmetic
• Rounding
• Round-to-Nearest
• AdaRound
Post-Training INT8 Linear Quantization
Symmetric Asymmertric

Per-Tensor Per-Tensor
Activation
Exponential Moving Average
Minimize KL-Divergence
(EMA)

Symmetric Symmetric
Weight
Per-Tensor Per-Channel

GoogleNet -0.45% 0%

ResNet-50 -0.13% -0.6%


Neural
Network ResNet-152 -0.08% -1.8%

MobileNetV1 - -11.8%

MobileNetV2 - -2.1%

Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv
2018]
8-bit Inference with TensorRT [Szymon Migacz, 2017]
Post-Training INT8 Linear Quantization
Symmetric Asymmertric

Per-Tensor Per-Tensor
Activation
Exponential Moving Average
Minimize KL-Divergence
(EMA)

Symmetric Symmetric
Weight
Per-Tensor Per-Channel
?
GoogleNet
Smaller -
models seem to not respond 0%
How should we
as 0.45%
well to post-training quantization,
improve -0.6%
performance
presumabley due to their smaller
Neural ResNet-50 - of quantized models?
representational capacity.
0.13% -1.8%
Network
MobileNetV1
ResNet-152 - - -11.8%
0.08%
MobileNetV2 - -2.1%

Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv
2018]
8-bit Inference with TensorRT [Szymon Migacz, 2017]
Quantization-Aware Training
How should we improve performance of quantized models?
Quantization-Aware Training
Train the model taking quantization into consideration
• To minimize the loss of accuracy, especially aggressive quantization with 4 bits and lower bit
width, neural network will be trained/fine-tuned with quantized weights and activations.
• Usually, fine-tuning a pre-trained floating point model provides better accuracy than training from
scratch.
weights cluster index fine-tuned Forward
(32-bit float) (2-bit int) centroids centroids weights
Backward
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48 weight quantization


-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 ×lr -0.97 Layer Layer Layer


inputs outputs
N-1 N N+1
gradient
-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04 example for operations
group
-0.01 0.01 -0.02 0.12 by 0.03 0.01 -0.02 reduce 0.02
Batch
Conv ReLU
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04 Norm

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Deep Compression [Han et al., ICLR 2016]


Quantization-Aware Training
Train the model taking quantization into consideration
• A full precision copy of the weights W is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
weights
Backward

weight quantization

Layer Layer Layer


inputs outputs
N-1 N N+1

example for operations

Batch
Conv ReLU
Norm
Quantization-Aware Training
Train the model taking quantization into consideration
• A full precision copy of the weights W is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
“Simulated/Fake Quantization” weights
Backward
W

W → SWqW = Q(W) weight quantization


Q
(W)
Layer Layer activation Layer
inputs outputs
N-1 N quantization N+1

example for operations


ensure discrete-valued
Conv
Batch
ReLU weights and activations
Norm
in the boundaries
these operations still run in full
precision
Linear Quantization
An affine mapping of integers to real numbers r = S(q −
Z)
weights quantized weights zero point scale
(32-bit float) (2-bit signed int) (2-bit signed int) (32-bit float)

2.09 -0.98 1.48 0.09 1 -2 0 -1 2.14 -1.07 1.07 0

0.05 -0.14 -1.08 2.12 -1 -1 -2 1 0 0 -1.07 2.14

-0.91 1.92 0 -1.03


( -2 1 -1 -2
-1 ) 1.07 = -1.07 2.14 0 -1.07

1.87 0 1.53 1.49 1 -1 0 0 2.14 0 1.07 1.07

W qW Q(W
)
Quantization-Aware Training
Train the model taking quantization into consideration
• A full precision copy of the weights W is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
“Simulated/Fake Quantization” weights
Backward
W

W → SWqW = Q(W) weight quantization


Y → SY (qY − ZY) = Q(Y)
Q
Layer Q (W) Layer Y Q Layer
inputs activation outputs
(X) (Y)
N-1 N quantization N+1
? example for operations
How should gradients ensure discrete-valued
back-propagate through Conv
Batch
ReLU weights and activations
Norm
the (simulated) in the boundaries
quantization?
these operations still run in full
precision
Straight-Through Estimator (STE)
5
• Quantization is discrete-valued, and thus the
derivative is 0 almost everywhere. 4

Q (w) = round
∂Q (W )
= 0 3
∂W
2
• The neural network will learn nothing since
gradients become 0 and the weights won’t get 1

(w)
updated. 0
∂L ∂L ∂Q (W) 0 1 2 3 4 5
gW = ∂W = ∂Q (W) ⋅ = 0
∂W weights
• Straight-Through Estimator (STE) simply passes the W gW
gradients through the quantization as if it had been weight quantization
the identity function. ∂L
Q
∂Q
∂L ∂L (W)
Layer (W)
gW = ∂W = ∂Q
N
(W)
Neural Networks for Machine Learning [Hinton et al., Coursera Video Lecture, 2012]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation [Bengio, arXiv 2013]
Quantization-Aware Training
Train the model taking quantization into consideration
• A full precision copy of the weights is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
“Simulated/Fake Quantization” weights
Backward
W → SWqW = Q(W) W
Y → SY (qY − ZY) = Q(Y)
weight quantization
∂L ∂L
gW ← ∂Q Q
g Y ← ∂Q
(W)
Layer Q (W) Layer Y (Y) Q Layer
inputs activation outputs
(X) (Y)
N-1 N quantization N+1

example for operations


ensure discrete-valued
Conv
Batch
ReLU weights and activations
Norm
in the boundaries
these operations still run in full
precision
INT8 Linear Quantization-Aware Training

Post-Training Quantization Quantization-Aware Training

Neural Network Floating-Point Asymmetric Symmetric Asymmetric Symmetric

Per-Tensor Per-Channel Per-Tensor Per-Channel

MobileNetV1 70.9% 0.1% 59.1% 70.0% 70.7%

MobileNetV2 71.9% 0.1% 69.8% 70.9% 71.1%

NASNet-Mobile 74.9% 72.2% 72.1% 73.0% 73.0%

Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv 2018]
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1


( -1) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0

K-Means-based Linear
Quantization Quantization
Integer Weights;
Floating-Point Floating-Point Integer Weights
Storage
Weights Codebook

Computation
Floating-Point Floating-Point
Integer Arithmetic
?
Arithmetic Arithmetic
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1 1 0 1 1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1 1 0 0 1


( -1) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2 0 1 1 0

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0 1 1 1 1

K-Means-based Linear Binary/Ternary


Quantization Quantization Quantization
Integer Weights;
Floating-Point Floating-Point Integer Weights Binary/Ternary
Storage
Weights Codebook Weights

Floating-Point Floating-Point
Computation Integer Arithmetic Bit Operations
Arithmetic Arithmetic
Binary/Ternary Quantization
Can we push the quantization precision to 1
bit?
Can quantization bit width go even lower?
yi = ∑ Wij ⋅ xj = ×
j yi
Wij
= 8×5 + (-3)×2 + 5×0 + (-1)×1 xj

input weight operations memory computation = × 5


8 -3 5 -1 2
R R +× 1× 1×
0
1
If weights are quantized to +1 and -1
yi = ∑ Wij ⋅ xj = × 5
j 8 -3 5 -1 2
=5-2+0-1 0
1

input weight operations memory computation = × 5


1 -1 1 -1 2
R R +× 1× 1×
0
R B +- ~32× less ~2× less
1

BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations [Courbariaux et al., NeurIPS 2015]
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
Binarization
• Deterministic Binarization
• directly computes the bit value based on a threshold, usually 0, resulting in a sign function.
+1, r ≥ 0
q = sign (r) = { r < 0
−1,
• Stochastic Binarization
• use global statistics or the value of input data to determine the probability of being -1 or +1
• e.g., in Binary Connect (BC), probability is determined by hard sigmoid function σ(r)
1
+1, with probability p = σ(r) r+
1 , where σ(r) = min(max( 0
q= 2
{ −1, ,0),1) -1
with probability 1 − p -3 -1 1 3

• harder to implement as it requires the hardware to generate random bits when quantizing.

BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations [Courbariaux et al., NeurIPS 2015]
BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1. [Courbariaux et al., Arxiv 2016]
Minimizing Quantization Error in Binarization
AlexNet-based ImageNet Top-1
weights binary weights Network Accuracy Delta
(32-bit float) (1-bit)
2.09 -0.98 1.48 0.09 1 -1 1 1 BinaryConnect -21.2%

0.05 -0.14 -1.08 2.12 W W 𝔹 1 -1 -1 1 Binary Weight


0.2%
Network (BWN)
-0.91 1.92 0 -1.03 -1 1 1 -1

1.87 0 1.53 1.49 1 1 1 1


∥W − W ∥F = 𝔹 2

9.28
1 -1 1 1 scale
(32-bit float)
W𝔹 = sign 1 -1 -1 1
(W) = 1 ∥W∥1
1

1.0 16
α = ∥W∥1 αW� -1 1 1 -1
5
n 1 1 1 1
∥W − αW𝔹∥F2 =
9.24
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
If both activations and weights are binarized
yi = ∑ Wij ⋅ xj = × 5
j 8 -3 5 -1 2
= 1×1 + (-1)×1 + 1×(-1) + (-1)×1 0
= 1 + (-1) + (-1) + (-1) = -2 1

= × 1
1 -1 1 -1 1
-1
1

XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
If both activations and weights are binarized
yi = ∑ Wij ⋅ xj
j

= 1×1 + (-1)×1 + 1×(-1) + (-1)×1


= 1 + (-1) + (-1) + (-1) = -2

W X Y=WX bW bX XNOR(bW, bX)


1 1 1 1 1 1
1 -1 -1 1 0 0
-1 -1 1 0 0 1
-1 1 -1 0 1 0
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
If both activations and weights are binarized
yi = ∑ Wij ⋅
j
•= 1 xnor 1 + 0 xnor 1 + 1 xnor 0 + 0 xnor 1
xj
= 1×1 + (-1)×1 + 1×(-1) + (-
•= 1 + 0 + 0 + 0 = 1
1)×1
•?
= 1 + (-1) + (-1) + (-1) = -2

W X Y=WX bW bX XNOR(bW, bX)


1 1 1 1 1 1
1 -1 -1 1 0 0
-1 -1 1 0 0 1
-1 1 -1 0 1 0
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
If both activations and weights are binarized
yi = ∑ Wij ⋅ xj yi = − n + 2 ⋅ ∑ Wij xnor xj
j j

= 1×1 + (-1)×1 + 1×(-1) + (-1)×1 = 1 xnor 1 + 0 xnor 1 + 1 xnor 0 + 0 xnor 1


=1+0+0+0 = 1×2
= 1 + (-1) + (-1) + (-1) = -2 +2 + = -2
Assuming -1 -1 -1 -1 → -4

W X Y=WX bW bX XNOR(bW, bX)


1 1 1 1 1 1
1 -1 -1 1 0 0
-1 -1 1 0 0 1
-1 1 -1 0 1 0
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
If both activations and weights are binarized
yi = − n + 2 ⋅ ∑ Wij xnor →
yi = − n + popcount (Wi xnor x) ≪ 1
j
xj
= -4 + 2 × (1 xnor 1 + 0 xnor 1 + 1 xnor 0 + 0 xnor 1)
= -4 + 2 × (1 + 0 + 0 + 0) = -2
→ popcount: return the number of 1

W X Y=WX bW bX XNOR(bW, bX)


1 1 1 1 1 1
1 -1 -1 1 0 0
-1 -1 1 0 0 1
-1 1 -1 0 1 0
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
If both activations and weights are binarized
yi = − n + popcount (Wi xnor x) ≪ 1 = × 5

= -4 + popcount(1010 xnor 1101) ≪ 1 8 -3 5 -1 2

= -4 + popcount(1000) ≪ 1 = -4 + 2 = -2 0
1

input weight operations memory computation = × 1


1 -1 1 -1 1
R R +× 1× 1×
-1
R B +- ~32× less ~2× less
1
B B xnor,
~32× less ~58× less
popcount
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
Accuracy Degradation of Binarization
Bit-Width ImageNet
Neural Network Quantization Top-1 Accuracy
W A Delta

BWN 1 32 0.2%

AlexNet BNN 1 1 -28.7%

XNOR-Net 1 1 -12.4%

BWN 1 32 -5.80%
GoogleNet
BNN 1 1 -24.20%

BWN 1 32 -8.5%
ResNet-18
* BWN: Binary Weight XNOR-Net 1 binarization
Network with scale for weight 1 -18.1%
* BNN: Binarized Neural Network without scale factors
* XNOR-Net: scale factors for both activation and weight binarization
Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1. [Courbariaux et al., Arxiv 2016]
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
Ternary Weight Networks
(TWN)
Weights are quantized to +1, -1 and 0
rt, r> Δ
q= 0, |r|≤ where Δ = 0.7 × 𝔼 ( | r | ) , rt = 𝔼|r|>Δ ( | r
Δ,
|)
−r t , r< −Δ ternary weights W𝕋
Δ = 0.7 × 1 ∥W∥1 = 0.73
(2-bit)
weights W
2.09 -0.98 1.48 0.09 1 -1 1 0 16
(32-bit float)

0.05 -0.14 -1.08 2.12 0 0 -1 1


1
-0.91 1.92 0 -1.03 -1 1 0 -1
1.5 =
1
∥WW�≠0 ∥

1
1
1.87 0 1.53 1.49 1 0 1 1

ImageNet Top-1
Full Precision 1 bit (BWN) 2 bit (TWN)
Accuracy

ResNet-18 69.6 60.8 65.3

Ternary Weight Networks [Li et al., Arxiv 2016]


68
Trained Ternary Quantization
(TTQ)
• Instead of using fixed scale , TTQ introduces two trainable parameters w
t p and wn to
represent the positive and negative scales in the quantization.
wp, r> Δ
q= 0, |r|≤
Δ
Normalized
−w n , r< −Δ Trained
Full Precision Weight Intermediate Ternary Weight Quantization Final Ternary Weight
Full Precision Weight
Scale
Normalize Quantize
Wn Wp
-1 0 1 -1 -t 0 t 1 -1 0 1 -Wn 0 Wp

ImageNet Top-1
Full Precision 1 bit (BWN) 2 bit (TWN) TTQ
Accuracy

ResNet-18 69.6 60.8 65.3 66.6

Trained Ternary Quantization [Zhu et al., ICLR 2017]


Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1 1 0 1 1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1 1 0 0 1


( -1) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2 0 1 1 0

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0 1 1 1 1

K-Means-based Linear Binary/Ternary


Quantization Quantization Quantization
Integer Weights;
Floating-Point Floating-Point Integer Weights Binary/Ternary
Storage
Weights Codebook Weights

Floating-Point Floating-Point
Computation Integer Arithmetic Bit Operations
Arithmetic Arithmetic
Mixed-Precision Quantization
Uniform Quantization
……
weight activation

Layer 1
1 1 1 0 1 1 1 0 1 0 1 0 1 0 1 0
8 bits / 8 bits

Layer 2 1 1 0 1 1 0 0 1 1 0 1 0 1 0 0 0
8 bits / 8 bits

Layer 3
0 1 0 1 0 1 0 1 0 0 1 0 0 0 1 0
8 bits / 8 bits

……

Bit Widths Quantized Model


Mixed-Precision Quantization
……
weight activation

Layer 1
1 1 1 0 1 0 1 0 1
4 bits / 5 bits

Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0
6 bits / 7 bits

Layer 3
1 0 1 0 1 0 0 1 0
5 bits / 4 bits

……

Bit Widths Quantized Model


Challenge: Huge Design Space
……
weight activation

Layer 1
1 1 1 0 1 0 1 0 1 Choices: 8 x 8 = 64
4 bits / 5 bits

Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0 Choices: 8 x 8 = 64
6 bits / 7 bits

Layer 3
1 0 1 0 1 0 0 1 0 Choices: 8 x 8 = 64
5 bits / 4 bits

……

Bit Widths Quantized Model Design Space: 64n


Solution: Design
Automation
……
weight activation

Layer 1
1 1 1 0 1 0 1 0 1
4 bits / 5 bits
Action
Critic

Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0
6 bits / 7 bits
State
Actor
Reward
Layer 3
1 0 1 0 1 0 0 1 0
5 bits / 4 bits

……

Bit Widths Quantized Model

HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
Solution: Design
Automation
…… BitFusion (Edge)

weight activation BISMO (Cloud)


PE PE PE
BISMO (Edge)
PE PE PE
Layer 1
1 1 1 0 1 0 1 0 1 PE PE PE
4 bits / 5 bits PE
wn
PE w
0
a
PE
a
0
PE n PE Cycle T
Action Hardware PE
Critic Mapping
wn
PE
w
PE
0
an a0
PE
(LSB)

CycleCTycle
wn w0 an a0 0 (LSB)
&
Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0 +
CTycle
Cycle(MSB)
0 (LSB)
6 bits / 7 bits &
<<
Cycle 0
(MSB)
(MSB)

State Direct &


<<
Actor
Reward Feedback <<
Layer 3
1 0 1 0 1 0 0 1 0 +

5 bits / 4 bits

Hardware
Accelerator
……

Bit Widths Quantized Model

HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
HAQ Outperforms Uniform Quantization
HAQ Uniform

HAQ (Ours) PACT Baseline

Mixed-Precision Quantized MobileNetV1

HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
HAQ Supports Multiple Objectives

Model Size Constrained Latency Constrained Energy Constrained

HAQ HAQ HAQ


Uniform Uniform Uniform

Mixed-Precision Quantized MobileNetV1

HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
Quantization Policy for Edge and Cloud

#weight bit (pointwise) #weight bit (depthwise) #activation bit (pointwise) #activation bit (depthwise)
Mixed-Precision Quantized MobileNetV2

HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
Summary of Today’s Lecture
In this lecture, we rmin 0 rmax
Floating-point
1. Reviewed Linear Quantization. range

2. Introduced Post-Training Quantization (PTQ) that


quantizes an already-trained floating-point neural ×S
Floating-point
network model.
• Per-tensor vs. per-channel vs. group quantization q Scale

• How to determine dynamic range for quantization qmin Z qmax


Zero point
3. Introduced Quantization-Aware Training (QAT) that
emulates inference-time quantization during the
Forward
training/fine-tuning. weights
Backward
• Straight-Through Estimator (STE) weight quantization

4. Introduced binary and ternary quantization. Layer Layer activation Layer


inputs outputs
5. Introduced automatic mixed-precision quantization. N-1 N quantization N+1

Batch
Conv ReLU
Norm
References
1. Deep Compression [Han et al., ICLR 2016]
2. Neural Network Distiller: https://intellabs.github.io/distiller/algo_quantization.html
3. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et al.,
CVPR 2018]
4. Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
5. Post-Training 4-Bit Quantization of Convolution Networks for Rapid-Deployment [Banner et al., NeurIPS 2019]
6. 8-bit Inference with TensorRT [Szymon Migacz, 2017]
7. Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper [Raghuraman Krishnamoorthi,
arXiv 2018]
8. Neural Networks for Machine Learning [Hinton et al., Coursera Video Lecture, 2012]
9. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation [Bengio, arXiv 2013]
10.Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1.
[Courbariaux et al., Arxiv 2016]
11. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients [Zhou et al., arXiv
2016]
12.PACT: Parameterized Clipping Activation for Quantized Neural Networks [Choi et al., arXiv 2018]
13.WRPN: Wide Reduced-Precision Networks [Mishra et al., ICLR 2018]
14.Towards Accurate Binary Convolutional Neural Network [Lin et al., NeurIPS 2017]
15.Incremental Network Quantization: Towards Lossless CNNs with Low-precision Weights [Zhou et al., ICLR 2017]
16.HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]

You might also like