0% found this document useful (0 votes)
22 views

ML System Optimization Lecture 11 Quantization

Uploaded by

allybenson5888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

ML System Optimization Lecture 11 Quantization

Uploaded by

allybenson5888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 150

Machine

Learning
Systems
QUANTIZATION

Lecture slides inspired by: Prof. Song


Han
Associate Professor, MIT
Distinguished Scientist, NVIDIA
Agenda
1. Review the numeric data types, including 1 1 0 0 1 1 1 1
integers and floating-point numbers. × × × × × × × ×
-27 + 26+25+24+23+22+21 + 20
2. Learn the basic concept of neural network
quantization
3. Learn three types of common neural network Continuous Quantized
Signal Signal
quantization: Signa
l
1. K-Means-based Quantization
2. Linear Quantization
tim
3. Binary and Ternary Quantization e

2
Low Bit Operations are Cheaper
Less Bit-Width → Less Energy

Operation Energy [pJ] S


30
e
8 bit int ADD 0.03 ✕
ri
32 bit int ADD 0.1 e
s
16 bit float ADD 0.4 1

32 bit float ADD 0.9


16
8 bit int MULT 0.2 ✕

32 bit int MULT 3.1

16 bit float MULT 1.1

32 bit float MULT 3.7


Rough Energy Cost For Various Operations in 1 10 100 1000
45nm 0.9V
=
1
200
Computing's Energy Problem (and What We Can Do About it) [Horowitz, This image is in the public
domain
M., IEEE ISSCC 2014]
5
Low Bit Operations are Cheaper
Less Bit-Width → Less Energy

Operation Energy [pJ] S


30
e
8 bit int ADD 0.03 ✕
ri
32 bit int ADD 0.1 e
s
16 bit float ADD 0.4 1

How should we make deep


32 bit float ADD 0.9 learning more
16 efficient?
8 bit int MULT 0.2 ✕

32 bit int MULT 3.1

16 bit float MULT 1.1

32 bit float MULT 3.7


Rough Energy Cost For Various Operations in 1 10 100 1000
45nm 0.9V

Computing's Energy Problem (and What We Can Do About it) [Horowitz,


M., IEEE ISSCC 2014]
6
Numeric Data Types
How is numeric data represented in modern computing systems?

7
Integer
• Unsigned Integer
• n-bit Range: 0 0 1 1 0 0 0 1
× × × × × × × ×
27 + 26+25+24+23+22+21 + 20
• Signed Integer
• Sign-Magnitude Representation
• n-bit Range: 1 0 1 1 0 0 0 1
• Both 000…00 and 100…00 represent 0 × × × × × × ×
- 26+25+24+23+22+21 + 20
• Two’s Complement Representation
• n-bit Range:
1 1 0 0 1 1 1 1
• 000…00 represents 0
× × × × × × × ×
• 100…00 represents -27 + 26+25+24+23+22+21 + 20

8
Fixed-Point Number

Integ Fractio
.
er “Decimal” n
Point

0 0 1 1 0 0 0 1
× × × × × × × ×
=
-23 + 22+21+20+2-1+2-2+2-3+ 2-4
3.0625

0 0 1 1 0 0 0 1
× × × × × × × ×
) × 2-4 = 49 × 0.0625 =
( -27 + 26+25+24+23+22+21 + 20
3.0625

(using 2’s complement


representation)

9
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754
23 22 21 20 2-1 2-2 2-3 2-4

Sig 8 bit 23 bit (significant /


n Exponent Fraction mantissa)
(-1)sign × (1 + Fraction) × Exponent Bias = 127 =
28-1-1
2Exponent-127

How to represent 0.265625?


0.265625 = 1.0625 × 2-2 = (1 + 0.0625)
× 2125-127

00111110100010000000000000000000
12 0.062
5 5

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit (significant /


n Exponent Fraction mantissa)
(-1)sign × (1 + Fraction) × Exponent Bias = 127 =
28-1-1
2Exponent-127

How should we represent 0?

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit


n Exponent Fraction(-1)sign × (1 + Fraction) ×
Should have 2 0-127
(-1) sign
× (1 + Fraction) × But we force to
been (-1) sign
× Fraction ×
2Exponent-127 be 21-127
Normal Numbers, Subnormal Numbers,
Exponent≠0 Exponent=0

0 0111110100010000000000000000000 0 0000000000000000000000000000000

12 0.062
0 0
5 5
1 0000000000000000000000000000000

0 0
0.265625 = 1.0625 × 2-2 = (1 + 0.0625) 0 = 0 × 2-
× 2125-127 126

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit


n Exponent Fraction
(-1)sign × (1 + Fraction) × (-1)sign × Fraction ×
2Exponent-127 21-127
Normal Numbers, Subnormal Numbers,
Exponent≠0 Exponent=0

What is the smallest positive subnormal value?

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit


n Exponent Fraction
(-1)sign × (1 + Fraction) × (-1)sign × Fraction ×
2Exponent-127 21-127
Normal Numbers, Subnormal Numbers,
Exponent≠0 Exponent=0

0 0000000100000000000000000000000 0 0000000000000000000000000000001

1 0 0 2-23
2-126 = (1 + 0) × 2-149 = 2-23 × 2-
21-127 126

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit


n Exponent Fraction
(-1)sign × (1 + Fraction) × (-1)sign × Fraction ×
2Exponent-127 21-127
Normal Numbers, Subnormal Numbers,
Exponent≠0 Exponent=0

What is the largest positive subnormal value?

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit


n Exponent Fraction
(-1)sign × (1 + Fraction) × (-1)sign × Fraction ×
2Exponent-127 21-127
Normal Numbers, Subnormal Numbers,
Exponent≠0 Exponent=0

0 0000000100000000000000000000000 0 0000000011111111111111111111111

2-23 + 2-22 +…+ 2-1 =1 -


1 0 0
2 -23
2-126 = (1 + 0) × 2-126-2 -149
= (1 - 2 ) × 2
-23 -

21-127 126

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit


n Exponent Fraction
(-1)sign × (1 + Fraction) × (-1)sign × Fraction ×
2Exponent-127 21-127
Normal Numbers, Subnormal Numbers,
Exponent≠0 Exponent=0

0 1111111100000000000000000000000 - 11111111 - - - - - - - - - - - - - - - - - - - - - - -
+∞ (positive NaN (Not a
infinity) Number)
1 1111111100000000000000000000000
-∞ (negative much waste. revisit
infinity) in fp8.

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit


n Exponent Fraction
Fraction= Fraction≠
Exponent Equation
0 0
00H = 0 ±0 subnormal (-1)sign × Fraction × 21-127
01H … FEH = 1 … 254 normal (-1)sign × (1 + Fraction) × 2Exponent-127
FFH = 255 ±INF NaN

subnormal normal
values values

± -149 (1-2-23) 2- (1+1-2-
2 2-126
0 126 23
)×2127

1
Floating-Point Number
Exponent Width → Range; Fraction Width → Precision
Exponent Fraction Total
(bits) (bits) (bits)
8 23 32
IEEE 754 Half Precision 16-bit Float
(IEEE FP16)
5 10 16

8 7 16

2
Numeric Data Types
• Question: What is the following IEEE half precision (IEEE FP16) number in decimal?

Exponent Bias =
1100011100000000
Sig 5 bit 10 bit 1510
n Exponent Fraction

• Sign: -

• Exponent: 100012 - 1510 = 1710 - 1510 = 210

• Fraction: 11000000002 = 0.7510

• Decimal Answer = - (1 + 0.75) × 22 = -1.75 × 22 = -7.010

2
Numeric Data Types
• Question: What is the decimal 2.5 in Brain Float (BF16)?

2.510 = 1.2510 × Exponent Bias =


21 12710

• Sign: +

• Exponent Binary: 110 + 12710 = 12810 = 100000002

• Fraction Binary: 0.2510 = 01000002


• Binary Answer

0100000000100000
Sig 8 bit 7 bit
n Exponent Fraction

2
Floating-Point Number
Exponent Width → Range; Fraction Width → Precision
Exponent Fraction Total
(bits) (bits) (bits)
8 23 32
IEEE 754 Half Precision 16-bit Float
(IEEE FP16)
5 10 16

* FP8 E4M3 does not have INF, and S.1111.1112 is used


for NaN. 4 3 8
* Largest FP8 E4M3 normal value is S.1111.1102 =448.
Nvidia FP8 (E5M2) for gradient in the backward
* FP8 E5M2 have INF (S.11111.002) and NaN
(S.11111.XX2).
* Largest FP8 E5M2 normal value is S.11110.112
5 2 8
=57344.

2
INT4 and FP4
Exponent Width → Range; Fraction Width → Precision
INT
4 -1,-2,-3,-4,-5,-6,-7,-8 -1,-2,-3,-4,-5,-6,-7,-8
S 0, 1, 2, 3, 4, 5, 6, 7 0, 1, 2, 3, 4, 5, 6, 7
0 1 2 3 4 5 6 7
=
0 0 0 1
1
=
0 1 1 1
7
FP4
(E1M2) -0,-0.5,-1,-1.5,-2,-2.5,-3,-3.5 -0,-1,-2,-3,-4,-5,-6,-7
S E M M 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5 0, 1, 2, 3, 4, 5, 6, 7
×0.5
0 1 2 3 3.5
=0.25×2 1-
0 0 0 1 0
=0.5
=(1+0.75)×21-
0 1 1 1 0
=3.5
FP4
(E2M1) -0,-0.5,-1,-1.5,-2,-3,-4,-6 -0,-1,-2,-3,-4,-6,-8,-12
S E E M 0, 0.5, 1, 1.5, 2, 3, 4, 6 0, 1, 2, 3, 4, 6, 8, 12
×0.5
=0.5×21- 0 1 2 3 4 6
0 0 0 1 1
=0.5
=(1+0.5)×23-
0 1 1 1 no inf, no
1
=1
FP4 NaN
(E3M0) -0,-0.25,-0.5,-1,-2,-4,-8,-16 -0,-1,-2,-4,-8,-16,-32,-64
S E E E 0, 0.25, 0.5, 1, 2, 4, 8, 16 0, 1, 2, 4, 8, 16, 32, 64
×0.25
0 1 2 4 8 16
=(1+0)×2 1-
0 0 0 1 3
=0.25
=(1+0)×27-
0 1 1 1 no inf, no
3
=16 NaN
2
What is Quantization?

Quantization is the process of constraining an input


from a continuous or otherwise large set of values to a
discrete set.

Continuous Quantized Original 16-Color


Signal
Quantization Signal Image Image
Signa
l Error

tim
e

Images are in the public


domain.
The difference between an input value and its quantized “Palettizatio
value is referred to as quantization error. n”

Quantization [Wikipe
dia]
2
Neural Network Quantization: Agenda

- 3 2.0 1 -2 0 -1
2.0 1.4 0.0 3 0 2 1 1 0 1 1
0.9 : 0
9 8 9
8 2 1.5 -1 -1 -2 1
1 1 0 3 1 0 0 1
:
0.0
- -
0.1 1.0
2.1 1
0
( -1) 1.07
5 2 0 3 1 0 0.0 -2 1 -1 -2 0 1 1 0
4 8 :
0
3 1 2 2 0 1 -1 0 0 1 1 1 1
- - -
1.9 :
0.9 0 1.0 1.0
2
1 3 0
1.8
7
0
1.5 1.4
3 9
K-Means-based Linear Binary/Ternary
Quantization Quantization Quantization

Floating-Point
Storage
Weights

Floating-Point
Computation
Arithmetic

2
Neural Network Quantization: Agenda

- 3 2.0 1 -2 0 -1
2.0 1.4 0.0 3 0 2 1 1 0 1 1
0.9 : 0
9 8 9
8 2 1.5 -1 -1 -2 1
1 1 0 3 1 0 0 1
:
0.0
- -
0.1 1.0
2.1 1
0
( -1) 1.07
5 2 0 3 1 0 0.0 -2 1 -1 -2 0 1 1 0
4 8 :
0
3 1 2 2 0 1 -1 0 0 1 1 1 1
- - -
1.9 :
0.9 0 1.0 1.0
2
1 3 0
1.8
7
0
1.5 1.4
3 9
K-Means-based Linear Binary/Ternary
Quantization Quantization Quantization

Integer Weights;
Floating-Point
Storage Floating-Point
Weights
Codebook

Floating-Point Floating-Point
Computation
Arithmetic Arithmetic

2
Neural Network Quantization
Weight Quantization
weights
(32-bit
float)
-
2.09 1.48 0.09
0.98
- -
0.05 2.12
0.14 1.08
- -
1.92 0
0.91 1.03

1.87 0 1.53 1.49

2
Neural Network Quantization
Weight Quantization
weights
(32-bit
float)
2.09
-
1.48 0.09
2.09, 2.12, 1.92,
0.98
1.87
- -
0.05 2.12
0.14 1.08
- -
1.92 0
0.91 1.03

1.87 0 1.53 1.49


2.0

2
K-Means-based Weight Quantization
weights
(32-bit
float)
-
2.09 1.48 0.09
0.98
- -
0.05 2.12
0.14 1.08
- -
1.92 0
0.91 1.03

1.87 0 1.53 1.49

Deep Compression [Han et al.,


ICLR 2016]
3
K-Means-based Weight Quantization
weights cluster reconstructed
(32-bit index centroid weights
float) (2-bit int) s (32-bit float)
- 3 2.00 -
2.09 1.48 0.09 3 0 2 1 2.00 1.50 0.00
0.98 : 1.00
- - 2 1.50 -
0.05 2.12 cluste 1 1 0 3 0.00 0.00 2.00
0.14 1.08 r : 1.00
- - 1 0.00 - -
1.92 0 0 3 1 0 2.00 0.00
0.91 1.03 : 1.00 1.00
0 -
1.87 0 1.53 1.49 3 1 2 2 2.00 0.00 1.50 1.50
: 1.00
index codeboo
quantization
32 bit × 16 2 bites
× 16 k×4
32 bit error
storag 20
e
= 512 bit = = 32 bit = = 128 bit = = B -
64 B 4B 16 B 0.09 0.02 0.09
0.02
3.2 ×
- -
smaller
Assume N-bit quantization, and #parameters = M
0.05
0.14 0.08
0.12
322bit
>> N
. × N bit ×
32 bit × - -
M M 0.09
0.08
0
0.03
2N
= 32M = NM
= 2N+5 bit - -
bit 32/N × bit 0 0.03
0.13 0.01
Deep Compression [Han et al.,
smaller
ICLR 2016]
3
K-Means-based Weight Quantization
Fine-tuning Quantized
weights
Weights cluster
(32-bit index centroid
float) (2-bit int) s
- 3 2.00
2.09 1.48 0.09 3 0 2 1
0.98 :
- - 2 1.50
0.05 2.12 cluste 1 1 0 3
0.14 1.08 r :
- - 1 0.00
1.92 0 0 3 1 0
0.91 1.03 :
0 -
1.87 0 1.53 1.49 3 1 2 2
: 1.00
gradie
nt
- -
0.03 0.02
0.03 0.01
- -
0.01 0.12
0.01 0.02
-
0.02 0.04 0.01
0.01
- - -
0.01
0.07 0.02 0.02
Deep Compression [Han et al.,
ICLR 2016]
3
K-Means-based Weight Quantization
Fine-tuning Quantized
weights
Weights cluster
fine-
centroid tuned
(32-bit index
s centroid
float) (2-bit int)
- 3 2.00 s
2.09 1.48 0.09 3 0 2 1 1.96
0.98 :
- - 2 1.50
0.05 2.12 cluste 1 1 0 3 1.48
0.14 1.08 r :
- - 1 0.00 -
1.92 0 0 3 1 0
0.91 1.03 : 0.04
0 - ×l -
1.87 0 1.53 1.49 3 1 2 2
: 1.00 r 0.97
gradie
nt
- - -
0.03 0.02 -0.03 0.12 0.02 0.04
0.03 0.01 0.07
- - group - reduc
0.01 0.12 by 0.03 0.01 e 0.02
0.01 0.02 0.02
- - -
0.02 0.04 0.01 0.02 0.01 0.04 0.04
0.01 0.01 0.02
- - - - - -
0.01 -0.01 0.01
0.07 0.02 0.02 0.02 0.01 0.03
Deep Compression [Han et al.,
ICLR 2016]
3
K-Means-based Weight Quantization
Accuracy vs. compression rate for AlexNet on ImageNet dataset

Quantization Only
0.5%
0.0%
2% 5% 8% 11% 14% 17% 20%
-0.5%
Accuracy Loss

-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%

Model Size Ratio after Compression

Deep Compression [Han et al.,


ICLR 2016]
3
K-Means-based Weight Quantization
Accuracy vs. compression rate for AlexNet on ImageNet dataset

Pruning Only Quantization Only


0.5%
0.0%
2% 5% 8% 11% 14% 17% 20%
-0.5%
Accuracy Loss

-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%

Model Size Ratio after Compression

Deep Compression [Han et al.,


ICLR 2016]
3
K-Means-based Weight Quantization
Accuracy vs. compression rate for AlexNet on ImageNet dataset

Pruning + Quantization Pruning Only Quantization Only


0.5%
0.0%
2% 5% 8% 11% 14% 17% 20%
-0.5%
Accuracy Loss

-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%

Model Size Ratio after Compression

Deep Compression [Han et al.,


ICLR 2016]
3
Before Quantization: Continuous Weight

Coun
t

Weight
Value
Deep Compression [Han et al.,
ICLR 2016]
3
After Quantization: Discrete Weight

Coun
t

Weight
Value
Deep Compression [Han et al.,
ICLR 2016]
3
After Quantization: Discrete Weight after Retraining

Coun
t

Weight
Value
Deep Compression [Han et al.,
ICLR 2016]
3
How Many Bits do We Need?

Deep Compression [Han et al.,


ICLR 2016]
4
Huffman Coding

• In-frequent weights: use more bits to


represent
• Frequent weights: use less bits to
represent
Deep Compression [Han et al.,
ICLR 2016]
4
Summary of Deep Compression

Deep Compression [Han et al.,


ICLR 2016]
4
Deep Compression Results
Original Compresse Compressio Original Compresse
Network
Size d Size n Ratio Accuracy d Accuracy

LeNet-300 1070KB 27KB 40x 98.36% 98.42%

LeNet-5 1720KB 44KB 39x 99.20% 99.26%

AlexNet 240MB 6.9MB 35x 80.27% 80.30%

VGGNet 550MB 11.3MB 49x 88.68% 89.09%

GoogleNet 28MB 2.8MB 10x 88.90% 88.92%

ResNet-18 44.6MB 4.0MB 11x 89.24% 89.28%

Can we make compact models to begin


with?
Deep Compression [Han et al.,
ICLR 2016]
4
SqueezeNet

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size [Iandola et
al., arXiv 2016] 4 4
Deep Compression on SqueezeNet

Top-1 Top-5
Network Approach Size Ratio
Accuracy Accuracy

AlexNet - 240MB 1x 57.2% 80.3%

AlexNet SVD 48MB 5x 56.0% 79.4%

Deep
AlexNet 6.9MB 35x 57.2% 80.3%
Compression

SqueezeNe
- 4.8MB 50x 57.5% 80.3%
t

SqueezeNe Deep
0.47MB 510x 57.5% 80.3%
t Compression

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size [Iandola et
al., arXiv 2016] 4
K-Means-based Weight Quantization
floa
t output
weights cluster ReLU
centroid floa s
(32-bit index
s t
float) (2-bit int)
- 3 2.00 bias floa +
2.09 1.48 0.09 3 0 2 1 floa
0.98 : t
t
- - 2 1.50 Conv
0.05 2.12 decod 1 1 0 3
0.14 1.08 e : floa floa
- - t t
1.92 0 0 3 1 0 1 0.00 input
0.91 1.03 : inputs weights
s floa
0 - t
1.87 0 1.53 1.49 3 1 2 2
: 1.00 decode
uin
• quantized t
During In weights
Computation Storage • codebook (float)

• The weights are decompressed using a lookup table (i.e., codebook) during runtime
inference.
• K-Means-based Weight Quantization only saves storage cost of a neural network model.
• All the computation and memory access are still floating-point.

4
Neural Network Quantization

- 3 2.0 1 -2 0 -1
2.0 1.4 0.0 3 0 2 1 1 0 1 1
0.9 : 0
9 8 9
8 2 1.5 -1 -1 -2 1
1 1 0 3 1 0 0 1
:
0.0
- -
0.1 1.0
2.1 1
0
( -1) 1.07
5 2 0 3 1 0 0.0 -2 1 -1 -2 0 1 1 0
4 8 :
0
3 1 2 2 0 1 -1 0 0 1 1 1 1
- - -
1.9 :
0.9 0 1.0 1.0
2
1 3 0
1.8
7
0
1.5 1.4
3 9
K-Means-based Linear Binary/Ternary
Quantization Quantization Quantization

Integer Weights;
Floating-Point
Storage Floating-Point Integer Weights
Weights
Codebook

Floating-Point Floating-Point
Computation Integer Arithmetic
Arithmetic Arithmetic

4
Linear Quantization

4
What is Linear Quantization?
weights
(32-bit
float)
-
2.09 1.48 0.09
0.98
- -
0.05 2.12
0.14 1.08
- -
1.92 0
0.91 1.03

1.87 0 1.53 1.49

4
What is Linear Quantization?
An affine mapping of integers to real numbers
weights quantized zero point scale reconstructed
(32-bit weights (2-bit signed (32-bit weights
float) (2-bit signed int) int) float) (32-bit float)
- -
2.09 1.48 0.09 1 -2 0 -1 2.14 1.07 0
0.98 1.07
- - -
0.05 2.12 -1 -1 -2 1
- 1.0 0 0 2.14

( ) =
0.14 1.08 1.07
-
0.91
1.92 0
-
1.03
-2 1 -1 -2
1 7 -
1.07
2.14 0
-
1.07

1.87 0 1.53 1.49 1 -1 0 0 2.14 0 1.07 1.07


we will learn how to
determine these quantization
parameters error
-
0.09 0.41 0.09
0.05
Binary Decimal
- - -
01 1 0.05
0.14 0.01 0.02
00 0 -
0.16 0 0.04
11 -1 0.22
10 -2 -
0 0.46 0.42
0.27
5
Linear Quantization
An affine mapping of integers to real numbers
weights quantized zero point scale
(32-bit weights (2-bit signed (32-bit
float) (2-bit signed int) int) float)
-
2.09 1.48 0.09 1 -2 0 -1
0.98
- -
0.05 2.12 -1 -1 -2 1
- 1.0
( )
0.14 1.08
-
0.91
1.92 0
-
1.03
-2 1 -1 -2
1 7
1.87 0 1.53 1.49 1 -1 0 0

Floating- Integ Integ Floating-


point er • quantization er point
parameter• quantization
• allow real number r=0 parameter
be exactly
representable by a
quantized integer
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 5
Linear Quantization
An affine mapping of integers to real numbers

Floating-point
range

Floating-
point

Floating-point
Scale

Integ Bit Width qmin qmax


er 2 -2 1
Zero
point 3 -4 3
4 -8 7
N -2N-1 2N-1-1
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 5
Scale of Linear Quantization
Linear Quantization is an affine mapping of integers to real numbers

Floating-
point range

Floating-
point Scale

Zero
point

5
Scale of Linear Quantization
Linear Quantization is an affine mapping of integers to real numbers

-
2.09 1.48 0.09
0.98
- -
0.05 2.12
0.14 1.08
Floating-
point range - -
1.92 0
0.91 1.03

1.87 0 1.53 1.49

Floating-
point Scale

Zero
point Binary Decimal
01 1
00 0
11 -1
10 -2

5
Zero Point of Linear Quantization
Linear Quantization is an affine mapping of integers to real numbers

Floating-
point range

Floating-
point Scale

Zero
point

5
Zero Point of Linear Quantization
Linear Quantization is an affine mapping of integers to real numbers

-
2.09 1.48 0.09
0.98
- -
0.05 2.12
0.14 1.08
Floating-
point range - -
1.92 0
0.91 1.03

1.87 0 1.53 1.49

Floating-
point Scale

Zero
point Binary Decimal
01 1
00 0
11 -1
10 -2

5
Linear Quantized Matrix Multiplication
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following matrix multiplication.

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 5
Linear Quantized Matrix Multiplication
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following matrix multiplication.

Precompu
te

N-bit Integer Multiplication N-bit


32-bit Integer Integer
Addition/Subtraction Addition

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 5
Linear Quantized Matrix Multiplication
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following matrix multiplication.

• Empirically, the scale is always in the interval (0, 1).

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 5
Linear Quantized Matrix Multiplication
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following matrix multiplication.

Precompu
te

Rescale to N-bit Integer Multiplication N-bit


N-bit 32-bit Integer Integer
Integer Addition/Subtraction Addition

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 6
Symmetric Linear Quantization
Zero point and Symmetric floating-point range

Floating-
point range

Floating-
point Scale

Zero
point

Bit Width qmin qmax


2 -2 1
3 -4 3
4 -8 7
N -2N-1 2N-1-1
6
Symmetric Linear Quantization
Full range mode

Floating-
point range

Floating-
point Scale

Zero
point

Bit Width qmin qmax


2 -2 1
3 -4 3
4 -8 7
N -2N-1 2N-1-1
6
Linear Quantized Matrix Multiplication
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following matrix multiplication, when Zw=0.

Precompu
te

Rescale to N-bit Integer Multiplication N-bit


N-bit 32-bit Integer Integer
Integer Addition/Subtraction Addition

6
Linear Quantized Fully-Connected Layer
Linear Quantization is an affine mapping of integers to real numbers
• So far, we ignore bias. Now we consider the following fully-connected layer with bias.

6
Linear Quantized Fully-Connected Layer
Linear Quantization is an affine mapping of integers to real numbers
• So far, we ignore bias. Now we consider the following fully-connected layer with bias.

6
Linear Quantized Fully-Connected Layer
Linear Quantization is an affine mapping of integers to real numbers
• So far, we ignore bias. Now we consider the following fully-connected layer with bias.

Precompu
te

We will discuss how to


compute activation zero
point in the next lecture.

6
Linear Quantized Fully-Connected Layer
Linear Quantization is an affine mapping of integers to real numbers
• So far, we ignore bias. Now we consider the following fully-connected layer with bias.

N-bit Int
Rescale N-bit
Mult.
to Int
32-bit Int
N-bit Int Add
Add.
Note: both and are 32 bits.

6
Linear Quantized Convolution Layer
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following convolution layer.

N-bit Int
Rescale N-bit
Mult.
to Int
32-bit Int
N-bit Int Add
Add.
Note: both and are 32 bits.

7
Linear Quantized Convolution Layer
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following convolution layer.

in in quantized
zero point t + t outputs
in
scale factor t
×
int3
2
quantized bias
int3 +
int3
2 2
Conv
in in
t t
quantized quantized weights
N-bit Int inputs
Rescale N-bit
Mult.
to Int
32-bit Int
N-bit Int Add
Add.
Note: both and are 32 bits.

7
INT8 Linear Quantization
An affine mapping of integers to real numbers

Neural Network ResNet-50 Inception-V3

Floating-point
76.4% 78.4%
Accuracy

8-bit Integer-
quantized 74.9% 75.4%
Acurracy

Latency-vs-accuracy tradeoff of float vs. integer-


only MobileNets on ImageNet using Snapdragon
835 big cores.

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 7
Neural Network Quantization

- 3 2.0 1 -2 0 -1
2.0 1.4 0.0 3 0 2 1
0.9 : 0
9 8 9
8 2 1.5 -1 -1 -2 1
1 1 0 3
:
0.0
- -
0.1 1.0
2.1 1
0
( -1) 1.07
5 2 0 3 1 0 0.0 -2 1 -1 -2
4 8 :
0
3 1 2 2 0 1 -1 0 0
- - -
1.9 :
0.9 0 1.0 1.0
2
1 3 0
1.8
7
0
1.5 1.4
3 9
K-Means-based Linear
Quantization Quantization

Integer Weights;
Floating-Point
Storage Floating-Point Integer Weights
Weights

?
Codebook

Floating-Point Floating-Point
Computation Integer Arithmetic
Arithmetic Arithmetic

7
Summary of Today’s Lecture
Today, we reviewed and learned
• the numeric data types used in the modern
computing systems, including integers and 1 1 0 0 1 1 1 1
× × × × × × × ×
floating-point numbers. =-
-27 + 26+25+24+23+22+21 + 20
49

• the basic concept of neural network


quantization: converting the weights and
activations of neural networks into a limited Floating-
point range
discrete set of numbers.

• two types of common neural network Floating-


point Scale
quantization:
• K-Means-based Quantization Zero
point
• Linear Quantization
7
References
1. Model Compression and Hardware Acceleration for Neural Networks: A
Comprehensive Survey [Deng et al., IEEE 2020]
2. Computing's Energy Problem (and What We Can Do About it) [Horowitz, M., IEEE
ISSCC 2014]
3. Deep Compression [Han et al., ICLR 2016]
4. Neural Network Distiller: https://intellabs.github.io/distiller/algo_quantization.html
5. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only
Inference [Jacob et al., CVPR 2018]
6. BinaryConnect: Training Deep Neural Networks with Binary Weights during
Propagations [Courbariaux et al., NeurIPS 2015]
7. Binarized Neural Networks: Training Deep Neural Networks with Weights and
Activations Constrained to +1 or −1. [Courbariaux et al., Arxiv 2016]
8. XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks
[Rastegari et al., ECCV 2016]
9. Ternary Weight Networks [Li et al., Arxiv 2016]
10.Trained Ternary Quantization [Zhu et al., ICLR 2017]

9
Lecture
Plan
Today we will:
1. Review Linear Quantization.

2. Introduce Post-Training Quantization (PTQ) that quantizes a floating-point neural network


model, including: channel quantization, group quantization, and range clipping.

3. Introduce Quantization-Aware Training (QAT) that emulates inference-time quantization during


the training/fine-tuning and recover the accuracy.

4. Introduce binary and ternary quantization.

5. Introduce automatic mixed-precision quantization.


Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1


( -1) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0

K-Means-based Linear
Quantization Quantization
Integer Weights;
Floating-Point Floating-Point Integer Weights
Storage
Weights Codebook

Floating-Point Floating-Point
Computation Integer Arithmetic
Arithmetic Arithmetic
K-Means-based Weight Quantization
weights cluster index fine-tuned
(32-bit float) (2-bit int) centroids centroids
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 ×lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Deep Compression [Han et al., ICLR 2016]


MIT 6.5940: TinyML and Efficient Deep Learning Computing
K-Means-based Weight Quantization
Accuracy vs. compression rate for AlexNet on ImageNet dataset

Pruning + Quantization Pruning Only Quantization Only


0.5%
0.0%
-0.5%
Accuracy Loss

-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
2% 5% 8% 11% 14% 17% 20%

Model Size Ratio after


Deep Compression [Han et al., ICLR 2016] Compression
MIT 6.5940: TinyML and Efficient Deep Learning Computing
Linear Quantization
An affine mapping of integers to real numbers r = S(q −
Z)
weights quantized weights zero point scale
(32-bit float) (2-bit signed int) (2-bit signed int) (32-bit float)

2.09 -0.98 1.48 0.09 1 -2 0 -1 2.14 -1.07 1.07 0

0.05 -0.14 -1.08 2.12 -1 -1 -2 1 0 0 -1.07 2.14

-0.91 1.92 0 -1.03


( -2 1 -1 -2
-1 ) 1.07 = -1.07 2.14 0 -1.07

1.87 0 1.53 1.49 1 -1 0 0 2.14 0 1.07 1.07

Binary Decimal
01 1
00 0
11 -1
10 -2
Linear Quantization
An affine mapping of integers to real numbers r = S(q −
Z)

rmin rmax
r 0
Floating-point range
Floating-point

×S
Floating-point Scale

q qmin qmax
Bit Width
Integer qmin Z qmax 2 -2 1
Zero point 3 -4 3
4 -8 7
N -2N-1 2N-1-1

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
Linear Quantized Fully-Connected
Layer
Linear Quantization is an affine mapping of integers to real numbers r = S(q −
Z)
• Consider the following fully-connected layer.

Y = WX + b
ZW = 0
Zb = 0, Sb = SWSX

qbias = qb − ZXqW
qY = S WS X
SY (qWqX + qbias) + ZY
Rescale to N-bit Int Mult. N-bit Int
N-bit Int 32-bit Int Add. Add
Note: both qb and qbias are 32 bits.
Linear Quantized Convolution
Layer
Linear Quantization is an affine mapping of integers to real numbers r = S(q −
Z)
• Consider the following convolution layer.

Y = Conv (W, X) + b
ZW = 0
Zb = 0, Sb = SWSX

qbias SWSX = qb − Conv (qW, ZX)


qY = (Conv ( q , q
W X) + q + Z
SY bias)
Y
Rescale to N-bit Int Mult. N-bit Int
N-bit Int 32-bit Int Add. Add
Note: both qb and qbias are 32 bits.
Scale and Zero Point of Linear Quantization
Linear Quantization is an affine mapping of integers to real numbers r = S(q −
Z)
2.09 -0.98 1.48 0.09
Asymmetric Linear Quantization
rmin 0 rmax 0.05 -0.14 -1.08 2.12

r Floating-point
range -0.91 1.92 0 -1.03

1.87 0 1.53 1.49


×S
Floating-point
q Scale
S = rmax − rmin
rmin
qmin Z qmax Z = qmin −
S
Zero point qmax − qmin
−1.08
= 2.12 − (−1.08) = round(−2 − )
1 − (−2) 1.07
= 1.07 = −1
Scale and Zero Point of Linear Quantization
Linear Quantization is an affine mapping of integers to real numbers r = S(q −
Z)
2.09 -0.98 1.48 0.09
Symmetric Linear Quantization
− | r |max 0.05 -0.14 -1.08 2.12
| r |max
r Floating-point
range -0.91 1.92 0 -1.03

1.87 0 1.53 1.49


×S
Floating-point
q Scale
S = | r |max Z=
qmin Z= 0 qmax
Zero point 0
qmax
2.1
= 2
1
=
2.12
Post-Training
Quantization
How should we get the optimal linear quantization parameters (S, Z)?

Topic I: Quantization Granularity


Topic II: Dynamic Range Clipping
Topic III: Rounding
Post-Training
Quantization
How should we get the optimal linear quantization parameters (S, Z)?

Topic I: Quantization Granularity


Topic II: Dynamic Range Clipping
Topic III: Rounding
Quantization Granularity
• Per-Tensor Quantization

• Per-Channel Quantization

• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type
Quantization Granularity
• Per-Tensor Quantization

• Per-Channel Quantization

• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type
Symmetric Linear Quantization on Weights
• r|
− | r |max max = | W |max
(first depthwise-separable layer in MobileNetV2)

| r |max
• Using single scale for whole weight tensor
(Per-Tensor Quantization)
r • works well for large models
• accuracy drops for small models

q • Common failure results from


qmin Z= 0 qmax • large differences (more than 100×) in
kw
ranges of weights for different
wi kh wo output channels — outlier weight
ci
co ho
hi co
ci • Solution: Per-Channel Quantization
X W Y

Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et al., CVPR
2018]
Quantization Granularity
• Per-Tensor Quantization

• Per-Channel Quantization

• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09

0.05 -0.14 -1.08 2.12


oc
-0.91 1.92 0 -1.03

1.87 0 1.53 1.49


Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09
| r |max = 2.12
0.05 -0.14 -1.08 2.12
oc 2.1
-0.91 1.92 0 -1.03 S= | r | max = 2 =
1.87 0 1.53 1.49
2−1
2 − 1 2.12
qmax
1 0 1 0 2.12 0 2.12 0

0 0 -1 1 0 0 -2.12 2.12

0 1 0 0 0 2.12 0 0

1 0 1 1 2.12 0 2.12 2.12

Quantized Reconstructed
∥W − SqW∥F =
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = S0 = | r |max =
2.09 2.09 2.12
oc 0.05 -0.14 -1.08 2.12 |r 2.12
| r |max = S1 = S= | max = =
2.12
-0.91 1.92 0 -1.03
2.12 2.12 q max 22−1 − 1
1.87 0 1.53 1.49
| r |max = S2 =
1 0 1 0 2.12 0 2.12 0
1.92 1.92
0 0 -1 1 0 0 -2.12 2.12
| r |max = S3 =
0 1 0 0 0 2.12 0 0
1.87 1.87
1 0 1 1 2.12 0 2.12 2.12

Quantized Reconstructed
∥W − SqW∥F =
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = S0 = | r |max =
2.09 2.09 2.12
oc 0.05 -0.14 -1.08 2.12 |r 2.12
| r |max = S1 = S= | max = =
2.12
-0.91 1.92 0 -1.03
2.12 2.12 q max 22−1 − 1
1.87 0 1.53 1.49
| r0|max1= S2 =
1 0 2.09 0 2.09 0 1 0 1 0 2.12 0 2.12 0
1.92 1.92
0 0 -1 1 0 0 -2.12 2.12 0 0 -1 1 0 0 -2.12 2.12
| r1|max0= S3 =
0 -1 0 1.92 0 -1.92 0 1 0 0 0 2.12 0 0
1.87 1.87
1 0 1 1 1.87 0 1.87 1.87 1 0 1 1 2.12 0 2.12 2.12

Quantized Quantized Reconstructed


∥W − S ⊙ qW∥F = 2.08 Reconstructed ∥W − SqW∥F = 2.28
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 21
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = S0 = | r |max =
2.09 2.09 2.12
oc 0.05 -0.14 -1.08 2.12 |r 2.12
| r |max = S1 = S= | max = =
2.12
-0.91 1.92 0 -1.03
2.12 2.12 q max 22−1 − 1
1.87 0 1.53 1.49
| r0|max1= S2 =
1 0 2.09 0 2.09 0 1 0 1 0 2.12 0 2.12 0
1.92 1.92
0 0 -1 1 0 0 -2.12 2.12 0 0 -1 1 0 0 -2.12 2.12
| r1|max0= S3 =
0 -1 0 1.92 0 -1.92 0 1 0 0 0 2.12 0 0
1.87 1.87
1 0 1 1 1.87 0 1.87 1.87 1 0 1 1 2.12 0 2.12 2.12

Quantized
∥W − S ⊙ qW∥F
Reconstructed
= < Quantized
∥W − SqW∥F
Reconstructed
= 2.28
Quantization Granularity
• Per-Tensor Quantization

• Per-Channel Quantization

• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type
VS-Quant: Per-vector Scaled
Quantization
Hierarchical scaling factor
• r = S(q − Z) → r = γ ⋅ Sq(q −
Z)
• is a floating-point coarse grained scale factor scale factor Sq for each vector
• Sq is an integer per-vector scale factor
• achieves a balance between accuracy and
hardware efficiency by
• less expensive integer scale factors at finer
granularity
M K = M

• more expensive floating-point scale factors


at coarser granularity
K N N

• Memory Overhead of two-level scaling:


• Given 4-bit quantization with 4-bit per-vector another scale factor for each tensor
scale for every 16 elements, the effective bit
width is 4 + 4 / 16 = 4.25 bits.

VS-Quant: Per-Vector Scaled Quantization for Accurate Low-Precision Neural Network Inference [Steve Dai, et al.]
Group
Quantization
Multi-level scaling scheme r = (q − z) ⋅ s →
W
r = (q − z) ⋅ sl ⋅ sl ⋅ ⋯
11 0
1

: real number value


: quantized value
: zero point (z = 0 is symmetric quantization)
: scale factors of different levels
Group
Quantization FP16 INT4
Multi-level scaling scheme sl0 q
r = (q − z) ⋅ sl ⋅ sl ⋅ ⋯
W11 0
1

: real number value


: quantized value
: zero point (z = 0 is symmetric quantization)
: scale factors of different levels

L0

Quantization L0 L0 Scale L1 L1 Scale Effective


Data Type
Approach Group Size Data Type Group Size Data Type Bit Width
Per-Channel Quant INT4 Per Channel FP16 - - 4
VSQ INT4 16 UINT4 Per Channel FP16 4+4/16=4.25

MX4 S1M2 2 E1M0 16 E8M0 3+1/2+8/16=4

MX6 S1M4 2 E1M0 16 E8M0 5+1/2+8/16=6

MX9 S1M7 2 E1M0 16 E8M0 8+1/2+8/16=9


Group
Quantization
L0 FP16 INT4
Multi-level
W scaling scheme sl0 q
r = (q − z) ⋅ sl ⋅ sl ⋅ ⋯
01

W11 0 INT4
W21 1 INT4
FP16 UINT4 INT4
W31 : real number value INT4
sl sl q
: quantized value 1 0

: zero point (z = 0 is symmetric quantization)


: scale factors of different levels

L1
Quantization L0 L0 Scale L1 L1 Scale Effective
Data Type
Approach Group Size Data Type Group Size Data Type Bit Width
Per-Channel Quant INT4 Per Channel FP16 - - 4
VSQ INT4 16 UINT4 Per Channel FP16 4+4/16=4.25
MX4 S1M2 2 E1M0 16 E8M0 3+1/2+8/16=4

MX6 S1M4 2 E1M0 16 E8M0 5+1/2+8/16=6

MX9 VS-Quant: Per-Vector


S1M7Scaled Quantization
2 for Accurate Low-Precision
E1M0 16
Neural Network E8M0Dai, et al.]
Inference [Steve 8+1/2+8/16=9
Group Quantization
Multi-level scaling scheme
L0 FP16 INT4
sl0 q
W01
r = (q − z) ⋅ sl ⋅ sl ⋅ ⋯
W11 0 INT4
1 INT4
FP16 UINT4 INT4
: real number value INT4
L1
: quantized value sl1 sl0 q
: zero point (z = 0 is symmetric quantization)
: scale factors of different levels S MAG 2
E8 E1 S MAG 2
sl1 sl0 q
Quantization L0 L0 Scale L1 L1 Scale Effective
Data Type
Approach Group Size Data Type Group Size Data Type Bit Width
Per-Channel Quant INT4 Per Channel FP16 - - 4
VSQ INT4 16 UINT4 Per Channel FP16 4+4/16=4.25
MX4 S1M2 2 E1M0 16 E8M0 3+1/2+8/16=4
MX6 S1M4 2 E1M0 16 E8M0 5+1/2+8/16=6
MX9 S1M7 2 E1M0 16 E8M0 8+1/2+8/16=9
With Shared Microexponents, A Little Shifting Goes a Long Way [Bita Rouhani et al.]
Post-Training
Quantization
How should we get the optimal linear quantization parameters (S, Z)?

Topic I: Quantization Granularity


Topic II: Dynamic Range Clipping
Topic III: Rounding
Linear Quantization on Activations
rmin 0 rmax
• Unlike weights, the activation range varies across
inputs.
r
• To determine the floating-point range, the activations
×S
statistics are gathered before deploying the model.
q
qmin Z qmax
Dynamic Range for Activation Quantization
Collect activations statistics before deploying the model
= α ⋅ r(t)
+ (1 − α) ⋅ r ̂
( t−1) • Type 1: During training
r(max
̂t , max , min • Exponential moving averages (EMA)
) min max , min
• observed ranges are smoothed across
r rmin 0 thousands of training steps

rmax
×S

q
qmin Z qmax

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
Dynamic Range for Activation
Quantization
Collect activations statistics before deploying the model
•Type 2: By running a few “calibration” batches of samples on the trained FP32 model

0 spending dynamic range on the outliers hurts the


rmin rmax outliers•
representation ability.
r • use mean of the min/max of each sample in the
batches
×S • analytical calculation (see next slide)

q
qmin Z qmax

Neural Network Distiller


Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
Dynamic Range for Activation
Quantization
Collect activations statistics before deploying the model
•Type 2: By running a few “calibration” batches of samples on the trained FP32 model
•minimize loss of information, since integer model encodes the same information as the original
floating-point model.
•loss of information is measured by Kullback- Leibler divergence (relative entropy or information
divergence):

• for two discrete probability distributions P, Q


N
P(xi)
DKL(P∥Q) = ∑ P(xi )lo
i g
Q(xi
)
• intuition: KL divergence measures the amount
of information lost when approximating a
given encoding.

8-bit Inference with TensorRT [Szymon Migacz, 2017]


MIT 6.5940: TinyML and Efficient Deep Learning Computing
Dynamic Range for Activation
Quantization
Minimize loss of information by minimizing the KL divergence

8-bit Inference with TensorRT [Szymon Migacz, 2017]


MIT 6.5940: TinyML and Efficient Deep Learning Computing
Dynamic Range for Activation
Quantization
Minimize loss of information by minimizing the KL divergence
GoogleNet:
AlexNet: Pool 2
incpetion_5a/5x5

ResNet-152: GoogleNet:
res4b8_branch2a incpetion_3a/pool

8-bit Inference with TensorRT [Szymon Migacz, 2017]


MIT 6.5940: TinyML and Efficient Deep Learning Computing
Dynamic Range for Quantization
Minimize mean-square-error (MSE) using Newton-Raphson method

max-scaled quantization clipped quantization

clip clip

large quantization noise low density data

Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training [Sakr et al., ICML 2022]
Dynamic Range for Quantization
Minimize mean-square-error (MSE) using Newton-Raphson method

Network FP32 Accuracy OCTAV int4


ResNet-50 76.07 75.84
MobileNet-V2 71.71 70.88
Bert-Large 91.00 87.09
Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training [Sakr et al., ICML 2022]
Post-Training
Quantization
How should we get the optimal linear quantization parameters (S, Z)?

Topic I: Quantization Granularity


Topic II: Dynamic Range Clipping
Topic III: Rounding
Adaptive Rounding for Weight
Quantization
Rounding-to-nearest is not optimal
• Philosophy
• Rounding-to-nearest is not optimal
• Weights are correlated with each other. The best rounding for each weight (to nearest) is not
the best rounding for the whole tensor
rounding-to-nearest
0.3 0.5 0.7 0.2 0 1 1 0

AdaRound
0.3 0.5 0.7 0.2 0 0 1 0
(one potential result)

• What is optimal? Rounding that reconstructs the original activation the best, which may be very
different
• For weight quantization only
• With short-term tuning, (almost) post-training quantization

Up or Down? Adaptive Rounding for Post-Training Quantization [Nagel et al., PMLR 2020]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
Adaptive Rounding for Weight
Quantization
Rounding-to-nearest is not optimal
• Method:
• Instead of ⌊w⌉ , we want to choose from {⌊w ⌋ , ⌈w ⌉ } to get the best reconstruction
• We took a learning-based method to find quantized value w˜ = ⌊⌊w⌋ + δ⌉, δ ∈
[0,1]

Up or Down? Adaptive Rounding for Post-Training Quantization [Nagel et al., PMLR 2020]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
Adaptive Rounding for Weight
Quantization
Rounding-to-nearest is not optimal
• Method:
• Instead of ⌊w⌉ , we want to choose from {⌊w ⌋ , ⌈w ⌉ } to get the best reconstruction
• We took a learning-based method to find quantized value w˜ = ⌊⌊w⌋ + δ⌉, δ ∈
• [0,1] We optimize the following equation (omit the derivation):
˜ 2
argminV∥Wx − x∥ F + λfreg(V)
W
→ argmin ∥Wx − ⌊⌊W⌋ + 2
+ λf
V F reg
h(V)⌉x∥ (V)
• is the input to the layer, V is a random variable of the same shape 2

1
• h() is a function to map the range to (0,1), such as rectified sigmoid
0
• freg(V) is a regularization that encourages h(V) to be binary -1
-6 0 6

• freg(V) = ∑ 1 − | 2h(Vi,j) − 1 |
i,j
β

Up or Down? Adaptive Rounding for Post-Training Quantization [Nagel et al., PMLR 2020]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
Neural Network • Zero Point
• Asymmetric
Quantization • Symmetric
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1
• Scaling
0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1 Granularity
• Per-Tensor
( -1) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2
• Per-Channel
1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0
• Group Quantization
K-Means-based Linear • Range Clipping
Quantization Quantization • Exponential Moving
Integer Weights; Average
Floating-Point Floating-Point Integer Weights • Minimizing KL
Storage
Weights Codebook Divergence
• Minimizing Mean-
Floating-Point Floating-Point Square-Error
Computation Integer Arithmetic
Arithmetic Arithmetic
• Rounding
• Round-to-Nearest
• AdaRound
Post-Training INT8 Linear Quantization
Symmetric Asymmertric

Per-Tensor Per-Tensor
Activation
Exponential Moving Average
Minimize KL-Divergence
(EMA)

Symmetric Symmetric
Weight
Per-Tensor Per-Channel

GoogleNet -0.45% 0%

ResNet-50 -0.13% -0.6%


Neural
Network ResNet-152 -0.08% -1.8%

MobileNetV1 - -11.8%

MobileNetV2 - -2.1%

Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv
2018]
8-bit Inference with TensorRT [Szymon Migacz, 2017]
Post-Training INT8 Linear Quantization
Symmetric Asymmertric

Per-Tensor Per-Tensor
Activation
Exponential Moving Average
Minimize KL-Divergence
(EMA)

Symmetric Symmetric
Weight
Per-Tensor Per-Channel
?
GoogleNet
Smaller -
models seem to not respond 0%
How should we
as 0.45%
well to post-training quantization,
improve -0.6%
performance
presumabley due to their smaller
Neural ResNet-50 - of quantized models?
representational capacity.
0.13% -1.8%
Network
MobileNetV1
ResNet-152 - - -11.8%
0.08%
MobileNetV2 - -2.1%

Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv
2018]
8-bit Inference with TensorRT [Szymon Migacz, 2017]
Quantization-Aware Training
How should we improve performance of quantized models?
Quantization-Aware Training
Train the model taking quantization into consideration
• To minimize the loss of accuracy, especially aggressive quantization with 4 bits and lower bit
width, neural network will be trained/fine-tuned with quantized weights and activations.
• Usually, fine-tuning a pre-trained floating point model provides better accuracy than training from
scratch.
weights cluster index fine-tuned Forward
(32-bit float) (2-bit int) centroids centroids weights
Backward
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48 weight quantization


-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 ×lr -0.97 Layer Layer Layer


inputs outputs
N-1 N N+1
gradient
-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04 example for operations
group
-0.01 0.01 -0.02 0.12 by 0.03 0.01 -0.02 reduce 0.02
Batch
Conv ReLU
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04 Norm

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Deep Compression [Han et al., ICLR 2016]


Quantization-Aware Training
Train the model taking quantization into consideration
• A full precision copy of the weights W is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
weights
Backward

weight quantization

Layer Layer Layer


inputs outputs
N-1 N N+1

example for operations

Batch
Conv ReLU
Norm
Quantization-Aware Training
Train the model taking quantization into consideration
• A full precision copy of the weights W is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
“Simulated/Fake Quantization” weights
Backward
W

W → SWqW = Q(W) weight quantization


Q
(W)
Layer Layer activation Layer
inputs outputs
N-1 N quantization N+1

example for operations


ensure discrete-valued
Conv
Batch
ReLU weights and activations
Norm
in the boundaries
these operations still run in full
precision
Linear Quantization
An affine mapping of integers to real numbers r = S(q −
Z)
weights quantized weights zero point scale
(32-bit float) (2-bit signed int) (2-bit signed int) (32-bit float)

2.09 -0.98 1.48 0.09 1 -2 0 -1 2.14 -1.07 1.07 0

0.05 -0.14 -1.08 2.12 -1 -1 -2 1 0 0 -1.07 2.14

-0.91 1.92 0 -1.03


( -2 1 -1 -2
-1 ) 1.07 = -1.07 2.14 0 -1.07

1.87 0 1.53 1.49 1 -1 0 0 2.14 0 1.07 1.07

W qW Q(W
)
Quantization-Aware Training
Train the model taking quantization into consideration
• A full precision copy of the weights W is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
“Simulated/Fake Quantization” weights
Backward
W

W → SWqW = Q(W) weight quantization


Y → SY (qY − ZY) = Q(Y)
Q
Layer Q (W) Layer Y Q Layer
inputs activation outputs
(X) (Y)
N-1 N quantization N+1
? example for operations
How should gradients ensure discrete-valued
back-propagate through Conv
Batch
ReLU weights and activations
Norm
the (simulated) in the boundaries
quantization?
these operations still run in full
precision
Straight-Through Estimator (STE)
5
• Quantization is discrete-valued, and thus the
derivative is 0 almost everywhere. 4

Q (w) = round
∂Q (W )
= 0 3
∂W
2
• The neural network will learn nothing since
gradients become 0 and the weights won’t get 1

(w)
updated. 0
∂L ∂L ∂Q (W) 0 1 2 3 4 5
gW = ∂W = ∂Q (W) ⋅ = 0
∂W weights
• Straight-Through Estimator (STE) simply passes the W gW
gradients through the quantization as if it had been weight quantization
the identity function. ∂L
Q
∂Q
∂L ∂L (W)
Layer (W)
gW = ∂W = ∂Q
N
(W)
Neural Networks for Machine Learning [Hinton et al., Coursera Video Lecture, 2012]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation [Bengio, arXiv 2013]
Quantization-Aware Training
Train the model taking quantization into consideration
• A full precision copy of the weights is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
“Simulated/Fake Quantization” weights
Backward
W → SWqW = Q(W) W
Y → SY (qY − ZY) = Q(Y)
weight quantization
∂L ∂L
gW ← ∂Q Q
g Y ← ∂Q
(W)
Layer Q (W) Layer Y (Y) Q Layer
inputs activation outputs
(X) (Y)
N-1 N quantization N+1

example for operations


ensure discrete-valued
Conv
Batch
ReLU weights and activations
Norm
in the boundaries
these operations still run in full
precision
INT8 Linear Quantization-Aware Training

Post-Training Quantization Quantization-Aware Training

Neural Network Floating-Point Asymmetric Symmetric Asymmetric Symmetric

Per-Tensor Per-Channel Per-Tensor Per-Channel

MobileNetV1 70.9% 0.1% 59.1% 70.0% 70.7%

MobileNetV2 71.9% 0.1% 69.8% 70.9% 71.1%

NASNet-Mobile 74.9% 72.2% 72.1% 73.0% 73.0%

Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv 2018]
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1


( -1) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0

K-Means-based Linear
Quantization Quantization
Integer Weights;
Floating-Point Floating-Point Integer Weights
Storage
Weights Codebook

Computation
Floating-Point Floating-Point
Integer Arithmetic
?
Arithmetic Arithmetic
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1 1 0 1 1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1 1 0 0 1


( -1) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2 0 1 1 0

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0 1 1 1 1

K-Means-based Linear Binary/Ternary


Quantization Quantization Quantization
Integer Weights;
Floating-Point Floating-Point Integer Weights Binary/Ternary
Storage
Weights Codebook Weights

Floating-Point Floating-Point
Computation Integer Arithmetic Bit Operations
Arithmetic Arithmetic
Binary/Ternary Quantization
Can we push the quantization precision to 1
bit?
Can quantization bit width go even lower?
yi = ∑ Wij ⋅ xj = ×
j yi
Wij
= 8×5 + (-3)×2 + 5×0 + (-1)×1 xj

input weight operations memory computation = × 5


8 -3 5 -1 2
R R +× 1× 1×
0
1
If weights are quantized to +1 and -1
yi = ∑ Wij ⋅ xj = × 5
j 8 -3 5 -1 2
=5-2+0-1 0
1

input weight operations memory computation = × 5


1 -1 1 -1 2
R R +× 1× 1×
0
R B +- ~32× less ~2× less
1

BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations [Courbariaux et al., NeurIPS 2015]
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
Binarization
• Deterministic Binarization
• directly computes the bit value based on a threshold, usually 0, resulting in a sign function.
+1, r ≥ 0
q = sign (r) = { r < 0
−1,
• Stochastic Binarization
• use global statistics or the value of input data to determine the probability of being -1 or +1
• e.g., in Binary Connect (BC), probability is determined by hard sigmoid function σ(r)
1
+1, with probability p = σ(r) r+
1 , where σ(r) = min(max( 0
q= 2
{ −1, ,0),1) -1
with probability 1 − p -3 -1 1 3

• harder to implement as it requires the hardware to generate random bits when quantizing.

BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations [Courbariaux et al., NeurIPS 2015]
BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1. [Courbariaux et al., Arxiv 2016]
Minimizing Quantization Error in Binarization
AlexNet-based ImageNet Top-1
weights binary weights Network Accuracy Delta
(32-bit float) (1-bit)
2.09 -0.98 1.48 0.09 1 -1 1 1 BinaryConnect -21.2%

0.05 -0.14 -1.08 2.12 W W 𝔹 1 -1 -1 1 Binary Weight


0.2%
Network (BWN)
-0.91 1.92 0 -1.03 -1 1 1 -1

1.87 0 1.53 1.49 1 1 1 1


∥W − W ∥F = 𝔹 2

9.28
1 -1 1 1 scale
(32-bit float)
W𝔹 = sign 1 -1 -1 1
(W) = 1 ∥W∥1
1

1.0 16
α = ∥W∥1 αW� -1 1 1 -1
5
n 1 1 1 1
∥W − αW𝔹∥F2 =
9.24
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
If both activations and weights are binarized
yi = ∑ Wij ⋅ xj = × 5
j 8 -3 5 -1 2
= 1×1 + (-1)×1 + 1×(-1) + (-1)×1 0
= 1 + (-1) + (-1) + (-1) = -2 1

= × 1
1 -1 1 -1 1
-1
1

XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
If both activations and weights are binarized
yi = ∑ Wij ⋅ xj
j

= 1×1 + (-1)×1 + 1×(-1) + (-1)×1


= 1 + (-1) + (-1) + (-1) = -2

W X Y=WX bW bX XNOR(bW, bX)


1 1 1 1 1 1
1 -1 -1 1 0 0
-1 -1 1 0 0 1
-1 1 -1 0 1 0
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
If both activations and weights are binarized
yi = ∑ Wij ⋅
j
•= 1 xnor 1 + 0 xnor 1 + 1 xnor 0 + 0 xnor 1
xj
= 1×1 + (-1)×1 + 1×(-1) + (-
•= 1 + 0 + 0 + 0 = 1
1)×1
•?
= 1 + (-1) + (-1) + (-1) = -2

W X Y=WX bW bX XNOR(bW, bX)


1 1 1 1 1 1
1 -1 -1 1 0 0
-1 -1 1 0 0 1
-1 1 -1 0 1 0
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
If both activations and weights are binarized
yi = ∑ Wij ⋅ xj yi = − n + 2 ⋅ ∑ Wij xnor xj
j j

= 1×1 + (-1)×1 + 1×(-1) + (-1)×1 = 1 xnor 1 + 0 xnor 1 + 1 xnor 0 + 0 xnor 1


=1+0+0+0 = 1×2
= 1 + (-1) + (-1) + (-1) = -2 +2 + = -2
Assuming -1 -1 -1 -1 → -4

W X Y=WX bW bX XNOR(bW, bX)


1 1 1 1 1 1
1 -1 -1 1 0 0
-1 -1 1 0 0 1
-1 1 -1 0 1 0
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
If both activations and weights are binarized
yi = − n + 2 ⋅ ∑ Wij xnor →
yi = − n + popcount (Wi xnor x) ≪ 1
j
xj
= -4 + 2 × (1 xnor 1 + 0 xnor 1 + 1 xnor 0 + 0 xnor 1)
= -4 + 2 × (1 + 0 + 0 + 0) = -2
→ popcount: return the number of 1

W X Y=WX bW bX XNOR(bW, bX)


1 1 1 1 1 1
1 -1 -1 1 0 0
-1 -1 1 0 0 1
-1 1 -1 0 1 0
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
If both activations and weights are binarized
yi = − n + popcount (Wi xnor x) ≪ 1 = × 5

= -4 + popcount(1010 xnor 1101) ≪ 1 8 -3 5 -1 2

= -4 + popcount(1000) ≪ 1 = -4 + 2 = -2 0
1

input weight operations memory computation = × 1


1 -1 1 -1 1
R R +× 1× 1×
-1
R B +- ~32× less ~2× less
1
B B xnor,
~32× less ~58× less
popcount
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
Accuracy Degradation of Binarization
Bit-Width ImageNet
Neural Network Quantization Top-1 Accuracy
W A Delta

BWN 1 32 0.2%

AlexNet BNN 1 1 -28.7%

XNOR-Net 1 1 -12.4%

BWN 1 32 -5.80%
GoogleNet
BNN 1 1 -24.20%

BWN 1 32 -8.5%
ResNet-18
* BWN: Binary Weight XNOR-Net 1 binarization
Network with scale for weight 1 -18.1%
* BNN: Binarized Neural Network without scale factors
* XNOR-Net: scale factors for both activation and weight binarization
Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1. [Courbariaux et al., Arxiv 2016]
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
Ternary Weight Networks
(TWN)
Weights are quantized to +1, -1 and 0
rt, r> Δ
q= 0, |r|≤ where Δ = 0.7 × 𝔼 ( | r | ) , rt = 𝔼|r|>Δ ( | r
Δ,
|)
−r t , r< −Δ ternary weights W𝕋
Δ = 0.7 × 1 ∥W∥1 = 0.73
(2-bit)
weights W
2.09 -0.98 1.48 0.09 1 -1 1 0 16
(32-bit float)

0.05 -0.14 -1.08 2.12 0 0 -1 1


1
-0.91 1.92 0 -1.03 -1 1 0 -1
1.5 =
1
∥WW�≠0 ∥

1
1
1.87 0 1.53 1.49 1 0 1 1

ImageNet Top-1
Full Precision 1 bit (BWN) 2 bit (TWN)
Accuracy

ResNet-18 69.6 60.8 65.3

Ternary Weight Networks [Li et al., Arxiv 2016]


68
Trained Ternary Quantization
(TTQ)
• Instead of using fixed scale , TTQ introduces two trainable parameters w
t p and wn to
represent the positive and negative scales in the quantization.
wp, r> Δ
q= 0, |r|≤
Δ
Normalized
−w n , r< −Δ Trained
Full Precision Weight Intermediate Ternary Weight Quantization Final Ternary Weight
Full Precision Weight
Scale
Normalize Quantize
Wn Wp
-1 0 1 -1 -t 0 t 1 -1 0 1 -Wn 0 Wp

ImageNet Top-1
Full Precision 1 bit (BWN) 2 bit (TWN) TTQ
Accuracy

ResNet-18 69.6 60.8 65.3 66.6

Trained Ternary Quantization [Zhu et al., ICLR 2017]


Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1 1 0 1 1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1 1 0 0 1


( -1) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2 0 1 1 0

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0 1 1 1 1

K-Means-based Linear Binary/Ternary


Quantization Quantization Quantization
Integer Weights;
Floating-Point Floating-Point Integer Weights Binary/Ternary
Storage
Weights Codebook Weights

Floating-Point Floating-Point
Computation Integer Arithmetic Bit Operations
Arithmetic Arithmetic
Mixed-Precision Quantization
Uniform Quantization
……
weight activation

Layer 1
1 1 1 0 1 1 1 0 1 0 1 0 1 0 1 0
8 bits / 8 bits

Layer 2 1 1 0 1 1 0 0 1 1 0 1 0 1 0 0 0
8 bits / 8 bits

Layer 3
0 1 0 1 0 1 0 1 0 0 1 0 0 0 1 0
8 bits / 8 bits

……

Bit Widths Quantized Model


Mixed-Precision Quantization
……
weight activation

Layer 1
1 1 1 0 1 0 1 0 1
4 bits / 5 bits

Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0
6 bits / 7 bits

Layer 3
1 0 1 0 1 0 0 1 0
5 bits / 4 bits

……

Bit Widths Quantized Model


Challenge: Huge Design Space
……
weight activation

Layer 1
1 1 1 0 1 0 1 0 1 Choices: 8 x 8 = 64
4 bits / 5 bits

Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0 Choices: 8 x 8 = 64
6 bits / 7 bits

Layer 3
1 0 1 0 1 0 0 1 0 Choices: 8 x 8 = 64
5 bits / 4 bits

……

Bit Widths Quantized Model Design Space: 64n


Solution: Design
Automation
……
weight activation

Layer 1
1 1 1 0 1 0 1 0 1
4 bits / 5 bits
Action
Critic

Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0
6 bits / 7 bits
State
Actor
Reward
Layer 3
1 0 1 0 1 0 0 1 0
5 bits / 4 bits

……

Bit Widths Quantized Model

HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
Solution: Design
Automation
…… BitFusion (Edge)

weight activation BISMO (Cloud)


PE PE PE
BISMO (Edge)
PE PE PE
Layer 1
1 1 1 0 1 0 1 0 1 PE PE PE
4 bits / 5 bits PE
wn
PE w
0
a
PE
a
0
PE n PE Cycle T
Action Hardware PE
Critic Mapping
wn
PE
w
PE
0
an a0
PE
(LSB)

CycleCTycle
wn w0 an a0 0 (LSB)
&
Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0 +
CTycle
Cycle(MSB)
0 (LSB)
6 bits / 7 bits &
<<
Cycle 0
(MSB)
(MSB)

State Direct &


<<
Actor
Reward Feedback <<
Layer 3
1 0 1 0 1 0 0 1 0 +

5 bits / 4 bits

Hardware
Accelerator
……

Bit Widths Quantized Model

HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
HAQ Outperforms Uniform Quantization
HAQ Uniform

HAQ (Ours) PACT Baseline

Mixed-Precision Quantized MobileNetV1

HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
HAQ Supports Multiple Objectives

Model Size Constrained Latency Constrained Energy Constrained

HAQ HAQ HAQ


Uniform Uniform Uniform

Mixed-Precision Quantized MobileNetV1

HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and Efficient Deep Learning Computing
Quantization Policy for Edge and Cloud

#weight bit (pointwise) #weight bit (depthwise) #activation bit (pointwise) #activation bit (depthwise)
Mixed-Precision Quantized MobileNetV2

HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
Summary of Today’s Lecture
In this lecture, we rmin 0 rmax
Floating-point
1. Reviewed Linear Quantization. range

2. Introduced Post-Training Quantization (PTQ) that


quantizes an already-trained floating-point neural ×S
Floating-point
network model.
• Per-tensor vs. per-channel vs. group quantization q Scale

• How to determine dynamic range for quantization qmin Z qmax


Zero point
3. Introduced Quantization-Aware Training (QAT) that
emulates inference-time quantization during the
Forward
training/fine-tuning. weights
Backward
• Straight-Through Estimator (STE) weight quantization

4. Introduced binary and ternary quantization. Layer Layer activation Layer


inputs outputs
5. Introduced automatic mixed-precision quantization. N-1 N quantization N+1

Batch
Conv ReLU
Norm
References
1. Deep Compression [Han et al., ICLR 2016]
2. Neural Network Distiller: https://intellabs.github.io/distiller/algo_quantization.html
3. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et al.,
CVPR 2018]
4. Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
5. Post-Training 4-Bit Quantization of Convolution Networks for Rapid-Deployment [Banner et al., NeurIPS 2019]
6. 8-bit Inference with TensorRT [Szymon Migacz, 2017]
7. Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper [Raghuraman Krishnamoorthi,
arXiv 2018]
8. Neural Networks for Machine Learning [Hinton et al., Coursera Video Lecture, 2012]
9. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation [Bengio, arXiv 2013]
10.Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1.
[Courbariaux et al., Arxiv 2016]
11. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients [Zhou et al., arXiv
2016]
12.PACT: Parameterized Clipping Activation for Quantized Neural Networks [Choi et al., arXiv 2018]
13.WRPN: Wide Reduced-Precision Networks [Mishra et al., ICLR 2018]
14.Towards Accurate Binary Convolutional Neural Network [Lin et al., NeurIPS 2017]
15.Incremental Network Quantization: Towards Lossless CNNs with Low-precision Weights [Zhou et al., ICLR 2017]
16.HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]

You might also like