0% found this document useful (0 votes)

30 views150 pages

ML System Optimization Lecture 11 Quantization

Uploaded by

allybenson5888

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views150 pages

ML System Optimization Lecture 11 Quantization

Uploaded by

allybenson5888

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 150

Machine

Learning
Systems
QUANTIZATION

Lecture slides inspired by: Prof. Song

Han
Associate Professor, MIT
Distinguished Scientist, NVIDIA
Agenda
1. Review the numeric data types, including 1 1 0 0 1 1 1 1
integers and floating-point numbers. × × × × × × × ×
-27 + 26+25+24+23+22+21 + 20
2. Learn the basic concept of neural network
quantization
3. Learn three types of common neural network Continuous Quantized
Signal Signal
quantization: Signa
l
1. K-Means-based Quantization
2. Linear Quantization
tim
3. Binary and Ternary Quantization e

2
Low Bit Operations are Cheaper
Less Bit-Width → Less Energy

Operation Energy [pJ] S

30
e
8 bit int ADD 0.03 ✕
ri
32 bit int ADD 0.1 e
s
16 bit float ADD 0.4 1

32 bit float ADD 0.9

16
8 bit int MULT 0.2 ✕

32 bit int MULT 3.1

16 bit float MULT 1.1

32 bit float MULT 3.7

Rough Energy Cost For Various Operations in 1 10 100 1000
45nm 0.9V
=
1
200
Computing's Energy Problem (and What We Can Do About it) [Horowitz, This image is in the public
domain
M., IEEE ISSCC 2014]
5
Low Bit Operations are Cheaper
Less Bit-Width → Less Energy

Operation Energy [pJ] S

30
e
8 bit int ADD 0.03 ✕
ri
32 bit int ADD 0.1 e
s
16 bit float ADD 0.4 1

How should we make deep

32 bit float ADD 0.9 learning more
16 efficient?
8 bit int MULT 0.2 ✕

32 bit int MULT 3.1

16 bit float MULT 1.1

32 bit float MULT 3.7

Rough Energy Cost For Various Operations in 1 10 100 1000
45nm 0.9V

Computing's Energy Problem (and What We Can Do About it) [Horowitz,

M., IEEE ISSCC 2014]
6
Numeric Data Types
How is numeric data represented in modern computing systems?

7
Integer
• Unsigned Integer
• n-bit Range: 0 0 1 1 0 0 0 1
× × × × × × × ×
27 + 26+25+24+23+22+21 + 20
• Signed Integer
• Sign-Magnitude Representation
• n-bit Range: 1 0 1 1 0 0 0 1
• Both 000…00 and 100…00 represent 0 × × × × × × ×
- 26+25+24+23+22+21 + 20
• Two’s Complement Representation
• n-bit Range:
1 1 0 0 1 1 1 1
• 000…00 represents 0
× × × × × × × ×
• 100…00 represents -27 + 26+25+24+23+22+21 + 20

8
Fixed-Point Number

Integ Fractio
.
er “Decimal” n
Point

0 0 1 1 0 0 0 1
× × × × × × × ×
=
-23 + 22+21+20+2-1+2-2+2-3+ 2-4
3.0625

0 0 1 1 0 0 0 1
× × × × × × × ×
) × 2-4 = 49 × 0.0625 =
( -27 + 26+25+24+23+22+21 + 20
3.0625

(using 2’s complement

representation)

9
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754
23 22 21 20 2-1 2-2 2-3 2-4

Sig 8 bit 23 bit (significant /

n Exponent Fraction mantissa)
(-1)sign × (1 + Fraction) × Exponent Bias = 127 =
28-1-1
2Exponent-127

How to represent 0.265625?

0.265625 = 1.0625 × 2-2 = (1 + 0.0625)
× 2125-127

00111110100010000000000000000000
12 0.062
5 5

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit (significant /

n Exponent Fraction mantissa)
(-1)sign × (1 + Fraction) × Exponent Bias = 127 =
28-1-1
2Exponent-127

How should we represent 0?

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit

n Exponent Fraction(-1)sign × (1 + Fraction) ×
Should have 2 0-127
(-1) sign
× (1 + Fraction) × But we force to
been (-1) sign
× Fraction ×
2Exponent-127 be 21-127
Normal Numbers, Subnormal Numbers,
Exponent≠0 Exponent=0

0 0111110100010000000000000000000 0 0000000000000000000000000000000

12 0.062
0 0
5 5
1 0000000000000000000000000000000

0 0
0.265625 = 1.0625 × 2-2 = (1 + 0.0625) 0 = 0 × 2-
× 2125-127 126

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit

n Exponent Fraction
(-1)sign × (1 + Fraction) × (-1)sign × Fraction ×
2Exponent-127 21-127
Normal Numbers, Subnormal Numbers,
Exponent≠0 Exponent=0

What is the smallest positive subnormal value?

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit

n Exponent Fraction
(-1)sign × (1 + Fraction) × (-1)sign × Fraction ×
2Exponent-127 21-127
Normal Numbers, Subnormal Numbers,
Exponent≠0 Exponent=0

0 0000000100000000000000000000000 0 0000000000000000000000000000001

1 0 0 2-23
2-126 = (1 + 0) × 2-149 = 2-23 × 2-
21-127 126

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit

n Exponent Fraction
(-1)sign × (1 + Fraction) × (-1)sign × Fraction ×
2Exponent-127 21-127
Normal Numbers, Subnormal Numbers,
Exponent≠0 Exponent=0

What is the largest positive subnormal value?

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit

n Exponent Fraction
(-1)sign × (1 + Fraction) × (-1)sign × Fraction ×
2Exponent-127 21-127
Normal Numbers, Subnormal Numbers,
Exponent≠0 Exponent=0

0 0000000100000000000000000000000 0 0000000011111111111111111111111

2-23 + 2-22 +…+ 2-1 =1 -

1 0 0
2 -23
2-126 = (1 + 0) × 2-126-2 -149
= (1 - 2 ) × 2
-23 -

21-127 126

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit

n Exponent Fraction
(-1)sign × (1 + Fraction) × (-1)sign × Fraction ×
2Exponent-127 21-127
Normal Numbers, Subnormal Numbers,
Exponent≠0 Exponent=0

0 1111111100000000000000000000000 - 11111111 - - - - - - - - - - - - - - - - - - - - - - -
+∞ (positive NaN (Not a
infinity) Number)
1 1111111100000000000000000000000
-∞ (negative much waste. revisit
infinity) in fp8.

1
Floating-Point Number
Example: 32-bit floating-point number in IEEE 754

Sig 8 bit 23 bit

n Exponent Fraction
Fraction= Fraction≠
Exponent Equation
0 0
00H = 0 ±0 subnormal (-1)sign × Fraction × 21-127
01H … FEH = 1 … 254 normal (-1)sign × (1 + Fraction) × 2Exponent-127
FFH = 255 ±INF NaN

subnormal normal
values values
…
± -149 (1-2-23) 2- (1+1-2-
2 2-126
0 126 23
)×2127

1
Floating-Point Number
Exponent Width → Range; Fraction Width → Precision
Exponent Fraction Total
(bits) (bits) (bits)
8 23 32
IEEE 754 Half Precision 16-bit Float
(IEEE FP16)
5 10 16

8 7 16

2
Numeric Data Types
• Question: What is the following IEEE half precision (IEEE FP16) number in decimal?

Exponent Bias =
1100011100000000
Sig 5 bit 10 bit 1510
n Exponent Fraction

• Sign: -

• Exponent: 100012 - 1510 = 1710 - 1510 = 210

• Fraction: 11000000002 = 0.7510

• Decimal Answer = - (1 + 0.75) × 22 = -1.75 × 22 = -7.010

2
Numeric Data Types
• Question: What is the decimal 2.5 in Brain Float (BF16)?

2.510 = 1.2510 × Exponent Bias =

21 12710

• Sign: +

• Exponent Binary: 110 + 12710 = 12810 = 100000002

• Fraction Binary: 0.2510 = 01000002

• Binary Answer

0100000000100000
Sig 8 bit 7 bit
n Exponent Fraction

2
Floating-Point Number
Exponent Width → Range; Fraction Width → Precision
Exponent Fraction Total
(bits) (bits) (bits)
8 23 32
IEEE 754 Half Precision 16-bit Float
(IEEE FP16)
5 10 16

* FP8 E4M3 does not have INF, and S.1111.1112 is used

for NaN. 4 3 8
* Largest FP8 E4M3 normal value is S.1111.1102 =448.
Nvidia FP8 (E5M2) for gradient in the backward
* FP8 E5M2 have INF (S.11111.002) and NaN
(S.11111.XX2).
* Largest FP8 E5M2 normal value is S.11110.112
5 2 8
=57344.

2
INT4 and FP4
Exponent Width → Range; Fraction Width → Precision
INT
4 -1,-2,-3,-4,-5,-6,-7,-8 -1,-2,-3,-4,-5,-6,-7,-8
S 0, 1, 2, 3, 4, 5, 6, 7 0, 1, 2, 3, 4, 5, 6, 7
0 1 2 3 4 5 6 7
=
0 0 0 1
1
=
0 1 1 1
7
FP4
(E1M2) -0,-0.5,-1,-1.5,-2,-2.5,-3,-3.5 -0,-1,-2,-3,-4,-5,-6,-7
S E M M 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5 0, 1, 2, 3, 4, 5, 6, 7
×0.5
0 1 2 3 3.5
=0.25×2 1-
0 0 0 1 0
=0.5
=(1+0.75)×21-
0 1 1 1 0
=3.5
FP4
(E2M1) -0,-0.5,-1,-1.5,-2,-3,-4,-6 -0,-1,-2,-3,-4,-6,-8,-12
S E E M 0, 0.5, 1, 1.5, 2, 3, 4, 6 0, 1, 2, 3, 4, 6, 8, 12
×0.5
=0.5×21- 0 1 2 3 4 6
0 0 0 1 1
=0.5
=(1+0.5)×23-
0 1 1 1 no inf, no
1
=1
FP4 NaN
(E3M0) -0,-0.25,-0.5,-1,-2,-4,-8,-16 -0,-1,-2,-4,-8,-16,-32,-64
S E E E 0, 0.25, 0.5, 1, 2, 4, 8, 16 0, 1, 2, 4, 8, 16, 32, 64
×0.25
0 1 2 4 8 16
=(1+0)×2 1-
0 0 0 1 3
=0.25
=(1+0)×27-
0 1 1 1 no inf, no
3
=16 NaN
2
What is Quantization?

Quantization is the process of constraining an input

from a continuous or otherwise large set of values to a
discrete set.

Continuous Quantized Original 16-Color

Signal
Quantization Signal Image Image
Signa
l Error

tim
e

Images are in the public

domain.
The difference between an input value and its quantized “Palettizatio
value is referred to as quantization error. n”

Quantization [Wikipe
dia]
2
Neural Network Quantization: Agenda

- 3 2.0 1 -2 0 -1
2.0 1.4 0.0 3 0 2 1 1 0 1 1
0.9 : 0
9 8 9
8 2 1.5 -1 -1 -2 1
1 1 0 3 1 0 0 1
:
0.0
- -
0.1 1.0
2.1 1
0
( -1) 1.07
5 2 0 3 1 0 0.0 -2 1 -1 -2 0 1 1 0
4 8 :
0
3 1 2 2 0 1 -1 0 0 1 1 1 1
- - -
1.9 :
0.9 0 1.0 1.0
2
1 3 0
1.8
7
0
1.5 1.4
3 9
K-Means-based Linear Binary/Ternary
Quantization Quantization Quantization

Floating-Point
Storage
Weights

Floating-Point
Computation
Arithmetic

2
Neural Network Quantization: Agenda

Integer Weights;
Floating-Point
Storage Floating-Point
Weights
Codebook

Floating-Point Floating-Point
Computation
Arithmetic Arithmetic

2
Neural Network Quantization
Weight Quantization
weights
(32-bit
float)
-
2.09 1.48 0.09
0.98
- -
0.05 2.12
0.14 1.08
- -
1.92 0
0.91 1.03

1.87 0 1.53 1.49

2
Neural Network Quantization
Weight Quantization
weights
(32-bit
float)
2.09
-
1.48 0.09
2.09, 2.12, 1.92,
0.98
1.87
- -
0.05 2.12
0.14 1.08
- -
1.92 0
0.91 1.03

1.87 0 1.53 1.49

2.0

2
K-Means-based Weight Quantization
weights
(32-bit
float)
-
2.09 1.48 0.09
0.98
- -
0.05 2.12
0.14 1.08
- -
1.92 0
0.91 1.03

1.87 0 1.53 1.49

Deep Compression [Han et al.,

ICLR 2016]
3
K-Means-based Weight Quantization
weights cluster reconstructed
(32-bit index centroid weights
float) (2-bit int) s (32-bit float)
- 3 2.00 -
2.09 1.48 0.09 3 0 2 1 2.00 1.50 0.00
0.98 : 1.00
- - 2 1.50 -
0.05 2.12 cluste 1 1 0 3 0.00 0.00 2.00
0.14 1.08 r : 1.00
- - 1 0.00 - -
1.92 0 0 3 1 0 2.00 0.00
0.91 1.03 : 1.00 1.00
0 -
1.87 0 1.53 1.49 3 1 2 2 2.00 0.00 1.50 1.50
: 1.00
index codeboo
quantization
32 bit × 16 2 bites
× 16 k×4
32 bit error
storag 20
e
= 512 bit = = 32 bit = = 128 bit = = B -
64 B 4B 16 B 0.09 0.02 0.09
0.02
3.2 ×
- -
smaller
Assume N-bit quantization, and #parameters = M
0.05
0.14 0.08
0.12
322bit
>> N
. × N bit ×
32 bit × - -
M M 0.09
0.08
0
0.03
2N
= 32M = NM
= 2N+5 bit - -
bit 32/N × bit 0 0.03
0.13 0.01
Deep Compression [Han et al.,
smaller
ICLR 2016]
3
K-Means-based Weight Quantization
Fine-tuning Quantized
weights
Weights cluster
(32-bit index centroid
float) (2-bit int) s
- 3 2.00
2.09 1.48 0.09 3 0 2 1
0.98 :
- - 2 1.50
0.05 2.12 cluste 1 1 0 3
0.14 1.08 r :
- - 1 0.00
1.92 0 0 3 1 0
0.91 1.03 :
0 -
1.87 0 1.53 1.49 3 1 2 2
: 1.00
gradie
nt
- -
0.03 0.02
0.03 0.01
- -
0.01 0.12
0.01 0.02
-
0.02 0.04 0.01
0.01
- - -
0.01
0.07 0.02 0.02
Deep Compression [Han et al.,
ICLR 2016]
3
K-Means-based Weight Quantization
Fine-tuning Quantized
weights
Weights cluster
fine-
centroid tuned
(32-bit index
s centroid
float) (2-bit int)
- 3 2.00 s
2.09 1.48 0.09 3 0 2 1 1.96
0.98 :
- - 2 1.50
0.05 2.12 cluste 1 1 0 3 1.48
0.14 1.08 r :
- - 1 0.00 -
1.92 0 0 3 1 0
0.91 1.03 : 0.04
0 - ×l -
1.87 0 1.53 1.49 3 1 2 2
: 1.00 r 0.97
gradie
nt
- - -
0.03 0.02 -0.03 0.12 0.02 0.04
0.03 0.01 0.07
- - group - reduc
0.01 0.12 by 0.03 0.01 e 0.02
0.01 0.02 0.02
- - -
0.02 0.04 0.01 0.02 0.01 0.04 0.04
0.01 0.01 0.02
- - - - - -
0.01 -0.01 0.01
0.07 0.02 0.02 0.02 0.01 0.03
Deep Compression [Han et al.,
ICLR 2016]
3
K-Means-based Weight Quantization
Accuracy vs. compression rate for AlexNet on ImageNet dataset

Quantization Only
0.5%
0.0%
2% 5% 8% 11% 14% 17% 20%
-0.5%
Accuracy Loss

-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%

Model Size Ratio after Compression

Deep Compression [Han et al.,

ICLR 2016]
3
K-Means-based Weight Quantization
Accuracy vs. compression rate for AlexNet on ImageNet dataset

Pruning Only Quantization Only

0.5%
0.0%
2% 5% 8% 11% 14% 17% 20%
-0.5%
Accuracy Loss

-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%

Model Size Ratio after Compression

Deep Compression [Han et al.,

ICLR 2016]
3
K-Means-based Weight Quantization
Accuracy vs. compression rate for AlexNet on ImageNet dataset

Pruning + Quantization Pruning Only Quantization Only

0.5%
0.0%
2% 5% 8% 11% 14% 17% 20%
-0.5%
Accuracy Loss

-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%

Model Size Ratio after Compression

Deep Compression [Han et al.,

ICLR 2016]
3
Before Quantization: Continuous Weight

Coun
t

Weight
Value
Deep Compression [Han et al.,
ICLR 2016]
3
After Quantization: Discrete Weight

Coun
t

Weight
Value
Deep Compression [Han et al.,
ICLR 2016]
3
After Quantization: Discrete Weight after Retraining

Coun
t

Weight
Value
Deep Compression [Han et al.,
ICLR 2016]
3
How Many Bits do We Need?

Deep Compression [Han et al.,

ICLR 2016]
4
Huffman Coding

• In-frequent weights: use more bits to

represent
• Frequent weights: use less bits to
represent
Deep Compression [Han et al.,
ICLR 2016]
4
Summary of Deep Compression

Deep Compression [Han et al.,

ICLR 2016]
4
Deep Compression Results
Original Compresse Compressio Original Compresse
Network
Size d Size n Ratio Accuracy d Accuracy

LeNet-300 1070KB 27KB 40x 98.36% 98.42%

LeNet-5 1720KB 44KB 39x 99.20% 99.26%

AlexNet 240MB 6.9MB 35x 80.27% 80.30%

VGGNet 550MB 11.3MB 49x 88.68% 89.09%

GoogleNet 28MB 2.8MB 10x 88.90% 88.92%

ResNet-18 44.6MB 4.0MB 11x 89.24% 89.28%

Can we make compact models to begin

with?
Deep Compression [Han et al.,
ICLR 2016]
4
SqueezeNet

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size [Iandola et
al., arXiv 2016] 4 4
Deep Compression on SqueezeNet

Top-1 Top-5
Network Approach Size Ratio
Accuracy Accuracy

AlexNet - 240MB 1x 57.2% 80.3%

AlexNet SVD 48MB 5x 56.0% 79.4%

Deep
AlexNet 6.9MB 35x 57.2% 80.3%
Compression

SqueezeNe
- 4.8MB 50x 57.5% 80.3%
t

SqueezeNe Deep
0.47MB 510x 57.5% 80.3%
t Compression

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size [Iandola et
al., arXiv 2016] 4
K-Means-based Weight Quantization
floa
t output
weights cluster ReLU
centroid floa s
(32-bit index
s t
float) (2-bit int)
- 3 2.00 bias floa +
2.09 1.48 0.09 3 0 2 1 floa
0.98 : t
t
- - 2 1.50 Conv
0.05 2.12 decod 1 1 0 3
0.14 1.08 e : floa floa
- - t t
1.92 0 0 3 1 0 1 0.00 input
0.91 1.03 : inputs weights
s floa
0 - t
1.87 0 1.53 1.49 3 1 2 2
: 1.00 decode
uin
• quantized t
During In weights
Computation Storage • codebook (float)

• The weights are decompressed using a lookup table (i.e., codebook) during runtime
inference.
• K-Means-based Weight Quantization only saves storage cost of a neural network model.
• All the computation and memory access are still floating-point.

4
Neural Network Quantization

Integer Weights;
Floating-Point
Storage Floating-Point Integer Weights
Weights
Codebook

Floating-Point Floating-Point
Computation Integer Arithmetic
Arithmetic Arithmetic

4
Linear Quantization

4
What is Linear Quantization?
weights
(32-bit
float)
-
2.09 1.48 0.09
0.98
- -
0.05 2.12
0.14 1.08
- -
1.92 0
0.91 1.03

1.87 0 1.53 1.49

4
What is Linear Quantization?
An affine mapping of integers to real numbers
weights quantized zero point scale reconstructed
(32-bit weights (2-bit signed (32-bit weights
float) (2-bit signed int) int) float) (32-bit float)
- -
2.09 1.48 0.09 1 -2 0 -1 2.14 1.07 0
0.98 1.07
- - -
0.05 2.12 -1 -1 -2 1
- 1.0 0 0 2.14

( ) =
0.14 1.08 1.07
-
0.91
1.92 0
-
1.03
-2 1 -1 -2
1 7 -
1.07
2.14 0
-
1.07

1.87 0 1.53 1.49 1 -1 0 0 2.14 0 1.07 1.07

we will learn how to
determine these quantization
parameters error
-
0.09 0.41 0.09
0.05
Binary Decimal
- - -
01 1 0.05
0.14 0.01 0.02
00 0 -
0.16 0 0.04
11 -1 0.22
10 -2 -
0 0.46 0.42
0.27
5
Linear Quantization
An affine mapping of integers to real numbers
weights quantized zero point scale
(32-bit weights (2-bit signed (32-bit
float) (2-bit signed int) int) float)
-
2.09 1.48 0.09 1 -2 0 -1
0.98
- -
0.05 2.12 -1 -1 -2 1
- 1.0
( )
0.14 1.08
-
0.91
1.92 0
-
1.03
-2 1 -1 -2
1 7
1.87 0 1.53 1.49 1 -1 0 0

Floating- Integ Integ Floating-

point er • quantization er point
parameter• quantization
• allow real number r=0 parameter
be exactly
representable by a
quantized integer
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 5
Linear Quantization
An affine mapping of integers to real numbers

Floating-point
range

Floating-
point

Floating-point
Scale

Integ Bit Width qmin qmax

er 2 -2 1
Zero
point 3 -4 3
4 -8 7
N -2N-1 2N-1-1
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 5
Scale of Linear Quantization
Linear Quantization is an affine mapping of integers to real numbers

Floating-
point range

Floating-
point Scale

Zero
point

5
Scale of Linear Quantization
Linear Quantization is an affine mapping of integers to real numbers

-
2.09 1.48 0.09
0.98
- -
0.05 2.12
0.14 1.08
Floating-
point range - -
1.92 0
0.91 1.03

1.87 0 1.53 1.49

Floating-
point Scale

Zero
point Binary Decimal
01 1
00 0
11 -1
10 -2

5
Zero Point of Linear Quantization
Linear Quantization is an affine mapping of integers to real numbers

Floating-
point range

Floating-
point Scale

Zero
point

5
Zero Point of Linear Quantization
Linear Quantization is an affine mapping of integers to real numbers

-
2.09 1.48 0.09
0.98
- -
0.05 2.12
0.14 1.08
Floating-
point range - -
1.92 0
0.91 1.03

1.87 0 1.53 1.49

Floating-
point Scale

Zero
point Binary Decimal
01 1
00 0
11 -1
10 -2

5
Linear Quantized Matrix Multiplication
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following matrix multiplication.

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 5
Linear Quantized Matrix Multiplication
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following matrix multiplication.

Precompu
te

N-bit Integer Multiplication N-bit

32-bit Integer Integer
Addition/Subtraction Addition

• Empirically, the scale is always in the interval (0, 1).

Precompu
te

Rescale to N-bit Integer Multiplication N-bit

N-bit 32-bit Integer Integer
Integer Addition/Subtraction Addition

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 6
Symmetric Linear Quantization
Zero point and Symmetric floating-point range

Floating-
point range

Floating-
point Scale

Zero
point

Bit Width qmin qmax

2 -2 1
3 -4 3
4 -8 7
N -2N-1 2N-1-1
6
Symmetric Linear Quantization
Full range mode

Floating-
point range

Floating-
point Scale

Zero
point

Bit Width qmin qmax

2 -2 1
3 -4 3
4 -8 7
N -2N-1 2N-1-1
6
Linear Quantized Matrix Multiplication
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following matrix multiplication, when Zw=0.

Precompu
te

Rescale to N-bit Integer Multiplication N-bit

N-bit 32-bit Integer Integer
Integer Addition/Subtraction Addition

6
Linear Quantized Fully-Connected Layer
Linear Quantization is an affine mapping of integers to real numbers
• So far, we ignore bias. Now we consider the following fully-connected layer with bias.

Precompu
te

We will discuss how to

compute activation zero
point in the next lecture.

N-bit Int
Rescale N-bit
Mult.
to Int
32-bit Int
N-bit Int Add
Add.
Note: both and are 32 bits.

6
Linear Quantized Convolution Layer
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following convolution layer.

N-bit Int
Rescale N-bit
Mult.
to Int
32-bit Int
N-bit Int Add
Add.
Note: both and are 32 bits.

7
Linear Quantized Convolution Layer
Linear Quantization is an affine mapping of integers to real numbers
• Consider the following convolution layer.

in in quantized
zero point t + t outputs
in
scale factor t
×
int3
2
quantized bias
int3 +
int3
2 2
Conv
in in
t t
quantized quantized weights
N-bit Int inputs
Rescale N-bit
Mult.
to Int
32-bit Int
N-bit Int Add
Add.
Note: both and are 32 bits.

7
INT8 Linear Quantization
An affine mapping of integers to real numbers

Neural Network ResNet-50 Inception-V3

Floating-point
76.4% 78.4%
Accuracy

8-bit Integer-
quantized 74.9% 75.4%
Acurracy

Latency-vs-accuracy tradeoff of float vs. integer-

only MobileNets on ImageNet using Snapdragon
835 big cores.

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et
al., CVPR 2018] 7
Neural Network Quantization

- 3 2.0 1 -2 0 -1
2.0 1.4 0.0 3 0 2 1
0.9 : 0
9 8 9
8 2 1.5 -1 -1 -2 1
1 1 0 3
:
0.0
- -
0.1 1.0
2.1 1
0
( -1) 1.07
5 2 0 3 1 0 0.0 -2 1 -1 -2
4 8 :
0
3 1 2 2 0 1 -1 0 0
- - -
1.9 :
0.9 0 1.0 1.0
2
1 3 0
1.8
7
0
1.5 1.4
3 9
K-Means-based Linear
Quantization Quantization

Integer Weights;
Floating-Point
Storage Floating-Point Integer Weights
Weights

?
Codebook

Floating-Point Floating-Point
Computation Integer Arithmetic
Arithmetic Arithmetic

7
Summary of Today’s Lecture
Today, we reviewed and learned
• the numeric data types used in the modern
computing systems, including integers and 1 1 0 0 1 1 1 1
× × × × × × × ×
floating-point numbers. =-
-27 + 26+25+24+23+22+21 + 20
49

• the basic concept of neural network

quantization: converting the weights and
activations of neural networks into a limited Floating-
point range
discrete set of numbers.

• two types of common neural network Floating-

point Scale
quantization:
• K-Means-based Quantization Zero
point
• Linear Quantization
7
References
1. Model Compression and Hardware Acceleration for Neural Networks: A
Comprehensive Survey [Deng et al., IEEE 2020]
2. Computing's Energy Problem (and What We Can Do About it) [Horowitz, M., IEEE
ISSCC 2014]
3. Deep Compression [Han et al., ICLR 2016]
4. Neural Network Distiller: https://intellabs.github.io/distiller/algo_quantization.html
5. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only
Inference [Jacob et al., CVPR 2018]
6. BinaryConnect: Training Deep Neural Networks with Binary Weights during
Propagations [Courbariaux et al., NeurIPS 2015]
7. Binarized Neural Networks: Training Deep Neural Networks with Weights and
Activations Constrained to +1 or −1. [Courbariaux et al., Arxiv 2016]
8. XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks
[Rastegari et al., ECCV 2016]
9. Ternary Weight Networks [Li et al., Arxiv 2016]
10.Trained Ternary Quantization [Zhu et al., ICLR 2017]

9
Lecture
Plan
Today we will:
1. Review Linear Quantization.

2. Introduce Post-Training Quantization (PTQ) that quantizes a floating-point neural network

model, including: channel quantization, group quantization, and range clipping.

3. Introduce Quantization-Aware Training (QAT) that emulates inference-time quantization during

the training/fine-tuning and recover the accuracy.

4. Introduce binary and ternary quantization.

5. Introduce automatic mixed-precision quantization.

Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1

( -1) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0

K-Means-based Linear
Quantization Quantization
Integer Weights;
Floating-Point Floating-Point Integer Weights
Storage
Weights Codebook

Floating-Point Floating-Point
Computation Integer Arithmetic
Arithmetic Arithmetic
K-Means-based Weight Quantization
weights cluster index fine-tuned
(32-bit float) (2-bit int) centroids centroids
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 ×lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Deep Compression [Han et al., ICLR 2016]

MIT 6.5940: TinyML and Eﬃcient Deep Learning Computing
K-Means-based Weight Quantization
Accuracy vs. compression rate for AlexNet on ImageNet dataset

Pruning + Quantization Pruning Only Quantization Only

0.5%
0.0%
-0.5%
Accuracy Loss

-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
2% 5% 8% 11% 14% 17% 20%

Model Size Ratio after

Deep Compression [Han et al., ICLR 2016] Compression
MIT 6.5940: TinyML and Eﬃcient Deep Learning Computing
Linear Quantization
An aﬃne mapping of integers to real numbers r = S(q −
Z)
weights quantized weights zero point scale
(32-bit float) (2-bit signed int) (2-bit signed int) (32-bit float)

2.09 -0.98 1.48 0.09 1 -2 0 -1 2.14 -1.07 1.07 0

0.05 -0.14 -1.08 2.12 -1 -1 -2 1 0 0 -1.07 2.14

-0.91 1.92 0 -1.03

( -2 1 -1 -2
-1 ) 1.07 = -1.07 2.14 0 -1.07

1.87 0 1.53 1.49 1 -1 0 0 2.14 0 1.07 1.07

Binary Decimal
01 1
00 0
11 -1
10 -2
Linear Quantization
An aﬃne mapping of integers to real numbers r = S(q −
Z)

rmin rmax
r 0
Floating-point range
Floating-point

×S
Floating-point Scale

q qmin qmax
Bit Width
Integer qmin Z qmax 2 -2 1
Zero point 3 -4 3
4 -8 7
N -2N-1 2N-1-1

Quantization and Training of Neural Networks for Eﬃcient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
Linear Quantized Fully-Connected
Layer
Linear Quantization is an aﬃne mapping of integers to real numbers r = S(q −
Z)
• Consider the following fully-connected layer.

Y = WX + b
ZW = 0
Zb = 0, Sb = SWSX

qbias = qb − ZXqW
qY = S WS X
SY (qWqX + qbias) + ZY
Rescale to N-bit Int Mult. N-bit Int
N-bit Int 32-bit Int Add. Add
Note: both qb and qbias are 32 bits.
Linear Quantized Convolution
Layer
Linear Quantization is an aﬃne mapping of integers to real numbers r = S(q −
Z)
• Consider the following convolution layer.

Y = Conv (W, X) + b
ZW = 0
Zb = 0, Sb = SWSX

qbias SWSX = qb − Conv (qW, ZX)

qY = (Conv ( q , q
W X) + q + Z
SY bias)
Y
Rescale to N-bit Int Mult. N-bit Int
N-bit Int 32-bit Int Add. Add
Note: both qb and qbias are 32 bits.
Scale and Zero Point of Linear Quantization
Linear Quantization is an aﬃne mapping of integers to real numbers r = S(q −
Z)
2.09 -0.98 1.48 0.09
Asymmetric Linear Quantization
rmin 0 rmax 0.05 -0.14 -1.08 2.12

r Floating-point
range -0.91 1.92 0 -1.03

1.87 0 1.53 1.49

×S
Floating-point
q Scale
S = rmax − rmin
rmin
qmin Z qmax Z = qmin −
S
Zero point qmax − qmin
−1.08
= 2.12 − (−1.08) = round(−2 − )
1 − (−2) 1.07
= 1.07 = −1
Scale and Zero Point of Linear Quantization
Linear Quantization is an aﬃne mapping of integers to real numbers r = S(q −
Z)
2.09 -0.98 1.48 0.09
Symmetric Linear Quantization
− | r |max 0.05 -0.14 -1.08 2.12
| r |max
r Floating-point
range -0.91 1.92 0 -1.03

1.87 0 1.53 1.49

×S
Floating-point
q Scale
S = | r |max Z=
qmin Z= 0 qmax
Zero point 0
qmax
2.1
= 2
1
=
2.12
Post-Training
Quantization
How should we get the optimal linear quantization parameters (S, Z)?

Topic I: Quantization Granularity

Topic II: Dynamic Range Clipping
Topic III: Rounding
Post-Training
Quantization
How should we get the optimal linear quantization parameters (S, Z)?

Topic I: Quantization Granularity

Topic II: Dynamic Range Clipping
Topic III: Rounding
Quantization Granularity
• Per-Tensor Quantization

• Per-Channel Quantization

• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type
Quantization Granularity
• Per-Tensor Quantization

• Per-Channel Quantization

• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type
Symmetric Linear Quantization on Weights
• r|
− | r |max max = | W |max
(first depthwise-separable layer in MobileNetV2)

| r |max
• Using single scale for whole weight tensor
(Per-Tensor Quantization)
r • works well for large models
• accuracy drops for small models

q • Common failure results from

qmin Z= 0 qmax • large diﬀerences (more than 100×) in
kw
ranges of weights for diﬀerent
wi kh wo output channels — outlier weight
ci
co ho
hi co
ci • Solution: Per-Channel Quantization
X W Y

Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
Quantization and Training of Neural Networks for Eﬃcient Integer-Arithmetic-Only Inference [Jacob et al., CVPR
2018]
Quantization Granularity
• Per-Tensor Quantization

• Per-Channel Quantization

• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09

0.05 -0.14 -1.08 2.12

oc
-0.91 1.92 0 -1.03

1.87 0 1.53 1.49

Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09
| r |max = 2.12
0.05 -0.14 -1.08 2.12
oc 2.1
-0.91 1.92 0 -1.03 S= | r | max = 2 =
1.87 0 1.53 1.49
2−1
2 − 1 2.12
qmax
1 0 1 0 2.12 0 2.12 0

0 0 -1 1 0 0 -2.12 2.12

0 1 0 0 0 2.12 0 0

1 0 1 1 2.12 0 2.12 2.12

Quantized Reconstructed
∥W − SqW∥F =
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = S0 = | r |max =
2.09 2.09 2.12
oc 0.05 -0.14 -1.08 2.12 |r 2.12
| r |max = S1 = S= | max = =
2.12
-0.91 1.92 0 -1.03
2.12 2.12 q max 22−1 − 1
1.87 0 1.53 1.49
| r |max = S2 =
1 0 1 0 2.12 0 2.12 0
1.92 1.92
0 0 -1 1 0 0 -2.12 2.12
| r |max = S3 =
0 1 0 0 0 2.12 0 0
1.87 1.87
1 0 1 1 2.12 0 2.12 2.12

Quantized Reconstructed
∥W − SqW∥F =
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = S0 = | r |max =
2.09 2.09 2.12
oc 0.05 -0.14 -1.08 2.12 |r 2.12
| r |max = S1 = S= | max = =
2.12
-0.91 1.92 0 -1.03
2.12 2.12 q max 22−1 − 1
1.87 0 1.53 1.49
| r0|max1= S2 =
1 0 2.09 0 2.09 0 1 0 1 0 2.12 0 2.12 0
1.92 1.92
0 0 -1 1 0 0 -2.12 2.12 0 0 -1 1 0 0 -2.12 2.12
| r1|max0= S3 =
0 -1 0 1.92 0 -1.92 0 1 0 0 0 2.12 0 0
1.87 1.87
1 0 1 1 1.87 0 1.87 1.87 1 0 1 1 2.12 0 2.12 2.12

Quantized Quantized Reconstructed

∥W − S ⊙ qW∥F = 2.08 Reconstructed ∥W − SqW∥F = 2.28
MIT 6.5940: TinyML and Eﬃcient Deep Learning Computing https://eﬃcientml.ai 21
Per-Channel Weight Quantization
Example: 2-bit linear quantization
ic Per-Channel Quantization Per-Tensor Quantization
2.09 -0.98 1.48 0.09 | r |max = S0 = | r |max =
2.09 2.09 2.12
oc 0.05 -0.14 -1.08 2.12 |r 2.12
| r |max = S1 = S= | max = =
2.12
-0.91 1.92 0 -1.03
2.12 2.12 q max 22−1 − 1
1.87 0 1.53 1.49
| r0|max1= S2 =
1 0 2.09 0 2.09 0 1 0 1 0 2.12 0 2.12 0
1.92 1.92
0 0 -1 1 0 0 -2.12 2.12 0 0 -1 1 0 0 -2.12 2.12
| r1|max0= S3 =
0 -1 0 1.92 0 -1.92 0 1 0 0 0 2.12 0 0
1.87 1.87
1 0 1 1 1.87 0 1.87 1.87 1 0 1 1 2.12 0 2.12 2.12

Quantized
∥W − S ⊙ qW∥F
Reconstructed
= < Quantized
∥W − SqW∥F
Reconstructed
= 2.28
Quantization Granularity
• Per-Tensor Quantization

• Per-Channel Quantization

• Group Quantization
• Per-Vector Quantization
• Shared Micro-exponent (MX) data type
VS-Quant: Per-vector Scaled
Quantization
Hierarchical scaling factor
• r = S(q − Z) → r = γ ⋅ Sq(q −
Z)
• is a floating-point coarse grained scale factor scale factor Sq for each vector
• Sq is an integer per-vector scale factor
• achieves a balance between accuracy and
hardware eﬃciency by
• less expensive integer scale factors at finer
granularity
M K = M

• more expensive floating-point scale factors

at coarser granularity
K N N

• Memory Overhead of two-level scaling:

• Given 4-bit quantization with 4-bit per-vector another scale factor for each tensor
scale for every 16 elements, the eﬀective bit
width is 4 + 4 / 16 = 4.25 bits.

VS-Quant: Per-Vector Scaled Quantization for Accurate Low-Precision Neural Network Inference [Steve Dai, et al.]
Group
Quantization
Multi-level scaling scheme r = (q − z) ⋅ s →
W
r = (q − z) ⋅ sl ⋅ sl ⋅ ⋯
11 0
1

: real number value

: quantized value
: zero point (z = 0 is symmetric quantization)
: scale factors of diﬀerent levels
Group
Quantization FP16 INT4
Multi-level scaling scheme sl0 q
r = (q − z) ⋅ sl ⋅ sl ⋅ ⋯
W11 0
1

: real number value

: quantized value
: zero point (z = 0 is symmetric quantization)
: scale factors of diﬀerent levels

Quantization L0 L0 Scale L1 L1 Scale Eﬀective

Data Type
Approach Group Size Data Type Group Size Data Type Bit Width
Per-Channel Quant INT4 Per Channel FP16 - - 4
VSQ INT4 16 UINT4 Per Channel FP16 4+4/16=4.25

MX4 S1M2 2 E1M0 16 E8M0 3+1/2+8/16=4

MX6 S1M4 2 E1M0 16 E8M0 5+1/2+8/16=6

MX9 S1M7 2 E1M0 16 E8M0 8+1/2+8/16=9

Group
Quantization
L0 FP16 INT4
Multi-level
W scaling scheme sl0 q
r = (q − z) ⋅ sl ⋅ sl ⋅ ⋯
01

W11 0 INT4
W21 1 INT4
FP16 UINT4 INT4
W31 : real number value INT4
sl sl q
: quantized value 1 0

: zero point (z = 0 is symmetric quantization)

: scale factors of diﬀerent levels

L1
Quantization L0 L0 Scale L1 L1 Scale Eﬀective
Data Type
Approach Group Size Data Type Group Size Data Type Bit Width
Per-Channel Quant INT4 Per Channel FP16 - - 4
VSQ INT4 16 UINT4 Per Channel FP16 4+4/16=4.25
MX4 S1M2 2 E1M0 16 E8M0 3+1/2+8/16=4

MX6 S1M4 2 E1M0 16 E8M0 5+1/2+8/16=6

MX9 VS-Quant: Per-Vector

S1M7Scaled Quantization
2 for Accurate Low-Precision
E1M0 16
Neural Network E8M0Dai, et al.]
Inference [Steve 8+1/2+8/16=9
Group Quantization
Multi-level scaling scheme
L0 FP16 INT4
sl0 q
W01
r = (q − z) ⋅ sl ⋅ sl ⋅ ⋯
W11 0 INT4
1 INT4
FP16 UINT4 INT4
: real number value INT4
L1
: quantized value sl1 sl0 q
: zero point (z = 0 is symmetric quantization)
: scale factors of diﬀerent levels S MAG 2
E8 E1 S MAG 2
sl1 sl0 q
Quantization L0 L0 Scale L1 L1 Scale Eﬀective
Data Type
Approach Group Size Data Type Group Size Data Type Bit Width
Per-Channel Quant INT4 Per Channel FP16 - - 4
VSQ INT4 16 UINT4 Per Channel FP16 4+4/16=4.25
MX4 S1M2 2 E1M0 16 E8M0 3+1/2+8/16=4
MX6 S1M4 2 E1M0 16 E8M0 5+1/2+8/16=6
MX9 S1M7 2 E1M0 16 E8M0 8+1/2+8/16=9
With Shared Microexponents, A Little Shifting Goes a Long Way [Bita Rouhani et al.]
Post-Training
Quantization
How should we get the optimal linear quantization parameters (S, Z)?

Topic I: Quantization Granularity

Topic II: Dynamic Range Clipping
Topic III: Rounding
Linear Quantization on Activations
rmin 0 rmax
• Unlike weights, the activation range varies across
inputs.
r
• To determine the floating-point range, the activations
×S
statistics are gathered before deploying the model.
q
qmin Z qmax
Dynamic Range for Activation Quantization
Collect activations statistics before deploying the model
= α ⋅ r(t)
+ (1 − α) ⋅ r ̂
( t−1) • Type 1: During training
r(max
̂t , max , min • Exponential moving averages (EMA)
) min max , min
• observed ranges are smoothed across
r rmin 0 thousands of training steps

rmax
×S

q
qmin Z qmax

0 spending dynamic range on the outliers hurts the

rmin rmax outliers•
representation ability.
r • use mean of the min/max of each sample in the
batches
×S • analytical calculation (see next slide)

q
qmin Z qmax

Neural Network Distiller

Quantization and Training of Neural Networks for Eﬃcient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
Dynamic Range for Activation
Quantization
Collect activations statistics before deploying the model
•Type 2: By running a few “calibration” batches of samples on the trained FP32 model
•minimize loss of information, since integer model encodes the same information as the original
floating-point model.
•loss of information is measured by Kullback- Leibler divergence (relative entropy or information
divergence):

• for two discrete probability distributions P, Q

N
P(xi)
DKL(P∥Q) = ∑ P(xi )lo
i g
Q(xi
)
• intuition: KL divergence measures the amount
of information lost when approximating a
given encoding.

8-bit Inference with TensorRT [Szymon Migacz, 2017]

MIT 6.5940: TinyML and Eﬃcient Deep Learning Computing
Dynamic Range for Activation
Quantization
Minimize loss of information by minimizing the KL divergence

8-bit Inference with TensorRT [Szymon Migacz, 2017]

MIT 6.5940: TinyML and Eﬃcient Deep Learning Computing
Dynamic Range for Activation
Quantization
Minimize loss of information by minimizing the KL divergence
GoogleNet:
AlexNet: Pool 2
incpetion_5a/5x5

ResNet-152: GoogleNet:
res4b8_branch2a incpetion_3a/pool

8-bit Inference with TensorRT [Szymon Migacz, 2017]

MIT 6.5940: TinyML and Eﬃcient Deep Learning Computing
Dynamic Range for Quantization
Minimize mean-square-error (MSE) using Newton-Raphson method

max-scaled quantization clipped quantization

clip clip

large quantization noise low density data

Optimal Clipping and Magnitude-aware Diﬀerentiation for Improved Quantization-aware Training [Sakr et al., ICML 2022]
Dynamic Range for Quantization
Minimize mean-square-error (MSE) using Newton-Raphson method

Network FP32 Accuracy OCTAV int4

ResNet-50 76.07 75.84
MobileNet-V2 71.71 70.88
Bert-Large 91.00 87.09
Optimal Clipping and Magnitude-aware Diﬀerentiation for Improved Quantization-aware Training [Sakr et al., ICML 2022]
Post-Training
Quantization
How should we get the optimal linear quantization parameters (S, Z)?

Topic I: Quantization Granularity

Topic II: Dynamic Range Clipping
Topic III: Rounding
Adaptive Rounding for Weight
Quantization
Rounding-to-nearest is not optimal
• Philosophy
• Rounding-to-nearest is not optimal
• Weights are correlated with each other. The best rounding for each weight (to nearest) is not
the best rounding for the whole tensor
rounding-to-nearest
0.3 0.5 0.7 0.2 0 1 1 0

AdaRound
0.3 0.5 0.7 0.2 0 0 1 0
(one potential result)

• What is optimal? Rounding that reconstructs the original activation the best, which may be very
diﬀerent
• For weight quantization only
• With short-term tuning, (almost) post-training quantization

Up or Down? Adaptive Rounding for Post-Training Quantization [Nagel et al., PMLR 2020]
MIT 6.5940: TinyML and Eﬃcient Deep Learning Computing
Adaptive Rounding for Weight
Quantization
Rounding-to-nearest is not optimal
• Method:
• Instead of ⌊w⌉ , we want to choose from {⌊w ⌋ , ⌈w ⌉ } to get the best reconstruction
• We took a learning-based method to find quantized value w˜ = ⌊⌊w⌋ + δ⌉, δ ∈
• [0,1] We optimize the following equation (omit the derivation):
˜ 2
argminV∥Wx − x∥ F + λfreg(V)
W
→ argmin ∥Wx − ⌊⌊W⌋ + 2
+ λf
V F reg
h(V)⌉x∥ (V)
• is the input to the layer, V is a random variable of the same shape 2

1
• h() is a function to map the range to (0,1), such as rectified sigmoid
0
• freg(V) is a regularization that encourages h(V) to be binary -1
-6 0 6

• freg(V) = ∑ 1 − | 2h(Vi,j) − 1 |
i,j
β

Up or Down? Adaptive Rounding for Post-Training Quantization [Nagel et al., PMLR 2020]
MIT 6.5940: TinyML and Eﬃcient Deep Learning Computing
Neural Network • Zero Point
• Asymmetric
Quantization • Symmetric
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1
• Scaling
0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1 Granularity
• Per-Tensor
( -1) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2
• Per-Channel
1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0
• Group Quantization
K-Means-based Linear • Range Clipping
Quantization Quantization • Exponential Moving
Integer Weights; Average
Floating-Point Floating-Point Integer Weights • Minimizing KL
Storage
Weights Codebook Divergence
• Minimizing Mean-
Floating-Point Floating-Point Square-Error
Computation Integer Arithmetic
Arithmetic Arithmetic
• Rounding
• Round-to-Nearest
• AdaRound
Post-Training INT8 Linear Quantization
Symmetric Asymmertric

Per-Tensor Per-Tensor
Activation
Exponential Moving Average
Minimize KL-Divergence
(EMA)

Symmetric Symmetric
Weight
Per-Tensor Per-Channel

GoogleNet -0.45% 0%

ResNet-50 -0.13% -0.6%

Neural
Network ResNet-152 -0.08% -1.8%

MobileNetV1 - -11.8%

MobileNetV2 - -2.1%

Per-Tensor Per-Tensor
Activation
Exponential Moving Average
Minimize KL-Divergence
(EMA)

Symmetric Symmetric
Weight
Per-Tensor Per-Channel
?
GoogleNet
Smaller -
models seem to not respond 0%
How should we
as 0.45%
well to post-training quantization,
improve -0.6%
performance
presumabley due to their smaller
Neural ResNet-50 - of quantized models?
representational capacity.
0.13% -1.8%
Network
MobileNetV1
ResNet-152 - - -11.8%
0.08%
MobileNetV2 - -2.1%

Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
Quantizing Deep Convolutional Networks for Eﬃcient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv
2018]
8-bit Inference with TensorRT [Szymon Migacz, 2017]
Quantization-Aware Training
How should we improve performance of quantized models?
Quantization-Aware Training
Train the model taking quantization into consideration
• To minimize the loss of accuracy, especially aggressive quantization with 4 bits and lower bit
width, neural network will be trained/fine-tuned with quantized weights and activations.
• Usually, fine-tuning a pre-trained floating point model provides better accuracy than training from
scratch.
weights cluster index fine-tuned Forward
(32-bit float) (2-bit int) centroids centroids weights
Backward
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48 weight quantization

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 ×lr -0.97 Layer Layer Layer

inputs outputs
N-1 N N+1
gradient
-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04 example for operations
group
-0.01 0.01 -0.02 0.12 by 0.03 0.01 -0.02 reduce 0.02
Batch
Conv ReLU
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04 Norm

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Deep Compression [Han et al., ICLR 2016]

Quantization-Aware Training
Train the model taking quantization into consideration
• A full precision copy of the weights W is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
weights
Backward

weight quantization

Layer Layer Layer

inputs outputs
N-1 N N+1

example for operations

Batch
Conv ReLU
Norm
Quantization-Aware Training
Train the model taking quantization into consideration
• A full precision copy of the weights W is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
“Simulated/Fake Quantization” weights
Backward
W

W → SWqW = Q(W) weight quantization

Q
(W)
Layer Layer activation Layer
inputs outputs
N-1 N quantization N+1

example for operations

ensure discrete-valued
Conv
Batch
ReLU weights and activations
Norm
in the boundaries
these operations still run in full
precision
Linear Quantization
An aﬃne mapping of integers to real numbers r = S(q −
Z)
weights quantized weights zero point scale
(32-bit float) (2-bit signed int) (2-bit signed int) (32-bit float)

2.09 -0.98 1.48 0.09 1 -2 0 -1 2.14 -1.07 1.07 0

0.05 -0.14 -1.08 2.12 -1 -1 -2 1 0 0 -1.07 2.14

-0.91 1.92 0 -1.03

( -2 1 -1 -2
-1 ) 1.07 = -1.07 2.14 0 -1.07

1.87 0 1.53 1.49 1 -1 0 0 2.14 0 1.07 1.07

W qW Q(W
)
Quantization-Aware Training
Train the model taking quantization into consideration
• A full precision copy of the weights W is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
“Simulated/Fake Quantization” weights
Backward
W

W → SWqW = Q(W) weight quantization

Y → SY (qY − ZY) = Q(Y)
Q
Layer Q (W) Layer Y Q Layer
inputs activation outputs
(X) (Y)
N-1 N quantization N+1
? example for operations
How should gradients ensure discrete-valued
back-propagate through Conv
Batch
ReLU weights and activations
Norm
the (simulated) in the boundaries
quantization?
these operations still run in full
precision
Straight-Through Estimator (STE)
5
• Quantization is discrete-valued, and thus the
derivative is 0 almost everywhere. 4

Q (w) = round
∂Q (W )
= 0 3
∂W
2
• The neural network will learn nothing since
gradients become 0 and the weights won’t get 1

(w)
updated. 0
∂L ∂L ∂Q (W) 0 1 2 3 4 5
gW = ∂W = ∂Q (W) ⋅ = 0
∂W weights
• Straight-Through Estimator (STE) simply passes the W gW
gradients through the quantization as if it had been weight quantization
the identity function. ∂L
Q
∂Q
∂L ∂L (W)
Layer (W)
gW = ∂W = ∂Q
N
(W)
Neural Networks for Machine Learning [Hinton et al., Coursera Video Lecture, 2012]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation [Bengio, arXiv 2013]
Quantization-Aware Training
Train the model taking quantization into consideration
• A full precision copy of the weights is maintained throughout the training.
• The small gradients are accumulated without loss of precision.
• Once the model is trained, only the quantized weights are used for inference.
Forward
“Simulated/Fake Quantization” weights
Backward
W → SWqW = Q(W) W
Y → SY (qY − ZY) = Q(Y)
weight quantization
∂L ∂L
gW ← ∂Q Q
g Y ← ∂Q
(W)
Layer Q (W) Layer Y (Y) Q Layer
inputs activation outputs
(X) (Y)
N-1 N quantization N+1

example for operations

ensure discrete-valued
Conv
Batch
ReLU weights and activations
Norm
in the boundaries
these operations still run in full
precision
INT8 Linear Quantization-Aware Training

Post-Training Quantization Quantization-Aware Training

Neural Network Floating-Point Asymmetric Symmetric Asymmetric Symmetric

Per-Tensor Per-Channel Per-Tensor Per-Channel

MobileNetV1 70.9% 0.1% 59.1% 70.0% 70.7%

MobileNetV2 71.9% 0.1% 69.8% 70.9% 71.1%

NASNet-Mobile 74.9% 72.2% 72.1% 73.0% 73.0%

Quantizing Deep Convolutional Networks for Eﬃcient Inference: A Whitepaper [Raghuraman Krishnamoorthi, arXiv 2018]
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1

( -1) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0

K-Means-based Linear
Quantization Quantization
Integer Weights;
Floating-Point Floating-Point Integer Weights
Storage
Weights Codebook

Computation
Floating-Point Floating-Point
Integer Arithmetic
?
Arithmetic Arithmetic
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1 1 0 1 1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1 1 0 0 1

( -1) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2 0 1 1 0

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0 1 1 1 1

K-Means-based Linear Binary/Ternary

Quantization Quantization Quantization
Integer Weights;
Floating-Point Floating-Point Integer Weights Binary/Ternary
Storage
Weights Codebook Weights

Floating-Point Floating-Point
Computation Integer Arithmetic Bit Operations
Arithmetic Arithmetic
Binary/Ternary Quantization
Can we push the quantization precision to 1
bit?
Can quantization bit width go even lower?
yi = ∑ Wij ⋅ xj = ×
j yi
Wij
= 8×5 + (-3)×2 + 5×0 + (-1)×1 xj

input weight operations memory computation = × 5

8 -3 5 -1 2
R R +× 1× 1×
0
1
If weights are quantized to +1 and -1
yi = ∑ Wij ⋅ xj = × 5
j 8 -3 5 -1 2
=5-2+0-1 0
1

input weight operations memory computation = × 5

1 -1 1 -1 2
R R +× 1× 1×
0
R B +- ~32× less ~2× less
1

BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations [Courbariaux et al., NeurIPS 2015]
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
Binarization
• Deterministic Binarization
• directly computes the bit value based on a threshold, usually 0, resulting in a sign function.
+1, r ≥ 0
q = sign (r) = { r < 0
−1,
• Stochastic Binarization
• use global statistics or the value of input data to determine the probability of being -1 or +1
• e.g., in Binary Connect (BC), probability is determined by hard sigmoid function σ(r)
1
+1, with probability p = σ(r) r+
1 , where σ(r) = min(max( 0
q= 2
{ −1, ,0),1) -1
with probability 1 − p -3 -1 1 3

• harder to implement as it requires the hardware to generate random bits when quantizing.

BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations [Courbariaux et al., NeurIPS 2015]
BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1. [Courbariaux et al., Arxiv 2016]
Minimizing Quantization Error in Binarization
AlexNet-based ImageNet Top-1
weights binary weights Network Accuracy Delta
(32-bit float) (1-bit)
2.09 -0.98 1.48 0.09 1 -1 1 1 BinaryConnect -21.2%

0.05 -0.14 -1.08 2.12 W W 𝔹 1 -1 -1 1 Binary Weight

0.2%
Network (BWN)
-0.91 1.92 0 -1.03 -1 1 1 -1

1.87 0 1.53 1.49 1 1 1 1

∥W − W ∥F = 𝔹 2

9.28
1 -1 1 1 scale
(32-bit float)
W𝔹 = sign 1 -1 -1 1
(W) = 1 ∥W∥1
1
�
1.0 16
α = ∥W∥1 αW� -1 1 1 -1
5
n 1 1 1 1
∥W − αW𝔹∥F2 =
9.24
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Eﬃcient Deep Learning Computing
If both activations and weights are binarized
yi = ∑ Wij ⋅ xj = × 5
j 8 -3 5 -1 2
= 1×1 + (-1)×1 + 1×(-1) + (-1)×1 0
= 1 + (-1) + (-1) + (-1) = -2 1

= × 1
1 -1 1 -1 1
-1
1

XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Eﬃcient Deep Learning Computing
If both activations and weights are binarized
yi = ∑ Wij ⋅ xj
j

= 1×1 + (-1)×1 + 1×(-1) + (-1)×1

= 1 + (-1) + (-1) + (-1) = -2

W X Y=WX bW bX XNOR(bW, bX)

1 1 1 1 1 1
1 -1 -1 1 0 0
-1 -1 1 0 0 1
-1 1 -1 0 1 0
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Eﬃcient Deep Learning Computing
If both activations and weights are binarized
yi = ∑ Wij ⋅
j
•= 1 xnor 1 + 0 xnor 1 + 1 xnor 0 + 0 xnor 1
xj
= 1×1 + (-1)×1 + 1×(-1) + (-
•= 1 + 0 + 0 + 0 = 1
1)×1
•?
= 1 + (-1) + (-1) + (-1) = -2

W X Y=WX bW bX XNOR(bW, bX)

= 1×1 + (-1)×1 + 1×(-1) + (-1)×1 = 1 xnor 1 + 0 xnor 1 + 1 xnor 0 + 0 xnor 1

=1+0+0+0 = 1×2
= 1 + (-1) + (-1) + (-1) = -2 +2 + = -2
Assuming -1 -1 -1 -1 → -4

W X Y=WX bW bX XNOR(bW, bX)

1 1 1 1 1 1
1 -1 -1 1 0 0
-1 -1 1 0 0 1
-1 1 -1 0 1 0
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Eﬃcient Deep Learning Computing
If both activations and weights are binarized
yi = − n + 2 ⋅ ∑ Wij xnor →
yi = − n + popcount (Wi xnor x) ≪ 1
j
xj
= -4 + 2 × (1 xnor 1 + 0 xnor 1 + 1 xnor 0 + 0 xnor 1)
= -4 + 2 × (1 + 0 + 0 + 0) = -2
→ popcount: return the number of 1

W X Y=WX bW bX XNOR(bW, bX)

= -4 + popcount(1010 xnor 1101) ≪ 1 8 -3 5 -1 2

= -4 + popcount(1000) ≪ 1 = -4 + 2 = -2 0
1

input weight operations memory computation = × 1

1 -1 1 -1 1
R R +× 1× 1×
-1
R B +- ~32× less ~2× less
1
B B xnor,
~32× less ~58× less
popcount
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Eﬃcient Deep Learning Computing
Accuracy Degradation of Binarization
Bit-Width ImageNet
Neural Network Quantization Top-1 Accuracy
W A Delta

BWN 1 32 0.2%

AlexNet BNN 1 1 -28.7%

XNOR-Net 1 1 -12.4%

BWN 1 32 -5.80%
GoogleNet
BNN 1 1 -24.20%

BWN 1 32 -8.5%
ResNet-18
* BWN: Binary Weight XNOR-Net 1 binarization
Network with scale for weight 1 -18.1%
* BNN: Binarized Neural Network without scale factors
* XNOR-Net: scale factors for both activation and weight binarization
Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1. [Courbariaux et al., Arxiv 2016]
XNOR-Net: ImageNet Classification using Binary Convolutional Neural Networks [Rastegari et al., ECCV 2016]
MIT 6.5940: TinyML and Eﬃcient Deep Learning Computing
Ternary Weight Networks
(TWN)
Weights are quantized to +1, -1 and 0
rt, r> Δ
q= 0, |r|≤ where Δ = 0.7 × 𝔼 ( | r | ) , rt = 𝔼|r|>Δ ( | r
Δ,
|)
−r t , r< −Δ ternary weights W𝕋
Δ = 0.7 × 1 ∥W∥1 = 0.73
(2-bit)
weights W
2.09 -0.98 1.48 0.09 1 -1 1 0 16
(32-bit float)

0.05 -0.14 -1.08 2.12 0 0 -1 1

1
-0.91 1.92 0 -1.03 -1 1 0 -1
1.5 =
1
∥WW�≠0 ∥
�
1
1
1.87 0 1.53 1.49 1 0 1 1

ImageNet Top-1
Full Precision 1 bit (BWN) 2 bit (TWN)
Accuracy

ResNet-18 69.6 60.8 65.3

Ternary Weight Networks [Li et al., Arxiv 2016]

68
Trained Ternary Quantization
(TTQ)
• Instead of using fixed scale , TTQ introduces two trainable parameters w
t p and wn to
represent the positive and negative scales in the quantization.
wp, r> Δ
q= 0, |r|≤
Δ
Normalized
−w n , r< −Δ Trained
Full Precision Weight Intermediate Ternary Weight Quantization Final Ternary Weight
Full Precision Weight
Scale
Normalize Quantize
Wn Wp
-1 0 1 -1 -t 0 t 1 -1 0 1 -Wn 0 Wp

ImageNet Top-1
Full Precision 1 bit (BWN) 2 bit (TWN) TTQ
Accuracy

ResNet-18 69.6 60.8 65.3 66.6

Trained Ternary Quantization [Zhu et al., ICLR 2017]

Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1 1 0 1 1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1 1 0 0 1

( -1) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2 0 1 1 0

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0 1 1 1 1

K-Means-based Linear Binary/Ternary

Quantization Quantization Quantization
Integer Weights;
Floating-Point Floating-Point Integer Weights Binary/Ternary
Storage
Weights Codebook Weights

Floating-Point Floating-Point
Computation Integer Arithmetic Bit Operations
Arithmetic Arithmetic
Mixed-Precision Quantization
Uniform Quantization
……
weight activation

Layer 1
1 1 1 0 1 1 1 0 1 0 1 0 1 0 1 0
8 bits / 8 bits

Layer 2 1 1 0 1 1 0 0 1 1 0 1 0 1 0 0 0
8 bits / 8 bits

Layer 3
0 1 0 1 0 1 0 1 0 0 1 0 0 0 1 0
8 bits / 8 bits

……

Bit Widths Quantized Model

Mixed-Precision Quantization
……
weight activation

Layer 1
1 1 1 0 1 0 1 0 1
4 bits / 5 bits

Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0
6 bits / 7 bits

Layer 3
1 0 1 0 1 0 0 1 0
5 bits / 4 bits

……

Bit Widths Quantized Model

Challenge: Huge Design Space
……
weight activation

Layer 1
1 1 1 0 1 0 1 0 1 Choices: 8 x 8 = 64
4 bits / 5 bits

Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0 Choices: 8 x 8 = 64
6 bits / 7 bits

Layer 3
1 0 1 0 1 0 0 1 0 Choices: 8 x 8 = 64
5 bits / 4 bits

……

Bit Widths Quantized Model Design Space: 64n

Solution: Design
Automation
……
weight activation

Layer 1
1 1 1 0 1 0 1 0 1
4 bits / 5 bits
Action
Critic

Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0
6 bits / 7 bits
State
Actor
Reward
Layer 3
1 0 1 0 1 0 0 1 0
5 bits / 4 bits

……

Bit Widths Quantized Model

HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and Eﬃcient Deep Learning Computing
Solution: Design
Automation
…… BitFusion (Edge)

weight activation BISMO (Cloud)

PE PE PE
BISMO (Edge)
PE PE PE
Layer 1
1 1 1 0 1 0 1 0 1 PE PE PE
4 bits / 5 bits PE
wn
PE w
0
a
PE
a
0
PE n PE Cycle T
Action Hardware PE
Critic Mapping
wn
PE
w
PE
0
an a0
PE
(LSB)

CycleCTycle
wn w0 an a0 0 (LSB)
&
Layer 2 0 1 1 0 0 1 1 0 1 0 1 0 0 +
CTycle
Cycle(MSB)
0 (LSB)
6 bits / 7 bits &
<<
Cycle 0
(MSB)
(MSB)

State Direct &

<<
Actor
Reward Feedback <<
Layer 3
1 0 1 0 1 0 0 1 0 +

5 bits / 4 bits

Hardware
Accelerator
……

Bit Widths Quantized Model

HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and Eﬃcient Deep Learning Computing
HAQ Outperforms Uniform Quantization
HAQ Uniform

HAQ (Ours) PACT Baseline

Mixed-Precision Quantized MobileNetV1

HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and Eﬃcient Deep Learning Computing
HAQ Supports Multiple Objectives

Model Size Constrained Latency Constrained Energy Constrained

HAQ HAQ HAQ

Uniform Uniform Uniform

Mixed-Precision Quantized MobileNetV1

HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
MIT 6.5940: TinyML and Eﬃcient Deep Learning Computing
Quantization Policy for Edge and Cloud

#weight bit (pointwise) #weight bit (depthwise) #activation bit (pointwise) #activation bit (depthwise)
Mixed-Precision Quantized MobileNetV2

HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]
Summary of Today’s Lecture
In this lecture, we rmin 0 rmax
Floating-point
1. Reviewed Linear Quantization. range

2. Introduced Post-Training Quantization (PTQ) that

quantizes an already-trained floating-point neural ×S
Floating-point
network model.
• Per-tensor vs. per-channel vs. group quantization q Scale

• How to determine dynamic range for quantization qmin Z qmax

Zero point
3. Introduced Quantization-Aware Training (QAT) that
emulates inference-time quantization during the
Forward
training/fine-tuning. weights
Backward
• Straight-Through Estimator (STE) weight quantization

4. Introduced binary and ternary quantization. Layer Layer activation Layer

inputs outputs
5. Introduced automatic mixed-precision quantization. N-1 N quantization N+1

Batch
Conv ReLU
Norm
References
1. Deep Compression [Han et al., ICLR 2016]
2. Neural Network Distiller: https://intellabs.github.io/distiller/algo_quantization.html
3. Quantization and Training of Neural Networks for Eﬃcient Integer-Arithmetic-Only Inference [Jacob et al.,
CVPR 2018]
4. Data-Free Quantization Through Weight Equalization and Bias Correction [Markus et al., ICCV 2019]
5. Post-Training 4-Bit Quantization of Convolution Networks for Rapid-Deployment [Banner et al., NeurIPS 2019]
6. 8-bit Inference with TensorRT [Szymon Migacz, 2017]
7. Quantizing Deep Convolutional Networks for Eﬃcient Inference: A Whitepaper [Raghuraman Krishnamoorthi,
arXiv 2018]
8. Neural Networks for Machine Learning [Hinton et al., Coursera Video Lecture, 2012]
9. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation [Bengio, arXiv 2013]
10.Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1.
[Courbariaux et al., Arxiv 2016]
11. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients [Zhou et al., arXiv
2016]
12.PACT: Parameterized Clipping Activation for Quantized Neural Networks [Choi et al., arXiv 2018]
13.WRPN: Wide Reduced-Precision Networks [Mishra et al., ICLR 2018]
14.Towards Accurate Binary Convolutional Neural Network [Lin et al., NeurIPS 2017]
15.Incremental Network Quantization: Towards Lossless CNNs with Low-precision Weights [Zhou et al., ICLR 2017]
16.HAQ: Hardware-Aware Automated Quantization with Mixed Precision [Wang et al., CVPR 2019]

Lec05 Quantization I
No ratings yet
Lec05 Quantization I
70 pages
Floating Point 6up
No ratings yet
Floating Point 6up
7 pages
4 Floating Point Inclass
No ratings yet
4 Floating Point Inclass
33 pages
Floating Point & Fixed Point Representation - BCA II
No ratings yet
Floating Point & Fixed Point Representation - BCA II
24 pages
"The Course That Gives CMU Its Zip!": Topics
No ratings yet
"The Course That Gives CMU Its Zip!": Topics
31 pages
CH03 Data II
No ratings yet
CH03 Data II
31 pages
Floating Point Numbers: CS101 Introduction To Computing
No ratings yet
Floating Point Numbers: CS101 Introduction To Computing
41 pages
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
No ratings yet
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
51 pages
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
No ratings yet
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
42 pages
Floating Point Sept 6, 2006 15-213: "The Course That Gives CMU Its Zip!"
No ratings yet
Floating Point Sept 6, 2006 15-213: "The Course That Gives CMU Its Zip!"
34 pages
Floating - Point - Number
No ratings yet
Floating - Point - Number
36 pages
5 Data - Floating - Point v1
No ratings yet
5 Data - Floating - Point v1
25 pages
Floating Point Arithmetic
100% (1)
Floating Point Arithmetic
30 pages
Floating Point Arithmetic Class
No ratings yet
Floating Point Arithmetic Class
24 pages
COA UNIT-III PPTs Dr.G.Bhaskar ECE
No ratings yet
COA UNIT-III PPTs Dr.G.Bhaskar ECE
64 pages
08 FloatingPoint
No ratings yet
08 FloatingPoint
52 pages
L2-Variables and Floating Point Number System
No ratings yet
L2-Variables and Floating Point Number System
38 pages
Floating Point Representation of Data: By-Astha Jain Class-It1 0827IT171019
No ratings yet
Floating Point Representation of Data: By-Astha Jain Class-It1 0827IT171019
16 pages
Computer Arithmetic Representations
No ratings yet
Computer Arithmetic Representations
24 pages
ENSC254 - Floating Point Computation
No ratings yet
ENSC254 - Floating Point Computation
29 pages
Lec08
No ratings yet
Lec08
36 pages
Lecture 4
No ratings yet
Lecture 4
21 pages
#3 - Floating Point
No ratings yet
#3 - Floating Point
38 pages
Module 2 - PART D Floating
No ratings yet
Module 2 - PART D Floating
30 pages
Floating Points
No ratings yet
Floating Points
31 pages
L1 FloatingPointNumbers Intro
No ratings yet
L1 FloatingPointNumbers Intro
17 pages
Computer Arithmetic Representations
No ratings yet
Computer Arithmetic Representations
24 pages
Lect4 Floats
No ratings yet
Lect4 Floats
64 pages
Lecture 14 - Arithmetic Subsystems - Numbering Systems and Floating Point Unit (FPU)
No ratings yet
Lecture 14 - Arithmetic Subsystems - Numbering Systems and Floating Point Unit (FPU)
32 pages
LEC03 Data II
No ratings yet
LEC03 Data II
45 pages
The IEEE Standard For Floating Point Arithmetic
No ratings yet
The IEEE Standard For Floating Point Arithmetic
9 pages
Fixed Point and Floating Point Number Representations
No ratings yet
Fixed Point and Floating Point Number Representations
7 pages
Floating Point Representation - M.eng Term Paper
No ratings yet
Floating Point Representation - M.eng Term Paper
6 pages
DSP Arithmetic
No ratings yet
DSP Arithmetic
33 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Lecture 4
No ratings yet
Lecture 4
21 pages
Unit 2
No ratings yet
Unit 2
16 pages
Fixed and Floating Point Representation
No ratings yet
Fixed and Floating Point Representation
5 pages
ELEC2041 Microprocessors and Interfacing Lectures 21: Floating Point Number Representation - III
No ratings yet
ELEC2041 Microprocessors and Interfacing Lectures 21: Floating Point Number Representation - III
31 pages
8.3 Floating Point Numbers
No ratings yet
8.3 Floating Point Numbers
19 pages
Floating Point
No ratings yet
Floating Point
33 pages
Class03 cs230s22
No ratings yet
Class03 cs230s22
33 pages
Fixed Point and Floating Point Number Representations
No ratings yet
Fixed Point and Floating Point Number Representations
5 pages
Week8 Slides
No ratings yet
Week8 Slides
43 pages
Lecture 06 - MIPS Floating Point Arithmetic
No ratings yet
Lecture 06 - MIPS Floating Point Arithmetic
23 pages
Floating Point
No ratings yet
Floating Point
13 pages
4.4 - 1 New Floating Point
No ratings yet
4.4 - 1 New Floating Point
22 pages
15 - Floating Point Encoding
No ratings yet
15 - Floating Point Encoding
17 pages
Lecture 3 - Floating Point
No ratings yet
Lecture 3 - Floating Point
33 pages
Lecture 02 - Floating Point Arithmetic
No ratings yet
Lecture 02 - Floating Point Arithmetic
14 pages
ELEC2041 Microprocessors and Interfacing Lectures 19: Floating Point Number Representation - I
No ratings yet
ELEC2041 Microprocessors and Interfacing Lectures 19: Floating Point Number Representation - I
24 pages
9-Algorithms For Floating Point Arithmetic Operations-22-01-2024
No ratings yet
9-Algorithms For Floating Point Arithmetic Operations-22-01-2024
49 pages
Floating-Point Numbers
No ratings yet
Floating-Point Numbers
23 pages
Floating Point Numbers 237045407 237045407
No ratings yet
Floating Point Numbers 237045407 237045407
20 pages
Asembly Language
No ratings yet
Asembly Language
42 pages
Lecture 4 - Floating Point Data
No ratings yet
Lecture 4 - Floating Point Data
44 pages
Floating Point Numbers: CS031 September 12, 2011
No ratings yet
Floating Point Numbers: CS031 September 12, 2011
22 pages
COMP0068 Lecture10 High Level Data Types
No ratings yet
COMP0068 Lecture10 High Level Data Types
25 pages
IEEE Standard 754
No ratings yet
IEEE Standard 754
10 pages
Summer Homework 3rd Grade
100% (2)
Summer Homework 3rd Grade
4 pages
رياضة لامتحان القبول
No ratings yet
رياضة لامتحان القبول
19 pages
Problem Sette #11 MAR 2025 DSAT
No ratings yet
Problem Sette #11 MAR 2025 DSAT
7 pages
Exponential and Logarithmic Functions
No ratings yet
Exponential and Logarithmic Functions
38 pages
01a - Primes, HCF & LCM (Basic)
No ratings yet
01a - Primes, HCF & LCM (Basic)
2 pages
Week 9: PHY 305 Classical Mechanics Instructor: Sebastian W Uster, IISER Bhopal, 2020
No ratings yet
Week 9: PHY 305 Classical Mechanics Instructor: Sebastian W Uster, IISER Bhopal, 2020
11 pages
"Vitruvian Man" by Leonardo Da Vinci and The Golden Ratio
No ratings yet
"Vitruvian Man" by Leonardo Da Vinci and The Golden Ratio
9 pages
Summative Test in Mathematics
No ratings yet
Summative Test in Mathematics
3 pages
Plotting Graphs, Surfaces and Curves in Matlab: Plotting A Graph Z F (X, Y) or Level Curves F (X, Y) C
No ratings yet
Plotting Graphs, Surfaces and Curves in Matlab: Plotting A Graph Z F (X, Y) or Level Curves F (X, Y) C
3 pages
Maths Formulas
No ratings yet
Maths Formulas
15 pages
Magallanes National High School 2 Periodic Test Math 10
No ratings yet
Magallanes National High School 2 Periodic Test Math 10
2 pages
JavaScript Math Object
No ratings yet
JavaScript Math Object
7 pages
Sat Practice Test 8 Digital 42 50
No ratings yet
Sat Practice Test 8 Digital 42 50
9 pages
Algebra 101: Real Numbers and The Real Number Line
No ratings yet
Algebra 101: Real Numbers and The Real Number Line
7 pages
Applications of Malliavin Calculus To Monte Carlo
No ratings yet
Applications of Malliavin Calculus To Monte Carlo
37 pages
Maths Set-2 X
No ratings yet
Maths Set-2 X
2 pages
Matlab 5b
No ratings yet
Matlab 5b
10 pages
Everaise PhysicsMechanics 1
No ratings yet
Everaise PhysicsMechanics 1
14 pages
Ejercicio 13 Del Libro de Álgebra de Baldor
94% (31)
Ejercicio 13 Del Libro de Álgebra de Baldor
6 pages
Reed-Solomon Encoding and Decoding
No ratings yet
Reed-Solomon Encoding and Decoding
46 pages
Benjamin C.kuo Signal Flow Graph
No ratings yet
Benjamin C.kuo Signal Flow Graph
44 pages
Cp2 2024
No ratings yet
Cp2 2024
32 pages
Area of Triangle
No ratings yet
Area of Triangle
6 pages
An Introduction To Linear Algebra by Krishnamurthy Mainra Arora PDF
No ratings yet
An Introduction To Linear Algebra by Krishnamurthy Mainra Arora PDF
348 pages
Grade 5 MTAP Elimination - 2005 PDF
67% (3)
Grade 5 MTAP Elimination - 2005 PDF
4 pages
Uva - MSC - Computational Science - Overzicht
No ratings yet
Uva - MSC - Computational Science - Overzicht
4 pages
Combinatories
No ratings yet
Combinatories
59 pages
Hancock - Elliptic Integrals
No ratings yet
Hancock - Elliptic Integrals
114 pages
PROBABILITY Chap - 6
No ratings yet
PROBABILITY Chap - 6
8 pages