ML System Optimization Lecture 11 Pruning Again
ML System Optimization Lecture 11 Pruning Again
• Introduction to
Pruning
• What is pruning?
• How should we formulate pruning? before pruning after pruning
2400
1600
800 Optimal EIE
Brain Damage
Deep
Compression
0
1989 1995 2001 2007 2013 2019 2:4 sparsity in A100 GPU
2X peak performance, 1.5X measured BERT speedup
Souce: https://github.com/mit-han-lab/pruning-sparsity-publications
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 2
Neural Network Pruning
• In general, we could formulate the pruning as
follows:
x x
arg min L(x;
WP
WP)
subject to
∥Wp∥0 < N
kh =
• Some of the commonly used pruning granularities
co =
3
3
Pruned
ci = 2
Irregular Regular
Exploring the granularity of sparsity in convolutional neural networks [Mao et al., CVPR-W]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 4
Selection of Synapses to
•Prune
When removing parameters from a neural network model,
• the less important the parameters being removed are,
• the better the performance of pruned neural network is.
• Magnitude-based pruning considers weights with larger absolute values are more important
than other weights.
• For element-wise pruning,
Importance = | W
• Example |
Neuron Pruning
in Linear Layer
Channel Pruning
in Convolution Layer
… …
Uniform Shrink Channel
Prune
Latency (ms)
AMC: Automl for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 10
Recap
Non-uniform pruning is better than uniform
shrinking
Pruning (AMC)
… …
Uniform Shrink Channel
Prune
Latency (ms)
AMC: Automl for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 11
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• We need different pruning ratios for each layer since different layers have different
•sensitivity
Some layers are more sensitive (e.g., first
• layer) Some layers are more redundant
100
86
Accuracy (%)
72
L0
58
44
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 14
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis ( * VGG-11 on CIFAR-10 dataset)
100
86
Accuracy (%)
Δ Ac
72 The higher pruning rate c
L0 The more accuracy loss
58
44
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 15
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis ( * VGG-11 on CIFAR-10 dataset)
100
86
Accuracy (%)
72
58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 16
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis ( * VGG-11 on CIFAR-10 dataset)
100
86
Accuracy (%)
72
58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 17
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis ( * VGG-11 on CIFAR-10 dataset)
100
86
Accuracy (%)
72
58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 18
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis ( * VGG-11 on CIFAR-10 dataset)
100
86
Accuracy (%)
72
58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 19
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis ( * VGG-11 on CIFAR-10 dataset)
100
86
Accuracy (%)
72
58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 20
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis ( * VGG-11 on CIFAR-10 dataset)
72
58
58 L0 L1
L1
44
L2 L3
44 L4 L5
30
3010%
20% 30% 40% 50% 60% 70% 80% 90%
10% 20%
Pruning
30% Rate (Percentage of Weights Pruned Away)
40%
MIT 6.5940: TinyML and Efficient Deep Learning
50% Computing https://efficientml.ai 21
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis ( * VGG-11 on CIFAR-10 dataset)
100
86
Accuracy (%)
86
Accuracy (%)
72 threshold
58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 23
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis ( * VGG-11 on CIFAR-10 dataset)
86
Accuracy (%)
threshold
72
L0 L1
58 Pruning
rates:
44 L2 L3
L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 24
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• Is this optimal?
• Maybe not. We do not consider the interaction between layers.
• Can we go beyond the heuristics?
• Yes!
100
86
Accuracy (%)
threshold
72
L0 L1
58 Pruning
rates:
44 L2 L3
L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 25
Automatic Pruning
• Given an overall compression ratio, how do we choose per-layer pruning ratios?
• Sensitivity analysis ignores the interaction between layers -> sub-optimal
Customers
Engineers
AMC: AutoML for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 27
Automatic Pruning
• Can we develop a push-the-button solution?
Reward
Reward= = -Error
-Error*log(FLOP)
Critic Layer t+1
?%
Original NN Compressed NN
AMC: AutoML for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 29
AMC: AutoML for Model
Compression
Pruning as aby reinforcement
Model Compression Human: learning problem
Labor Consuming, Sub-optimal
Reward= -Error*log(FLOP)
Critic Layer t+1
?%
Original NN Compressed NN
An Analysis of Deep Neural Network Models for Practical Applications [Canziani et al., 2016]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 30
AMC: AutoML for Model Compression
• AMC uses the following setups for the reinforcement learning problem
• State: 11 features (including layer indices, channel numbers, kernel sizes, FLOPs,
• …)
• Action: A continuous number (pruning ratio) a ∈ [0,1)
−Error
Agent: DDPG agent, , ifit satisfies
since supportsconstrains
continuous action output
Reward: R =
• {−∞, if not
• We can also optimize latency constraints with a pre-built lookup table (LUT)
AMC: AutoML for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 32
AMC: AutoML for Model Compression
Human*
AutoML
Figure 14: The pruning policy (sparsity ratio) given by our reinforcement learning
agent for ResNet-50.
* Measured with TF-Lite on Samsung Galaxy S7 Edge, which has Qualcomm Snapdragon SoC
Single core, Batch size = 1(mobile, latency oriented)
• The goal of NetAdapt is to find a per-layer pruning ratio to meet a global resource constraint (e.g.,
latency, energy, …)
• The process is done iteratively
• We take latency constraint as an example
NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [Yang et al., ECCV 2018]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 35
NetAdapt
• For each iteration, we aim to reduce the latency by a certain amount Δ (manually defined)
original model
NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [Yang et al., ECCV 2018]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 36
NetAdapt
• For each iteration, we aim to reduce the latency by a certain amount Δ (manually defined)
• For each layer Lk (k in A-Z in the figure)
• Prune the layer s.t. the latency reduction meets Δ (based on a pre-built lookup table)
Short-term fine-tune
Short-term fine-tune
Short-term fine-tune
Long-term
fine-tune
• The iterative nature allows us to obtain a serial of models with different costs
• #models = #iterations
model series
NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [Yang et al., ECCV 2018]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 42
Neural Network Pruning
• Introduction to
Pruning
• What is pruning? x
• How should we formulate pruning?
• Determine the Pruning Granularity
• In what pattern should we prune the neural Pruning
network?
• Determine the Pruning Criterion
• What synapses/neurons should we prune?
• Determine the Pruning Ratio
• What should target sparsity be for each arg min L(x;
WP
layer? WP)
• s . t .∥WP∥0 ≤
•Fine-tune/Train
How should wePruned Neural
improve Networkof pruned
performance N
models?
Learning Both Weights and Connections for Efficient
Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 43
Section 2: Fine-tuning /
Training
How should we improve performance of sparse models?
Pruning Pruning+Finetuing
-0.5%
Accuracy Loss
-1.5%
Prune Connections
-2.5%
-3.5%
Pruning Pruning+Finetuing
-0.5%
Accuracy Loss
-1.5%
Prune Connections
-2.5%
-3.5%
Pruning Pruning+Finetuing
-0.5%
Accuracy Loss
-1.5%
30% pruned Prune Connections
-2.5%
-3.5%
Pruning Pruning+Finetuing
-0.5%
Accuracy Loss
-1.5%
30% pruned Prune Connections
-2.5%
-3.5%
Pruning Pruning+Finetuing
-0.5%
Accuracy Loss
-1.5%
50% pruned Prune Connections
-2.5%
-3.5%
Pruning Pruning+Finetuing
-0.5%
Accuracy Loss
-1.5%
50% pruned Prune Connections
-2.5%
-3.5%
Pruning Pruning+Finetuing
-0.5%
Accuracy Loss
-1.5%
70% pruned Prune Connections
-2.5%
-3.5%
Pruning Pruning+Finetuing
-0.5%
Accuracy Loss
-1.5%
70% pruned Prune Connections
-2.5%
-3.5%
-0.5%
Accuracy Loss
-1.5%
Prune Connections
-2.5%
-3.5%
Learning Efficient Convolutional Networks through Network Slimming [Liu et al., ICCV
2017] Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS
Learning
2015]
https://efficientml.ai 54
Neural Network Pruning
• Introduction to
Pruning
• What is pruning? x
• How should we formulate pruning?
• Determine the Pruning Granularity
• In what pattern should we prune the neural Pruning
network?
• Determine the Pruning Criterion
• What synapses/neurons should we prune?
• Determine the Pruning Ratio
• What should target sparsity be for each arg min L(x;
WP
layer? WP)
• s . t .∥WP∥0 ≤
Fine-tune/Train Pruned Neural Network N
• How should we improve performance of pruned
models? Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS
2015]
https://efficientml.ai 55
The Lottery Ticket Hypothesis
A randomly-initialized, dense neural network contains a subnetwork that is
initialized such that—when trained in isolation—it can match the test
accuracy of the original network after training for at most the same
number of iterations.
—The Lottery Ticket Hypothesis
Winning Ticket
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks [Frankle et al., ICLR 2019]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 57
Iterative Magnitude Pruning
Init Train Prune
Winning Ticket
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks [Frankle et al., ICLR 2019]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 58
System Support for
Sparsity
Conventional
Training Inference
Slow Power
Hungry
Proposed Accelerated
Model Inference
Training
Compression
Fast Power
Han et al ICLR’17 Han et al ISCA’16 Efficient
Han et al FPGA’17
Han et al NeurIPS’15
(best paper award)
Han et al
ICLR’16 (best
paper award)
EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 63
EIE: Parallelization on Sparsity
)
~a 0 a1 0 a3
( ⇥ ~
1 0 1 0 1
PE PE PE PE PE0 w 0,0 w0,1 0 w 0,3 b bb0
0 0 0
0 w 1,2 0 C B - bb 1 C B b1 C
PE PE PE PE PE2 2 0
PE1 B C B C B C
Central Control PE3 B 0 w2,1 0 w 2,3 C = B b 3 C R )e L U B b3
B
PE PE PE PE 0 0 0 0 C C B - b4C B 0
B b5 b5
0 0 C
w w C B
B 4,2 4,3
C
0 0 0 A B@b6A C C
PE PE PE PE @
w 5,0
- b7 0
0 0 0 w
6,3
0 w7,1 0 0
EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 64
EIE: Parallelization on Sparsity
)
~a 0 a1 0 a3
( ⇥ ~
1 0 1 0 1
w 0,0 w0,1 0 w 0,3
0 b
PE0 bb0
PE1
B 0 0 0 C
0
B - bb 1 C B b1 C
PE2 w 1,2 2 0
B C B C B C
logically PE3
B
0 w2,1 0 w 2,3 C = B b 3 C R )e L U B b3
0 0 0 0 C C B - b4C B 0
B b5 b5
0 0 C
w w C B
B 4,2 4,3
C
@ 0 0 0 A B@b6A C C
w 5,0
- b7 0
0 0 0 w
6,3
0 w7,1 0 0
Virtual Weight W0,0 W0,1 W4,2 W0,3 W4,3
Column Pointer 0 1 2 3 5
EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 65
Dataflow
)
~a 0 a1 0 a3
( ⇥ ~
1 0 1 0 1
PE0 w 0,0 w0,1 0 w 0,3 b bb0
0 0 0
0 w 1,2 0 C B - bb 1 C B b1 C
PE2 2 0
PE1 B C B C B C
PE3 B 0 w2,1 0 w 2,3
b 3C C
B C
=
B R e LU
)
B
0 0 0 0
B C B - b4C B 0 C
b5 b5
0 0 w w bC3 B
B 4,2 4,3
C
@ 0 0 0 CA B b6A
@
C
w 5,0
- b7 0
0 0 0 w
6,3
0 w7,1 0 0
rule of thumb:
0*A=0 W*0=
EIE: Efficient Inference Engine on Compressed 0
Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 66
Dataflow
)
a3
~a 0 a1 0
( ⇥ ~
0 1 0 1
w 0,0 w0,1 0 w 0,3 b bb0
PE0 0 0 1 0
B - bb 1 C B b1 C
PE2 2 0
0 0 w 2,3 B C B C
PE3 w 1,2 0
PE1 B 0 w 0 0 C C = B b 3C R e LUB b3 C
)
BB 2,1 w
4,2 w 4,3 C B - b 4C
b5
B 0 C
b5
w 5,0 0 0 C B
B 0 0 0 0 C B
C b C
@ 0 3A 6A
0 0 w
6, @ - b7 0
0 w7,1 0 0
rule of thumb:
0*A=0 W*0=
EIE: Efficient Inference Engine on Compressed 0
Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 67
Dataflow
)
~a 0 a1 0 a3
( ⇥ ~
1 0 1 0 1
PE0 w 0,0 w0,1 0 w 0,3 b bb0
0 0 0
0 w 1,2 0 C B - bb 1 C B b1 C
PE2 2 0
PE1 B C B C B C
PE3 B 0 w2,1 0 w 2,3
b 3C C
B C
=
B R e LU
)
B
0 0 0 0
B C B - b4C B 0 C
b5 b5
0 0 w w bC3 B
B 4,2 4,3
C
@ 0 0 0 CA B b6A
@
C
w 5,0
- b7 0
0 0 0 w
6,3
0 w7,1 0 0
rule of thumb:
0*A=0 W*0=
EIE: Efficient Inference Engine on Compressed 0
Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 68
Dataflow
)
~a 0 a1 0 a3
( ⇥ ~
1 0 1
0 0 1 0 b bb0
P E0
PE1 w 00, 0 w0,1
0 w 0,3
0
w 1,2 0 C B b1C B b1 C
PE2 B 0 w2, C 0
0 w2,1 B - b2 C B C
PE3 B 0 0 0 3
0 C B b 3C R e LUB b3 C
0 0 = )
B
B w 4,2 w 4,3 C B - b 4C B 0 C
w 5,0 0 0 b5 b5
C B
B 0 C B
C b C
@ 0 3A 6A
0 0 w
6, @ - b7 0
0 w7,1 0 0
rule of thumb:
0*A=0 W*0=
EIE: Efficient Inference Engine on Compressed 0
Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 69
Dataflow
)
~a 0 a1 0 a3
( ⇥ ~
1 0 1 0 1
PE0 w 0,0 w0,1 0 w 0,3 b0 bb0
0 0
0 w 1,2 0 C B - bb 1 C B b1 C
PE2 2 0
PE1 B C B C B C
PE3 B 0 w2,1 0 w 2,3
b 3C C
B C
=
B R e LU
)
B
0 0 0 0
B C B - b4C B 0 C
b5 b5
0 0 w w bC3 B
B 4,2 4,3
C
@ 0 0 0 CA B b6A
@
C
w 5,0
- b7 0
0 0 0 w
6,3
0 w7,1 0 0
rule of thumb:
0*A=0 W*0=
EIE: Efficient Inference Engine on Compressed 0
Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 70
Micro Architecture for each PE
Col
Weight
Even Ptr SRAM Bank Sparse
Start/ Decoder Bypass Dest Src
Matrix Regs
End Act Act
Addr SRAM Regs Regs
Address ReLU
Odd Ptr SRAM Bank Absolute Address
Accum
Relative Index
Pointer Read Sparse Matrix Access Arithmetic Unit Act R/W
EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 71
Load Balance
Act Value
PE PE PE PE Act Queue
Act Index
PE PE PE PE
Central Control
PE PE PE PE
PE PE PE PE
EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 72
Activation Sparsity
Pointer Read
EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 73
Weight Sparsity
EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 74
Weight Sharing
EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 75
Arithmetic & Write
Back
EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 76
Relu, Non-zero Detection
EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 77
What’s
Special
EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 78
Post Layout Result of
EIE
• ALU width: with accuracy
• 32bit float: no loss Technology 40 nm
• 32 bit Int: 0.3% loss # PEs 64
• 16 bit Int: 0.5% loss on-chip SRAM 8 MB
• 8 bit Int: 27% loss Max Model Size 84 Million
• FIFO queue depth Static Sparsity 10x
• number of PEs Dynamic Sparsity 3x
Quantization 4-bit
Power 586 mW
EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 79
Benchmark
• CPU: Intel Core-i7 5930k
• GPU: NVIDIA TitanX
• Mobile GPU: NVIDIA
Jetson TK1
Weight Activation FLOP
Layer Size Description
Density Density Reduction
AlexNet-6 4096 × 9216 9% 35% 33x AlexNet for
AlexNet-7 4096 × 4096 9% 35% 33x image
classification
AlexNet-8 1000 × 4096 25% 38% 10x
VGG-6 4096 × 25088 4% 18% 100x VGG-16 for
VGG-7 4096 × 4096 4% 37% 50x image
classification
VGG-8 1000 × 4096 23% 41% 10x
NeuralTalk-We 600 × 4096 10% 100% 10x RNN and LSTM
NeuralTalk-Wd 8791 × 600 11% 100% 10x for image
caption
NeuralTalk-LSTM 2400 × 1201 10% 100% 10x
EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 80
Comparison: Throughput
EIE
ASIC
1E+05
ASIC
ASIC
1E+04
GPU
1E+03
ASIC
1E+02
CPU mGPU
1E+01 FPGA
1E+00
Core-i7 5930k TitanX Tegra K1 A-Eye DaDianNao TrueNorth EIE EIE
22nm 28nm 28nm 28nm 28nm 28nm 45nm 28nm
CPU GPU mGPU FPGA ASIC ASIC ASIC ASIC
64PEs 256PEs
EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 81
Comparison: Energy Efficiency
EIE
Energy Efficiency (Layers/J in log scale)
1E+06
1E+05
ASIC ASIC
1E+04
ASIC ASIC
1E+03
1E+02
EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 82
Top-5 most cited papers in 50 years of ISCA
Rank Citations Year Title (★ means it won the ISCA Influential Paper Award) First Author + HOF Authors Type Topic
The SPLASH-2 programs: Characterization and
1 5351 1995 Stephen Woo, Anoop Gupta Tool Benchmark
methodological considerations
In-datacenter performance analysis of a Tensor Processing Norm Jouppi, David Machine
2 4214 2017 Patterson Arch
Unit Learning
★ Wattch: A framework for architectural-level power David Brooks, Margaret
3 3834 2000 Tool Power
analysis and optimizations Martonosi
★ Transactional memory: Architectural support for
4 3386 1993 Maurice Herlihy Micro Parallelism
lock-free data structures
EIE: Efficient inference engine on compressed deep neural Song Han, Bill Dally, Mark Machine
5 2690 2016 Arch
network Horowitz Learning
We envision future AI models will be sparse at various granularity and structures. Co-
designed with specialized accelerators, sparse models will become more efficient and
accessible.
Fine-grained
structured-sparse
matrix format
R R
R ✕ C/2 elements +
R ✕ C/2 2bits meta
data
C C/2 C/2
= zero entry Non-zero data 2-bits
values indices
Two weights are nonzero out of four consecutive weights (2:4 sparsity).
Fine-grained
structured-sparse
matrix format
R R
R ✕ C/2 elements +
R ✕ C/2 2bits meta
data
C C/2 C/2
= zero entry Non-zero data 2-bits
values indices
Two weights are nonzero out of four consecutive weights (2:4 sparsity).
Fine-grained
structured-sparse
matrix format
R R
R ✕ C/2 elements +
R ✕ C/2 2bits meta
data
C C/2 C/2
= zero entry Non-zero data 2-bits
values indices
Push all the nonzero elements to the left in memory: save storage and computation.
K K
☓ ☓
A Accumulator (result)
c
c
u N N
m
A matrix (Sparse)
A matrix (Dense)
u
l
a M M M M
t
o
r
Maps
(In, Out, Wgt)
Computation
WWgt) for
(fOut = fOut + fIn
each entry in the maps
TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 95
Sparse convolution computation
A sparse set of dense MMA, with rules defined by maps
Conventional Convolution Sparse Convolution
Computation
WWgt) for
(fOut = fOut + fIn
each entry in the maps
TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 96
Sparse convolution computation
A sparse set of dense MMA, with rules defined by maps
Conventional Convolution Sparse Convolution
Computation
WWgt) for
(fOut = fOut + fIn
each entry in the maps
TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 97
Sparse convolution computation
A sparse set of dense MMA, with rules defined by maps
Conventional Convolution Sparse Convolution
Computation
WWgt) for
(fOut = fOut + fIn
each entry in the maps
TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 98
Sparse convolution computation
A sparse set of dense MMA, with rules defined by maps
Conventional Convolution Sparse Convolution
TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 99
Sparse convolution computation
A sparse set of dense MMA, with rules defined by maps
Conventional Convolution Sparse Convolution
TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 100
Sparse convolution computation
A sparse set of dense MMA, with rules defined by maps
Conventional Convolution Sparse Convolution
TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 101
Sparse convolution computation
A sparse set of dense MMA, with rules defined by maps
Conventional Convolution Sparse Convolution
TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 102
Sparse convolution computation
A sparse set of dense MMA, with rules defined by maps
Conventional Convolution Sparse Convolution
Workload
Maps
(In, Out, Wgt)
(P0, Q1, W-1,-1) Input Features Input Buffer Weight Partial Sum Output Features
(P3, Q4, W-1,-1)
(P1, Q3, W-1,0) P0 Q0
(P0, Q0, W0,0) P1 Q1
P0 PSUM1
(P1, Q1, W0,0) P2 Q2
P3 w-1,-1 PSUM4
(P2, Q2, W0,0) P3 Q3
(P3, Q3, W0,0) P4 2 2 Q4
Cin Cin Cout Cout
(P4, Q4, W0,0) 5 5
Cin Cout
(P3, Q1, W1,0)
f1 = f1 + f0 W-1,-1
(P1, Q0, W1,1)
(P4, Q3, W1,1) f4 = f4 + f3
W-1,-1
TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 104
Existing GPU implementation of sparse convolution
Weight-stationary computation, separate matmul for different weights
Workload
Maps
(In, Out, Wgt)
(P0, Q1, W-1,-1) Input Features Input Buffer Weight Partial Sum Output Features
(P3, Q4, W-1,-1)
(P1, Q3, W-1,0) P0 Q0
(P0, Q0, W0,0) P1 Q1
(P1, Q1, W0,0) P2 P1 PSUM3 Q2
w-1,0
(P2, Q2, W0,0) P3 Q3
(P3, Q3, W0,0) P4 1 1 Q4
Cin Cin Cout Cout
(P4, Q4, W0,0) 5 5
Cin Cout
(P3, Q1, W1,0)
(P1, Q0, W1,1)
f3 = f3 + f1 W-1,0
TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 105
Existing GPU implementation of sparse convolution
Weight-stationary computation, separate matmul for different weights
Workload
Maps
(In, Out, Wgt)
(P0, Q1, W-1,-1) Input Features Input Buffer Weight Partial Sum Output Features
(P3, Q4, W-1,-1)
(P1, Q3, W-1,0) P0 P0 PSUM0 Q0
(P0, Q0, W0,0) P1 P1 PSUM1 Q1
(P1, Q1, W0,0) P2 P2 w-1,-1 PSUM2 Q2
w0,0
(P2, Q2, W0,0) P3 PSUM3 Q3
P3 1 Cout
(P3, Q3, W0,0) P4 PSUM4 Q4
P4
(P4, Q4, W0,0) 5 5 5 5
Cin Cin Cin Cout Cout Cout
(P3, Q1, W1,0)
(P1, Q0, W1,1) fi = fi + fi
W0,0
(P4, Q3, W1,1) i = 0, 1, 2, 3, 4
Workload
Maps
(In, Out, Wgt)
(P0, Q1, W-1,-1) Input Features Input Buffer Weight Partial Sum Output Features
(P3, Q4, W-1,-1)
(P1, Q3, W-1,0) P0 Q0
(P0, Q0, W0,0) P1 Q1
(P1, Q1, W0,0) P2 P3 PSUM1 Q2
w1,0
(P2, Q2, W0,0) P3 Q3
(P3, Q3, W0,0) P4 1 1 Q4
Cin Cin Cout Cout
(P4, Q4, W0,0) 5 5
Cin Cout
(P3, Q1, W1,0)
(P1, Q0, W1,1)
f1 = f1 + f3 W1,0
TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 107
Existing GPU implementation of sparse convolution
Weight-stationary computation, separate matmul for different weights
Workload
Maps
(In, Out, Wgt)
(P0, Q1, W-1,-1) Input Features Input Buffer Weight Partial Sum Output Features
(P3, Q4, W-1,-1)
(P1, Q3, W-1,0) P0 Q0
(P0, Q0, W0,0) P1 Q1
P1 PSUM0
(P1, Q1, W0,0) P2 Q2
P4 w1,1 PSUM3
(P2, Q2, W0,0) P3 Q3
(P3, Q3, W0,0) P4 2 2 Q4
Cin Cin Cout Cout
(P4, Q4, W0,0) 5 5
Cin Cout
(P3, Q1, W1,0)
f0 = f0 + f1 W1,1
(P1, Q0, W1,1)
(P4, Q3, W1,1) f3 = f3 + f4
W1,1
TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 108
TorchSparse optimization overview
F0 PSUM 1
F3 × W-1,-1 = PSUM 4
F1 PSUM 3
pad × W-1,0 =
F0 Gather F3 PSUM 1
Scatter
F1 pad × W1,0 = F0
F1
F2 F1 PSUM 0 F2
F3 F4 × W1,1 = PSUM 3 F3
F4 Apply BMM F4
F0 PSUM 0
Input
Features F1 × W0,0 = PSUM 1
Output
Features
F2 PSUM 2
F3 PSUM 3
F4 PSUM 4
Apply MM
TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 109
Trading computation for regularity
Separate computation (baseline) : many kernel calls, low device utilization
MM MM
MM MM MM MM MM
Separate Computation
Worst Best
Computation overhead
Computation regularity
TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 110
Trading computation for regularity
Dense convolution: best regularity but large computation overhead
MM MM BMM (batch=7)
MM MM MM MM MM
TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 111
Trading computation for regularity
Computation with grouping: balancing overhead and
regularity
Extra computation = 2 / 28
(Small overhead)
TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 112
Trading computation for regularity
Searching customized strategy for different model and datasets
1.6
1 group
0
26 24 22 20 18 16 14 12 10 8 6
4 2 0
Number of Groups
100000
105 100000
105
SemanticKITTI nuScenes
104
10000 104
10000
Map Size
103
1000 103
1000
100 100
102
102 4 7 10 13 16 19 22 2527 1 4 7 10 13 16 19 22 2527
1
Weight Index Weight Index
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 113
Results on matrix multiplication optimizations
SemanticKITTI
Baseline Fixed
Grouping Adaptive
12.0 Grouping 1.40
11.9 1.39
Normalized Speedup
9.0 1.05
8.7 1.00
8.1
TFLOP/s
0.87
6.0 0.70
3.0 0.35
0.0 0.00
TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 114
Results on matrix multiplication optimizations
nuScenes: fixed grouping has best TFLOP/s but adaptive grouping is faster
22.0 1.60
21.1 1.50 1.54
Normalized Speedup
16.5 16.9 1.20
TFLOP/s
1.00
11.0 0.80
10.4
5.5 0.40
0.0 0.00
This is because fixed grouping introduced large amount of redundant computation.
TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 115
TorchSparse++: Overlapped memory and
computation Autonomous
Vehicles 3D Segmentation 3D Detection 3D Scene Reconstruction
iPhone Vision
15 Pro
TorchSparse++
MinkowskiEngine SpConv 1.2.1 (FP16) TorchSparse (FP16) SpConv 2.3.5 (FP16) TorchSparse++ (FP16)
1.00
1.00 0.80 0.86
0.73 0.72 0.72 0.73 0.75
0.60 0.56
0.45 0.48 0.47 0.47 0.45
0.34 0.40 0.39 0.38
0.25 0.30 0.29 0.42
0.19 0.22
0.30 0.29 0.29 0.30 0.29 0.210.3
2
A100 3090 Orin 2080 Ti 3090-TF32 1080 Ti-FP32 A100-train 2080 Ti-train
redundant computation=34
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 118
Mapping
Unit
Merge sort can be used to find mappings in sparse convolution
Input Point Cloud
P0 P0 P1 P2 P3 P4 Q0 Q1 Q2 Q3 Q4
P1 P2 1,1 2,2 2,4 3,2 4,3 1,1 2,2 2,4 3,2 4,3
P3
Q0 + (-1, -1) for w1,1
P4 P1
W-1,-1 W-1,0 W-1,1
0,0 1,1 2,1
stride = 1 W0,-1 W0,0 W0,1 1,3 3,2 Merge Sort
Q3
P4
W1,-1 W1,0 W1,1 P 0 Q0 P 1 P 2 P 3 Q1 Q2 Q3 P 4
Q0 Q4 1,1 1,1 1,3 2,1 2,2 2,4 3,2 3,2 4,3
0,0
Q1 Q2
= = = = = = = = =
Q3 Q0 Q3
Shift Input for W1,1
Q4 P1 P4
53
27 37
8.3
3.7 3.7 4.7 3.7
2.8 2.8 3.7 3.4 2.4
1,319
Energy Saving
682
394
172 169 119 324268
99 152 91 161221 127 139 210193
45 36
18 25 27 16 22
14 13
38
granularities
pruning
• We will cover in the next neurons
lecture:
• Numeric data types in modern computer systems
• Basic concept of neural network quantization
• Common neural network quantization methods
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 122
References
1. Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS 2015]
2. Exploring the granularity of sparsity in convolutional neural networks [Mao et al., CVPR-W]
3. Learning Structured Sparsity in Deep Neural Networks [Wen et al., NeurIPS 2016]
4. Learning Efficient Convolutional Networks through Network Slimming [Liu et al., ICCV 2017]
5. A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers
[Zhang et al., ECCV 2018]
6. AMC: Automl for Model Compression and Acceleration on Mobile Devices [He et al., ECCV
2018]
7. Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA
TensorRT
8. EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
9. ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA [Han et al.,
FPGA 2017]
10.Block Sparse Format [NVIDIA, 2021]
11. Accelerating Sparse Deep Neural Networks [Mishra et al., arXiv 2021]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 123