0% found this document useful (0 votes)
19 views123 pages

ML System Optimization Lecture 11 Pruning Again

Uploaded by

allybenson5888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views123 pages

ML System Optimization Lecture 11 Pruning Again

Uploaded by

allybenson5888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 123

Neural Network Pruning

• Introduction to
Pruning
• What is pruning?
• How should we formulate pruning? before pruning after pruning

• Determine the Pruning Granularity


• In what pattern should we prune the neural pruning
synapses
network?
• Determine the Pruning Criterion
pruning
• What synapses/neurons should we prune? neurons

• Determine the Pruning Ratio


• What should target sparsity be for each
layer?

Fine-tune/Train Pruned Neural Network
• How should we improve performance of pruned
models? Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS
2015]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 1
Neural Network Pruning
Make neural network smaller by removing synapses and
neurons
#publications on pruning and sparse neural networks
3200
# Publications

2400

1600
800 Optimal EIE
Brain Damage
Deep
Compression
0
1989 1995 2001 2007 2013 2019 2:4 sparsity in A100 GPU
2X peak performance, 1.5X measured BERT speedup

Souce: https://github.com/mit-han-lab/pruning-sparsity-publications
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 2
Neural Network Pruning
• In general, we could formulate the pruning as
follows:
x x
arg min L(x;
WP
WP)
subject to
∥Wp∥0 < N

• represents the objective function for neural


network training;
• is input, W is original weights, WP is pruned
weights; arg min L(x; arg min L(x;
W) WP
∥Wp∥0 calculates the #nonzeros in WP, and N is WP)
• W
s . t .∥WP∥0 ≤
the target #nonzeros. N

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 3


Pruning at Different kw = 3
Granularities
The case of convolutional layers
Preserved

kh =
• Some of the commonly used pruning granularities

co =

3
3
Pruned
ci = 2

Irregular Regular

Fine-grained Pattern-based Vector-level Kernel-level Channel-level


Pruning Pruning Pruning Pruning Pruning
like Tetris :)

Exploring the granularity of sparsity in convolutional neural networks [Mao et al., CVPR-W]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 4
Selection of Synapses to
•Prune
When removing parameters from a neural network model,
• the less important the parameters being removed are,
• the better the performance of pruned neural network is.

• Magnitude-based pruning considers weights with larger absolute values are more important
than other weights.
• For element-wise pruning,
Importance = | W
• Example |

3 -2 L1-norm |3| |-2| 3 2 3 0

1 -5 Element-wise |1| |-5| 1 5 0 -5

Weight Importance Pruned Weight

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 5


Selection of Neurons to
Prune
• When removing neurons from a neural network model,
• the less useful the neurons being removed are,
• the better the performance of pruned neural network is.
Recall: Neuron pruning is coarse-grained pruning indeed.
Weight Matrix

Neuron Pruning
in Linear Layer

Channel Pruning
in Convolution Layer

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 6


Lecture
Plan
Today we will:
1. Go through all steps of pruning, and introduce how
to select pruning ratio and how to fine-tune in
before pruning after pruning
neural network pruning.
2. Introduce Lottery Ticket Hypothesis in neural pruning
network pruning which shows that training a sparse synapses

neural network from scratch is sometimes possible.


3. Introduce the system and hardware support for pruning
the sparsity, and elaborate how to translate the neurons

computation reduction to measured speedup on


general-purpose and specialized hardware by
utilizing weight sparsity (NVIDIA Tensor Core),
activation sparsity (TorchSparse, PointAcc) and
weight & activation sparsity (EIE).

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 7


Neural Network Pruning
• Introduction to
Pruning
• What is pruning?
• How should we formulate pruning?
• Determine the Pruning Granularity
• In what pattern should we prune the neural Pruning
network?
• Determine the Pruning Criterion
• What synapses/neurons should we prune?
prune 30%?
• Determine the Pruning Ratio prune 50%?
• What should target sparsity be for each prune 70%?
layer?
• Fine-tune/Train Pruned Neural Network

• How should we improve performance of pruned


Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS
models?
MIT 6.5940: 2015]
TinyML and Efficient Deep Learning Computing https://efficientml.ai 8
Section 1: Pruning Ratio
How should we find per-layer pruning
ratios?

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 9


Recap
Non-uniform pruning is better than uniform
shrinking
Pruning (AMC)

ImageNet Accuracy (%)


<
Uniform Scaling

… …
Uniform Shrink Channel
Prune
Latency (ms)

AMC: Automl for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 10
Recap
Non-uniform pruning is better than uniform
shrinking
Pruning (AMC)

ImageNet Accuracy (%)


<
Uniform Scaling

… …
Uniform Shrink Channel
Prune
Latency (ms)

Question: how should we find ratios for each layer?

AMC: Automl for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 11
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• We need different pruning ratios for each layer since different layers have different
•sensitivity
Some layers are more sensitive (e.g., first
• layer) Some layers are more redundant

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 12


Finding Pruning
Ratios
Analyze the sensitivity of each layer
• We need different pruning ratios for each layer since different layers have different
•sensitivity
Some layers are more sensitive (e.g., first
• layer) Some layers are more redundant
• We can perform sensitivity analysis to determine the per-layer pruning ratio

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 13


Finding Pruning
Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis ( * VGG-11 on CIFAR-10 dataset)

• Pick a layer Li in the model

100

86
Accuracy (%)

72
L0
58

44

30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 14
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis ( * VGG-11 on CIFAR-10 dataset)

• Pick a layer Li in the model


• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
• Observe the accuracy degrade Δ Accri for each pruning ratio

100

86
Accuracy (%)

Δ Ac
72 The higher pruning rate c
L0 The more accuracy loss
58

44

30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 15
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis ( * VGG-11 on CIFAR-10 dataset)

• Pick a layer Li in the model


• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
• Observe the accuracy degrade Δ Accri for each pruning ratio
• Repeat the process for all layers

100

86
Accuracy (%)

72

58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 16
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis ( * VGG-11 on CIFAR-10 dataset)

• Pick a layer Li in the model


• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
• Observe the accuracy degrade Δ Accri for each pruning ratio
• Repeat the process for all layers

100

86
Accuracy (%)

72

58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 17
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis ( * VGG-11 on CIFAR-10 dataset)

• Pick a layer Li in the model


• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
• Observe the accuracy degrade Δ Accri for each pruning ratio
• Repeat the process for all layers

100

86
Accuracy (%)

72

58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 18
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis ( * VGG-11 on CIFAR-10 dataset)

• Pick a layer Li in the model


• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
• Observe the accuracy degrade Δ Accri for each pruning ratio
• Repeat the process for all layers

100

86
Accuracy (%)

72

58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 19
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis ( * VGG-11 on CIFAR-10 dataset)

• Pick a layer Li in the model


• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
• Observe the accuracy degrade Δ Accri for each pruning ratio
• Repeat the process for all layers

100

86
Accuracy (%)

72

58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 20
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis ( * VGG-11 on CIFAR-10 dataset)

• Pick a layer Li in the model


• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
• Observe the accuracy degrade Δ Accri for each pruning ratio
• Repeat the process for all layers
Some layers are less sensitive to
100 pruning
86
Accuracy (%)

72

58
58 L0 L1
L1
44
L2 L3
44 L4 L5
30
3010%
20% 30% 40% 50% 60% 70% 80% 90%
10% 20%
Pruning
30% Rate (Percentage of Weights Pruned Away)
40%
MIT 6.5940: TinyML and Efficient Deep Learning
50% Computing https://efficientml.ai 21
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis ( * VGG-11 on CIFAR-10 dataset)

• Pick a layer Li in the model


• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
• Observe the accuracy degrade Δ Accri for each pruning ratio
• Repeat the process for all layers

100

86
Accuracy (%)

72 Some layers are more sensitive to pruning


58
58
L0
L0 L1
L1
L2
L2 L3
L3
44 L4 L5
44 L4 L5
30
3010%
20% 30% 40% 50% 60% 70% 80% 90%
10% 20%
Pruning
30% Rate (Percentage of Weights Pruned Away)
40%
MIT 6.5940: TinyML and Efficient Deep Learning
50% Computing https://efficientml.ai 22
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis ( * VGG-11 on CIFAR-10 dataset)

• Pick a layer Li in the model


• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
• Observe the accuracy degrade Δ Accri for each pruning ratio
• Repeat the process for all layers
• Pick a degradation threshold such that the overall pruning rate is desired
100

86
Accuracy (%)

72 threshold
58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 23
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis ( * VGG-11 on CIFAR-10 dataset)

• Pick a layer Li in the model


• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
• Observe the accuracy degrade Δ Accri for each pruning ratio
• Repeat the process for all layers
• Pick a degradation threshold such that the overall pruning rate is desired
100

86
Accuracy (%)

threshold
72
L0 L1
58 Pruning
rates:
44 L2 L3
L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 24
Finding Pruning
Ratios
Analyze the sensitivity of each layer
• Is this optimal?
• Maybe not. We do not consider the interaction between layers.
• Can we go beyond the heuristics?
• Yes!

100

86
Accuracy (%)

threshold
72
L0 L1
58 Pruning
rates:
44 L2 L3
L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 25
Automatic Pruning
• Given an overall compression ratio, how do we choose per-layer pruning ratios?
• Sensitivity analysis ignores the interaction between layers -> sub-optimal

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 26


Automatic Pruning
• Given an overall compression ratio, how do we choose per-layer pruning ratios?
• Sensitivity analysis ignores the interaction between layers -> sub-optimal
• Conventionally, such process relies on human expertise and trials and errors

Customers

Engineers

AMC: AutoML for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 27
Automatic Pruning
• Can we develop a push-the-button solution?

Bridge the gap


+ AutoML

Machine learning expert


Hardware expert Non-expert Hardware-Centric
AutoML

Efficient Neural Net


AMC: AutoML for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 28
AMC: AutoML for Model
Compression
Pruning as aby reinforcement
Model Compression Human: learning problem
Labor Consuming, Sub-optimal

Reward
Reward= = -Error
-Error*log(FLOP)
Critic Layer t+1
?%
Original NN Compressed NN

Action: Compress with


Actor Sparsity ratio at (e.g. 50%) Layer t
AMC Engine
50%

Embedding Layer t-1


Embedding st=[N,C,H,W,i…] 30
%
Original NN Compressed NN
Agent: DDPG
Model Compression by AI: Environment: Channel Pruning
Automated, Higher Compression Rate, Faster

AMC: AutoML for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 29
AMC: AutoML for Model
Compression
Pruning as aby reinforcement
Model Compression Human: learning problem
Labor Consuming, Sub-optimal

Reward= -Error*log(FLOP)
Critic Layer t+1
?%
Original NN Compressed NN

Action: Compress with


Actor Sparsity ratio at (e.g. 50%) Layer t
AMC Engine
50%

Embedding Layer t-1


Embedding st=[N,C,H,W,i…] 30
%
Original NN Compressed NN
Agent: DDPG
Model Compression by AI: Environment: Channel Pruning
Automated, Higher Compression Rate, Faster

An Analysis of Deep Neural Network Models for Practical Applications [Canziani et al., 2016]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 30
AMC: AutoML for Model Compression
• AMC uses the following setups for the reinforcement learning problem
• State: 11 features (including layer indices, channel numbers, kernel sizes, FLOPs,
• …)
• Action: A continuous number (pruning ratio) a ∈ [0,1)
−Error
Agent: DDPG agent, , ifit satisfies
since supportsconstrains
continuous action output
Reward: R =
• {−∞, if not
• We can also optimize latency constraints with a pre-built lookup table (LUT)

AMC: AutoML for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 32
AMC: AutoML for Model Compression

Human*

AutoML

(smaller the better)

* Efficient Methods and Hardware for Deep Learning [Han,


thesis]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 32
AMC: AutoML for Model Compression
Peaks: our RL agent automatically learns 1x1 convolutions
have less redundancy and can be pruned less.

Crests: our RL agent automatically learns 3x3 convolutions have


more redundancy and can be pruned more.
Residual Block 1 Residual Block 2 Residual Block 3 Residual Block 4

Figure 14: The pruning policy (sparsity ratio) given by our reinforcement learning
agent for ResNet-50.

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 33


AMC: AutoML for Model Compression

Model MAC Top-1 Latency* Speedup Memory

1.0 MobileNet 569M 70.6% 119.0ms 1x 20.1MB

AMC (50% FLOPs) 285M 70.5% 64.4ms 1.8x 14.3MB

AMC (50% Time) 272M 70.2% 59.7ms 2.0x 13.2MB

0.75 MobileNet 325M 68.4% 69.5ms 1.7x 14.8MB

* Measured with TF-Lite on Samsung Galaxy S7 Edge, which has Qualcomm Snapdragon SoC
Single core, Batch size = 1(mobile, latency oriented)

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 34


NetAdapt
A rule-based iterative/progressive method

• The goal of NetAdapt is to find a per-layer pruning ratio to meet a global resource constraint (e.g.,
latency, energy, …)
• The process is done iteratively
• We take latency constraint as an example

NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [Yang et al., ECCV 2018]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 35
NetAdapt
• For each iteration, we aim to reduce the latency by a certain amount Δ (manually defined)

original model
NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [Yang et al., ECCV 2018]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 36
NetAdapt
• For each iteration, we aim to reduce the latency by a certain amount Δ (manually defined)
• For each layer Lk (k in A-Z in the figure)
• Prune the layer s.t. the latency reduction meets Δ (based on a pre-built lookup table)

original model prune each layer to reduce Δ


NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [Yang et al., ECCV 2018]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 37
NetAdapt
• For each iteration, we aim to reduce the latency by a certain amount Δ (manually defined)
• For each layer Lk (k in A-Z in the figure)
• Prune the layer s.t. the latency reduction meets Δ (based on a pre-built lookup table)
• Short-term fine-tune model (10k iterations); measure accuracy after fine-tuning

AccA AccB AccC AccD … AccZ

Short-term fine-tune

original model prune each layer to reduce Δ


NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [Yang et al., ECCV 2018]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 38
NetAdapt
• For each iteration, we aim to reduce the latency by a certain amount Δ (manually defined)
• For each layer Lk (k in A-Z in the figure)
• Prune the layer s.t. the latency reduction meets Δ (based on a pre-built lookup table)
• Short-term fine-tune model (10k iterations); measure accuracy after fine-tuning
• Choose and prune the layer with the highest accuracy

AccA AccB AccC AccD … AccZ

Short-term fine-tune

original model prune each layer to reduce Δ


NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [Yang et al., ECCV 2018]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 39
NetAdapt
• For each iteration, we aim to reduce the latency by a certain amount Δ (manually defined)
• For each layer Lk (k in A-Z in the figure)
• Prune the layer s.t. the latency reduction meets Δ (based on a pre-built lookup table)
• Short-term fine-tune model (10k iterations); measure accuracy after fine-tuning
•Choose and prune the layer with the highest accuracy
• Repeat until the total latency reduction satisfies the constraint

AccA AccB AccC AccD … AccZ

Short-term fine-tune

original model prune each layer to reduce Δ


NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [Yang et al., ECCV 2018]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 40
NetAdapt
• For each iteration, we aim to reduce the latency by a certain amount Δ (manually defined)
• For each layer Lk (k in A-Z in the figure)
• Prune the layer s.t. the latency reduction meets Δ (based on a pre-built lookup table)
• Short-term fine-tune model (10k iterations); measure accuracy after fine-tuning
•Choose and prune the layer with the highest accuracy
• Repeat until the total latency reduction satisfies the constraint
• Long-term fine-tune to recover accuracy
AccA AccB AccC AccD AccZ

Short-term fine-tune

Long-term
fine-tune

original model prune each layer to reduce Δ Final model


NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [Yang et al., ECCV 2018]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 41
NetAdapt

• The iterative nature allows us to obtain a serial of models with different costs
• #models = #iterations

model series

NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [Yang et al., ECCV 2018]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 42
Neural Network Pruning
• Introduction to
Pruning
• What is pruning? x
• How should we formulate pruning?
• Determine the Pruning Granularity
• In what pattern should we prune the neural Pruning
network?
• Determine the Pruning Criterion
• What synapses/neurons should we prune?
• Determine the Pruning Ratio
• What should target sparsity be for each arg min L(x;
WP
layer? WP)
• s . t .∥WP∥0 ≤
•Fine-tune/Train
How should wePruned Neural
improve Networkof pruned
performance N
models?
Learning Both Weights and Connections for Efficient
Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 43
Section 2: Fine-tuning /
Training
How should we improve performance of sparse models?

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 44


Finetuning Pruned Neural Networks
• After pruning, the model may decrease, especially for larger pruning ratio.
• Fine-tuning the pruned neural networks will help recover the accuracy and push the pruning ratio
higher.
• Learning rate for fine-tuning is usually 1/100 or 1/10 of the original learning rate.

Pruning Pruning+Finetuing

Train Connectivity 0.5%

-0.5%

Accuracy Loss
-1.5%
Prune Connections
-2.5%

-3.5%

Train Weights -4.5%


40% 50% 60% 70% 80% 90%
100%
Pruning Ratio (Parameters Pruned Away)
Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS
2015]
https://efficientml.ai 45
Iterative
•Pruning
Consider pruning followed by a fine-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.

Pruning Pruning+Finetuing

Train Connectivity 0.5%

-0.5%

Accuracy Loss
-1.5%
Prune Connections
-2.5%

-3.5%

Train Weights -4.5%


40% 50% 60% 70% 80% 90%
100%
Pruning Ratio (Parameters Pruned Away)
Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS
2015]
https://efficientml.ai 46
Iterative
•Pruning
Consider pruning followed by a fine-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.

Pruning Pruning+Finetuing

Train Connectivity 0.5%

-0.5%

Accuracy Loss
-1.5%
30% pruned Prune Connections
-2.5%

-3.5%

Train Weights -4.5%


40% 50% 60% 70% 80% 90%
100%
Pruning Ratio (Parameters Pruned Away)
Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS
2015]
https://efficientml.ai 47
Iterative
•Pruning
Consider pruning followed by a fine-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.

Pruning Pruning+Finetuing

Train Connectivity 0.5%

-0.5%

Accuracy Loss
-1.5%
30% pruned Prune Connections
-2.5%

-3.5%

Train Weights -4.5%


40% 50% 60% 70% 80% 90%
100%
Pruning Ratio (Parameters Pruned Away)
Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS
2015]
https://efficientml.ai 48
Iterative
•Pruning
Consider pruning followed by a fine-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.

Pruning Pruning+Finetuing

Train Connectivity 0.5%

-0.5%

Accuracy Loss
-1.5%
50% pruned Prune Connections
-2.5%

-3.5%

Train Weights -4.5%


40% 50% 60% 70% 80% 90%
100%
Pruning Ratio (Parameters Pruned Away)
Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS
2015]
https://efficientml.ai 49
Iterative
•Pruning
Consider pruning followed by a fine-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.

Pruning Pruning+Finetuing

Train Connectivity 0.5%

-0.5%

Accuracy Loss
-1.5%
50% pruned Prune Connections
-2.5%

-3.5%

Train Weights -4.5%


40% 50% 60% 70% 80% 90%
100%
Pruning Ratio (Parameters Pruned Away)
Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS
2015]
https://efficientml.ai 50
Iterative
•Pruning
Consider pruning followed by a fine-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.

Pruning Pruning+Finetuing

Train Connectivity 0.5%

-0.5%

Accuracy Loss
-1.5%
70% pruned Prune Connections
-2.5%

-3.5%

Train Weights -4.5%


40% 50% 60% 70% 80% 90%
100%
Pruning Ratio (Parameters Pruned Away)
Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS
2015]
https://efficientml.ai 51
Iterative
•Pruning
Consider pruning followed by a fine-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.

Pruning Pruning+Finetuing

Train Connectivity 0.5%

-0.5%

Accuracy Loss
-1.5%
70% pruned Prune Connections
-2.5%

-3.5%

Train Weights -4.5%


40% 50% 60% 70% 80% 90%
100%
Pruning Ratio (Parameters Pruned Away)
Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS
2015]
https://efficientml.ai 52
Iterative
•Pruning
Consider pruning followed by a fine-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.
• boost pruning ratio from 5✕ to 9✕ on AlexNet compared to single-step aggressive pruning.

Pruning Pruning+Finetuing Iterative Pruning and Finetuing

Train Connectivity 0.5%

-0.5%

Accuracy Loss
-1.5%
Prune Connections
-2.5%

-3.5%

Train Weights -4.5%


40% 50% 60% 70% 80% 90%
100%
Pruning Ratio (Parameters Pruned Away)
Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS
2015]
https://efficientml.ai 53
Regularization
• When training neural networks or fine-tuning quantized neural networks, regularization is added
to the loss term to
• penalize non-zero parameters
• encourage smaller parameters
• The most common regularization for improving performance of pruning is L1/L2 regularization.
• L1-Regularization
L′ = L(x; W) + λ | W
• L2-Regularization |

• Examples: L′ = L(x; W) + λ∥W∥2


• Magnitude-based Fine-grained Pruning applies L2 regularization on weights
• Network Slimming applies smooth-L1 regularization on channel scaling factors.

Learning Efficient Convolutional Networks through Network Slimming [Liu et al., ICCV
2017] Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS
Learning
2015]
https://efficientml.ai 54
Neural Network Pruning
• Introduction to
Pruning
• What is pruning? x
• How should we formulate pruning?
• Determine the Pruning Granularity
• In what pattern should we prune the neural Pruning
network?
• Determine the Pruning Criterion
• What synapses/neurons should we prune?
• Determine the Pruning Ratio
• What should target sparsity be for each arg min L(x;
WP
layer? WP)
• s . t .∥WP∥0 ≤
Fine-tune/Train Pruned Neural Network N
• How should we improve performance of pruned
models? Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS
2015]
https://efficientml.ai 55
The Lottery Ticket Hypothesis
A randomly-initialized, dense neural network contains a subnetwork that is
initialized such that—when trained in isolation—it can match the test
accuracy of the original network after training for at most the same
number of iterations.
—The Lottery Ticket Hypothesis

Winning Ticket

Train at most T Epochs

Initialized Subnetwork Accuracy Matched


The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks [Frankle et al., ICLR 2019]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 56
Iterative Magnitude Pruning
Init Train Prune

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks [Frankle et al., ICLR 2019]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 57
Iterative Magnitude Pruning
Init Train Prune

Winning Ticket

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks [Frankle et al., ICLR 2019]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 58
System Support for
Sparsity

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 59


System & Hardware Support for
•Sparsity
EIE: Weight Sparsity + Activation Sparsity for GEMM

• NVIDIA Tensor Core: M:N Weight Sparsity Sparsity


• TorchSparse & PointAcc: Activation Sparsity for Sparse Convolution

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 60


System & Hardware Support for
•Sparsity
EIE: Weight Sparsity + Activation Sparsity for GEMM

• NVIDIA Tensor Core: M:N Weight Sparsity Sparsity


• TorchSparse & PointAcc: Activation Sparsity for Sparse Convolution

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 61


Proposed Paradigm

Conventional
Training Inference
Slow Power
Hungry

Proposed Accelerated
Model Inference
Training
Compression
Fast Power
Han et al ICLR’17 Han et al ISCA’16 Efficient
Han et al FPGA’17
Han et al NeurIPS’15
(best paper award)
Han et al
ICLR’16 (best
paper award)

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 62


EIE: Efficient Inference
Engine
The First DNN Accelerator for Sparse, Compressed Model

0*A=0 W*0= 2.09, 1.92=> 2


0
Sparse Weight Sparse Activation Weight Sharing
90% static sparsity 70% dynamic sparsity 4-bit weights

10x less computation 3x less computation

5x less memory footprint 8x less memory footprint

EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 63
EIE: Parallelization on Sparsity
)
~a 0 a1 0 a3
( ⇥ ~
1 0 1 0 1
PE PE PE PE PE0 w 0,0 w0,1 0 w 0,3 b bb0
0 0 0
0 w 1,2 0 C B - bb 1 C B b1 C
PE PE PE PE PE2 2 0
PE1 B C B C B C
Central Control PE3 B 0 w2,1 0 w 2,3 C = B b 3 C R )e L U B b3
B
PE PE PE PE 0 0 0 0 C C B - b4C B 0
B b5 b5
0 0 C
w w C B
B 4,2 4,3
C
0 0 0 A B@b6A C C
PE PE PE PE @
w 5,0
- b7 0
0 0 0 w
6,3

0 w7,1 0 0

EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 64
EIE: Parallelization on Sparsity
)
~a 0 a1 0 a3
( ⇥ ~
1 0 1 0 1
w 0,0 w0,1 0 w 0,3
0 b
PE0 bb0
PE1
B 0 0 0 C
0
B - bb 1 C B b1 C
PE2 w 1,2 2 0
B C B C B C
logically PE3
B
0 w2,1 0 w 2,3 C = B b 3 C R )e L U B b3
0 0 0 0 C C B - b4C B 0
B b5 b5
0 0 C
w w C B
B 4,2 4,3
C
@ 0 0 0 A B@b6A C C
w 5,0
- b7 0
0 0 0 w
6,3

0 w7,1 0 0
Virtual Weight W0,0 W0,1 W4,2 W0,3 W4,3

physically Relative Index 0 1 2 0 0

Column Pointer 0 1 2 3 5

EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 65
Dataflow
)
~a 0 a1 0 a3
( ⇥ ~
1 0 1 0 1
PE0 w 0,0 w0,1 0 w 0,3 b bb0
0 0 0
0 w 1,2 0 C B - bb 1 C B b1 C
PE2 2 0
PE1 B C B C B C
PE3 B 0 w2,1 0 w 2,3
b 3C C
B C
=
B R e LU
)
B
0 0 0 0
B C B - b4C B 0 C
b5 b5
0 0 w w bC3 B
B 4,2 4,3
C
@ 0 0 0 CA B b6A
@
C
w 5,0
- b7 0
0 0 0 w
6,3

0 w7,1 0 0

rule of thumb:
0*A=0 W*0=
EIE: Efficient Inference Engine on Compressed 0
Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 66
Dataflow
)
a3
~a 0 a1 0
( ⇥ ~
0 1 0 1
w 0,0 w0,1 0 w 0,3 b bb0
PE0 0 0 1 0
B - bb 1 C B b1 C
PE2 2 0
0 0 w 2,3 B C B C
PE3 w 1,2 0
PE1 B 0 w 0 0 C C = B b 3C R e LUB b3 C
)
BB 2,1 w
4,2 w 4,3 C B - b 4C
b5
B 0 C
b5
w 5,0 0 0 C B
B 0 0 0 0 C B
C b C
@ 0 3A 6A
0 0 w
6, @ - b7 0
0 w7,1 0 0

rule of thumb:
0*A=0 W*0=
EIE: Efficient Inference Engine on Compressed 0
Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 67
Dataflow
)
~a 0 a1 0 a3
( ⇥ ~
1 0 1 0 1
PE0 w 0,0 w0,1 0 w 0,3 b bb0
0 0 0
0 w 1,2 0 C B - bb 1 C B b1 C
PE2 2 0
PE1 B C B C B C
PE3 B 0 w2,1 0 w 2,3
b 3C C
B C
=
B R e LU
)
B
0 0 0 0
B C B - b4C B 0 C
b5 b5
0 0 w w bC3 B
B 4,2 4,3
C
@ 0 0 0 CA B b6A
@
C
w 5,0
- b7 0
0 0 0 w
6,3

0 w7,1 0 0

rule of thumb:
0*A=0 W*0=
EIE: Efficient Inference Engine on Compressed 0
Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 68
Dataflow
)
~a 0 a1 0 a3
( ⇥ ~
1 0 1
0 0 1 0 b bb0
P E0
PE1 w 00, 0 w0,1
0 w 0,3
0
w 1,2 0 C B b1C B b1 C
PE2 B 0 w2, C 0
0 w2,1 B - b2 C B C
PE3 B 0 0 0 3
0 C B b 3C R e LUB b3 C
0 0 = )
B
B w 4,2 w 4,3 C B - b 4C B 0 C
w 5,0 0 0 b5 b5
C B
B 0 C B
C b C
@ 0 3A 6A
0 0 w
6, @ - b7 0
0 w7,1 0 0

rule of thumb:
0*A=0 W*0=
EIE: Efficient Inference Engine on Compressed 0
Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 69
Dataflow
)
~a 0 a1 0 a3
( ⇥ ~
1 0 1 0 1
PE0 w 0,0 w0,1 0 w 0,3 b0 bb0
0 0
0 w 1,2 0 C B - bb 1 C B b1 C
PE2 2 0
PE1 B C B C B C
PE3 B 0 w2,1 0 w 2,3
b 3C C
B C
=
B R e LU
)
B
0 0 0 0
B C B - b4C B 0 C
b5 b5
0 0 w w bC3 B
B 4,2 4,3
C
@ 0 0 0 CA B b6A
@
C
w 5,0
- b7 0
0 0 0 w
6,3

0 w7,1 0 0

rule of thumb:
0*A=0 W*0=
EIE: Efficient Inference Engine on Compressed 0
Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 70
Micro Architecture for each PE

Act Value Act Value


Act Queue Act Leading
Act Index SRAM NZero
Encoded
Weight
Detect
Act Index

Col
Weight
Even Ptr SRAM Bank Sparse
Start/ Decoder Bypass Dest Src
Matrix Regs
End Act Act
Addr SRAM Regs Regs
Address ReLU
Odd Ptr SRAM Bank Absolute Address
Accum
Relative Index
Pointer Read Sparse Matrix Access Arithmetic Unit Act R/W

SRAM Regs Comb

EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 71
Load Balance

Act Value
PE PE PE PE Act Queue
Act Index

PE PE PE PE
Central Control
PE PE PE PE

PE PE PE PE

SRAM Regs Comb

EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 72
Activation Sparsity

Act Value Act Value Leading


PE PE PE PE Act Queue
Act Index NZero
PE PE PE PE Detect
Act Index
Central Control
PE PE PE PE

Even Ptr SRAM Bank


PE PE PE PE

Odd Ptr SRAM Bank

Pointer Read

SRAM Regs Comb

EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 73
Weight Sparsity

Act Value Act Value Leading


PE PE PE PE Act Queue
Act Index NZero
PE PE PE PE Detect
Act Index
Central Control
PE PE PE PE
Col Sparse
Even Ptr SRAM Bank Start/
PE PE PE PE Matrix Regs
End
Addr SRAM
Odd Ptr SRAM Bank

Pointer Read Sparse Matrix Access

SRAM Regs Comb

EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 74
Weight Sharing

Act Value Act Value Leading


PE PE PE PE Act Queue
Act Index NZero
Encoded Decoded
PE PE PE PE Weight Weight
Detect
Act Index
Central Control
PE PE PE PE
Col Weight
Even Ptr SRAM Bank Sparse
Start/ Decoder
PE PE PE PE Matrix Regs
End
Addr SRAM
Odd Ptr SRAM Bank Address
Relative
Accum Absolute
Pointer Read Sparse Matrix Access Index Index

SRAM Regs Comb

EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 75
Arithmetic & Write
Back

Act Value Act Value


Act Leading
PE PE PE PE Act Queue
Act Index SRAM NZero
Encoded
PE PE PE PE Weight
Detect
Act Index
Central Control
PE PE PE PE
Col Weight
Even Ptr SRAM Bank Sparse
Start/ Decoder Bypass
PE PE PE PE Matrix Regs Dest Src
End
Addr SRAM Act Act
Odd Ptr SRAM Bank Address
Absolute Address Regs Regs
Relative
Accum
Pointer Read Sparse Matrix Access Index Arithmetic Unit Act R/W

SRAM Regs Comb

EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 76
Relu, Non-zero Detection

Act Value Act Value


Act Leading
PE PE PE PE Act Queue
Act Index SRAM NZero
Encoded
PE PE PE PE Weight
Detect
Act Index
Central Control
PE PE PE PE
Col Weight
Even Ptr SRAM Bank Sparse
Start/ Decoder Bypass
PE PE PE PE Matrix Regs Dest Src
End
Addr SRAM Act Act
Address ReLU
Odd Ptr SRAM Bank Absolute Address Regs Regs
Relative
Accum
Pointer Read Sparse Matrix Access Index Arithmetic Unit Act R/W

SRAM Regs Comb

EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 77
What’s
Special

Act Value Act Value


Act Leading
PE PE PE PE Act Queue
Act Index SRAM NZero
Encoded
PE PE PE PE Weight
Detect
Act Index
Central Control
PE PE PE PE
Col Weight
Even Ptr SRAM Bank Sparse
Start/ Decoder Bypass
PE PE PE PE Matrix Regs Dest Src
End
Addr SRAM Act Act
Address ReLU
Odd Ptr SRAM Bank Absolute Address Regs Regs
Relative
Accum
Pointer Read Sparse Matrix Access Index Arithmetic Unit Act R/W

SRAM Regs Comb

EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 78
Post Layout Result of
EIE
• ALU width: with accuracy
• 32bit float: no loss Technology 40 nm
• 32 bit Int: 0.3% loss # PEs 64
• 16 bit Int: 0.5% loss on-chip SRAM 8 MB
• 8 bit Int: 27% loss Max Model Size 84 Million
• FIFO queue depth Static Sparsity 10x
• number of PEs Dynamic Sparsity 3x

Quantization 4-bit

ALU Width 16-bit

Area 40.8 mm^2

MxV Throughput 81,967 layers/s

Power 586 mW

1. Post layout result


2. Throughput measured on AlexNet FC-7

EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 79
Benchmark
• CPU: Intel Core-i7 5930k
• GPU: NVIDIA TitanX
• Mobile GPU: NVIDIA
Jetson TK1
Weight Activation FLOP
Layer Size Description
Density Density Reduction
AlexNet-6 4096 × 9216 9% 35% 33x AlexNet for
AlexNet-7 4096 × 4096 9% 35% 33x image
classification
AlexNet-8 1000 × 4096 25% 38% 10x
VGG-6 4096 × 25088 4% 18% 100x VGG-16 for
VGG-7 4096 × 4096 4% 37% 50x image
classification
VGG-8 1000 × 4096 23% 41% 10x
NeuralTalk-We 600 × 4096 10% 100% 10x RNN and LSTM
NeuralTalk-Wd 8791 × 600 11% 100% 10x for image
caption
NeuralTalk-LSTM 2400 × 1201 10% 100% 10x

EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 80
Comparison: Throughput
EIE

Throughput (Layers/s in log scale)


1E+06

ASIC
1E+05
ASIC
ASIC
1E+04

GPU
1E+03
ASIC
1E+02
CPU mGPU
1E+01 FPGA

1E+00
Core-i7 5930k TitanX Tegra K1 A-Eye DaDianNao TrueNorth EIE EIE
22nm 28nm 28nm 28nm 28nm 28nm 45nm 28nm
CPU GPU mGPU FPGA ASIC ASIC ASIC ASIC
64PEs 256PEs

EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 81
Comparison: Energy Efficiency
EIE
Energy Efficiency (Layers/J in log scale)
1E+06

1E+05
ASIC ASIC
1E+04
ASIC ASIC
1E+03

1E+02

1E+01 GPU mGPU

1E+00 CPU FPGA


Core-i7 5930k TitanX Tegra K1 A-Eye DaDianNao TrueNorth EIE EIE
22nm 28nm 28nm 28nm 28nm 28nm 45nm 28nm
CPU GPU mGPU FPGA ASIC ASIC ASIC ASIC
64PEs 256PEs

EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
https://efficientml.ai 82
Top-5 most cited papers in 50 years of ISCA

Rank Citations Year Title (★ means it won the ISCA Influential Paper Award) First Author + HOF Authors Type Topic
The SPLASH-2 programs: Characterization and
1 5351 1995 Stephen Woo, Anoop Gupta Tool Benchmark
methodological considerations
In-datacenter performance analysis of a Tensor Processing Norm Jouppi, David Machine
2 4214 2017 Patterson Arch
Unit Learning
★ Wattch: A framework for architectural-level power David Brooks, Margaret
3 3834 2000 Tool Power
analysis and optimizations Martonosi
★ Transactional memory: Architectural support for
4 3386 1993 Maurice Herlihy Micro Parallelism
lock-free data structures
EIE: Efficient inference engine on compressed deep neural Song Han, Bill Dally, Mark Machine
5 2690 2016 Arch
network Horowitz Learning

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 83


Pro:
• EIE demonstrated that special-purpose hardware can make it cost-effective to do sparse operations
with matrices that are up to 50% dense
• EIE exploits both weight sparsity and activation sparsity, not only saves energy by skipping zero
weights, but also saves the cycle by not computing it.
• EIE supports fine-grained sparsity, and allows pruning to achieve a higher pruning ratio.
• Aggressive weight quantization (4bit) to save memory footprint. To maintain accuracy, EIE decodes
the weight to 16bit and uses 16bit arithmetic. W4A16 approach is reborn in LLM: GPTQ, AWQ, llama.cpp,
MLC LLM
Con:
• EIE isn’t as easily applied to arrays of vector processors — improve: structured sparsity (N:M
sparsity)
• EIE’s Control flow overhead, storage overhead — improve: coarse grain sparsity
• EIE only support FC layers - actually reborn in LLM
• EIE fits everything on SRAM - practical for TinyML, not LLM
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 84
The first principle of efficient AI computing is to be lazy: avoid redundant
computation, quickly reject the work, or delay the work.

• Generative AI: spatial sparsity [SIGE, NeurIPS’22]


• Transformer: token sparsity, progressive quantization [SpAtten, HPCA’21]
• Video: temporal sparsity [TSM, ICCV’19]
• Point cloud: spatial sparsity [TorchSparse, MLSys’22 & PointAcc, Micro’22]

We envision future AI models will be sparse at various granularity and structures. Co-
designed with specialized accelerators, sparse models will become more efficient and
accessible.

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 85


System & Hardware Support for
•Sparsity
EIE: Weight Sparsity + Activation Sparsity for GEMM

• NVIDIA Tensor Core: M:N Weight Sparsity Sparsity


• TorchSparse & PointAcc: Activation Sparsity for Sparse Convolution

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 86


M:N Sparsity

Structured-sparse Structured-sparse and


matrix W compressed matrix W

Fine-grained
structured-sparse
matrix format
R R
R ✕ C/2 elements +
R ✕ C/2 2bits meta
data

C C/2 C/2
= zero entry Non-zero data 2-bits
values indices

Two weights are nonzero out of four consecutive weights (2:4 sparsity).

Accelerating Sparse Deep Neural Networks [Mishra et al., arXiv 2021]


MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 87
M:N Sparsity

Structured-sparse Structured-sparse and


matrix W compressed matrix W

Fine-grained
structured-sparse
matrix format
R R
R ✕ C/2 elements +
R ✕ C/2 2bits meta
data

C C/2 C/2
= zero entry Non-zero data 2-bits
values indices

Two weights are nonzero out of four consecutive weights (2:4 sparsity).

Accelerating Sparse Deep Neural Networks [Mishra et al., arXiv 2021]


MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 88
M:N Sparsity

Structured-sparse Structured-sparse and


matrix W compressed matrix W

Fine-grained
structured-sparse
matrix format
R R
R ✕ C/2 elements +
R ✕ C/2 2bits meta
data

C C/2 C/2
= zero entry Non-zero data 2-bits
values indices

Push all the nonzero elements to the left in memory: save storage and computation.

Accelerating Sparse Deep Neural Networks [Mishra et al., arXiv 2021]


MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 89
System Support for M:N Sparsity
Mapping M:N sparse matrices onto NVIDIA tensor cores
Sparse operation
on Tensor Core
B matrix (Dense) Choose matching K/2 B matrix (Dense)
Dense operation elements out of K
elements
on Tensor Core Select

K K
☓ ☓

A Accumulator (result)
c
c
u N N
m

A matrix (Sparse)
A matrix (Dense)

u
l
a M M M M
t
o
r

( K C matrix (Dense) K/2 K/2 C matrix (Dense)


r Non-zero data
e 2-bits
s values indices
Denseu M✕N✕K
GEMMl Sparse M✕N✕K
t GEMM
The indices are used to mask out the inputs. Only 2 multiplications will be done out of four.
)

Accelerating Sparse Deep Neural Networks [Mishra et al., arXiv 2021]


MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 90
System Support for M:N Sparsity
Accurac
Network y
Dense Sparse Spars
e
FP16 FP16 INT8
ResNet-34 73.7 73.9 73.7
ResNet-50 76.1 76.2 76.2
INT8 (TN) cuSPARSELt vs. cuBLAS Performance
GEMM-M = GEMM-N = 10240
ResNet-50 (SWSL) 81.1 80.9 80.9
Sparse vs. Dense Speedup

2.0 ResNet-101 77.7 78.0 77.9


1.9 ResNeXt-50-32x4 77.6 77.7 77.7
1.8
1.7
ResNeXt-101-32x16 79.7 79.9 79.9
1.6 ResNeXt-101-32x16 84.2 84.0 84.2
1.5 (WSL)
1.4 DenseNet-121 75.5 75.3 75.3
1.3
1.2
DenseNet-161 78.8 78.8 78.9
1.1 Wide ResNet-50 78.5 78.6 78.5
1.0
Wide ResNet-101 78.9 79.2 79.1
1280 2560 3840 5120 6400 7680 8960 10240 11520 12800 14080 15360 16640 17920 19200
20480 Inception v3 77.1 77.1 77.1
GEMM-K Xception 79.2 79.2 79.2
Fig. 3. Comparison of sparse and dense INT8 GEMMs on NVIDIA A100 VGG-11 70.9 70.9 70.8
Tensor Cores. Larger GEMMs achieve nearly a 2⇥ speedup with Sparse VGG-16 74.0 74.1 74.1
Tensor Cores. VGG-19 75.0 75.0 75.0
SUNet-128 75.6 76.0 75.4
SUNet-7-128 76.4 76.5 76.3
DRN26 75.2 75.3 75.3
DRN-105 79.4 79.5 79.4
Pruning CNNs with 2:4 sparsity will bring about large speedup for GEMM workloads and it will not incur
performance drop for DNN models.
Accelerating Sparse Deep Neural Networks [Mishra et al., arXiv 2021]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 91
System & Hardware Support for
•Sparsity
EIE: Weight Sparsity + Activation Sparsity for GEMM

• NVIDIA Tensor Core: M:N Weight Sparsity Sparsity


• TorchSparse & PointAcc: Activation Sparsity for Sparse
Convolution

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 92


System & Hardware Support for
•Sparsity
EIE: Weight Sparsity + Activation Sparsity for GEMM

• NVIDIA Tensor Core: M:N Weight Sparsity Sparsity


• TorchSparse & PointAcc: Activation Sparsity for Sparse
Convolution
• TorchSparse: Sparse Convolution Library
• PointAcc: Hardware Accelerator for Sparse Convolution

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 93


Sparse convolution on sparse inputs
Conventional Convolution Sparse Convolution
~0.01%

Input sparsity from Nonzeros


Input sparsity Nonzeros
the distribution in will not
from ReLU will dilate
physical space dilate

Submanifold Sparse Convolutional Neural Networks [Graham, BMVC 2015]


MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 94
Sparse convolution computation
A sparse set of dense MMA, with rules defined by maps
Conventional Convolution Sparse Convolution

(P0, Q0, W1,1) No compute

Maps
(In, Out, Wgt)

Computation
WWgt) for
(fOut = fOut + fIn
each entry in the maps

TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 95
Sparse convolution computation
A sparse set of dense MMA, with rules defined by maps
Conventional Convolution Sparse Convolution

(P0, Q0, W1,1) No compute


(P0, Q1, W1,0) No compute
Maps
(In, Out, Wgt)

Computation
WWgt) for
(fOut = fOut + fIn
each entry in the maps

TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 96
Sparse convolution computation
A sparse set of dense MMA, with rules defined by maps
Conventional Convolution Sparse Convolution

(P0, Q0, W1,1) No compute


(P0, Q1, W1,0) No compute
Maps (P0, Q2, W1,-1) No compute
(In, Out, Wgt)

Computation
WWgt) for
(fOut = fOut + fIn
each entry in the maps

TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 97
Sparse convolution computation
A sparse set of dense MMA, with rules defined by maps
Conventional Convolution Sparse Convolution

(P0, Q0, W1,1) No compute


(P0, Q1, W1,0) No compute
Maps (P0, Q2, W1,-1) No compute
(In, Out, Wgt) (P0, Q3, W0,1) No compute

Computation
WWgt) for
(fOut = fOut + fIn
each entry in the maps

TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 98
Sparse convolution computation
A sparse set of dense MMA, with rules defined by maps
Conventional Convolution Sparse Convolution

(P0, Q0, W1,1) No compute


(P0, Q1, W1,0) No compute
Maps (P0, Q2, W1,-1) No compute
(In, Out, Wgt) (P0, Q3, W0,1) No compute
(P0, Q4, W0,0) (P0, Q0, W0,0)
Computation
WWgt) for
(fOut = fOut + fIn
each entry in the maps

TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 99
Sparse convolution computation
A sparse set of dense MMA, with rules defined by maps
Conventional Convolution Sparse Convolution

(P0, Q0, W1,1) No compute


(P0, Q1, W1,0) No compute
Maps (P0, Q2, W1,-1) No compute
(In, Out, Wgt) (P0, Q3, W0,1) No compute
(P0, Q4, W0,0) (P0, Q0, W0,0)
Computation (P0, Q5, W0,-1) No compute
WWgt) for
(fOut = fOut + fIn
each entry in the maps

TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 100
Sparse convolution computation
A sparse set of dense MMA, with rules defined by maps
Conventional Convolution Sparse Convolution

(P0, Q0, W1,1) No compute


(P0, Q1, W1,0) No compute
Maps (P0, Q2, W1,-1) No compute
(In, Out, Wgt) (P0, Q3, W0,1) No compute
(P0, Q4, W0,0) (P0, Q0, W0,0)
Computation (P0, Q5, W0,-1) No
WWgt) for (P0, Q8, W-1,1) compute No
(fOut = fOut + fIn
each entry in the maps compute

TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 101
Sparse convolution computation
A sparse set of dense MMA, with rules defined by maps
Conventional Convolution Sparse Convolution

(P0, Q0, W1,1) No compute


(P0, Q1, W1,0) No compute
Maps (P0, Q2, W1,-1) No compute
(In, Out, Wgt) (P0, Q3, W0,1) No compute
(P0, Q4, W0,0) (P0, Q0, W0,0)
Computation (P0, Q5, W0,-1) No
WWgt) for (P0, Q8, W-1,1) compute No
(fOut = fOut + fIn
each entry in the maps (P0, Q9, W-1,0) compute No
compute

TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 102
Sparse convolution computation
A sparse set of dense MMA, with rules defined by maps
Conventional Convolution Sparse Convolution

(P0, Q0, W1,1) No compute


(P0, Q1, W1,0) No compute
Maps (P0, Q2, W1,-1) No compute
(In, Out, Wgt) (P0, Q3, W0,1) No compute
(P0, Q4, W0,0) (P0, Q0, W0,0)
Computation (P0, Q5, W0,-1) No compute
WWgt) for (P0, Q8, W-1,1) No compute
(fOut = fOut + fIn
each entry in the maps (P0, Q9, W-1,0) No compute
(P0, Q1, W-1,-1)
(P0, Q10, W-1,-1)

9 matrix multiplications 2 matrix multiplications


TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 103
Existing GPU implementation of sparse convolution
Weight-stationary computation, separate matmul for different weights

Workload
Maps
(In, Out, Wgt)

(P0, Q1, W-1,-1) Input Features Input Buffer Weight Partial Sum Output Features
(P3, Q4, W-1,-1)
(P1, Q3, W-1,0) P0 Q0
(P0, Q0, W0,0) P1 Q1
P0 PSUM1
(P1, Q1, W0,0) P2 Q2
P3 w-1,-1 PSUM4
(P2, Q2, W0,0) P3 Q3
(P3, Q3, W0,0) P4 2 2 Q4
Cin Cin Cout Cout
(P4, Q4, W0,0) 5 5
Cin Cout
(P3, Q1, W1,0)
f1 = f1 + f0 W-1,-1
(P1, Q0, W1,1)
(P4, Q3, W1,1) f4 = f4 + f3
W-1,-1

TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 104
Existing GPU implementation of sparse convolution
Weight-stationary computation, separate matmul for different weights

Workload
Maps
(In, Out, Wgt)

(P0, Q1, W-1,-1) Input Features Input Buffer Weight Partial Sum Output Features
(P3, Q4, W-1,-1)
(P1, Q3, W-1,0) P0 Q0
(P0, Q0, W0,0) P1 Q1
(P1, Q1, W0,0) P2 P1 PSUM3 Q2
w-1,0
(P2, Q2, W0,0) P3 Q3
(P3, Q3, W0,0) P4 1 1 Q4
Cin Cin Cout Cout
(P4, Q4, W0,0) 5 5
Cin Cout
(P3, Q1, W1,0)
(P1, Q0, W1,1)
f3 = f3 + f1 W-1,0

(P4, Q3, W1,1)

TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 105
Existing GPU implementation of sparse convolution
Weight-stationary computation, separate matmul for different weights

Workload
Maps
(In, Out, Wgt)

(P0, Q1, W-1,-1) Input Features Input Buffer Weight Partial Sum Output Features
(P3, Q4, W-1,-1)
(P1, Q3, W-1,0) P0 P0 PSUM0 Q0
(P0, Q0, W0,0) P1 P1 PSUM1 Q1
(P1, Q1, W0,0) P2 P2 w-1,-1 PSUM2 Q2
w0,0
(P2, Q2, W0,0) P3 PSUM3 Q3
P3 1 Cout
(P3, Q3, W0,0) P4 PSUM4 Q4
P4
(P4, Q4, W0,0) 5 5 5 5
Cin Cin Cin Cout Cout Cout
(P3, Q1, W1,0)
(P1, Q0, W1,1) fi = fi + fi
W0,0
(P4, Q3, W1,1) i = 0, 1, 2, 3, 4

Note: maps for W0,0 contains all entries.


TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 106
Existing GPU implementation of sparse convolution
Weight-stationary computation, separate matmul for different weights

Workload
Maps
(In, Out, Wgt)

(P0, Q1, W-1,-1) Input Features Input Buffer Weight Partial Sum Output Features
(P3, Q4, W-1,-1)
(P1, Q3, W-1,0) P0 Q0
(P0, Q0, W0,0) P1 Q1
(P1, Q1, W0,0) P2 P3 PSUM1 Q2
w1,0
(P2, Q2, W0,0) P3 Q3
(P3, Q3, W0,0) P4 1 1 Q4
Cin Cin Cout Cout
(P4, Q4, W0,0) 5 5
Cin Cout
(P3, Q1, W1,0)
(P1, Q0, W1,1)
f1 = f1 + f3 W1,0

(P4, Q3, W1,1)

TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 107
Existing GPU implementation of sparse convolution
Weight-stationary computation, separate matmul for different weights

Workload
Maps
(In, Out, Wgt)

(P0, Q1, W-1,-1) Input Features Input Buffer Weight Partial Sum Output Features
(P3, Q4, W-1,-1)
(P1, Q3, W-1,0) P0 Q0
(P0, Q0, W0,0) P1 Q1
P1 PSUM0
(P1, Q1, W0,0) P2 Q2
P4 w1,1 PSUM3
(P2, Q2, W0,0) P3 Q3
(P3, Q3, W0,0) P4 2 2 Q4
Cin Cin Cout Cout
(P4, Q4, W0,0) 5 5
Cin Cout
(P3, Q1, W1,0)
f0 = f0 + f1 W1,1
(P1, Q0, W1,1)
(P4, Q3, W1,1) f3 = f3 + f4
W1,1

TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 108
TorchSparse optimization overview

Locality-Aware Access Adaptive Grouping Locality-Aware Access


Gather Matrix-Matrix Multiplication Scatter-Accumulate

F0 PSUM 1
F3 × W-1,-1 = PSUM 4

F1 PSUM 3
pad × W-1,0 =
F0 Gather F3 PSUM 1
Scatter
F1 pad × W1,0 = F0
F1
F2 F1 PSUM 0 F2
F3 F4 × W1,1 = PSUM 3 F3
F4 Apply BMM F4

F0 PSUM 0
Input
Features F1 × W0,0 = PSUM 1
Output
Features
F2 PSUM 2
F3 PSUM 3
F4 PSUM 4
Apply MM

TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 109
Trading computation for regularity
Separate computation (baseline) : many kernel calls, low device utilization

MM MM
MM MM MM MM MM

Separate Computation

Worst Best

Computation overhead

Computation regularity

TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 110
Trading computation for regularity
Dense convolution: best regularity but large computation overhead

MM MM BMM (batch=7)
MM MM MM MM MM

Separate Computation Dense Convolution

Worst Best Worst Best

Computation overhead Computation overhead

Computation regularity Computation regularity

TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 111
Trading computation for regularity
Computation with grouping: balancing overhead and
regularity

Extra computation = 2 / 28
(Small overhead)

MM MM BMM (batch=7) MM BMM (batch=4)


MM MM MM MM MM
BMM (batch=2)

Separate Computation Dense Convolution


Computation with grouping

Worst Best Worst Best Worst Best

Computation overhead Computation overhead Computation overhead

Computation regularity Computation regularity Computation regularity

TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 112
Trading computation for regularity
Searching customized strategy for different model and datasets
1.6

Speedup Over Baseline


Increasing regularity helps 6 groups
improve latency
1.2
13 groups 3 groups
26 groups Padding overhead hurts
0.8 (=33 − 1, assume
latency
offset=(0,0,0) separately
0.4 computed.)

1 group
0
26 24 22 20 18 16 14 12 10 8 6
4 2 0
Number of Groups

100000
105 100000
105
SemanticKITTI nuScenes
104
10000 104
10000
Map Size

103
1000 103
1000

100 100
102
102 4 7 10 13 16 19 22 2527 1 4 7 10 13 16 19 22 2527
1
Weight Index Weight Index
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 113
Results on matrix multiplication optimizations
SemanticKITTI

Baseline Fixed
Grouping Adaptive
12.0 Grouping 1.40
11.9 1.39

Normalized Speedup
9.0 1.05
8.7 1.00
8.1
TFLOP/s

0.87
6.0 0.70

3.0 0.35

0.0 0.00

TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 114
Results on matrix multiplication optimizations
nuScenes: fixed grouping has best TFLOP/s but adaptive grouping is faster

Baseline Fixed Grouping Adaptive Grouping

22.0 1.60
21.1 1.50 1.54

Normalized Speedup
16.5 16.9 1.20
TFLOP/s

1.00
11.0 0.80
10.4

5.5 0.40

0.0 0.00
This is because fixed grouping introduced large amount of redundant computation.
TorchSparse: Efficient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 115
TorchSparse++: Overlapped memory and
computation Autonomous
Vehicles 3D Segmentation 3D Detection 3D Scene Reconstruction

iPhone Vision
15 Pro

TorchSparse++

Sparse Kernel Generator Sparse Autotuner

Dense to Sparse Static to Dynamic Design Space Group-Based


Adaptation Adaptation Augmentation Config Tuning

MinkowskiEngine SpConv 1.2.1 (FP16) TorchSparse (FP16) SpConv 2.3.5 (FP16) TorchSparse++ (FP16)

1.00 1.00 1.00 1.00 1.00 1.00


Geomean Speed

1.00
1.00 0.80 0.86
0.73 0.72 0.72 0.73 0.75
0.60 0.56
0.45 0.48 0.47 0.47 0.45
0.34 0.40 0.39 0.38
0.25 0.30 0.29 0.42
0.19 0.22
0.30 0.29 0.29 0.30 0.29 0.210.3

2
A100 3090 Orin 2080 Ti 3090-TF32 1080 Ti-FP32 A100-train 2080 Ti-train

TorchSparse++ [Tang and Yang et. al, MICRO 2023]


MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 116
TorchSparse++: Overlapped memory and
computation
xout
W-1,-1 W-1,0 W-1,1 W0,-1 W0,0 W0,1 W1,-1 W1,0 W1,1
xin xin xin xout
W-1,-1 W-1,0 W-1,1 W0,-1 W0,0 W0,1 W1,-1 W1,0 W1,1
0 0 0 0 1 1 0 0 1 25 3rd
0 0 1 3 0
xout xin xin xin xin Thread xout 0 0 0 1 1 1 0 1 0 58 5th
1 0 1 2 3 1
xout xin xin xin
Block 1 xout 4th
2 1 2 3 2 0 0 0 1 1 0 1 0 0 52
xout xin xin xin xin xout 464 8th
3 0 1 2 3 3 1 1 1 0 1 0 0 0 0
xout xin xin xout 1st
4 4 6 4 0 0 0 0 1 0 0 0 1 17
xout xin xin xout 0 0 0 0 1 0 1 0 0 20 2nd
5 5 7 Thread 5
xout xin xin Block 2 xout 1 0 0 0 1 0 0 0 0 272 7th
6 4 6 6
xout xin xin xout 6th
7 5 7 7 0 0 1 0 1 0 0 0 0 80

redundant computation=34

TorchSparse++ [Tang and Yang et. al, MICRO 2023]


MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 117
System & Hardware Support for
•Sparsity
EIE: Weight Sparsity + Activation Sparsity for GEMM

• NVIDIA Tensor Core: M:N Weight Sparsity Sparsity


• TorchSparse & PointAcc: Activation Sparsity for Sparse
Convolution
• TorchSparse: Sparse Convolution Library
• PointAcc: Hardware Accelerator for Sparse Convolution

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 118
Mapping
Unit
Merge sort can be used to find mappings in sparse convolution
Input Point Cloud

Input Point Cloud Output Point Cloud


P0 P0 P1 P2 Q0 Q1 Q2
P1 P2 P3 2,2P42,4 3,2 4,3
1,1 Q3 2,2Q42,4 3,2 4,3
1,1
P3
+ (1, 1)
P4 W-1,-1 W-1,0 W-1,1 for w-1,-1

Q1 2,2 3,3 3,5 4,3 5,4


stride = 1 P0 W0,-1 W0,0 W0,1 Merge Sort

Q4 W1,-1 W1,0 W1,1 Q0 Q1 P 0 Q2 Q3 P 1 P 2 Q4 P 3 P 4


Q0 P3
1,1 2,2 2,2 2,4 3,2 3,3 3,5 4,3 4,3 5,4
Q1 Q2
= == = = = = = =
Q3
Shift Input for W-1,-1 Q1 Q4
Q4 P0 Intersection P3

Output Point Cloud (In, Out, Wgt)


(P0, Q1, W-1,-1)
(P3, Q4, W-1,-1)

PointAcc: Efficient Point Cloud Accelerator [Lin et al., MICRO


2021]
https://efficientml.ai 119
Mapping
Unit
Merge sort can be used to find mappings in sparse convolution
Input Point Cloud

P0 P0 P1 P2 P3 P4 Q0 Q1 Q2 Q3 Q4
P1 P2 1,1 2,2 2,4 3,2 4,3 1,1 2,2 2,4 3,2 4,3
P3
Q0 + (-1, -1) for w1,1
P4 P1
W-1,-1 W-1,0 W-1,1
0,0 1,1 2,1
stride = 1 W0,-1 W0,0 W0,1 1,3 3,2 Merge Sort
Q3
P4
W1,-1 W1,0 W1,1 P 0 Q0 P 1 P 2 P 3 Q1 Q2 Q3 P 4
Q0 Q4 1,1 1,1 1,3 2,1 2,2 2,4 3,2 3,2 4,3
0,0
Q1 Q2
= = = = = = = = =
Q3 Q0 Q3
Shift Input for W1,1
Q4 P1 P4

Output Point Cloud (In, Out, Wgt)


(P0, Q1, W-1,-1)
(P3, Q4, W-1,-1)

(P1, Q0, W1,1)
(P4, Q3, W1,1)
PointAcc: Efficient Point Cloud Accelerator [Lin et al., MICRO
2021]
https://efficientml.ai 120
PointAcc: Speedup and Energy
over NVIDIA RTX 2080Ti over Intel Xeon Skylake + TPU V3 over Intel Xeon Gold 6130
Saving 269
127 131
113 97 82 65 88 106 102 94 71 51 90
Speedup

53
27 37

8.3
3.7 3.7 4.7 3.7
2.8 2.8 3.7 3.4 2.4

intNet + + (c) + (ps) CNN et++ + + (s) Net(i) et(o) Mean


Po in tN e t tN e t + DG -PointN in tN e t Mink MinkN Geo
P o P o in F Po

1,319
Energy Saving

682
394
172 169 119 324268
99 152 91 161221 127 139 210193
45 36
18 25 27 16 22
14 13
38

intNet + + (c) + (ps) CNN et++ + + (s) Net(i) et(o) Mean


Po e t e t + DG intN e t ink MinkN Geo
PointN P o in tN F -Po Po in tN M
PointAcc: Efficient Point Cloud Accelerator [Lin et al., MICRO
2021]
https://efficientml.ai 121
Summary of Today’s Lecture
In this lecture, we introduced:
• Automated ways to find pruning ratios before pruning after pruning

• Lottery ticket hypothesis


pruning
• System and hardware support for different synapses

granularities
pruning
• We will cover in the next neurons

lecture:
• Numeric data types in modern computer systems
• Basic concept of neural network quantization
• Common neural network quantization methods

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 122
References
1. Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS 2015]
2. Exploring the granularity of sparsity in convolutional neural networks [Mao et al., CVPR-W]
3. Learning Structured Sparsity in Deep Neural Networks [Wen et al., NeurIPS 2016]
4. Learning Efficient Convolutional Networks through Network Slimming [Liu et al., ICCV 2017]
5. A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers
[Zhang et al., ECCV 2018]
6. AMC: Automl for Model Compression and Acceleration on Mobile Devices [He et al., ECCV
2018]
7. Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA
TensorRT
8. EIE: Efficient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA
2016]
9. ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA [Han et al.,
FPGA 2017]
10.Block Sparse Format [NVIDIA, 2021]
11. Accelerating Sparse Deep Neural Networks [Mishra et al., arXiv 2021]

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 123

You might also like