0% found this document useful (0 votes)

8 views87 pages

Ofa CVPR Tutorial

Uploaded by

Private

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views87 pages

Ofa CVPR Tutorial

Uploaded by

Private

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 87

AutoML for TinyML

with Once-for-All Network

Song Han
Massachusetts Institute of Technology

Once-for-All, ICLR’20
AutoML for TinyML
with Once-for-All Network

fewer engineers small model

many engineers less computation

large model
Less Engineer Resources: AutoML
Less Computational Resources: TinyML

A lot of computation
Once-for-All, ICLR’20
Challenge: Efficient Inference on Diverse Hardware
Platforms

Cloud AI Mobile AI Tiny AI (AIoT)

less less
resource resource

• Memory: 32GB • Memory: 4GB • Memory: <100 KB

• Computation: TFLOPS/s • Computation: GFLOPS/s • Computation: <MFLOPS/s

• Different hardware platforms have different resource constraints. We need to customize

our models for each platform to achieve the best accuracy-efficiency trade-off,
especially on resource-constrained edge devices.
Once-for-All, ICLR’20 3
Challenge: Efficient Inference on Diverse Hardware
Platforms

Design Cost (GPU hours)

200
for training iterations:
forward-backward();

The design cost is calculated under the assumption of using MobileNet-v2.

4
Challenge: Efficient Inference on Diverse Hardware
Platforms

Design Cost (GPU hours)

（1）for search episodes: 40K
for training iterations:
forward-backward();
if good_model: break;
for post-search training iterations:
forward-backward();
The design cost is calculated under the assumption of using MnasNet.
Once-for-All, ICLR’20
[1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019. 5
Challenge: Efficient Inference on Diverse Hardware
Platforms
Diverse Hardware Platforms

2019 2017 2015 2013

（2）
for devices: Design Cost (GPU hours)
（1）for search episodes: 40K
for training iterations:
160K
forward-backward();
if good_model: break;
for post-search training iterations:
forward-backward();
The design cost is calculated under the assumption of using MnasNet.
Once-for-All, ICLR’20
[1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019. 6
Challenge: Efficient Inference on Diverse Hardware
Platforms
Diverse Hardware Platforms

12 9 6
Cloud AI (10 FLOPS) Mobile AI (10 FLOPS) Tiny AI (10 FLOPS)

（2）
for many devices: Design Cost (GPU hours)
（1）for search episodes: 40K
for training iterations:
160K
forward-backward();
if good_model: break; 1600K

for post-search training iterations:

forward-backward();
The design cost is calculated under the assumption of using MnasNet.
Once-for-All, ICLR’20
[1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019. 7
Challenge: Efficient Inference on Diverse Hardware
Platforms
Diverse Hardware Platforms

12 9 6
Cloud AI (10 FLOPS) Mobile AI (10 FLOPS) Tiny AI (10 FLOPS)

（2）
for many devices: Design Cost (GPU hours)
（1）for search episodes: 40K → 11.4k lbs CO2 emission
for training iterations:
160K → 45.4k lbs CO2 emission
forward-backward();
if good_model: break; 1600K → 454.4k lbs CO2 emission
for post-search training iterations:
forward-backward();
1 GPU hour translates to 0.284 lbs CO2 emission according to
Once-for-All, ICLR’20
Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019. 8
Problem:
TinyML comes at the cost of BigML
(inference) (training/search)

We need Green AI:

Solve the Environmental Problem of NAS

Evolved Transformer ICML’19, ACL’19

Ours 52 4 orders of magnitude ACL’20
Hardware-Aware Transformer
OFA: Decouple Training and Search

Conventional NAS Once-for-All:

（2）for devices: for OFA training iterations:

forward-backward(); training
（1） for search episodes:
decouple
for training iterations: => for devices:
forward-backward(); for search episodes:
search
if good_model: break; sample from OFA;
for post-search training iterations: if good_model: break;
forward-backward(); direct deploy without training;

Once-for-All, ICLR’20 10
Challenge: Efficient Inference on Diverse Hardware
Platforms
Diverse Hardware Platforms

12 9 6
Cloud AI (10 FLOPS) Mobile AI (10 FLOPS) Tiny AI (10 FLOPS)

for OFA training iterations: Design Cost (GPU hours)

forward-backward(); training
40K → 11.4k lbs CO2 emission
decouple
for devices:
search 160K → 45.4k lbs CO2 emission
for search episodes:
sample from OFA; 1600K → 454.4k lbs CO2 emission
if good_model: break;
direct deploy without training; Once-for-All Network
1 GPU hour translates to 0.284 lbs CO2 emission according to
Once-for-All, ICLR’20
Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019. 11
Once-for-All Network:
Decouple Model Training and Architecture Design
once-for-all network

Once-for-All, ICLR’20 12
Once-for-All Network:
Decouple Model Training and Architecture Design
once-for-all network

Once-for-All, ICLR’20 13
Once-for-All Network:
Decouple Model Training and Architecture Design
once-for-all network

Once-for-All, ICLR’20 14
Once-for-All Network:
Decouple Model Training and Architecture Design
once-for-all network

Once-for-All, ICLR’20 15
Challenge: how to prevent different subnetworks
from interfering with each other?

Once-for-All, ICLR’20 16
Solution: Progressive Shrinking
19
• More than 10 different sub-networks in a single once-for-all network, covering
4 different dimensions: resolution, kernel size, depth, width.
• Directly optimizing the once-for-all network from scratch is much more challenging
than training a normal neural network given so many sub-networks to support.

Once-for-All, ICLR’20 17
Solution: Progressive Shrinking
19
• More than 10 different sub-networks in a single once-for-all network, covering
4 different dimensions: resolution, kernel size, depth, width.
• Directly optimizing the once-for-all network from scratch is much more challenging
than training a normal neural network given so many sub-networks to support.

Progressive Shrinking
Jointly fine-tune
Train the Shrink the model once-for-all
both large and
full model (4 dimensions) network
small sub-networks

• Small sub-networks are nested in large sub-networks.

• Cast the training process of the once-for-all network as a progressive shrinking and
joint fine-tuning process.

Once-for-All, ICLR’20 18
Connection to Network Pruning
Network Pruning
Train the Shrink the model Fine-tune single pruned
full model (only width) the small net network

Progressive Shrinking
Fine-tune
Train the Shrink the model once-for-all
both large and
full model (4 dimensions) network
small sub-nets

• Progressive shrinking can be viewed as a generalized network pruning with much

higher flexibility across 4 dimensions.

Once-for-All, ICLR’20 19
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial

Once-for-All, ICLR’20 20
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial

Once-for-All, ICLR’20 21
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial

Once-for-All, ICLR’20 22
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial

Once-for-All, ICLR’20 23
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial

Once-for-All, ICLR’20 24
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial

Once-for-All, ICLR’20 25
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial

Once-for-All, ICLR’20 26
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial

Once-for-All, ICLR’20 27
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial

Once-for-All, ICLR’20 28
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial

Once-for-All, ICLR’20 29
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial

Once-for-All, ICLR’20 30
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial

Once-for-All, ICLR’20 31
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial

Once-for-All, ICLR’20 32
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial

Once-for-All, ICLR’20 33
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial

Once-for-All, ICLR’20 34
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial

Once-for-All, ICLR’20 35
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial

Once-for-All, ICLR’20 36
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial

Once-for-All, ICLR’20 37
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial Partial

Once-for-All, ICLR’20 38
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial Partial

Once-for-All, ICLR’20 39
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial Partial

Once-for-All, ICLR’20 40
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial Partial

Once-for-All, ICLR’20 41
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial Partial

Once-for-All, ICLR’20 42
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial Partial

Once-for-All, ICLR’20 43
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial

Randomly sample input image

size for each batch

Once-for-All, ICLR’20 44
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial

Randomly sample input image

size for each batch

Once-for-All, ICLR’20 45
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial

Randomly sample input image

size for each batch

Once-for-All, ICLR’20 46
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial

Randomly sample input image

size for each batch

Once-for-All, ICLR’20 47
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial

Randomly sample input image

size for each batch

Once-for-All, ICLR’20 48
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial

Randomly sample input image

size for each batch

Once-for-All, ICLR’20 49
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial

7x7
5x5
3x3

Transform Transform
Matrix Matrix
25x25 9x9

Start with full kernel size

Smaller kernel takes centered weights via a transformation matrix

Once-for-All, ICLR’20 50
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial

7x7
5x5
3x3

Transform Transform
Matrix Matrix
25x25 9x9

Start with full kernel size

Smaller kernel takes centered weights via a transformation matrix

Once-for-All, ICLR’20 51
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial

7x7
5x5
3x3

Transform Transform
Matrix Matrix
25x25 9x9

Start with full kernel size

Smaller kernel takes centered weights via a transformation matrix

Once-for-All, ICLR’20 52
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial

7x7
5x5
3x3

Transform Transform
Matrix Matrix
25x25 9x9

Start with full kernel size

Smaller kernel takes centered weights via a transformation matrix

Once-for-All, ICLR’20 53
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial

7x7
5x5
3x3

Transform Transform
Matrix Matrix
25x25 9x9

Start with full kernel size

Smaller kernel takes centered weights via a transformation matrix

Once-for-All, ICLR’20 54
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial

7x7
5x5
3x3

Transform Transform
Matrix Matrix
25x25 9x9

Start with full kernel size

Smaller kernel takes centered weights via a transformation matrix

Once-for-All, ICLR’20 55
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial

O1
unit i unit i unit i
O1
O2
O2
O3
train with full depth shrink the depth shrink the depth

Gradually allow later layers in each unit

to be skipped to reduce the depth

Once-for-All, ICLR’20 56
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial

O1
unit i unit i unit i
O1
O2
O2
O3
train with full depth shrink the depth shrink the depth

Gradually allow later layers in each unit

to be skipped to reduce the depth

Once-for-All, ICLR’20 57
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial

O1
unit i unit i unit i
O1
O2
O2
O3
train with full depth shrink the depth shrink the depth

Gradually allow later layers in each unit

to be skipped to reduce the depth

Once-for-All, ICLR’20 58
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial

O1
unit i unit i unit i
O1
O2
O2
O3
train with full depth shrink the depth shrink the depth

Gradually allow later layers in each unit

to be skipped to reduce the depth

Once-for-All, ICLR’20 59
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial

O1
unit i unit i unit i
O1
O2
O2
O3
train with full depth shrink the depth shrink the depth

Gradually allow later layers in each unit

to be skipped to reduce the depth

Once-for-All, ICLR’20 60
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial

O1
unit i unit i unit i
O1
O2
O2
O3
train with full depth shrink the depth shrink the depth

Gradually allow later layers in each unit

to be skipped to reduce the depth

Once-for-All, ICLR’20 61
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial Partial

channel channel
importance importance
0.02 0.82
channel reorg.
channel 0.15 reorg. 0.11
sorting
sorting 0.85 0.46 O3
O2
0.63 O2
O1 O1
O1
train with full width progressively shrink the width progressively shrink the width
Gradually shrink the width
Keep the most important channels when shrinking via channel sorting
Once-for-All, ICLR’20 62
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial Partial

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial Partial

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial Partial

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial Partial

Full Elastic Full Elastic Full Elastic Full Elastic

Resolution Kernel Size Depth Width
Partial Partial Partial Partial

Progressive Shrinking
put it
Published as a conference paper at ICLR 2020 together:

D [4, 3, 2] W [6, 4, 3]
K [7, 5, 3] D [4, 3] W [6, 4]
Sample D at each Channel sorting
Sample D at each Channel sorting
Sample K at each layer
unit; sample K (Fig. 4)
Skip top (4-D) Sample E at each
Generate kernel Keep the first D layers Sample W at each Once-
Train full
weights (Fig. 3) at each unit (Fig. 3) layer; sample K, D for-all
network Network
Fine-tune weights Fine-tune weights
K=7 Fine-tune weights &
Fine-tune weights Fine-tune weights
Elastic Resolution D = 4 transformation matrix
R [128, 132, …, 224] W = 6 Elastic Kernel Size Elastic Depth Elastic Width
D = 4, W = 6 W = 6, K [7, 5, 3] D [4, 3, 2], K [7, 5, 3]

Figure 2: Illustration of the progressive shrinking process to support different depth D, width W ,
kernel size K and resolution R. It leads to a large space comprising diverse sub-networks (> 1019 ).

sequence of layers where only the first layer has stride 2 if the feature map size decreases (Sandler
Once-for-All, ICLR’20 68
Performances of Sub-networks on ImageNet
w/o PS w/ PS
78
ImageNet Top-1 Acc (%)

75 3.5%
3.7%
3.4%
3.4% 3.3%
73 3.5%

2.8%
70
2.5%

67
D=2 D=2 D=2 D=2 D=4 D=4 D=4 D=4
W=3 W=3 W=6 W=6 W=3 W=3 W=6 W=6
K=3 K=7 K=3 K=7 K=3 K=7 K=3 K=7
Sub-networks under various architecture configurations
D: depth, W: width, K: kernel size

• Progressive shrinking consistently improves accuracy of sub-networks on ImageNet.

Once-for-All, ICLR’20 69
How about search?

for OFA training iterations:

training forward-backward();
decouple
for devices:
search for search episodes:
sample from OFA; //with evolution
if good_model: break;
direct deploy without training;

Once-for-All, ICLR’20 70
2.6x faster than EfficientNet
1.5x faster than MobileNetV3
OFA OFA
EfficientNet MobileNetV3
81 77
76.4
80.1 2.6x faster
80 75 74.9
79.8
Top-1 ImageNet Acc (%)

Top-1 ImageNet Acc (%)

75.2
79.8
73.3
79 73 73.3
78.7 1.5x faster
78.8 71.4

78 71
. 8 % h i g h e r 70.4
3
accuracy 4% higher
77 69 accuracy

76.3 67.4
76 67
0 50 100 150 200 250 300 350 400 18 24 30 36 42 48 54 60
Google Pixel1 Latency (ms) Google Pixel1 Latency (ms)

• Training from scratch cannot achieve the same level of accuracy

Once-for-All, ICLR’20 71
More accurate than training from scratch
OFA OFA
EfficientNet MobileNetV3
OFA - Train from scratch OFA - Train from scatch
81 77
76.4
80.1 2.6x faster
80 75 74.9
79.8
Top-1 ImageNet Acc (%)

Top-1 ImageNet Acc (%)

75.2
79.8
73.3
79 73 73.3
78.7 1.5x faster
78.8 71.4

78 71
. 8 % h i g h e r 70.4
3
accuracy 4% higher
77 69 accuracy

76.3 67.4
76 67
0 50 100 150 200 250 300 350 400 18 24 30 36 42 48 54 60
Google Pixel1 Latency (ms) Google Pixel1 Latency (ms)

• Training from scratch cannot achieve the same level of accuracy

Once-for-All, ICLR’20 72
OFA: 80% Top-1 Accuracy on ImageNet
81
14x less computation
595M MACs Xception
Once-for-All (ours) InceptionV3
80.0% Top-1
79 EfficientNet ResNetXt-50
NASNet-A
DPN-92
ImageNet Top-1 accuracy (%)

MBNetV3 DenseNet-169 ResNetXt-101

77
ProxylessNAS DenseNet-264
AmoebaNet DenseNet-121
75
MBNetV2 ResNet-101
PNASNet ResNet-50
ShuffleNet InceptionV2
73 DARTS 2M 4M 8M 16M 32M 64M
IGCV3-D
Model Size The higher the better

→
71

69
0
MobileNetV1 (MBNetV1)

1 2 3 4
Handcrafted

5
AutoML

6
The lower the better
7
→8 9
MACs (Billion)

• Once-for-all sets a new state-of-the-art 80% ImageNet top-1 accuracy under

the mobile vision setting (< 600M MACs).
Once-for-All, ICLR’20 73
OFA: 80% Top-1 Accuracy on ImageNet
81
14x less computation
595M MACs Xception
Once-for-All (ours) InceptionV3
80.0% Top-1
79 EfficientNet ResNetXt-50
NASNet-A
DPN-92
ImageNet Top-1 accuracy (%)

MBNetV3 DenseNet-169 ResNetXt-101

77
ProxylessNAS DenseNet-264
AmoebaNet DenseNet-121
75
MBNetV2 ResNet-101
PNASNet ResNet-50
ShuffleNet InceptionV2
73 DARTS 2M 4M 8M 16M 32M 64M
IGCV3-D
Model Size The higher the better

→
71

69
0
MobileNetV1 (MBNetV1)

1 2 3 4
Handcrafted

5
AutoML

6
The lower the better
7
→8 9
MACs (Billion)
Mobile Setting

Once-for-All, ICLR’20 74
OFA Enables Fast Specialization on Diverse Hardware Platforms
OFA MobileNetV3 MobileNetV2
77 77 77
76.3 76.4
75.8

Top-1 ImageNet Acc (%)

74.7 74.7 74.7
75 75 75
75.2 75.2 75.2
73.1 73.4 73.0
73 73.3 73 73.3 73 73.3
71.5
70.5 71.1
71 71 71
70.4 70.4 70.4
69 69 69
67.4 67.4 67.4
67 67 67
25 40 55 70 85 100 23 28 33 38 43 48 53 58 63 68 7 10 13 16 19 22 25
Samsung S7 Edge Latency (ms) Google Pixel2 Latency (ms) LG G8 Latency (ms)
77 76.4 77 77
75.3 75.7
73.8 74.6 73.7
Top-1 ImageNet Acc (%)

73 72.6 73 73 72.8
72.0
71.1
72.0 72.0 69.6 71.5
69 69.8 69 69.8 69
67.0 69.0

66 65.4 66 65.4 66
63.3
62 62 62
60.3 60.3
58 58 59.1
58
10 14 18 22 26 30 9 11 13 15 17 19 3.0 4.0 5.0 6.0 7.0 8.0
NVIDIA 1080Ti Latency (ms) Intel Xeon CPU Latency (ms) Xilinx ZU3EG FPGA Latency (ms)
Batch Size = 64 Batch Size = 1 Batch Size = 1 (Quantized)

Once-for-All, ICLR’20 75
Diverse Hardware Platforms, 50+ Pretriained Models are Released

Once-for-All, ICLR’20 76
OFA for FPGA Accelerators

MobileNetV2 MnasNet OFA (Ours)

50.0 80.0

Arithmetic Intensity (OPS/Byte)

40%

ZU3EG FPGA (GOPS/s)

higher 57%
37.5 60.0
higher

25.0 40.0

12.5 20.0

0.0 0.0

Measured results on FPGA

• Non-specialized neural networks do not fully utilize the hardware resource. There is a large room for
improvement via neural network specialization.

Once-for-All, ICLR’20
We need Green AI
Solve the Environmental Problem of NAS

Evolved Transformer
How to save CO2 emission

1. Once for all: Amortize the search cost 2. Lite-transformer: Human-in-the-loop

across many sub-networks and design. Apply human insights of HW&ML,
deployment scenarios rather than “just search it”

Once-for-All, ICLR’20 Lite Transformer, ICLR’20

OFA has broad applications

• Efficient Transformer

• Efficient Video Recognition

• Efficient 3D Vision

• Efficient GAN Compression

OFA’s Application: Hardware-Aware Transformer
American Life 36156

US car including 126000

fuel
Evolved 626155
Efficient NLP on mobile devices
Transformer
enable real time conversation
HAT (Ours)
between speakers using6000
different
languages

“Nice to meet you” “Encantada de conocerte”

“ ”
ACL 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
“ ”
“Freut mich, dich kennenzulernen”

00 Human Life 750

(Avg. 1 year)
11,023 BLEU Model Size Reduction
01 American Life 751
(Avg. 1 year)
36,156 Transformer Float32 41.2 705MB –
02 HAT Float32 41.8 227MB 3⇥ 752
US Car w/ Fuel
126,000
03 (Avg. 1 lifetime) HAT 8 bits 41.9 57MB 12⇥ 753
04 Evolved 626,155 HAT 4 bits 41.1 28MB 25⇥ 754
Transformer
05 12041× 755
HAT (Ours) 52
Table 5: K-means quantization of HAT models on
06 0 175K 350K 525K 700K 756
WMT’14 En-Fr. 4-bit quantization reduces model size
07 CO2 Emission (lbs) 757
by 25⇥ with only 0.1 BLEU loss than transformer base-
08 758
Figure 9: The design cost 3.7x smaller
measured in model size, line.
pounds of same 8-bit quantization even
performance increases BLEU
on WMT’14 En-De;by 0.1
09 CO2 emission. Our 3x,framework for searching
1.6x, 1.5x faster on HAT re-
Raspberrythan its
Pi,float
CPU,version.
GPU than Transformer Baseline 759
10 duces the search cost by four orders of magnitude than
12,000x less CO 2 than evolved transformer 760
Evolved Transformer (So et al., 2019).
settings for NLP tasks and proposed a multi-branch
11 761
mobile Transformer. However, it relied on FLOPs
HAT, ACL’20
OFA’s Application: Efficient Video Recognition
75
OFA + TSM (large) ResNet50 + TSM

Kinetics Top-1 Accuracy (%)

Same Acc.
74
7x less computation

73 ResNet50 + I3D
OFA + TSM (small)
72
Same Comp.
71 +3.0% Acc.

70
MobileNetV2 + TSM
69
0 10 20 30 40
Computation (GFLOPs)

7x less computation, same performance as TSM+ResNet50

same computation, 3% higher accuracy than TSM+MobileNet-v2

TSM, ICCV’19
OFA’s Application: Efficient 3D Recognition

AR/VR: a whole backpack

of computer

Accuracy v.s. Latency Tradeoff

self-driving: a whole trunk of GPU

4x FLOPs reduction and 2x speedup over MinkowskiNet

3.6% better accuracy under the same computation budget.
followup of PVCNN, NeurIPS’19 (spotlight)
OFA’s Application: GAN Compression

8-21x FLOPs reduction on CycleGAN, Pix2pix, GauGAN

1.7x-18.5x speedup on CPU/GPU & Mobile CPU/GPU

GAN Compression, CVPR’20 84

Summary: Once-for-All Network
• We introduce once-for-all network for efficient inference on diverse hardware platforms.
• We present an effective progressive shrinking approach for training once-for-all networks.

Progressive Shrinking
Fine-tune
Train the Shrink the model once-for-all
both large and
full model In 4 dimensions network
small sub-nets

• Once-for-all network surpasses MobileNetV3 and EfficientNet by a large margin under all scenarios,
setting a new state-of-the-art 80% ImageNet Top1-accuracy under the mobile setting (< 600M MACs).
• First place in the 3rd Low-Power Computer Vision Challenge, DSP track @ICCV’19
• First place in the 4th Low-Power Computer Vision Challenge @NeurIPS’19, both classification & detection.

• Released 50+ different pre-trained OFA models on diverse hardware platforms (CPU/GPU/FPGA/DSP).
net, image_size = ofa_specialized(net_id, pretrained=True)

• Released the training code & pre-trained OFA network that provides diverse sub-networks without training.
ofa_network = ofa_net(net_id, pretrained=True)

Project Page: https://ofa.mit.edu

References
Model Compression & NAS
- Once-For-All: Train One Network and Specialize It for Efficient Deployment, ICLR’20
- ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware, ICLR’19
- APQ: Joint Search for Network Architecture, Pruning and Quantization Policy, CVPR’20
- HAQ: Hardware-Aware Automated Quantization with Mixed Precision, CVPR’19
- Defensive Quantization: When Efficiency Meets Robustness, ICLR’19
- AMC: AutoML for Model Compression and Acceleration on Mobile Devices, ECCV’18

Efficient Vision:
- GAN Compression: Learning Efficient Architectures for Conditional GANs, CVPR’20
- TSM: Temporal Shift Module for Efficient Video Understanding, ICCV’19
- PVCNN: Point Voxel CNN for Efficient 3D Deep Learning, NeurIPS’19

Efficient NLP:
- Lite Transformer with Long Short Term Attention, ICLR’20
- HAT: Hardware-aware Transformer, ACL’20

Hardware & EDA:

- SpArch: Efficient Architecture for Sparse Matrix Multiplication, HPCA’20
- Transferable Transistor Sizing with Graph Neural Networks and Reinforcement Learning, DAC’20

86
Make AI Efficient:
Tiny Computational Resources
Tiny Human Resources

Media Coverage:

Website: songhan.mit.edu github.com/mit-han-lab youtube.com/c/MITHANLab

Previews 1928680 Pre
No ratings yet
Previews 1928680 Pre
7 pages
High School DXD 22 - Gremory of The Graduation Ceremony
56% (18)
High School DXD 22 - Gremory of The Graduation Ceremony
172 pages
Saint Kabir's Pad & Dohe
100% (3)
Saint Kabir's Pad & Dohe
89 pages
Haldirams Intro
100% (2)
Haldirams Intro
12 pages
Super Drive Tut 3
No ratings yet
Super Drive Tut 3
4 pages
Tiny Machine Learning For IOT Jiaying 041123
No ratings yet
Tiny Machine Learning For IOT Jiaying 041123
87 pages
O - A: T O N S - E D: NCE FOR LL Rain NE Etwork and PE Cialize It For Fficient Eployment
No ratings yet
O - A: T O N S - E D: NCE FOR LL Rain NE Etwork and PE Cialize It For Fficient Eployment
15 pages
An Ultra-Low Power TinyML System For Real-Time Visual Processing at Edge
No ratings yet
An Ultra-Low Power TinyML System For Real-Time Visual Processing at Edge
6 pages
Final Draft-Efficient Neural Network Development
No ratings yet
Final Draft-Efficient Neural Network Development
7 pages
E"Cientnet: Improving Accuracy and E"Ciency Through Automl and Model Scaling
No ratings yet
E"Cientnet: Improving Accuracy and E"Ciency Through Automl and Model Scaling
5 pages
Tinyreptile: Tinyml With Federated Meta-Learning: 1 Haoyu Ren 2 Darko Anicic 3 Thomas A. Runkler
No ratings yet
Tinyreptile: Tinyml With Federated Meta-Learning: 1 Haoyu Ren 2 Darko Anicic 3 Thomas A. Runkler
9 pages
3) - (A3D3 Workshop) Mcunet-V3
No ratings yet
3) - (A3D3 Workshop) Mcunet-V3
42 pages
Bay Learn 2015 Deep Mind
No ratings yet
Bay Learn 2015 Deep Mind
69 pages
20387-Article Text-24400-1-2-20220628
No ratings yet
20387-Article Text-24400-1-2-20220628
9 pages
Tiny Machine Learning
No ratings yet
Tiny Machine Learning
39 pages
1) - MCUNet Tiny Deep Learning On IoT Devices
No ratings yet
1) - MCUNet Tiny Deep Learning On IoT Devices
15 pages
ETH Zurich Talk - April 14, 2025
No ratings yet
ETH Zurich Talk - April 14, 2025
84 pages
Wall Journal
No ratings yet
Wall Journal
8 pages
Lec 02
No ratings yet
Lec 02
91 pages
BT 3
No ratings yet
BT 3
28 pages
Ug4 Proj
No ratings yet
Ug4 Proj
44 pages
Computers 13 00173
No ratings yet
Computers 13 00173
18 pages
W01 PracticalProblemsProjects
No ratings yet
W01 PracticalProblemsProjects
27 pages
Issue 5
No ratings yet
Issue 5
25 pages
Future Directions in TinyML - Emerging Trends and Innovations
No ratings yet
Future Directions in TinyML - Emerging Trends and Innovations
21 pages
An Survey of Neural Network Compression
No ratings yet
An Survey of Neural Network Compression
73 pages
Squeeze Net
No ratings yet
Squeeze Net
13 pages
Paper Colossal-AI - A Unified Deep Learning System For Large-Scale Parallel Training
No ratings yet
Paper Colossal-AI - A Unified Deep Learning System For Large-Scale Parallel Training
10 pages
Computing Tools PDF
No ratings yet
Computing Tools PDF
24 pages
Lecture 1 - Intro
No ratings yet
Lecture 1 - Intro
57 pages
From Tiny Machine Learning To Tiny Deep Learning: A Survey
No ratings yet
From Tiny Machine Learning To Tiny Deep Learning: A Survey
38 pages
On-Device Training Under 256KB Memory: Indicates Equal Contributions
No ratings yet
On-Device Training Under 256KB Memory: Indicates Equal Contributions
18 pages
TInyML Research
No ratings yet
TInyML Research
11 pages
Tiny ML-Progress and Futures
No ratings yet
Tiny ML-Progress and Futures
24 pages
On-Device Training of Artificial Intelligence Models On Microcontrollers
No ratings yet
On-Device Training of Artificial Intelligence Models On Microcontrollers
11 pages
On-Device Training of Artificial Intelligence Mode
No ratings yet
On-Device Training of Artificial Intelligence Mode
11 pages
A Review On Machine Learning in Iot Devices: Imopishak Thingom, N. Basanta Singh
No ratings yet
A Review On Machine Learning in Iot Devices: Imopishak Thingom, N. Basanta Singh
5 pages
Implementation of Smart Security System in Agriculture Fields Using Embedded Mac
No ratings yet
Implementation of Smart Security System in Agriculture Fields Using Embedded Mac
6 pages
Slides Architecture Based Continual Learning
No ratings yet
Slides Architecture Based Continual Learning
23 pages
UNIT-III Convolution Neural Networks
No ratings yet
UNIT-III Convolution Neural Networks
9 pages
Article - Python - TensorFlow: A System For Large-Scale Machine Learning
No ratings yet
Article - Python - TensorFlow: A System For Large-Scale Machine Learning
18 pages
Machine Learning (ML) :: Aim: Analysis and Implementation of Deep Neural Network. Definitions
No ratings yet
Machine Learning (ML) :: Aim: Analysis and Implementation of Deep Neural Network. Definitions
6 pages
Lec01 Introduction
No ratings yet
Lec01 Introduction
93 pages
Lec03 Pruning I
No ratings yet
Lec03 Pruning I
74 pages
Le y Yang - Tiny ImageNet Visual Recognition Challenge
No ratings yet
Le y Yang - Tiny ImageNet Visual Recognition Challenge
6 pages
Zehao Huang Data-Driven Sparse Structure ECCV 2018 Paper
No ratings yet
Zehao Huang Data-Driven Sparse Structure ECCV 2018 Paper
17 pages
Recent Advances in Artificial Intelligence
No ratings yet
Recent Advances in Artificial Intelligence
6 pages
Building Efficient Lightweight CNN Models
No ratings yet
Building Efficient Lightweight CNN Models
25 pages
Explaining How Resnet-50 Works and Why It Is So Popular
No ratings yet
Explaining How Resnet-50 Works and Why It Is So Popular
15 pages
Evolving Deep Architecture Generation With Residua 2
No ratings yet
Evolving Deep Architecture Generation With Residua 2
23 pages
Mathematics 12 03032
No ratings yet
Mathematics 12 03032
19 pages
ANNFL Question Papers
No ratings yet
ANNFL Question Papers
5 pages
M Thesis Report
No ratings yet
M Thesis Report
38 pages
A Review On The Emerging Technology of TinyML
No ratings yet
A Review On The Emerging Technology of TinyML
37 pages
Certsinside Oracle 1z0 1122 23 Exam Dumps by Vaughan 22-07-2024 9qa
No ratings yet
Certsinside Oracle 1z0 1122 23 Exam Dumps by Vaughan 22-07-2024 9qa
14 pages
Speeding Up Document Image Classi Cation
No ratings yet
Speeding Up Document Image Classi Cation
59 pages
10 From Zero To ML
No ratings yet
10 From Zero To ML
53 pages
EE292A Lecture 2.ML - Hardware - 2 - April9
No ratings yet
EE292A Lecture 2.ML - Hardware - 2 - April9
13 pages
Fin Irjmets1684902949
No ratings yet
Fin Irjmets1684902949
6 pages
Deep Neural Nets - 33 Years Ago and 33 Years From Now
No ratings yet
Deep Neural Nets - 33 Years Ago and 33 Years From Now
17 pages
Deep Learning
No ratings yet
Deep Learning
46 pages
Deep Learning
No ratings yet
Deep Learning
28 pages
Notesv 1
No ratings yet
Notesv 1
6 pages
OpenCL Programming by Example
From Everand
OpenCL Programming by Example
Koushik Bhattacharyya
No ratings yet
Kubernetes and Cloud Native Associate (KCNA) Exam Preparation
From Everand
Kubernetes and Cloud Native Associate (KCNA) Exam Preparation
Georgio Daccache
No ratings yet
Neural ODE
No ratings yet
Neural ODE
21 pages
Recursive Make Considered Harmful
No ratings yet
Recursive Make Considered Harmful
22 pages
Multi View
No ratings yet
Multi View
49 pages
Us7915515 Pianoteq
No ratings yet
Us7915515 Pianoteq
19 pages
Non-Recursive Make Considered Harmful
No ratings yet
Non-Recursive Make Considered Harmful
12 pages
Solcv 2022
No ratings yet
Solcv 2022
2 pages
Muzen Vivaus Guide
No ratings yet
Muzen Vivaus Guide
46 pages
A New Cycle-Stepped 6502 CPU Emulator
No ratings yet
A New Cycle-Stepped 6502 CPU Emulator
14 pages
XVA1 Installation Guide
No ratings yet
XVA1 Installation Guide
10 pages
Unpacking PMF - Enjoy The Work
No ratings yet
Unpacking PMF - Enjoy The Work
31 pages
Video Tutorial CVPR19
No ratings yet
Video Tutorial CVPR19
40 pages
Brook For GPUs - Stream Computing On Graphics Hardware - Paper
No ratings yet
Brook For GPUs - Stream Computing On Graphics Hardware - Paper
10 pages
XVA1 Schematics
No ratings yet
XVA1 Schematics
1 page
Beyond Goldfish Memory - Long-Term Open-Domain Conversation
No ratings yet
Beyond Goldfish Memory - Long-Term Open-Domain Conversation
15 pages
XNB Format
No ratings yet
XNB Format
23 pages
Alpha Geometry 2
No ratings yet
Alpha Geometry 2
28 pages
Thesis Defenseman Mike Has Been Named To His Team
No ratings yet
Thesis Defenseman Mike Has Been Named To His Team
113 pages
specular 거칠기의 일부 모양을 제공
No ratings yet
specular 거칠기의 일부 모양을 제공
28 pages
X7 Petrol Brochure
No ratings yet
X7 Petrol Brochure
1 page
DLL G6 Q3 WEEK 9 Version2 (Mam Inkay Peralta)
No ratings yet
DLL G6 Q3 WEEK 9 Version2 (Mam Inkay Peralta)
71 pages
Pinnacle DV500 User Guide
No ratings yet
Pinnacle DV500 User Guide
188 pages
Tos Math 7
No ratings yet
Tos Math 7
1 page
Sop For Digital Crop Survey
No ratings yet
Sop For Digital Crop Survey
8 pages
The Trinity - Lesson 4
100% (1)
The Trinity - Lesson 4
3 pages
Pump Division: Type MPT
No ratings yet
Pump Division: Type MPT
41 pages
9.intracellular Accumulations 1
No ratings yet
9.intracellular Accumulations 1
45 pages
Chia Verini 2002
No ratings yet
Chia Verini 2002
2 pages
Lunch Menu
No ratings yet
Lunch Menu
1 page
Intake - Output Medication Nursing Reference
No ratings yet
Intake - Output Medication Nursing Reference
4 pages
Materials Compatibility Milling Units Chart
No ratings yet
Materials Compatibility Milling Units Chart
1 page
Exercises
No ratings yet
Exercises
9 pages
Combined Stresses Singer
No ratings yet
Combined Stresses Singer
8 pages
Iec 309.1-1988
No ratings yet
Iec 309.1-1988
66 pages
Rulebook - Preparatory Works-Prefeasibility and Feasibility Study Content and Scope
No ratings yet
Rulebook - Preparatory Works-Prefeasibility and Feasibility Study Content and Scope
8 pages
A Trek Through Time - The History of Trek Bicycles
No ratings yet
A Trek Through Time - The History of Trek Bicycles
5 pages
3-Structure Analysis - Trusses
No ratings yet
3-Structure Analysis - Trusses
58 pages
Fatigue Fracture Mechanics
0% (2)
Fatigue Fracture Mechanics
12 pages
Vedant's Resume
No ratings yet
Vedant's Resume
2 pages
11 PracticeExamples T Distribution
No ratings yet
11 PracticeExamples T Distribution
4 pages
Thomas Printz' Private Bulletin - 1953 - Vol. 01 - 34
No ratings yet
Thomas Printz' Private Bulletin - 1953 - Vol. 01 - 34
2 pages
Brosur CCTV ZiFMachines
No ratings yet
Brosur CCTV ZiFMachines
3 pages
Toefl Speaking & Writing
No ratings yet
Toefl Speaking & Writing
5 pages
Bai Tap Dat Cau Hoi
No ratings yet
Bai Tap Dat Cau Hoi
4 pages