0% found this document useful (0 votes)
8 views87 pages

Ofa CVPR Tutorial

Uploaded by

Private
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views87 pages

Ofa CVPR Tutorial

Uploaded by

Private
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

AutoML for TinyML

with Once-for-All Network

Song Han
Massachusetts Institute of Technology

Once-for-All, ICLR’20
AutoML for TinyML
with Once-for-All Network

fewer engineers small model

many engineers less computation


large model
Less Engineer Resources: AutoML
Less Computational Resources: TinyML

A lot of computation
Once-for-All, ICLR’20
Challenge: Efficient Inference on Diverse Hardware
Platforms

Cloud AI Mobile AI Tiny AI (AIoT)

less less
resource resource

• Memory: 32GB • Memory: 4GB • Memory: <100 KB


• Computation: TFLOPS/s • Computation: GFLOPS/s • Computation: <MFLOPS/s

• Different hardware platforms have different resource constraints. We need to customize


our models for each platform to achieve the best accuracy-efficiency trade-off,
especially on resource-constrained edge devices.
Once-for-All, ICLR’20 3
Challenge: Efficient Inference on Diverse Hardware
Platforms

Design Cost (GPU hours)

200
for training iterations:
forward-backward();

The design cost is calculated under the assumption of using MobileNet-v2.


4
Challenge: Efficient Inference on Diverse Hardware
Platforms

Design Cost (GPU hours)


(1)for search episodes: 40K
for training iterations:
forward-backward();
if good_model: break;
for post-search training iterations:
forward-backward();
The design cost is calculated under the assumption of using MnasNet.
Once-for-All, ICLR’20
[1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019. 5
Challenge: Efficient Inference on Diverse Hardware
Platforms
Diverse Hardware Platforms

2019 2017 2015 2013

(2)
for devices: Design Cost (GPU hours)
(1)for search episodes: 40K
for training iterations:
160K
forward-backward();
if good_model: break;
for post-search training iterations:
forward-backward();
The design cost is calculated under the assumption of using MnasNet.
Once-for-All, ICLR’20
[1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019. 6
Challenge: Efficient Inference on Diverse Hardware
Platforms
Diverse Hardware Platforms

12 9 6
Cloud AI (10 FLOPS) Mobile AI (10 FLOPS) Tiny AI (10 FLOPS)

(2)
for many devices: Design Cost (GPU hours)
(1)for search episodes: 40K
for training iterations:
160K
forward-backward();
if good_model: break; 1600K

for post-search training iterations:


forward-backward();
The design cost is calculated under the assumption of using MnasNet.
Once-for-All, ICLR’20
[1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019. 7
Challenge: Efficient Inference on Diverse Hardware
Platforms
Diverse Hardware Platforms

12 9 6
Cloud AI (10 FLOPS) Mobile AI (10 FLOPS) Tiny AI (10 FLOPS)

(2)
for many devices: Design Cost (GPU hours)
(1)for search episodes: 40K → 11.4k lbs CO2 emission
for training iterations:
160K → 45.4k lbs CO2 emission
forward-backward();
if good_model: break; 1600K → 454.4k lbs CO2 emission
for post-search training iterations:
forward-backward();
1 GPU hour translates to 0.284 lbs CO2 emission according to
Once-for-All, ICLR’20
Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019. 8
Problem:
TinyML comes at the cost of BigML
(inference) (training/search)

We need Green AI:


Solve the Environmental Problem of NAS

Evolved Transformer ICML’19, ACL’19


Ours 52 4 orders of magnitude ACL’20
Hardware-Aware Transformer
OFA: Decouple Training and Search

Conventional NAS Once-for-All:

(2)for devices: for OFA training iterations:


forward-backward(); training
(1) for search episodes:
decouple
for training iterations: => for devices:
forward-backward(); for search episodes:
search
if good_model: break; sample from OFA;
for post-search training iterations: if good_model: break;
forward-backward(); direct deploy without training;

Once-for-All, ICLR’20 10
Challenge: Efficient Inference on Diverse Hardware
Platforms
Diverse Hardware Platforms

12 9 6
Cloud AI (10 FLOPS) Mobile AI (10 FLOPS) Tiny AI (10 FLOPS)

for OFA training iterations: Design Cost (GPU hours)


forward-backward(); training
40K → 11.4k lbs CO2 emission
decouple
for devices:
search 160K → 45.4k lbs CO2 emission
for search episodes:
sample from OFA; 1600K → 454.4k lbs CO2 emission
if good_model: break;
direct deploy without training; Once-for-All Network
1 GPU hour translates to 0.284 lbs CO2 emission according to
Once-for-All, ICLR’20
Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019. 11
Once-for-All Network:
Decouple Model Training and Architecture Design
once-for-all network

Once-for-All, ICLR’20 12
Once-for-All Network:
Decouple Model Training and Architecture Design
once-for-all network

Once-for-All, ICLR’20 13
Once-for-All Network:
Decouple Model Training and Architecture Design
once-for-all network

Once-for-All, ICLR’20 14
Once-for-All Network:
Decouple Model Training and Architecture Design
once-for-all network

Once-for-All, ICLR’20 15
Challenge: how to prevent different subnetworks
from interfering with each other?

Once-for-All, ICLR’20 16
Solution: Progressive Shrinking
19
• More than 10 different sub-networks in a single once-for-all network, covering
4 different dimensions: resolution, kernel size, depth, width.
• Directly optimizing the once-for-all network from scratch is much more challenging
than training a normal neural network given so many sub-networks to support.

Once-for-All, ICLR’20 17
Solution: Progressive Shrinking
19
• More than 10 different sub-networks in a single once-for-all network, covering
4 different dimensions: resolution, kernel size, depth, width.
• Directly optimizing the once-for-all network from scratch is much more challenging
than training a normal neural network given so many sub-networks to support.

Progressive Shrinking
Jointly fine-tune
Train the Shrink the model once-for-all
both large and
full model (4 dimensions) network
small sub-networks

• Small sub-networks are nested in large sub-networks.


• Cast the training process of the once-for-all network as a progressive shrinking and
joint fine-tuning process.

Once-for-All, ICLR’20 18
Connection to Network Pruning
Network Pruning
Train the Shrink the model Fine-tune single pruned
full model (only width) the small net network

Progressive Shrinking
Fine-tune
Train the Shrink the model once-for-all
both large and
full model (4 dimensions) network
small sub-nets

• Progressive shrinking can be viewed as a generalized network pruning with much


higher flexibility across 4 dimensions.

Once-for-All, ICLR’20 19
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial

Once-for-All, ICLR’20 20
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial

Once-for-All, ICLR’20 21
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial

Once-for-All, ICLR’20 22
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial

Once-for-All, ICLR’20 23
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial

Once-for-All, ICLR’20 24
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial

Once-for-All, ICLR’20 25
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial

Once-for-All, ICLR’20 26
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial

Once-for-All, ICLR’20 27
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial

Once-for-All, ICLR’20 28
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial

Once-for-All, ICLR’20 29
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial

Once-for-All, ICLR’20 30
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial

Once-for-All, ICLR’20 31
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial

Once-for-All, ICLR’20 32
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial

Once-for-All, ICLR’20 33
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial

Once-for-All, ICLR’20 34
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial

Once-for-All, ICLR’20 35
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial

Once-for-All, ICLR’20 36
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial

Once-for-All, ICLR’20 37
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial Partial

Once-for-All, ICLR’20 38
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial Partial

Once-for-All, ICLR’20 39
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial Partial

Once-for-All, ICLR’20 40
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial Partial

Once-for-All, ICLR’20 41
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial Partial

Once-for-All, ICLR’20 42
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial Partial

Once-for-All, ICLR’20 43
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial

Randomly sample input image


size for each batch

Once-for-All, ICLR’20 44
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial

Randomly sample input image


size for each batch

Once-for-All, ICLR’20 45
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial

Randomly sample input image


size for each batch

Once-for-All, ICLR’20 46
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial

Randomly sample input image


size for each batch

Once-for-All, ICLR’20 47
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial

Randomly sample input image


size for each batch

Once-for-All, ICLR’20 48
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial

Randomly sample input image


size for each batch

Once-for-All, ICLR’20 49
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial

7x7
5x5
3x3

Transform Transform
Matrix Matrix
25x25 9x9

Start with full kernel size


Smaller kernel takes centered weights via a transformation matrix

Once-for-All, ICLR’20 50
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial

7x7
5x5
3x3

Transform Transform
Matrix Matrix
25x25 9x9

Start with full kernel size


Smaller kernel takes centered weights via a transformation matrix

Once-for-All, ICLR’20 51
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial

7x7
5x5
3x3

Transform Transform
Matrix Matrix
25x25 9x9

Start with full kernel size


Smaller kernel takes centered weights via a transformation matrix

Once-for-All, ICLR’20 52
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial

7x7
5x5
3x3

Transform Transform
Matrix Matrix
25x25 9x9

Start with full kernel size


Smaller kernel takes centered weights via a transformation matrix

Once-for-All, ICLR’20 53
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial

7x7
5x5
3x3

Transform Transform
Matrix Matrix
25x25 9x9

Start with full kernel size


Smaller kernel takes centered weights via a transformation matrix

Once-for-All, ICLR’20 54
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial

7x7
5x5
3x3

Transform Transform
Matrix Matrix
25x25 9x9

Start with full kernel size


Smaller kernel takes centered weights via a transformation matrix

Once-for-All, ICLR’20 55
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial

O1
unit i unit i unit i
O1
O2
O2
O3
train with full depth shrink the depth shrink the depth

Gradually allow later layers in each unit


to be skipped to reduce the depth

Once-for-All, ICLR’20 56
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial

O1
unit i unit i unit i
O1
O2
O2
O3
train with full depth shrink the depth shrink the depth

Gradually allow later layers in each unit


to be skipped to reduce the depth

Once-for-All, ICLR’20 57
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial

O1
unit i unit i unit i
O1
O2
O2
O3
train with full depth shrink the depth shrink the depth

Gradually allow later layers in each unit


to be skipped to reduce the depth

Once-for-All, ICLR’20 58
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial

O1
unit i unit i unit i
O1
O2
O2
O3
train with full depth shrink the depth shrink the depth

Gradually allow later layers in each unit


to be skipped to reduce the depth

Once-for-All, ICLR’20 59
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial

O1
unit i unit i unit i
O1
O2
O2
O3
train with full depth shrink the depth shrink the depth

Gradually allow later layers in each unit


to be skipped to reduce the depth

Once-for-All, ICLR’20 60
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial

O1
unit i unit i unit i
O1
O2
O2
O3
train with full depth shrink the depth shrink the depth

Gradually allow later layers in each unit


to be skipped to reduce the depth

Once-for-All, ICLR’20 61
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial Partial

channel channel
importance importance
0.02 0.82
channel reorg.
channel 0.15 reorg. 0.11
sorting
sorting 0.85 0.46 O3
O2
0.63 O2
O1 O1
O1
train with full width progressively shrink the width progressively shrink the width
Gradually shrink the width
Keep the most important channels when shrinking via channel sorting
Once-for-All, ICLR’20 62
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial Partial

channel channel
importance importance
0.02 0.82
channel reorg.
channel 0.15 reorg. 0.11
sorting
sorting 0.85 0.46 O3
O2
0.63 O2
O1 O1
O1
train with full width progressively shrink the width progressively shrink the width
Gradually shrink the width
Keep the most important channels when shrinking via channel sorting
Once-for-All, ICLR’20 63
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial Partial

channel channel
importance importance
0.02 0.82
channel reorg.
channel 0.15 reorg. 0.11
sorting
sorting 0.85 0.46 O3
O2
0.63 O2
O1 O1
O1
train with full width progressively shrink the width progressively shrink the width
Gradually shrink the width
Keep the most important channels when shrinking via channel sorting
Once-for-All, ICLR’20 64
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial Partial

channel channel
importance importance
0.02 0.82
channel reorg.
channel 0.15 reorg. 0.11
sorting
sorting 0.85 0.46 O3
O2
0.63 O2
O1 O1
O1
train with full width progressively shrink the width progressively shrink the width
Gradually shrink the width
Keep the most important channels when shrinking via channel sorting
Once-for-All, ICLR’20 65
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial Partial

channel channel
importance importance
0.02 0.82
channel reorg.
channel 0.15 reorg. 0.11
sorting
sorting 0.85 0.46 O3
O2
0.63 O2
O1 O1
O1
train with full width progressively shrink the width progressively shrink the width
Gradually shrink the width
Keep the most important channels when shrinking via channel sorting
Once-for-All, ICLR’20 66
Progressive Shrinking

Full Elastic Full Elastic Full Elastic Full Elastic


Resolution Kernel Size Depth Width
Partial Partial Partial Partial

channel channel
importance importance
0.02 0.82
channel reorg.
channel 0.15 reorg. 0.11
sorting
sorting 0.85 0.46 O3
O2
0.63 O2
O1 O1
O1
train with full width progressively shrink the width progressively shrink the width
Gradually shrink the width
Keep the most important channels when shrinking via channel sorting
Once-for-All, ICLR’20 67
progressively shrink the width

Progressive Shrinking
put it
Published as a conference paper at ICLR 2020 together:

D [4, 3, 2] W [6, 4, 3]
K [7, 5, 3] D [4, 3] W [6, 4]
Sample D at each Channel sorting
Sample D at each Channel sorting
Sample K at each layer
unit; sample K (Fig. 4)
Skip top (4-D) Sample E at each
Generate kernel Keep the first D layers Sample W at each Once-
Train full
weights (Fig. 3) at each unit (Fig. 3) layer; sample K, D for-all
network Network
Fine-tune weights Fine-tune weights
K=7 Fine-tune weights &
Fine-tune weights Fine-tune weights
Elastic Resolution D = 4 transformation matrix
R [128, 132, …, 224] W = 6 Elastic Kernel Size Elastic Depth Elastic Width
D = 4, W = 6 W = 6, K [7, 5, 3] D [4, 3, 2], K [7, 5, 3]

Figure 2: Illustration of the progressive shrinking process to support different depth D, width W ,
kernel size K and resolution R. It leads to a large space comprising diverse sub-networks (> 1019 ).

sequence of layers where only the first layer has stride 2 if the feature map size decreases (Sandler
Once-for-All, ICLR’20 68
Performances of Sub-networks on ImageNet
w/o PS w/ PS
78
ImageNet Top-1 Acc (%)

75 3.5%
3.7%
3.4%
3.4% 3.3%
73 3.5%

2.8%
70
2.5%

67
D=2 D=2 D=2 D=2 D=4 D=4 D=4 D=4
W=3 W=3 W=6 W=6 W=3 W=3 W=6 W=6
K=3 K=7 K=3 K=7 K=3 K=7 K=3 K=7
Sub-networks under various architecture configurations
D: depth, W: width, K: kernel size

• Progressive shrinking consistently improves accuracy of sub-networks on ImageNet.

Once-for-All, ICLR’20 69
How about search?

for OFA training iterations:


training forward-backward();
decouple
for devices:
search for search episodes:
sample from OFA; //with evolution
if good_model: break;
direct deploy without training;

Once-for-All, ICLR’20 70
2.6x faster than EfficientNet
1.5x faster than MobileNetV3
OFA OFA
EfficientNet MobileNetV3
81 77
76.4
80.1 2.6x faster
80 75 74.9
79.8
Top-1 ImageNet Acc (%)

Top-1 ImageNet Acc (%)


75.2
79.8
73.3
79 73 73.3
78.7 1.5x faster
78.8 71.4

78 71
. 8 % h i g h e r 70.4
3
accuracy 4% higher
77 69 accuracy

76.3 67.4
76 67
0 50 100 150 200 250 300 350 400 18 24 30 36 42 48 54 60
Google Pixel1 Latency (ms) Google Pixel1 Latency (ms)

• Training from scratch cannot achieve the same level of accuracy


Once-for-All, ICLR’20 71
More accurate than training from scratch
OFA OFA
EfficientNet MobileNetV3
OFA - Train from scratch OFA - Train from scatch
81 77
76.4
80.1 2.6x faster
80 75 74.9
79.8
Top-1 ImageNet Acc (%)

Top-1 ImageNet Acc (%)


75.2
79.8
73.3
79 73 73.3
78.7 1.5x faster
78.8 71.4

78 71
. 8 % h i g h e r 70.4
3
accuracy 4% higher
77 69 accuracy

76.3 67.4
76 67
0 50 100 150 200 250 300 350 400 18 24 30 36 42 48 54 60
Google Pixel1 Latency (ms) Google Pixel1 Latency (ms)

• Training from scratch cannot achieve the same level of accuracy


Once-for-All, ICLR’20 72
OFA: 80% Top-1 Accuracy on ImageNet
81
14x less computation
595M MACs Xception
Once-for-All (ours) InceptionV3
80.0% Top-1
79 EfficientNet ResNetXt-50
NASNet-A
DPN-92
ImageNet Top-1 accuracy (%)

MBNetV3 DenseNet-169 ResNetXt-101


77
ProxylessNAS DenseNet-264
AmoebaNet DenseNet-121
75
MBNetV2 ResNet-101
PNASNet ResNet-50
ShuffleNet InceptionV2
73 DARTS 2M 4M 8M 16M 32M 64M
IGCV3-D
Model Size The higher the better


71

69
0
MobileNetV1 (MBNetV1)

1 2 3 4
Handcrafted

5
AutoML

6
The lower the better
7
→8 9
MACs (Billion)

• Once-for-all sets a new state-of-the-art 80% ImageNet top-1 accuracy under


the mobile vision setting (< 600M MACs).
Once-for-All, ICLR’20 73
OFA: 80% Top-1 Accuracy on ImageNet
81
14x less computation
595M MACs Xception
Once-for-All (ours) InceptionV3
80.0% Top-1
79 EfficientNet ResNetXt-50
NASNet-A
DPN-92
ImageNet Top-1 accuracy (%)

MBNetV3 DenseNet-169 ResNetXt-101


77
ProxylessNAS DenseNet-264
AmoebaNet DenseNet-121
75
MBNetV2 ResNet-101
PNASNet ResNet-50
ShuffleNet InceptionV2
73 DARTS 2M 4M 8M 16M 32M 64M
IGCV3-D
Model Size The higher the better


71

69
0
MobileNetV1 (MBNetV1)

1 2 3 4
Handcrafted

5
AutoML

6
The lower the better
7
→8 9
MACs (Billion)
Mobile Setting

Once-for-All, ICLR’20 74
OFA Enables Fast Specialization on Diverse Hardware Platforms
OFA MobileNetV3 MobileNetV2
77 77 77
76.3 76.4
75.8

Top-1 ImageNet Acc (%)


74.7 74.7 74.7
75 75 75
75.2 75.2 75.2
73.1 73.4 73.0
73 73.3 73 73.3 73 73.3
71.5
70.5 71.1
71 71 71
70.4 70.4 70.4
69 69 69
67.4 67.4 67.4
67 67 67
25 40 55 70 85 100 23 28 33 38 43 48 53 58 63 68 7 10 13 16 19 22 25
Samsung S7 Edge Latency (ms) Google Pixel2 Latency (ms) LG G8 Latency (ms)
77 76.4 77 77
75.3 75.7
73.8 74.6 73.7
Top-1 ImageNet Acc (%)

73 72.6 73 73 72.8
72.0
71.1
72.0 72.0 69.6 71.5
69 69.8 69 69.8 69
67.0 69.0

66 65.4 66 65.4 66
63.3
62 62 62
60.3 60.3
58 58 59.1
58
10 14 18 22 26 30 9 11 13 15 17 19 3.0 4.0 5.0 6.0 7.0 8.0
NVIDIA 1080Ti Latency (ms) Intel Xeon CPU Latency (ms) Xilinx ZU3EG FPGA Latency (ms)
Batch Size = 64 Batch Size = 1 Batch Size = 1 (Quantized)

Once-for-All, ICLR’20 75
Diverse Hardware Platforms, 50+ Pretriained Models are Released

Once-for-All, ICLR’20 76
OFA for FPGA Accelerators

MobileNetV2 MnasNet OFA (Ours)


50.0 80.0

Arithmetic Intensity (OPS/Byte)


40%

ZU3EG FPGA (GOPS/s)


higher 57%
37.5 60.0
higher

25.0 40.0

12.5 20.0

0.0 0.0

Measured results on FPGA


• Non-specialized neural networks do not fully utilize the hardware resource. There is a large room for
improvement via neural network specialization.

Once-for-All, ICLR’20
We need Green AI
Solve the Environmental Problem of NAS

Evolved Transformer
How to save CO2 emission

1. Once for all: Amortize the search cost 2. Lite-transformer: Human-in-the-loop


across many sub-networks and design. Apply human insights of HW&ML,
deployment scenarios rather than “just search it”

Once-for-All, ICLR’20 Lite Transformer, ICLR’20


OFA has broad applications

• Efficient Transformer

• Efficient Video Recognition

• Efficient 3D Vision

• Efficient GAN Compression


OFA’s Application: Hardware-Aware Transformer
American Life 36156

US car including 126000


fuel
Evolved 626155
Efficient NLP on mobile devices
Transformer
enable real time conversation
HAT (Ours)
between speakers using6000
different
languages

“Nice to meet you” “Encantada de conocerte”


“ ”
ACL 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
“ ”
“Freut mich, dich kennenzulernen”

00 Human Life 750


(Avg. 1 year)
11,023 BLEU Model Size Reduction
01 American Life 751
(Avg. 1 year)
36,156 Transformer Float32 41.2 705MB –
02 HAT Float32 41.8 227MB 3⇥ 752
US Car w/ Fuel
126,000
03 (Avg. 1 lifetime) HAT 8 bits 41.9 57MB 12⇥ 753
04 Evolved 626,155 HAT 4 bits 41.1 28MB 25⇥ 754
Transformer
05 12041× 755
HAT (Ours) 52
Table 5: K-means quantization of HAT models on
06 0 175K 350K 525K 700K 756
WMT’14 En-Fr. 4-bit quantization reduces model size
07 CO2 Emission (lbs) 757
by 25⇥ with only 0.1 BLEU loss than transformer base-
08 758
Figure 9: The design cost 3.7x smaller
measured in model size, line.
pounds of same 8-bit quantization even
performance increases BLEU
on WMT’14 En-De;by 0.1
09 CO2 emission. Our 3x,framework for searching
1.6x, 1.5x faster on HAT re-
Raspberrythan its
Pi,float
CPU,version.
GPU than Transformer Baseline 759
10 duces the search cost by four orders of magnitude than
12,000x less CO 2 than evolved transformer 760
Evolved Transformer (So et al., 2019).
settings for NLP tasks and proposed a multi-branch
11 761
mobile Transformer. However, it relied on FLOPs
HAT, ACL’20
OFA’s Application: Efficient Video Recognition
75
OFA + TSM (large) ResNet50 + TSM

Kinetics Top-1 Accuracy (%)


Same Acc.
74
7x less computation

73 ResNet50 + I3D
OFA + TSM (small)
72
Same Comp.
71 +3.0% Acc.

70
MobileNetV2 + TSM
69
0 10 20 30 40
Computation (GFLOPs)

7x less computation, same performance as TSM+ResNet50


same computation, 3% higher accuracy than TSM+MobileNet-v2

TSM, ICCV’19
OFA’s Application: Efficient 3D Recognition

AR/VR: a whole backpack


of computer

Accuracy v.s. Latency Tradeoff


self-driving: a whole trunk of GPU

4x FLOPs reduction and 2x speedup over MinkowskiNet


3.6% better accuracy under the same computation budget.
followup of PVCNN, NeurIPS’19 (spotlight)
OFA’s Application: GAN Compression

8-21x FLOPs reduction on CycleGAN, Pix2pix, GauGAN


1.7x-18.5x speedup on CPU/GPU & Mobile CPU/GPU

GAN Compression, CVPR’20 84


Summary: Once-for-All Network
• We introduce once-for-all network for efficient inference on diverse hardware platforms.
• We present an effective progressive shrinking approach for training once-for-all networks.

Progressive Shrinking
Fine-tune
Train the Shrink the model once-for-all
both large and
full model In 4 dimensions network
small sub-nets

• Once-for-all network surpasses MobileNetV3 and EfficientNet by a large margin under all scenarios,
setting a new state-of-the-art 80% ImageNet Top1-accuracy under the mobile setting (< 600M MACs).
• First place in the 3rd Low-Power Computer Vision Challenge, DSP track @ICCV’19
• First place in the 4th Low-Power Computer Vision Challenge @NeurIPS’19, both classification & detection.

• Released 50+ different pre-trained OFA models on diverse hardware platforms (CPU/GPU/FPGA/DSP).
net, image_size = ofa_specialized(net_id, pretrained=True)

• Released the training code & pre-trained OFA network that provides diverse sub-networks without training.
ofa_network = ofa_net(net_id, pretrained=True)

Project Page: https://ofa.mit.edu


References
Model Compression & NAS
- Once-For-All: Train One Network and Specialize It for Efficient Deployment, ICLR’20
- ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware, ICLR’19
- APQ: Joint Search for Network Architecture, Pruning and Quantization Policy, CVPR’20
- HAQ: Hardware-Aware Automated Quantization with Mixed Precision, CVPR’19
- Defensive Quantization: When Efficiency Meets Robustness, ICLR’19
- AMC: AutoML for Model Compression and Acceleration on Mobile Devices, ECCV’18

Efficient Vision:
- GAN Compression: Learning Efficient Architectures for Conditional GANs, CVPR’20
- TSM: Temporal Shift Module for Efficient Video Understanding, ICCV’19
- PVCNN: Point Voxel CNN for Efficient 3D Deep Learning, NeurIPS’19

Efficient NLP:
- Lite Transformer with Long Short Term Attention, ICLR’20
- HAT: Hardware-aware Transformer, ACL’20

Hardware & EDA:


- SpArch: Efficient Architecture for Sparse Matrix Multiplication, HPCA’20
- Transferable Transistor Sizing with Graph Neural Networks and Reinforcement Learning, DAC’20

86
Make AI Efficient:
Tiny Computational Resources
Tiny Human Resources

Media Coverage:

Website: songhan.mit.edu github.com/mit-han-lab youtube.com/c/MITHANLab

You might also like