Ofa CVPR Tutorial
Ofa CVPR Tutorial
Song Han
Massachusetts Institute of Technology
Once-for-All, ICLR’20
AutoML for TinyML
with Once-for-All Network
A lot of computation
Once-for-All, ICLR’20
Challenge: Efficient Inference on Diverse Hardware
Platforms
less less
resource resource
200
for training iterations:
forward-backward();
(2)
for devices: Design Cost (GPU hours)
(1)for search episodes: 40K
for training iterations:
160K
forward-backward();
if good_model: break;
for post-search training iterations:
forward-backward();
The design cost is calculated under the assumption of using MnasNet.
Once-for-All, ICLR’20
[1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019. 6
Challenge: Efficient Inference on Diverse Hardware
Platforms
Diverse Hardware Platforms
12 9 6
Cloud AI (10 FLOPS) Mobile AI (10 FLOPS) Tiny AI (10 FLOPS)
(2)
for many devices: Design Cost (GPU hours)
(1)for search episodes: 40K
for training iterations:
160K
forward-backward();
if good_model: break; 1600K
12 9 6
Cloud AI (10 FLOPS) Mobile AI (10 FLOPS) Tiny AI (10 FLOPS)
(2)
for many devices: Design Cost (GPU hours)
(1)for search episodes: 40K → 11.4k lbs CO2 emission
for training iterations:
160K → 45.4k lbs CO2 emission
forward-backward();
if good_model: break; 1600K → 454.4k lbs CO2 emission
for post-search training iterations:
forward-backward();
1 GPU hour translates to 0.284 lbs CO2 emission according to
Once-for-All, ICLR’20
Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019. 8
Problem:
TinyML comes at the cost of BigML
(inference) (training/search)
Once-for-All, ICLR’20 10
Challenge: Efficient Inference on Diverse Hardware
Platforms
Diverse Hardware Platforms
12 9 6
Cloud AI (10 FLOPS) Mobile AI (10 FLOPS) Tiny AI (10 FLOPS)
Once-for-All, ICLR’20 12
Once-for-All Network:
Decouple Model Training and Architecture Design
once-for-all network
Once-for-All, ICLR’20 13
Once-for-All Network:
Decouple Model Training and Architecture Design
once-for-all network
Once-for-All, ICLR’20 14
Once-for-All Network:
Decouple Model Training and Architecture Design
once-for-all network
Once-for-All, ICLR’20 15
Challenge: how to prevent different subnetworks
from interfering with each other?
Once-for-All, ICLR’20 16
Solution: Progressive Shrinking
19
• More than 10 different sub-networks in a single once-for-all network, covering
4 different dimensions: resolution, kernel size, depth, width.
• Directly optimizing the once-for-all network from scratch is much more challenging
than training a normal neural network given so many sub-networks to support.
Once-for-All, ICLR’20 17
Solution: Progressive Shrinking
19
• More than 10 different sub-networks in a single once-for-all network, covering
4 different dimensions: resolution, kernel size, depth, width.
• Directly optimizing the once-for-all network from scratch is much more challenging
than training a normal neural network given so many sub-networks to support.
Progressive Shrinking
Jointly fine-tune
Train the Shrink the model once-for-all
both large and
full model (4 dimensions) network
small sub-networks
Once-for-All, ICLR’20 18
Connection to Network Pruning
Network Pruning
Train the Shrink the model Fine-tune single pruned
full model (only width) the small net network
Progressive Shrinking
Fine-tune
Train the Shrink the model once-for-all
both large and
full model (4 dimensions) network
small sub-nets
Once-for-All, ICLR’20 19
Progressive Shrinking
Once-for-All, ICLR’20 20
Progressive Shrinking
Once-for-All, ICLR’20 21
Progressive Shrinking
Once-for-All, ICLR’20 22
Progressive Shrinking
Once-for-All, ICLR’20 23
Progressive Shrinking
Once-for-All, ICLR’20 24
Progressive Shrinking
Once-for-All, ICLR’20 25
Progressive Shrinking
Once-for-All, ICLR’20 26
Progressive Shrinking
Once-for-All, ICLR’20 27
Progressive Shrinking
Once-for-All, ICLR’20 28
Progressive Shrinking
Once-for-All, ICLR’20 29
Progressive Shrinking
Once-for-All, ICLR’20 30
Progressive Shrinking
Once-for-All, ICLR’20 31
Progressive Shrinking
Once-for-All, ICLR’20 32
Progressive Shrinking
Once-for-All, ICLR’20 33
Progressive Shrinking
Once-for-All, ICLR’20 34
Progressive Shrinking
Once-for-All, ICLR’20 35
Progressive Shrinking
Once-for-All, ICLR’20 36
Progressive Shrinking
Once-for-All, ICLR’20 37
Progressive Shrinking
Once-for-All, ICLR’20 38
Progressive Shrinking
Once-for-All, ICLR’20 39
Progressive Shrinking
Once-for-All, ICLR’20 40
Progressive Shrinking
Once-for-All, ICLR’20 41
Progressive Shrinking
Once-for-All, ICLR’20 42
Progressive Shrinking
Once-for-All, ICLR’20 43
Progressive Shrinking
Once-for-All, ICLR’20 44
Progressive Shrinking
Once-for-All, ICLR’20 45
Progressive Shrinking
Once-for-All, ICLR’20 46
Progressive Shrinking
Once-for-All, ICLR’20 47
Progressive Shrinking
Once-for-All, ICLR’20 48
Progressive Shrinking
Once-for-All, ICLR’20 49
Progressive Shrinking
7x7
5x5
3x3
Transform Transform
Matrix Matrix
25x25 9x9
Once-for-All, ICLR’20 50
Progressive Shrinking
7x7
5x5
3x3
Transform Transform
Matrix Matrix
25x25 9x9
Once-for-All, ICLR’20 51
Progressive Shrinking
7x7
5x5
3x3
Transform Transform
Matrix Matrix
25x25 9x9
Once-for-All, ICLR’20 52
Progressive Shrinking
7x7
5x5
3x3
Transform Transform
Matrix Matrix
25x25 9x9
Once-for-All, ICLR’20 53
Progressive Shrinking
7x7
5x5
3x3
Transform Transform
Matrix Matrix
25x25 9x9
Once-for-All, ICLR’20 54
Progressive Shrinking
7x7
5x5
3x3
Transform Transform
Matrix Matrix
25x25 9x9
Once-for-All, ICLR’20 55
Progressive Shrinking
O1
unit i unit i unit i
O1
O2
O2
O3
train with full depth shrink the depth shrink the depth
Once-for-All, ICLR’20 56
Progressive Shrinking
O1
unit i unit i unit i
O1
O2
O2
O3
train with full depth shrink the depth shrink the depth
Once-for-All, ICLR’20 57
Progressive Shrinking
O1
unit i unit i unit i
O1
O2
O2
O3
train with full depth shrink the depth shrink the depth
Once-for-All, ICLR’20 58
Progressive Shrinking
O1
unit i unit i unit i
O1
O2
O2
O3
train with full depth shrink the depth shrink the depth
Once-for-All, ICLR’20 59
Progressive Shrinking
O1
unit i unit i unit i
O1
O2
O2
O3
train with full depth shrink the depth shrink the depth
Once-for-All, ICLR’20 60
Progressive Shrinking
O1
unit i unit i unit i
O1
O2
O2
O3
train with full depth shrink the depth shrink the depth
Once-for-All, ICLR’20 61
Progressive Shrinking
channel channel
importance importance
0.02 0.82
channel reorg.
channel 0.15 reorg. 0.11
sorting
sorting 0.85 0.46 O3
O2
0.63 O2
O1 O1
O1
train with full width progressively shrink the width progressively shrink the width
Gradually shrink the width
Keep the most important channels when shrinking via channel sorting
Once-for-All, ICLR’20 62
Progressive Shrinking
channel channel
importance importance
0.02 0.82
channel reorg.
channel 0.15 reorg. 0.11
sorting
sorting 0.85 0.46 O3
O2
0.63 O2
O1 O1
O1
train with full width progressively shrink the width progressively shrink the width
Gradually shrink the width
Keep the most important channels when shrinking via channel sorting
Once-for-All, ICLR’20 63
Progressive Shrinking
channel channel
importance importance
0.02 0.82
channel reorg.
channel 0.15 reorg. 0.11
sorting
sorting 0.85 0.46 O3
O2
0.63 O2
O1 O1
O1
train with full width progressively shrink the width progressively shrink the width
Gradually shrink the width
Keep the most important channels when shrinking via channel sorting
Once-for-All, ICLR’20 64
Progressive Shrinking
channel channel
importance importance
0.02 0.82
channel reorg.
channel 0.15 reorg. 0.11
sorting
sorting 0.85 0.46 O3
O2
0.63 O2
O1 O1
O1
train with full width progressively shrink the width progressively shrink the width
Gradually shrink the width
Keep the most important channels when shrinking via channel sorting
Once-for-All, ICLR’20 65
Progressive Shrinking
channel channel
importance importance
0.02 0.82
channel reorg.
channel 0.15 reorg. 0.11
sorting
sorting 0.85 0.46 O3
O2
0.63 O2
O1 O1
O1
train with full width progressively shrink the width progressively shrink the width
Gradually shrink the width
Keep the most important channels when shrinking via channel sorting
Once-for-All, ICLR’20 66
Progressive Shrinking
channel channel
importance importance
0.02 0.82
channel reorg.
channel 0.15 reorg. 0.11
sorting
sorting 0.85 0.46 O3
O2
0.63 O2
O1 O1
O1
train with full width progressively shrink the width progressively shrink the width
Gradually shrink the width
Keep the most important channels when shrinking via channel sorting
Once-for-All, ICLR’20 67
progressively shrink the width
Progressive Shrinking
put it
Published as a conference paper at ICLR 2020 together:
D [4, 3, 2] W [6, 4, 3]
K [7, 5, 3] D [4, 3] W [6, 4]
Sample D at each Channel sorting
Sample D at each Channel sorting
Sample K at each layer
unit; sample K (Fig. 4)
Skip top (4-D) Sample E at each
Generate kernel Keep the first D layers Sample W at each Once-
Train full
weights (Fig. 3) at each unit (Fig. 3) layer; sample K, D for-all
network Network
Fine-tune weights Fine-tune weights
K=7 Fine-tune weights &
Fine-tune weights Fine-tune weights
Elastic Resolution D = 4 transformation matrix
R [128, 132, …, 224] W = 6 Elastic Kernel Size Elastic Depth Elastic Width
D = 4, W = 6 W = 6, K [7, 5, 3] D [4, 3, 2], K [7, 5, 3]
Figure 2: Illustration of the progressive shrinking process to support different depth D, width W ,
kernel size K and resolution R. It leads to a large space comprising diverse sub-networks (> 1019 ).
sequence of layers where only the first layer has stride 2 if the feature map size decreases (Sandler
Once-for-All, ICLR’20 68
Performances of Sub-networks on ImageNet
w/o PS w/ PS
78
ImageNet Top-1 Acc (%)
75 3.5%
3.7%
3.4%
3.4% 3.3%
73 3.5%
2.8%
70
2.5%
67
D=2 D=2 D=2 D=2 D=4 D=4 D=4 D=4
W=3 W=3 W=6 W=6 W=3 W=3 W=6 W=6
K=3 K=7 K=3 K=7 K=3 K=7 K=3 K=7
Sub-networks under various architecture configurations
D: depth, W: width, K: kernel size
Once-for-All, ICLR’20 69
How about search?
Once-for-All, ICLR’20 70
2.6x faster than EfficientNet
1.5x faster than MobileNetV3
OFA OFA
EfficientNet MobileNetV3
81 77
76.4
80.1 2.6x faster
80 75 74.9
79.8
Top-1 ImageNet Acc (%)
78 71
. 8 % h i g h e r 70.4
3
accuracy 4% higher
77 69 accuracy
76.3 67.4
76 67
0 50 100 150 200 250 300 350 400 18 24 30 36 42 48 54 60
Google Pixel1 Latency (ms) Google Pixel1 Latency (ms)
78 71
. 8 % h i g h e r 70.4
3
accuracy 4% higher
77 69 accuracy
76.3 67.4
76 67
0 50 100 150 200 250 300 350 400 18 24 30 36 42 48 54 60
Google Pixel1 Latency (ms) Google Pixel1 Latency (ms)
→
71
69
0
MobileNetV1 (MBNetV1)
1 2 3 4
Handcrafted
5
AutoML
6
The lower the better
7
→8 9
MACs (Billion)
→
71
69
0
MobileNetV1 (MBNetV1)
1 2 3 4
Handcrafted
5
AutoML
6
The lower the better
7
→8 9
MACs (Billion)
Mobile Setting
Once-for-All, ICLR’20 74
OFA Enables Fast Specialization on Diverse Hardware Platforms
OFA MobileNetV3 MobileNetV2
77 77 77
76.3 76.4
75.8
73 72.6 73 73 72.8
72.0
71.1
72.0 72.0 69.6 71.5
69 69.8 69 69.8 69
67.0 69.0
66 65.4 66 65.4 66
63.3
62 62 62
60.3 60.3
58 58 59.1
58
10 14 18 22 26 30 9 11 13 15 17 19 3.0 4.0 5.0 6.0 7.0 8.0
NVIDIA 1080Ti Latency (ms) Intel Xeon CPU Latency (ms) Xilinx ZU3EG FPGA Latency (ms)
Batch Size = 64 Batch Size = 1 Batch Size = 1 (Quantized)
Once-for-All, ICLR’20 75
Diverse Hardware Platforms, 50+ Pretriained Models are Released
Once-for-All, ICLR’20 76
OFA for FPGA Accelerators
25.0 40.0
12.5 20.0
0.0 0.0
Once-for-All, ICLR’20
We need Green AI
Solve the Environmental Problem of NAS
Evolved Transformer
How to save CO2 emission
• Efficient Transformer
• Efficient 3D Vision
73 ResNet50 + I3D
OFA + TSM (small)
72
Same Comp.
71 +3.0% Acc.
70
MobileNetV2 + TSM
69
0 10 20 30 40
Computation (GFLOPs)
TSM, ICCV’19
OFA’s Application: Efficient 3D Recognition
Progressive Shrinking
Fine-tune
Train the Shrink the model once-for-all
both large and
full model In 4 dimensions network
small sub-nets
• Once-for-all network surpasses MobileNetV3 and EfficientNet by a large margin under all scenarios,
setting a new state-of-the-art 80% ImageNet Top1-accuracy under the mobile setting (< 600M MACs).
• First place in the 3rd Low-Power Computer Vision Challenge, DSP track @ICCV’19
• First place in the 4th Low-Power Computer Vision Challenge @NeurIPS’19, both classification & detection.
• Released 50+ different pre-trained OFA models on diverse hardware platforms (CPU/GPU/FPGA/DSP).
net, image_size = ofa_specialized(net_id, pretrained=True)
• Released the training code & pre-trained OFA network that provides diverse sub-networks without training.
ofa_network = ofa_net(net_id, pretrained=True)
Efficient Vision:
- GAN Compression: Learning Efficient Architectures for Conditional GANs, CVPR’20
- TSM: Temporal Shift Module for Efficient Video Understanding, ICCV’19
- PVCNN: Point Voxel CNN for Efficient 3D Deep Learning, NeurIPS’19
Efficient NLP:
- Lite Transformer with Long Short Term Attention, ICLR’20
- HAT: Hardware-aware Transformer, ACL’20
86
Make AI Efficient:
Tiny Computational Resources
Tiny Human Resources
Media Coverage: