0% found this document useful (0 votes)
18 views107 pages

Week3 Lec1 2

Lecture notes of CV801
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views107 pages

Week3 Lec1 2

Lecture notes of CV801
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

CV801: Week 3 Lecture 1 and 2

CNN Architectures
Project Discussion
ImageNet Classification Challenge
30 28.2 152 152 152
25.8 layers layers layers
25

20
Error Rate

16.4
15 19 22
11.7 layers layers
10 7.3 6.7
8 layers 8 layers
3.6 5.1
5 3 2.3
Shallow
0
2010 2011 2012 2013 2014 2014 2015 2016 2017 Human
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
(ResNet) (SENet)
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet)
AlexNet

Input size Layer Output size


Layer C H / W filters kernel stride pad C H / W memory
memory(KB)
(KB) params
params(k)
(k) flop
flop
(M)
(M)
conv1 3 227 64 11 4 2 64 56 784
784 2323 7373
pool1 64 56 3 2 0 64 27 182
182 00 00
conv2 64 27 192 5 1 2 192 27 547
547 307
307 224
224
pool2 192 27 3 2 0 192 13 127
127 00 00
conv3 192 13 384 3 1 1 384 13 254
254 664
664 112
112
conv4 384 13 256 3 1 1 256 13 169
169 885
885 145
145
conv5 256 13 256 3 1 1 256 13 169
169 590
590 100
100
pool5 256 13 3 2 0 256 6 36
36 00 00
flatten 256 6 9216 36
36 00 00
fc6 9216 4096 4096 16
16 37,749
37,749 3838
fc7 4096 4096 4096 16 16,777 17
fc8 4096 1000 1000 4 4,096 46
AlexNe
t Cin: 256, Cout 384

• 224 x 224 inputs


• Used “Local response normalization”; Not
• 5 Convolutional layers used anymore
• 3 fully-connected layers • Trained on two GTX 580 GPUs – only 3GB of
memory each! Model split over two GPUs.
• 3 Max pooling layers
• ReLU nonlinearities to the output of every conv and fc layer.
• The output of the last fully-connected layer is fed to a 1000-way softmax which
produces a distribution over the 1000 class labels.
ImageNet Classification Challenge
30 28.2 152 152 152
25.8 layers layers layers
25

20
Error Rate

16.4
15 19 22
11.7 layers layers
10 7.3 6.7
8 layers 8 layers
3.6 5.1
5 3 2.3
Shallow
0
2010 2011 2012 2013 2014 2014 2015 2016 2017 Human
Sanchez & Simonyan & Szegedy et al He et al Russakovsky et
Lin et al Krizhevsky et al Zeiler & Hu et al (GoogLeNet) (ResNet) (SENet)
Perronnin (AlexNet) Fergus Zisserman (VGG) aSlhaoet al
VGG: Deeper Networks, Regular Design Softmax
Softmax
FC 1000
FC 4096
FC 1000 FC 4096

VGG Design rules: FC 4096


FC 4096
Pool
3x3 conv, 512

All conv are 3x3 stride 1 pad 1 Pool


3x3 conv, 512
3x3 conv, 512
3x3 conv, 512

All max pool are 2x2 stride 2 3x3 conv, 512


3x3 conv, 512
3x3 conv, 512
Pool

After pool, double #channels Softmax


Pool
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
FC 1000 3x3 conv, 512 3x3 conv, 512
FC 4096 3x3 conv, 512 3x3 conv, 512
FC 4096 Pool Pool
Pool 3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256 3x3 conv, 256
3x3 conv, 384 Pool Pool
Pool 3x3 conv, 128 3x3 conv, 128
3x3 conv, 384 3x3 conv, 128 3x3 conv, 128
Pool Pool Pool
5x5 conv, 256 3x3 conv, 64 3x3 conv, 64
11x11 conv, 96 3x3 conv, 64 3x3 conv, 64
Input Input Input

AlexNet VGG16 VGG19


Simonyan and Zissermann, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, ICLR 2015
VGG-16
VGG: Deeper Networks, Regular Design
VGG Design rules:
All conv are 3x3 stride 1 pad 1
All max pool are 2x2 stride 2
After pool, double #channels
Network has 5 convolutional stages:
Stage 1: conv-conv-pool
Stage 2: conv-conv-pool
Stage 3: conv-conv-conv-[conv]- pool
Stage 4: conv-conv-conv-[conv]-pool
Stage 5: conv-conv-conv-[conv]-pool
(VGG-16 has 3 conv in stages 3, 4 and 5)
(VGG-19 has 4 conv in stages 3, 4 and 5)
VGG: Deeper Networks, Regular Design
VGG Design rules:
All conv are 3x3 stride 1 pad 1
All max pool are 2x2 stride 2
After pool, double #channels

Option 1:
Conv(5x5, C -> C)

Params: 25C2
FLOPs: 25C2HW
VGG: Deeper Networks, Regular Design
VGG Design rules: Two 3x3 conv has same
receptive field as a single 5x5
All conv are 3x3 stride 1 pad 1 conv, but has fewer parameters
All max pool are 2x2 stride 2 and takes less computation!
After pool, double #channels Softmax
FC 1000
FC 4096
FC 4096

Option 1: Option 2: Pool


3x3 conv, 256

Conv(5x5, C -> C) Conv(3x3, C -> C) 3x3 conv, 384


Pool

Conv(3x3, C -> C) 3x3 conv, 384


Pool
5x5 conv, 256
11x11 conv, 96

Params: 25C2 Params: 18C2 Input

AlexNet
FLOPs: 25C2HW FLOPs: 18C2HW
VGG: Deeper Networks, Regular Design
VGG Design rules:
All conv are 3x3 stride 1 pad 1
All max pool are 2x2 stride 2
After pool, double #channels

Input: C x 2H x 2W
Layer: Conv(3x3, C->C)

FLOPs: 36HWC2
VGG: Deeper Networks, Regular Design
VGG Design rules:
All conv are 3x3 stride 1 pad 1
All max pool are 2x2 stride 2
After pool, double #channels

Input: C x 2H x 2W Input: 2C x H x W
Layer: Conv(3x3, C->C) Conv(3x3, 2C -> 2C)

FLOPs: 36HWC2 FLOPs: 36HWC2


VGG: Deeper Networks, Regular Design
VGG Design rules: Most of the Conv layers at
All conv are 3x3 stride 1 pad 1 EACH spatial resolution take
All max pool are 2x2 stride 2 the same amount of
computation!
After pool, double #channels

Input: C x 2H x 2W Input: 2C x H x W
Layer: Conv(3x3, C->C) Conv(3x3, 2C -> 2C)

FLOPs: 36HWC2 FLOPs: 36HWC2

Simonyan and Zissermann, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, ICLR 2015
AlexNet vs VGG-16: Much bigger network!

Memory, (KB) Params, M MFLOPs

AlexNet total: 1.9 MB AlexNet total: 61M AlexNet total: 0.7 GFLOP
VGG-16 total: 48.6 MB (25x) VGG-16 total: 138M (2.3x) VGG-16 total: 13.6 GFLOP (19.4x)
ImageNet Classification Challenge
30 28.2 152 152 152
25.8 layers layers layers
25

20
Error Rate

16.4
15 19 22
11.7 layers layers
10 7.3 6.7
8 layers 8 layers
3.6 5.1
5 3 2.3
Shallow
0
2010 2011 2012 2013 2014 2014 2015 2016 2017 Human
Sanchez & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Lin et al Krizhevsky et al Zeiler & (SENet)
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet)
ImageNet Classification Challenge
30 28.2 152 152 152
25.8 layers layers layers
25

20
Error Rate

16.4
15 19 22
11.7 layers layers
10 7.3 6.7
8 layers 8 layers
3.6 5.1
5 3 2.3
Shallow
0
2010 2011 2012 2013 2014 2014 2015 2016 2017 Human
Szegedy et al He et al Russakovsky et
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan &
alShaoet al Hu et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
GoogLeNet: Focus on Efficiency

Many innovations for efficiency: reduce parameter


count, memory usage, and computation

Szegedy et al, “Going deeper with convolutions”, CVPR 2015


GoogLeNet: Aggressive Stem
Stem network at the start aggressively downsamples input
(In VGG-16: Most of the compute was at the start)

Szegedy et al, “Going deeper with convolutions”, CVPR 2015


GoogLeNet: Aggressive Stem
Stem network at the start aggressively downsamples input

Input size Layer Output size


Layer C H / W filters kernelstride pad C H/W memory (KB) params (K) flop (M)
conv 3 224 64 7 2 3 64 112 3136 9 118
max-pool 64 112 3 2 1 64 56 784 0 2
conv 64 56 64 1 1 0 64 56 784 4 13
conv 64 56 192 3 1 1 192 56 2352 111 347
max-pool 192 56 3 2 1 192 28 588 0 1

Total from 224 to 28 spatial resolution:


Memory: 7.5 MB
Params: 124K
MFLOP: 418
Szegedy et al, “Going deeper with convolutions”, CVPR 2015
GoogLeNet: Aggressive Stem
Stem network at the start aggressively downsamples input
(Recall in VGG-16: Most of the compute was at the start)
Input size Layer Output size
Layer C H / W filters kernelstride pad C H/W memory (KB) params (K) flop (M)
conv 3 224 64 7 2 3 64 112 3136 9 118
max-pool 64 112 3 2 1 64 56 784 0 2
conv 64 56 64 1 1 0 64 56 784 4 13
conv 64 56 192 3 1 1 192 56 2352 111 347
max-pool 192 56 3 2 1 192 28 588 0 1

Total from 224 to 28 spatial resolution: Compare VGG-16:


Memory: 7.5 MB Memory: 42.9 MB (5.7x)
Params: 124K Params: 1.1M (8.9x)
MFLOP: 418 MFLOP: 7485 (17.8x)
Szegedy et al, “Going deeper with convolutions”, CVPR 2015
GoogLeNet: Global Average Pooling
No large FC layers at the end! Instead uses global average pooling to
collapse spatial dimensions, and one linear layer to produce class scores
(Recall VGG-16: Most parameters were in the FC layers!)

Input size Layer Output size


Layer C H/W filters kernel stride pad C H/W memory (KB) params (k) flop (M)
avg-pool 1024 7 7 1 0 1024 1 4 0 0
fc 1024 1000 1000 0 1025 1
Compare with VGG-16:
Layer C H/W filters kernel stride pad C H/W memory (KB) params (K) flop (M)
flatten 512 7 25088 98
fc6 25088 4096 4096 16 102760 103
fc7 4096 4096 4096 16 16777 17
fc8 4096 1000 1000 4 4096 4
AlexNet vs VGG-16
AlexNet vs VGG-16 AlexNet vs VGG-16 AlexNet vs VGG-16
(Memory, KB) (MFLOPs) (Params, M)
30000 5000
120000
4500
25000
4000 100000
3500
20000
80000
3000
15000 2500
60000
2000
10000 1500 40000
1000
5000
500 20000

0 0
0

fc8
1

1
2
3
4
5
6

fc6
fc7
nv

nv

nv

nv

nv

nv
nv
nv
nv
nv
v
fc

fc

fc

8
5
nv

nv

nv

nv

fc

fc

fc
co

co

co

co

co

co
co
co
co
co

n
co

co

co

co
co
AlexNet VGG-16 AlexNet VGG-16 AlexNet VGG-16
AlexNet total: 1.9 MB AlexNet total: 0.7 GFLOP AlexNet total: 61M
VGG-16 total: 48.6 MB (25x) VGG-16 total: 13.6 GFLOP (19.4x) VGG-16 total: 138M (2.3x)
GoogLeNet: Global Average Pooling
No large FC layers at the end! Instead uses “global average pooling” to
collapse spatial dimensions, and one linear layer to produce class
scores (Recall VGG-16: Most parameters were in the FC layers!)

Input size Layer Output size


Layer C H/W filters kernel stride pad C H/W memory (KB) params (k) flop (M)
avg-pool 1024 7 7 1 0 1024 1 4 0 0
fc 1024 1000 1000 0 1025 1
Compare wi th VGG-16:
Layer C H/W filters kernel stride pad C H/W memory (KB) params (K) flop (M)
flatten 512 7 25088 98
fc6 25088 4096 4096 16 102760 103
fc7 4096 4096 4096 16 16777 17
fc8 4096 1000 1000 4 4096 4
GoogLeNet: Inception Module
Inception module
Local unit with
parallel branches

Local structure repeated


many times throughout the
network

Szegedy et al, “Going deeper with convolutions”, CVPR 2015


GoogLeNet: Inception Module
Inception module
Local unit with
parallel branches

Local structure repeated


many times throughout the
network

Uses 1x1 “Bottleneck”


layers to reduce channel
dimension before
expensive conv (used in
ResNet!)
Szegedy et al, “Going deeper with convolutions”, CVPR 2015
GoogLeNet: Auxiliary Classifiers

Training using loss at the end of the network didn’t work well:
Network is too deep, gradients don’t propagate cleanly

As a hack, attach “auxiliary classifiers” at several intermediate points


in the network that also try to classify the image and receive loss

GoogLeNet was before batch normalization! With BatchNorm no


longer need to use this trick
ImageNet Classification Challenge
30 28.2 152 152 152
25.8 layers layers layers
25

20
Error Rate

16.4
15 19 22
11.7 layers layers
10 7.3 6.7
8 layers 8 layers
3.6 5.1
5 3 2.3
Shallow
0
2010 2011 2012 2013 2014 2014 2015 2016 2017 Human
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
(ResNet) (SENet)
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet)
ImageNet Classification Challenge
30 28.2 152 152 152
25.8 layers layers layers
25

20
Error Rate

16.4
15 19 22
11.7 layers layers
10 7.3
8 layers 8 layers
6.7 5.1
5 3.6 3 2.3
Shallow
0
2010 2011 2012 2013 2014 2014 2015 2016 2017 Human
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Russakovsky et alShao et al Hu et al
(ResNet) (SENet)
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet)
Residual Networks
Outline

• Residual Networks (ResNet)


ImageNet Classification Challenge
30 28.2 152 152 152
25.8 layers layers layers
25

20
Error Rate

16.4
15 19 22
11.7 layers layers
10 7.3 6.7
8 layers 8 layers
3.6 5.1
5 3 2.3
Shallow
0
2010 2011 2012 2013 2014 2014 2015 2016 2017 Human
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
(ResNet) (SENet)
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet)
Residual Networks
Once we have Batch Normalization, we can train networks with 10+ layers.
What happens as we go deeper?

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016


Residual Networks
Once we have Batch Normalization, we can train networks with 10+ layers.
What happens as we go deeper?

Test error
Deeper model does worse than
shallow model! 56-layer

Initial guess: Deep model is 20-layer


overfitting since it is much
bigger than the other model
Iterations

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016


Residual Networks
Once we have Batch Normalization, we can train networks with 10+ layers.
What happens as we go deeper?
Training error Test error

56-layer
56-layer

20-layer

20-layer
Iterations Iterations

In fact the deep model seems to be underfitting since it also performs worse
than the shallow model on the training set! It is actually underfitting

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016


Residual Networks
A deeper model can emulate a shallower model: copy layers from
shallower model, set extra layers to identity

Thus deeper models should do at least as good as shallow models

Hypothesis: This is an optimization problem. Deeper models are


harder to optimize, and in particular don’t learn identity functions to
emulate shallow models

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016


Residual Networks
A deeper model can emulate a shallower model: copy layers from
shallower model, set extra layers to identity

Thus deeper models should do at least as good as shallow models

Hypothesis: This is an optimization problem. Deeper models are


harder to optimize, and in particular don’t learn identity functions to
emulate shallow models

Solution: Change the network so learning identity functions with


extra layers is easy!

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016


Residual Networks
Solution: Change the network so learning identity functions with extra layers is easy!
F(x) + x relu
H(x)
F(x)
conv
conv
relu Additive
relu
“shortcut”
conv conv

X X
“Plain” block Residual Block
He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016
Residual Networks
Solution: Change the network so learning identity functions with extra layers is easy!

relu
H(x) F(x) + x
F(x)
If you set these to conv
conv
0, the whole block Additive
relu will compute the relu
identity function! “shortcut”
conv conv

X X
“Plain” block Residual Block
He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016
Softmax

Residual Networks
FC 1000
Pool

3x3 conv, 512


3x3 conv, 512

A residual network is a stack of


3x3 conv, 512
3x3 conv, 512

many residual blocks relu 3x3 conv, 512

F(x) + x
3x3 conv, 512, /2

..
Regular design, like VGG: each .
3x3 conv, 128
3x3 conv
residual block has two 3x3 conv 3x3 conv, 128

F(x) relu 3x3 conv, 128


3x3 conv, 128

Network is divided into stages: the 3x3 conv


3x3 conv, 128
3x3 conv, 128, / 2

first block of each stage halves the 3x3 conv, 64


3x3 conv, 64
resolution (with stride-2 conv) and
doubles the number of channels X 3x3 conv, 64
3x3 conv, 64

Residual block 3x3 conv, 64


3x3 conv, 64

Pool
7x7 conv, 64, / 2
Input

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016


Softmax

Residual Networks
FC 1000
Pool

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512


3x3 conv, 512

Similar to GoogleNet, downsample the input 4x before applying


3x3 conv, 512
3x3 conv, 512, /2

residual blocks: ..
.
3x3 conv, 128
3x3 conv, 128
Input Output 3x3 conv, 128

size Layer size 3x3 conv, 128

3x3 conv, 128

C H/W filters kernel stride pad C H/W


3x3 conv, 128, / 2

Layer 3x3 conv, 64

conv 3 224 64 7 2 3 64 112 3x3 conv, 64

max-pool 64 112 3 2 1 64 56 3x3 conv, 64


3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64, / 2
Input

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016


Softmax

Residual Networks
FC 1000
Pool

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512


3x3 conv, 512, /2

..
Like GoogLeNet, no big fully-connected-layers: instead use .

global average pooling and a single linear layer at the end


3x3 conv, 128
3x3 conv, 128

3x3 conv, 128


3x3 conv, 128

3x3 conv, 128


3x3 conv, 128, / 2

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64, / 2
Input

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016


Softmax

Residual Networks
FC 1000
Pool

3x3 conv, 512


3x3 conv, 512
ResNet-18: 3x3 conv, 512

Stem: 1 conv layer 3x3 conv, 512

3x3 conv, 512


Stage 1 (C=64): 2 res. block = 4 conv 3x3 conv, 512, /2

Stage 2 (C=128): 2 res. block = 4 conv ..


.
Stage 3 (C=256): 2 res. block = 4 conv 3x3 conv, 128

Stage 4 (C=512): 2 res. block = 4 conv


3x3 conv, 128

3x3 conv, 128

Linear 3x3 conv, 128

3x3 conv, 128


3x3 conv, 128, / 2

ImageNet top-5 error: 10.92 3x3 conv, 64


3x3 conv, 64

GFLOP: 1.8 3x3 conv, 64


3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64, / 2
Input
He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016
Error rates are 224x224 single-crop testing, reported by torchvision
Softmax

Residual Networks
FC 1000
Pool

3x3 conv, 512


3x3 conv, 512
ResNet-18: ResNet-34: 3x3 conv, 512

Stem: 1 conv layer Stem: 1 conv layer 3x3 conv, 512

3x3 conv, 512


Stage 1 (C=64): 2 res. block = 4 conv Stage 1: 3 res. block = 6 conv 3x3 conv, 512, /2

Stage 2 (C=128): 2 res. block = 4 conv Stage 2: 4 res. block = 8 conv ..


.
Stage 3 (C=256): 2 res. block = 4 conv Stage 3: 6 res. block = 12 conv 3x3 conv, 128

Stage 4 (C=512): 2 res. block = 4 conv Stage 4: 3 res. block = 6 conv


3x3 conv, 128

3x3 conv, 128

Linear Linear 3x3 conv, 128

3x3 conv, 128


3x3 conv, 128, / 2

ImageNet top-5 error: 10.92 ImageNet top-5 error: 8.58 3x3 conv, 64
3x3 conv, 64

GFLOP: 1.8 GFLOP: 3.6 3x3 conv, 64


3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64, / 2
Input
He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016
Error rates are 224x224 single-crop testing, reported by torchvision
Softmax

Residual Networks
FC 1000
Pool

3x3 conv, 512


3x3 conv, 512
ResNet-18: ResNet-34: 3x3 conv, 512

Stem: 1 conv layer Stem: 1 conv layer 3x3 conv, 512

3x3 conv, 512


Stage 1 (C=64): 2 res. block = 4 conv Stage 1: 3 res. block = 6 conv 3x3 conv, 512, /2

Stage 2 (C=128): 2 res. block = 4 conv Stage 2: 4 res. block = 8 conv ..


.
Stage 3 (C=256): 2 res. block = 4 conv Stage 3: 6 res. block = 12 conv 3x3 conv, 128

Stage 4 (C=512): 2 res. block = 4 conv Stage 4: 3 res. block = 6 conv


3x3 conv, 128

3x3 conv, 128

Linear Linear 3x3 conv, 128

3x3 conv, 128


3x3 conv, 128, / 2

ImageNet top-5 error: 10.92 ImageNet top-5 error: 8.58 3x3 conv, 64
3x3 conv, 64

GFLOP: 1.8 GFLOP: 3.6 3x3 conv, 64


3x3 conv, 64

3x3 conv, 64

VGG-16: 3x3 conv, 64

ImageNet top-5 error: 9.62 Pool


7x7 conv, 64, / 2

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016


GFLOP: 13.6
Input

Error rates are 224x224 single-crop testing, reported by torchvision


Residual Networks: Basic Block

Conv(3x3, C->C)

Conv(3x3, C->C)

“Basic”
Residual block

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016


Residual Networks: Basic Block

Conv(3x3, C->C) FLOPs: 9HWC2

Conv(3x3, C->C) FLOPs: 9HWC2

“Basic” Total FLOPs:


Residual block 18HWC2

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016


Residual Networks: Bottleneck Block

Conv(1x1, C->4C)
Conv(3x3, C->C) FLOPs: 9HWC2
Conv(3x3, C->C)
Conv(3x3, C->C) FLOPs: 9HWC2
Conv(1x1, 4C->C)

“Basic” Total FLOPs:


Residual block 18HWC2 “Bottleneck”
Residual block
He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016
Residual Networks: Bottleneck Block
More layers, less computational cost!

FLOPs: 4HWC2 Conv(1x1, C->4C)


Conv(3x3, C->C) FLOPs: 9HWC2
FLOPs: 9HWC2 Conv(3x3, C->C)
Conv(3x3, C->C) FLOPs: 9HWC2
FLOPs: 4HWC2 Conv(1x1, 4C->C)

“Basic” Total FLOPs:


Residual block 18HWC2 Total FLOPs: “Bottleneck”
17HWC2 Residual block
He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016
Softmax

Residual Networks
FC 1000
Pool

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512


3x3 conv, 512, /2

Stage 1 Stage 2 Stage 3 Stage 4 ..


.
Block Stem FC ImageNet 3x3 conv, 128
type layers Blocks Layers Blocks Layers Blocks Layers Blocks Layers layers GFLOP top-5 error 3x3 conv, 128

ResNet-18 Basic 1 2 4 2 4 2 4 2 4 1 1.8 10.92 3x3 conv, 128

ResNet-34 Basic 1 3 6 4 8 6 12 3 6 1 3.6 8.58 3x3 conv, 128

ResNet-50 Bottle 1 3 9 4 12 6 18 3 9 1 3.8 7.13 3x3 conv, 128


3x3 conv, 128, / 2

ResNet-101 Bottle 1 3 9 4 12 23 69 3 9 1 7.6 6.44 3x3 conv, 64


ResNet-152 Bottle 1 3 9 8 24 36 108 3 9 1 11.3 5.94 3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64, / 2

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Input

Error rates are 224x224 single-crop testing, reported by torchvision


Softmax

Residual Networks
FC 1000
Pool

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512


ResNet-50 is the same as ResNet-34, but replaces Basic blocks with Bottleneck Blocks. 3x3 conv, 512

This is a great baseline architecture for many tasks even today! 3x3 conv, 512
3x3 conv, 512, /2

Stage 1 Stage 2 Stage 3 Stage 4 ..


.
Block Stem FC ImageNet 3x3 conv, 128
type layers Blocks Layers Blocks Layers Blocks Layers Blocks Layers layers GFLOP top-5 error 3x3 conv, 128

ResNet-18 Basic 1 2 4 2 4 2 4 2 4 1 1.8 10.92 3x3 conv, 128

ResNet-34 Basic 1 3 6 4 8 6 12 3 6 1 3.6 8.58 3x3 conv, 128

ResNet-50 Bottle 1 3 9 4 12 6 18 3 9 1 3.8 7.13 3x3 conv, 128


3x3 conv, 128, / 2

ResNet-101 Bottle 1 3 9 4 12 23 69 3 9 1 7.6 6.44 3x3 conv, 64


ResNet-152 Bottle 1 3 9 8 24 36 108 3 9 1 11.3 5.94 3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64, / 2

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Input

Error rates are 224x224 single-crop testing, reported by torchvision


Softmax

Residual Networks
FC 1000
Pool

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512


Deeper ResNet-101 and ResNet-152 models are more 3x3 conv, 512

accurate, but also more computationally heavy 3x3 conv, 512


3x3 conv, 512, /2

Stage 1 Stage 2 Stage 3 Stage 4 ..


.
Block Stem FC ImageNet 3x3 conv, 128
type layers Blocks Layers Blocks Layers Blocks Layers Blocks Layers layers GFLOP top-5 error 3x3 conv, 128

ResNet-18 Basic 1 2 4 2 4 2 4 2 4 1 1.8 10.92 3x3 conv, 128

ResNet-34 Basic 1 3 6 4 8 6 12 3 6 1 3.6 8.58 3x3 conv, 128

ResNet-50 Bottle 1 3 9 4 12 6 18 3 9 1 3.8 7.13 3x3 conv, 128


3x3 conv, 128, / 2

ResNet-101 Bottle 1 3 9 4 12 23 69 3 9 1 7.6 6.44 3x3 conv, 64


ResNet-152 Bottle 1 3 9 8 24 36 108 3 9 1 11.3 5.94 3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64, / 2

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Input

Error rates are 224x224 single-crop testing, reported by torchvision


Residual Networks
Ø The spatial downsampling is achieved by the residual block at the start of
each stage, using 3x3 conv with stride 2.

Ø How about short connections of such downsampling residual block?


Ø (1x1 conv with stride 2 at the shortcut connection).
Residual Networks

- Able to train very deep networks


- Deeper networks do better than
shallow networks (as expected)
- Swept 1st place in all ILSVRC and
COCO 2015 competitions
- Still widely used today!

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016


Improving Residual Networks: Block Design
Original ResNet block “Pre-Activation” ResNet Block

ReLU Note ReLU after residual:

Cannot actually learn Conv


identity function since
Batch Norm outputs are nonnegative! ReLU
Conv Batch Norm
Note ReLU inside residual:
ReLU Conv
Batch Norm Can learn true identity ReLU
function by setting Conv
Conv weights to zero! Batch Norm

He et al, ”Identity mappings in deep residual networks”, ECCV 2016


Improving Residual Networks: Block Design
Original ResNet block “Pre-Activation” ResNet Block

ReLU
Slight improvement in accuracy
Conv
(ImageNet top-1 error)
Batch Norm ReLU
Conv ResNet-152: 21.3 vs 21.1 Batch Norm
ResNet-200: 21.8 vs 20.7
ReLU Conv
Batch Norm Not actually used that much in ReLU
Conv practice Batch Norm

He et al, ”Identity mappings in deep residual networks”, ECCV 2016


Improving ResNets

FLOPs:
Conv(1x1, C->4C) 4HWC2

FLOPs:
Conv(3x3, C->C) 9HWC2

FLOPs:
Conv(1x1, 4C->C) 4HWC2

“Bottleneck” Total FLOPs:


Residual block 17HWC2
Grouped Convolution (recap)
Convolution with groups=G:
X Cout/G G parallel conv layers; each “sees” Cin/G input
channels and produces Cout/G output channels
Cin/G
Input: C x H x W
x Cout/G
Split to G x [(Cin / G) x H x W]
Cin Cout Weight: G x (Cout / G) x (Cin /G) x K x K parallel
Cin/G
convolutions
Output: G x [(Cout / G) x H’ x W’] Concat to
Cout x H’ x W’
FLOPs: CoutCinK2H'W'/G

Convolution with groups=1: Normal convolution

Input: Cin x H x W Weight: Cout x Cin x K x K Output: Cout x H’ x W’ FLOPs: CoutCinK2H'W'

All convolutional kernels touch all Cin channels of the input


Improving ResNets: ResNeXt
G parallel pathways

FLOPs:
Conv(1x1, C->4C) 4HWC2

FLOPs:
Conv(3x3, C->C) 9HWC2 Conv(1x1, c->4C) Conv(1x1, c->4C)
FLOPs:
Conv(1x1, 4C->C) 4HWC2 Conv(3x3, c->c) … Conv(3x3, c->c)

Conv(1x1, 4C->c) Conv(1x1, 4C->c)


“Bottleneck” Total FLOPs:
Residual block 17HWC2
Xie et al, “Aggregated residual transformations for deep neural networks”, CVPR 2017
Improving ResNets: ResNeXt
G parallel pathways

FLOPs:
Conv(1x1, C->4C) 4HWC2

FLOPs:
Conv(3x3, C->C) 9HWC2
Conv(1x1, c->4C) Conv(1x1, c->4C)
4HWCc
FLOPs:
Conv(1x1, 4C->C) 4HWC2
9HWc2 Conv(3x3, c->c) … Conv(3x3, c->c)

4HWCc Conv(1x1, 4C->c) Conv(1x1, 4C->c)


“Bottleneck” Total FLOPs:
Residual block 17HWC2
Xie et al, “Aggregated residual transformations for deep neural networks”, CVPR 2017 Total FLOPs:
(8Cc + 9c2)*HWG
Improving ResNets: ResNeXt
G parallel pathways

FLOPs:
Conv(1x1, C->4C) 4HWC2
Conv(1x1, c->4C) Conv(1x1, c->4C)
FLOPs: 4HWCc
Conv(3x3, C->C) 9HWC2

FLOPs:
9HWc2 Conv(3x3, c->c) … Conv(3x3, c->c)
Conv(1x1, 4C->C) 4HWC2
4HWCc Conv(1x1, 4C->c) Conv(1x1, 4C->c)

“Bottleneck” Total FLOPs:


Residual block 17HWC2 Total FLOPs:
Equal cost when (8Cc + 9c2)*HWG
Xie et al, “Aggregated residual transformations for deep neural networks”, CVPR 2017
9Gc2 + 8GCc – 17C2 = 0
Example: C=64, G=4, c=24; C=64, G=32, c=4
Improving ResNets: ResNeXt
Improving ResNets: ResNeXt G parallel pathways

Equivalent formulation
with grouped convolution

Conv(1x1, Gc->4C) Conv(1x1, c->4C) Conv(1x1, c->4C)


4HWCc
Conv(3x3, Gc->Gc,
9HWc2 Conv(3x3, c->c) … Conv(3x3, c->c)
groups=G)

Conv(1x1, 4C->Gc) 4HWCc Conv(1x1, 4C->c) Conv(1x1, 4C->c)

Total FLOPs:
ResNeXt block: Equal cost when (8Cc + 9c2)*HWG
Grouped convolution 9Gc2 + 8GCc – 17C2 = 0
Xie et al, “Aggregated residual transformations for deep neural networks”, CVPR 2017 Example: C=64, G=4, c=24; C=64, G=32, c=4
ResNeXt: Maintain computation by adding groups!

Model Groups Group width Top-1 Error Model Groups Group width Top-1 Error
ResNet-50 1 64 23.9 ResNet-101 1 64 22.0
ResNeXt-50 2 40 23 ResNeXt-101 2 40 21.7
ResNeXt-50 4 24 22.6 ResNeXt-101 4 24 21.4
ResNeXt-50 8 14 22.3 ResNeXt-101 8 14 21.3
ResNeXt-50 32 4 22.2 ResNeXt-101 32 4 21.2

Adding groups improves performance with same computational complexity!

Xie et al, “Aggregated residual transformations for deep neural networks”, CVPR 2017
ImageNet Classification Challenge
30 152 152 152
28.2
25.8 layers layers layers
25

20
Error Rate

16.4
15 19 22
11.7 layers layers
10 7.3 6.7
8 layers 8 layers 5.1
5 3.6 3 2.3
Shallow
0
2010 2011 2012 2013 2014 2014 2015 2016 2017 Human
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Hu et al Russakovsky et al
Shao et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
Squeeze-and-Excitation Networks

Adds a ”Squeeze-and-excite” branch to


each residual block that performs global
pooling, full-connected layers, and
multiplies back onto feature map

Adds global context to each residual block!

Won ILSVRC 2017 with ResNeXt-152-SE

Hu et al, “Squeeze-and-Excitation networks”, CVPR 2018


ImageNet Classification Challenge
30 28.2 152 152 152
25.8 layers layers layers
25

20
Error Rate

16.4
Completion of the challenge:
15 Annual ImageNet 19 22 no longer
11.7 competition
held after 2017 ->layers
now moved
layersto Kaggle.
10 7.3 6.7
8 layers 8 layers 5.1
5 3.6 3 2.3
Shallow
0
2010 2011 2012 2013 2014 2014 2015 2016 2017 Human
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Hu et al Russakovsky et al
Shao et al (SENet)
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet)
Densely Connected Neural Networks
Softmax

FC
1x1 conv, 64
Pool

Dense Block 3
Dense blocks where each layer is Concat
Conv
connected to every other layer in 1x1 conv, 64 Pool

feedforward fashion Conv


Concat
Dense Block 2

Alleviates vanishing gradient, Conv


Conv

Pool
strengthens feature propagation, Concat Conv
encourages feature reuse Dense Block 1
Conv
Conv

Input Input

Dense Block
Huang et al, “Densely connected neural networks”, CVPR 2017
MobileNets: Tiny Networks (For Mobile Devices)
Standard Convolution Block Depthwise Separable Convolution
Total cost: 9C2HW Total cost: (9C + C2)HW

ReLU
ReLU Batch Norm
Batch Norm C2HW Conv(1x1, C->C) “Pointwise Convolution”
Conv(3x3, C->C) 9C2HW ReLU
Batch Norm
Speedup = 9C2/(9C+C2)
Conv(3x3, C->C,
= 9C/(9+C) 9CHW “Depthwise Convolution”
groups=C)
=> 9 (as C->>9)
Howard et al, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, 2017
MobileNets: Tiny Networks (For Mobile Devices)
Depthwise Separable Convolution
Total cost: (9C + C2)HW

Also related: ReLU


Batch Norm
ShuffleNet: Zhang et al, CVPR 2018 MobileNetV2:
Sandler et al, CVPR 2018 ShuffleNetV2: Ma et al, C2HW Conv(1x1, C->C) “Pointwise Convolution”
ECCV 2018, MobileOne: CVPR 2023 ReLU
Batch Norm

Conv(3x3, C->C,
9CHW “Depthwise Convolution”
groups=C)

Howard et al, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, 2017
Inverted Residual Block

(a) Conventional residual block connects the layers with high number of channels.
(b) Inverted residual connects the bottlenecks instead of layers with high number of channels.

Inverted residual block is more memory efficient


Inverted Residual Block

Conv(1x1, C->4C) Conv(1x1, 4C->C)


Conv(1x1, C->4C) Conv(1x1, 4C->C)
Conv(3x3, C->C) Conv(3x3,4C->4C)
Conv(3x3, C->C) Conv(3x3,4C->4C)
Conv(1x1, 4C->C) Conv(1x1, C->4C)
Conv(1x1, 4C->C) Conv(1x1, C->4C)

Conv(1x1, C->4C) Conv(1x1, 4C->C)


Conv(1x1, C->4C) Conv(1x1, 4C->C)
Conv(3x3, C->C) Conv(3x3, 4C->4C)
Conv(3x3, C->C) Conv(3x3, 4C->4C)
Conv(1x1, 4C->C) Conv(1x1, C->4C)
Conv(1x1, 4C->C) Conv(1x1, C->4C)
Select your Project mentor
Assessment Items: Reminder-1

• Project Problem Statement Slide submission: September 09 EoD (11.59pm


UAE time)
What to include in the Project Problem Statement Slides?
• The problem identified
• Baselines, reproduced baseline results (if any)
• Discussion on potential directions to solve
• Discussion on potential challenges and risks involved.
Assessment Items: Reminder-2

• Peer Review Report Submission for Project presentations :


September 18, EoD (11.59pm UAE time)

• Attend the Project Problem Statement sessions in-person


• Ask questions to peers
• Submit Peer-review report for Project presentations
Follow CVPR review format, summary, strength, weakness, Suggestions to improve the
proposed idea, overall rating of the proposed problem statement in a scale of 1-10).
Do you have some questions to be answered by the team for the next presentation? )
ConvNeXts: A ConvNet for the 2020s, CVPR 2022
Citations: >5000
Why ConvNeXt?
• Identify several key components that contribute to the performance gain.
• ConvNeXt maintains the efficiency of standard ConvNets, and the fully-convolutional
nature for both training and testing.
• Re-examine the design spaces of ConvNet and test the limits of what a pure ConvNet
can achieve.
ConvNet vs ViT
• ViT comparisons with CNN was fair? Especially @ post-ViT era?
• ConvNext claim to bridge the gap between the
pre-ViT and post-ViT eras for ConvNets
• They test the limits of what a pure ConvNet ConvNet ViT
can achieve.
• “Modernize” a standard ResNet toward the
design of a vision Transformer.
ConvNeXt: Modernizing a ConvNet

• How do design decisions in Transformers impact ConvNets’ performance?


• Key Idea: different levels of designs from a Swin Transformer while maintaining
the network’s simplicity as a standard ConvNet.
• From ResNet to a ‘ConvNet that bears a resemblance to Transformers’.
Why ConvNeXt?

“Constructed entirely from standard ConvNet modules,


ConvNeXts compete favorably with Transformers in terms
of accuracy and scalability, achieving 87.8% ImageNet
top-1 accuracy and outperforming Swin Transformers on
COCO detection and ADE20K segmentation, while
maintaining the simplicity and efficiency of standard
ConvNets.”
ConvNeXt: Key Ideas
Key changes

Ø Advanced Training techniques


Ø Study a series of design decisions that are summarized as
• Macro design
• ResNeXt-ify
• Inverted bottleneck
• Large kernel size
• Various layer-wise micro designs
1. Advanced Training Techniques

Follow the advanced training techniques used in Swin transformer to obtain a baseline based
on ResNet50
• The training is extended to 300 epochs from the original 90 epochs for ResNets.
• Use the AdamW optimizer ,
• Use data augmentation techniques such as Mixup, Cutmix, RandAugment, Random Erasing
• Use regularization schemes such as Label Smoothing

• This enhanced training recipe increased the performance of the ResNet-50


model from 76.1% to 78.8% (+2.7%),
2. Macro Design

1. Changing stage compute ratio in ResNet

• The original design of the computation


distribution across stages in ResNet was
empirical.
• The heavy “res4” stage is generally used for
downstream tasks like object detection
• Following Swin-T design, the number of
blocks in each stage of ResNet is adjusted
from (3, 4, 6, 3) in ResNet-50 to (3, 3, 9, 3),
which also aligns the FLOPs with Swin-T.
• This improves the model accuracy
from 78.8% to 79.4%.
2. Macro Design: Changing stage compute ratio in ResNet
Softmax
FC 1000
Pool

Residual Networks (recap) 3x3 conv, 512


3x3 conv, 512

3x3 conv, 512


3x3 conv, 512
Stage 1 Stage 2 Stage 3 Stage 4
3x3 conv, 512
Block Stem FC ImageNet 3x3 conv, 512, /2

type layers Blocks Layers Blocks Layers Blocks Layers Blocks Layers layers GFLOP top-5 error ..
ResNet-18 Basic 1 2 4 2 4 2 4 2 4 1 1.8 10.92 .
ResNet-34 Basic 1 3 6 4 8 6 12 3 6 1 3.6 8.58 3x3 conv, 128
3x3 conv, 128
ResNet-50 Bottle 1 3 9 4 12 6 18 3 9 1 3.8 7.13
3x3 conv, 128
ResNet-101 Bottle 1 3 9 4 12 23 69 3 9 1 7.6 6.44 3x3 conv, 128

ResNet-152 Bottle 1 3 9 8 24 36 108 3 9 1 11.3 5.94 3x3 conv, 128


3x3 conv, 128, / 2

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 3x3 conv, 64

Error rates are 224x224 single-crop testing, reported by torchvision Pool


7x7 conv, 64, / 2
Input
2. Macro Design

2. Modification to Stem in ResNet:


- Use 4x4, stride 4 convolution,
96 filters

Why?

• This simplified stem slightly


increased the performance by
0.1 while slightly reducing the
computation over resnet!
3. ResNext-ify: Using depthwise convolutions
ResNeXt (recap)
• ResNeXt has a better FLOPs/accuracy trade-off than ResNet.
• The core component in ResNeXt is grouped convolution, where the
convolutional filters are separated into different groups.
• ResNeXt’s guiding principle is to “use more groups, expand width”.
• More precisely, ResNeXt employs grouped convolution for the 3x3 conv layer in
a bottleneck block. As this significantly reduces the FLOPs, the network width is
expanded to compensate for the capacity loss.
3. ResNext-ify: Using depthwise Convolutions

• Use depthwise convolutions.


---This will reduce the number of FLOPS and “Accuracy”

• Increase the network width similar to ResNeXt.


---to the same number of channels as Swin-T’s
(from 64 to 96).
4. Inverted Residual Block (recap)

(a) Conventional residual block connects the layers with high number of channels.
(b) Inverted residual connects the bottlenecks instead of layers with high number of channels.

Inverted residual block is more memory efficient


Inverted Residual Block (recap)
Inverted residual block is more memory efficient

Conv(1x1, C->4C)Conv(1x1, C->4C) Conv(1x1, 4C->C)Conv(1x1, 4C->C)

Conv(3x3, C->C) Conv(3x3, C->C) Conv(3x3,4C->4C)Conv(3x3,4C->4C)

Conv(1x1, 4C->C)Conv(1x1, 4C->C) Conv(1x1, C->4C)Conv(1x1, C->4C)

Conv(1x1, C->4C)Conv(1x1, C->4C) Conv(1x1, 4C->C)Conv(1x1, 4C->C)

Conv(3x3, C->C) Conv(3x3, C->C) Conv(3x3, 4C->4C)


Conv(3x3, 4C->4C)

Conv(1x1, 4C->C)Conv(1x1, 4C->C) Conv(1x1, C->4C)Conv(1x1, C->4C)


4. Inverted Bottleneck

384-96

How this can reduce the number of FLOPS?


Due to the significant FLOPs reduction in the downsampling
residual blocks’ shortcut 1x1 conv layer.
à Also increased the accuracy.
4. Inverted Bottleneck

Moving up depthwise convolution.


4. Inverted Residual Block
3. Moving up depthwise convolution. 1. Residual block 2. Inverted residual block

Conv(1x1, 4C->C) Conv(1x1,


Conv(1x1, C->4C)
C->4C) Conv(1x1,
Conv(1x1, 4C->C)
4C->C)
Conv(1x1, C->C4) Conv(3x3,
Conv(3x3, C->C)
C->C) Conv(3x3,4C->4C)
Conv(3x3,4C->4C)

Conv(3x3, C->C) Conv(1x1,


Conv(1x1, 4C->C)
4C->C) Conv(1x1,
Conv(1x1, C->4C)
C->4C)

Conv(1x1, 4C->C) Conv(1x1,


Conv(1x1, C->4C)
C->4C) Conv(1x1,
Conv(1x1, 4C->C)
4C->C)

Conv(1x1 C->4C) Conv(3x3,


Conv(3x3, C->C)
C->C) Conv(3x3,
Conv(3x3, 4C->4C)
4C->4C)

Conv(3x3, C->C) Conv(1x1,


Conv(1x1, 4C->C)
4C->C) Conv(1x1,
Conv(1x1, C->4C)
C->4C)
4. Inverted Bottleneck

Moving up depthwise convolution.

This reduced the accuracy. Then why they did that?


5. Large Kernel Sizes

To increase the kernel size from 3x3, while maintaining the FLOPS.

• 7x7 provides the optimum accuracy.


• Beyond 7x7 performance wont increase
• 7x7 here has nearly the same FLOPS as 3x3.

Why large Kernel size?


-To have a global/larger receptive field similar to ViT
(different to what we learned from VGG-use only 3x3)
6. Micro Design Choices: Activation Function

1. Replace ReLU with GELU: same accuracy


--> similar to ViT

2. Fewer activation functions.


6. Micro Design Choices: Fewer Normalization Layers
1. Fewer Normalization layers: Improved the performance

2. Replacing BN with LN

• Directly substituting LN for BN in the original ResNet will result in suboptimal performance.
• With all the modifications in network architecture and training techniques, ConvNext model
does not have any difficulties training with LN. It provides slightly better accuracy of 81.5%.
Spatial downsampling in resent (recap)
Ø The spatial downsampling is achieved by the residual block at the start of
each stage, using 3x3 conv with stride 2.

Ø How about short connections of such downsampling residual block?


• 1x1 conv with stride 2 at the shortcut connection.
6. Micro Design Choices: Separate Downsampling Layers
• ConvNext first explored a strategy in which they use 2x2
conv layers with stride 2 for spatial downsampling. This
modification leads to diverged training.

• How to solve this?


Ø Adding normalization layers wherever spatial resolution
is changed can help stablize training.
Ø These include several LN layers (also used in Swin
Transformers): one before each downsampling layer,
one after the stem, and one after the final global
average pooling.
Ø We can improve the accuracy to 82.0%.
Summary of Changes
Throughput Comparison
Experiments

• Scalability
• Used as a backbone for downstream applications such as detection,
segmentation.
Limitations.
CNN Architectures Summary
Early work (AlexNet -> VGG) shows that bigger networks work better

GoogLeNet one of the first to focus on efficiency (aggressive stem, 1x1 bottleneck
convolutions, global avg pool instead of FC layers)

ResNet showed us how to train extremely deep networks – limited only by GPU
memory! Started to show diminishing returns as networks got bigger

After ResNet: Efficient networks became central: how can we improve the accuracy
without increasing the complexity?

Lots of tiny networks aimed at mobile devices: MobileNet, ShuffleNet, etc

Neural Architecture Search promises to automate architecture design


Which Architecture should I use?

• If you care about accuracy, ResNet-50 or ResNet-101 are good choices among CNN (try

convNeXt also!)

• If you want an efficient network (real-time, run on mobile, etc) try


MobileNets and ShuffleNets
Summary-what we learned?

• How to compute FLOPS, parameters and memory requirements for a given network

• Computationally efficient convolutional operators


-depthwise separable convolution and group convolution

• Key design principles of Alexnet, VGG, Resent, ResNeXt, Densenet, SE, MobileNets

• Detailed discussion about ConvNeXT.

You might also like