Week3 Lec1 2
Week3 Lec1 2
CNN Architectures
Project Discussion
ImageNet Classification Challenge
30 28.2 152 152 152
25.8 layers layers layers
25
20
Error Rate
16.4
15 19 22
11.7 layers layers
10 7.3 6.7
8 layers 8 layers
3.6 5.1
5 3 2.3
Shallow
0
2010 2011 2012 2013 2014 2014 2015 2016 2017 Human
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
(ResNet) (SENet)
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet)
AlexNet
20
Error Rate
16.4
15 19 22
11.7 layers layers
10 7.3 6.7
8 layers 8 layers
3.6 5.1
5 3 2.3
Shallow
0
2010 2011 2012 2013 2014 2014 2015 2016 2017 Human
Sanchez & Simonyan & Szegedy et al He et al Russakovsky et
Lin et al Krizhevsky et al Zeiler & Hu et al (GoogLeNet) (ResNet) (SENet)
Perronnin (AlexNet) Fergus Zisserman (VGG) aSlhaoet al
VGG: Deeper Networks, Regular Design Softmax
Softmax
FC 1000
FC 4096
FC 1000 FC 4096
Option 1:
Conv(5x5, C -> C)
Params: 25C2
FLOPs: 25C2HW
VGG: Deeper Networks, Regular Design
VGG Design rules: Two 3x3 conv has same
receptive field as a single 5x5
All conv are 3x3 stride 1 pad 1 conv, but has fewer parameters
All max pool are 2x2 stride 2 and takes less computation!
After pool, double #channels Softmax
FC 1000
FC 4096
FC 4096
AlexNet
FLOPs: 25C2HW FLOPs: 18C2HW
VGG: Deeper Networks, Regular Design
VGG Design rules:
All conv are 3x3 stride 1 pad 1
All max pool are 2x2 stride 2
After pool, double #channels
Input: C x 2H x 2W
Layer: Conv(3x3, C->C)
FLOPs: 36HWC2
VGG: Deeper Networks, Regular Design
VGG Design rules:
All conv are 3x3 stride 1 pad 1
All max pool are 2x2 stride 2
After pool, double #channels
Input: C x 2H x 2W Input: 2C x H x W
Layer: Conv(3x3, C->C) Conv(3x3, 2C -> 2C)
Input: C x 2H x 2W Input: 2C x H x W
Layer: Conv(3x3, C->C) Conv(3x3, 2C -> 2C)
Simonyan and Zissermann, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, ICLR 2015
AlexNet vs VGG-16: Much bigger network!
AlexNet total: 1.9 MB AlexNet total: 61M AlexNet total: 0.7 GFLOP
VGG-16 total: 48.6 MB (25x) VGG-16 total: 138M (2.3x) VGG-16 total: 13.6 GFLOP (19.4x)
ImageNet Classification Challenge
30 28.2 152 152 152
25.8 layers layers layers
25
20
Error Rate
16.4
15 19 22
11.7 layers layers
10 7.3 6.7
8 layers 8 layers
3.6 5.1
5 3 2.3
Shallow
0
2010 2011 2012 2013 2014 2014 2015 2016 2017 Human
Sanchez & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Lin et al Krizhevsky et al Zeiler & (SENet)
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet)
ImageNet Classification Challenge
30 28.2 152 152 152
25.8 layers layers layers
25
20
Error Rate
16.4
15 19 22
11.7 layers layers
10 7.3 6.7
8 layers 8 layers
3.6 5.1
5 3 2.3
Shallow
0
2010 2011 2012 2013 2014 2014 2015 2016 2017 Human
Szegedy et al He et al Russakovsky et
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan &
alShaoet al Hu et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
GoogLeNet: Focus on Efficiency
0 0
0
fc8
1
1
2
3
4
5
6
fc6
fc7
nv
nv
nv
nv
nv
nv
nv
nv
nv
nv
v
fc
fc
fc
8
5
nv
nv
nv
nv
fc
fc
fc
co
co
co
co
co
co
co
co
co
co
n
co
co
co
co
co
AlexNet VGG-16 AlexNet VGG-16 AlexNet VGG-16
AlexNet total: 1.9 MB AlexNet total: 0.7 GFLOP AlexNet total: 61M
VGG-16 total: 48.6 MB (25x) VGG-16 total: 13.6 GFLOP (19.4x) VGG-16 total: 138M (2.3x)
GoogLeNet: Global Average Pooling
No large FC layers at the end! Instead uses “global average pooling” to
collapse spatial dimensions, and one linear layer to produce class
scores (Recall VGG-16: Most parameters were in the FC layers!)
Training using loss at the end of the network didn’t work well:
Network is too deep, gradients don’t propagate cleanly
20
Error Rate
16.4
15 19 22
11.7 layers layers
10 7.3 6.7
8 layers 8 layers
3.6 5.1
5 3 2.3
Shallow
0
2010 2011 2012 2013 2014 2014 2015 2016 2017 Human
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
(ResNet) (SENet)
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet)
ImageNet Classification Challenge
30 28.2 152 152 152
25.8 layers layers layers
25
20
Error Rate
16.4
15 19 22
11.7 layers layers
10 7.3
8 layers 8 layers
6.7 5.1
5 3.6 3 2.3
Shallow
0
2010 2011 2012 2013 2014 2014 2015 2016 2017 Human
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Russakovsky et alShao et al Hu et al
(ResNet) (SENet)
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet)
Residual Networks
Outline
20
Error Rate
16.4
15 19 22
11.7 layers layers
10 7.3 6.7
8 layers 8 layers
3.6 5.1
5 3 2.3
Shallow
0
2010 2011 2012 2013 2014 2014 2015 2016 2017 Human
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
(ResNet) (SENet)
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet)
Residual Networks
Once we have Batch Normalization, we can train networks with 10+ layers.
What happens as we go deeper?
Test error
Deeper model does worse than
shallow model! 56-layer
56-layer
56-layer
20-layer
20-layer
Iterations Iterations
In fact the deep model seems to be underfitting since it also performs worse
than the shallow model on the training set! It is actually underfitting
X X
“Plain” block Residual Block
He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016
Residual Networks
Solution: Change the network so learning identity functions with extra layers is easy!
relu
H(x) F(x) + x
F(x)
If you set these to conv
conv
0, the whole block Additive
relu will compute the relu
identity function! “shortcut”
conv conv
X X
“Plain” block Residual Block
He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016
Softmax
Residual Networks
FC 1000
Pool
F(x) + x
3x3 conv, 512, /2
..
Regular design, like VGG: each .
3x3 conv, 128
3x3 conv
residual block has two 3x3 conv 3x3 conv, 128
Pool
7x7 conv, 64, / 2
Input
Residual Networks
FC 1000
Pool
residual blocks: ..
.
3x3 conv, 128
3x3 conv, 128
Input Output 3x3 conv, 128
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64, / 2
Input
Residual Networks
FC 1000
Pool
..
Like GoogLeNet, no big fully-connected-layers: instead use .
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64, / 2
Input
Residual Networks
FC 1000
Pool
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64, / 2
Input
He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016
Error rates are 224x224 single-crop testing, reported by torchvision
Softmax
Residual Networks
FC 1000
Pool
ImageNet top-5 error: 10.92 ImageNet top-5 error: 8.58 3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64, / 2
Input
He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016
Error rates are 224x224 single-crop testing, reported by torchvision
Softmax
Residual Networks
FC 1000
Pool
ImageNet top-5 error: 10.92 ImageNet top-5 error: 8.58 3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Conv(3x3, C->C)
Conv(3x3, C->C)
“Basic”
Residual block
Conv(1x1, C->4C)
Conv(3x3, C->C) FLOPs: 9HWC2
Conv(3x3, C->C)
Conv(3x3, C->C) FLOPs: 9HWC2
Conv(1x1, 4C->C)
Residual Networks
FC 1000
Pool
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64, / 2
He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Input
Residual Networks
FC 1000
Pool
This is a great baseline architecture for many tasks even today! 3x3 conv, 512
3x3 conv, 512, /2
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64, / 2
He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Input
Residual Networks
FC 1000
Pool
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64, / 2
He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Input
ReLU
Slight improvement in accuracy
Conv
(ImageNet top-1 error)
Batch Norm ReLU
Conv ResNet-152: 21.3 vs 21.1 Batch Norm
ResNet-200: 21.8 vs 20.7
ReLU Conv
Batch Norm Not actually used that much in ReLU
Conv practice Batch Norm
FLOPs:
Conv(1x1, C->4C) 4HWC2
FLOPs:
Conv(3x3, C->C) 9HWC2
FLOPs:
Conv(1x1, 4C->C) 4HWC2
FLOPs:
Conv(1x1, C->4C) 4HWC2
FLOPs:
Conv(3x3, C->C) 9HWC2 Conv(1x1, c->4C) Conv(1x1, c->4C)
FLOPs:
Conv(1x1, 4C->C) 4HWC2 Conv(3x3, c->c) … Conv(3x3, c->c)
FLOPs:
Conv(1x1, C->4C) 4HWC2
FLOPs:
Conv(3x3, C->C) 9HWC2
Conv(1x1, c->4C) Conv(1x1, c->4C)
4HWCc
FLOPs:
Conv(1x1, 4C->C) 4HWC2
9HWc2 Conv(3x3, c->c) … Conv(3x3, c->c)
FLOPs:
Conv(1x1, C->4C) 4HWC2
Conv(1x1, c->4C) Conv(1x1, c->4C)
FLOPs: 4HWCc
Conv(3x3, C->C) 9HWC2
FLOPs:
9HWc2 Conv(3x3, c->c) … Conv(3x3, c->c)
Conv(1x1, 4C->C) 4HWC2
4HWCc Conv(1x1, 4C->c) Conv(1x1, 4C->c)
Equivalent formulation
with grouped convolution
Total FLOPs:
ResNeXt block: Equal cost when (8Cc + 9c2)*HWG
Grouped convolution 9Gc2 + 8GCc – 17C2 = 0
Xie et al, “Aggregated residual transformations for deep neural networks”, CVPR 2017 Example: C=64, G=4, c=24; C=64, G=32, c=4
ResNeXt: Maintain computation by adding groups!
Model Groups Group width Top-1 Error Model Groups Group width Top-1 Error
ResNet-50 1 64 23.9 ResNet-101 1 64 22.0
ResNeXt-50 2 40 23 ResNeXt-101 2 40 21.7
ResNeXt-50 4 24 22.6 ResNeXt-101 4 24 21.4
ResNeXt-50 8 14 22.3 ResNeXt-101 8 14 21.3
ResNeXt-50 32 4 22.2 ResNeXt-101 32 4 21.2
Xie et al, “Aggregated residual transformations for deep neural networks”, CVPR 2017
ImageNet Classification Challenge
30 152 152 152
28.2
25.8 layers layers layers
25
20
Error Rate
16.4
15 19 22
11.7 layers layers
10 7.3 6.7
8 layers 8 layers 5.1
5 3.6 3 2.3
Shallow
0
2010 2011 2012 2013 2014 2014 2015 2016 2017 Human
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Hu et al Russakovsky et al
Shao et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
Squeeze-and-Excitation Networks
20
Error Rate
16.4
Completion of the challenge:
15 Annual ImageNet 19 22 no longer
11.7 competition
held after 2017 ->layers
now moved
layersto Kaggle.
10 7.3 6.7
8 layers 8 layers 5.1
5 3.6 3 2.3
Shallow
0
2010 2011 2012 2013 2014 2014 2015 2016 2017 Human
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Hu et al Russakovsky et al
Shao et al (SENet)
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet)
Densely Connected Neural Networks
Softmax
FC
1x1 conv, 64
Pool
Dense Block 3
Dense blocks where each layer is Concat
Conv
connected to every other layer in 1x1 conv, 64 Pool
Pool
strengthens feature propagation, Concat Conv
encourages feature reuse Dense Block 1
Conv
Conv
Input Input
Dense Block
Huang et al, “Densely connected neural networks”, CVPR 2017
MobileNets: Tiny Networks (For Mobile Devices)
Standard Convolution Block Depthwise Separable Convolution
Total cost: 9C2HW Total cost: (9C + C2)HW
ReLU
ReLU Batch Norm
Batch Norm C2HW Conv(1x1, C->C) “Pointwise Convolution”
Conv(3x3, C->C) 9C2HW ReLU
Batch Norm
Speedup = 9C2/(9C+C2)
Conv(3x3, C->C,
= 9C/(9+C) 9CHW “Depthwise Convolution”
groups=C)
=> 9 (as C->>9)
Howard et al, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, 2017
MobileNets: Tiny Networks (For Mobile Devices)
Depthwise Separable Convolution
Total cost: (9C + C2)HW
Conv(3x3, C->C,
9CHW “Depthwise Convolution”
groups=C)
Howard et al, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, 2017
Inverted Residual Block
(a) Conventional residual block connects the layers with high number of channels.
(b) Inverted residual connects the bottlenecks instead of layers with high number of channels.
Follow the advanced training techniques used in Swin transformer to obtain a baseline based
on ResNet50
• The training is extended to 300 epochs from the original 90 epochs for ResNets.
• Use the AdamW optimizer ,
• Use data augmentation techniques such as Mixup, Cutmix, RandAugment, Random Erasing
• Use regularization schemes such as Label Smoothing
type layers Blocks Layers Blocks Layers Blocks Layers Blocks Layers layers GFLOP top-5 error ..
ResNet-18 Basic 1 2 4 2 4 2 4 2 4 1 1.8 10.92 .
ResNet-34 Basic 1 3 6 4 8 6 12 3 6 1 3.6 8.58 3x3 conv, 128
3x3 conv, 128
ResNet-50 Bottle 1 3 9 4 12 6 18 3 9 1 3.8 7.13
3x3 conv, 128
ResNet-101 Bottle 1 3 9 4 12 23 69 3 9 1 7.6 6.44 3x3 conv, 128
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 3x3 conv, 64
Why?
(a) Conventional residual block connects the layers with high number of channels.
(b) Inverted residual connects the bottlenecks instead of layers with high number of channels.
384-96
To increase the kernel size from 3x3, while maintaining the FLOPS.
2. Replacing BN with LN
• Directly substituting LN for BN in the original ResNet will result in suboptimal performance.
• With all the modifications in network architecture and training techniques, ConvNext model
does not have any difficulties training with LN. It provides slightly better accuracy of 81.5%.
Spatial downsampling in resent (recap)
Ø The spatial downsampling is achieved by the residual block at the start of
each stage, using 3x3 conv with stride 2.
• Scalability
• Used as a backbone for downstream applications such as detection,
segmentation.
Limitations.
CNN Architectures Summary
Early work (AlexNet -> VGG) shows that bigger networks work better
GoogLeNet one of the first to focus on efficiency (aggressive stem, 1x1 bottleneck
convolutions, global avg pool instead of FC layers)
ResNet showed us how to train extremely deep networks – limited only by GPU
memory! Started to show diminishing returns as networks got bigger
After ResNet: Efficient networks became central: how can we improve the accuracy
without increasing the complexity?
• If you care about accuracy, ResNet-50 or ResNet-101 are good choices among CNN (try
convNeXt also!)
• How to compute FLOPS, parameters and memory requirements for a given network
• Key design principles of Alexnet, VGG, Resent, ResNeXt, Densenet, SE, MobileNets