Lec - 05 - CNN Deep Learning
Lec - 05 - CNN Deep Learning
CS-878
Week-05
A closer look at spatial dimensions:
activation map
32x32x3image
5x5x3filter
32
28
32 28
3 1
=>5x5output
7
7 doesn’t fit!
cannot apply 3x3filter on 7x7
input with stride 3.
0 0 0 0 0 0
e.g.input 7x7
0 3x3filter, applied with stride 1
0 pad with 1pixel border =>what is the output?
0
(recall:)
(N-F)/ stride +1
0 0 0 0 0 0
e.g.input 7x7
0 3x3filter, applied with stride 1
0 pad with 1pixel border =>what is the output?
0
7x7output!
0
(recall:)
(N+2P -F)/ stride +1
0 0 0 0 0 0
e.g.input 7x7
0 3x3filter, applied with stride 1
0 pad with 1pixel border =>what is the output?
0
7x7output!
0 in general, common to see CONV layers with stride 1,
filters of size FxF, and zero-padding with (F-1)/2. (will
preservesize spatially)
e.g.F =3=>zero pad with 1 F
=5=>zero pad with 2 F =
7=>zero pad with 3
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10
Inputvolume: 32x32x3
105x5filters with stride 1,pad 2
Outputvolumesize:
(32+2*2-5)/1+1=32spatially,so 32x32x10
Inputvolume:32x32x3
105x5filters with stride1,pad 2
Inputvolume:32x32x3
105x5filters with stride1,pad 2
Input Output
Input Output
Becareful–“receptivefieldintheinput”vs.“receptivefieldinthepreviouslayer”
Slideinspiration:JustinJohnson
Input Problem: For large images we need many layers for Output
each output to “see” the whole image image
Slideinspiration:JustinJohnson
Input Problem: For large images we need many layers for Output
each output to “see” the whole image image
Solution: Downsample inside the network Slideinspiration:JustinJohnson
7
7x7input (spatially)
assume 3x3filter
applied with stride 2
7
7x7input (spatially)
assume 3x3filter
applied with stride 2
=>3x3output!
7
1x1 CONV
56 with 32filters
56
(each filter has size 1x1x64,
and performs a 64-
dimensional dot product)
56 56
64 32
1x1 CONV
56 with 32filters
56
(each filter has size 1x1x64,
and performs a 64-
dimensional dot product)
56 56
64 32
32x32x3image
5x5x3filter
32
1number:
32 the result of taking a dot product between the
filter and this part of the image
3
(i.e. 5*5*3=75-dimensional dot product)
32
1 1
10x3072
3072 10
weights
1number:
the resultof taking a dot product
between a row of Wand the input (a
3072-dimensional dot product)
3 2 1 0 3 4
1 2 3 4
y
Fei-Fei Li, Ehsan Adeli
MAXPOOLING
3 2 1 0 3 4
1 2 3 4 • No learnable parameters
• Introduces spatial invariance
y
Fei-Fei Li, Ehsan Adeli
Pooling
maximum
Number of parameters: 0
Fei-Fei Li, Ehsan Adeli
Rectified Linear Units (ReLUs)
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 0.77 0 0.11 0.33 0.55 0 0.33
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 0.77 0 0.11 0.33 0.55 0 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0 1.00 0 0.33 0 0.11 0
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.11 0 1.00 0 0.11 0 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.33 0.33 0 0.55 0 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 0.55 0 0.11 0 1.00 0 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0 0.11 0 0.33 0 1.00 0
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77 0.33 0 0.55 0.33 0.11 0 0.77
Layers get stacked
The output of one becomes the input of the next.
Convolution
-1 -1 -1 -1 -1 -1 -1 -1 -1 0.55 0.33 1.00 0.11
Pooling
-1 1 -1 -1 -1 -1 -1 1 -1
ReLU
0.33 0.55 0.11 0.77
-1 -1 1 -1 -1 -1 1 -1 -1
0.55 0.33 0.55 0.33
-1 -1 -1 1 -1 1 -1 -1 -1
0.33 1.00 0.55 0.11
-1 -1 -1 -1 1 -1 -1 -1 -1
0.55 0.55 0.55 0.11
-1 -1 -1 1 -1 1 -1 -1 -1
0.33 0.11 0.11 0.33
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 0.33 0.55 1.00 0.77
1.00 0.55
-1 -1 -1 -1 -1 -1 -1 -1 -1 0.55 1.00
Convolution
Convolution
Convolution
-1 1 -1 -1 -1 -1 -1 1 -1
Pooling
Pooling
ReLU
ReLU
ReLU
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 1.00 0.55
-1 -1 -1 -1 1 -1 -1 -1 -1
0.55 0.55
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 0.55 1.00
-1 -1 -1 -1 -1 -1 -1 -1 -1
1.00 0.55
0.55
1.00
0.55
0.55 1.00
0.55
1.00 0.55
1.00
1.00
0.55
0.55
1.00
1.00
0.55
0.55
X
0.55
0.55
0.55
1.00
0.55
0.55
O
1.00
http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html
Columns of pixels
Rows of pixels
Time steps
frequency band
Intensity in each
dictionary
Words in
Total
Connections?
6x6x1 x 3x3x1
Total Unique
Parameters?
3x3x1
+ bias
8x8 image, 3x3 filter, Stride 1 2x2 pooling, Stride 2
8x8 image, 3x3 filter, Stride 1 2x2 pooling, Stride 2
Model Memory Parameter
INPUT: [224x224x3]
CONV3-64: [224x224x64]
CONV3-64: [224x224x64]
POOL2: [112x112x64]
CONV3-128: [112x112x128]
CONV3-128: [112x112x128]
POOL2: [56x56x128]
CONV3-256: [56x56x256]
CONV3-256: [56x56x256]
CONV3-256: [56x56x256]
POOL2: [28x28x256]
CONV3-512: [28x28x512]
CONV3-512: [28x28x512]
CONV3-512: [28x28x512]
POOL2: [14x14x512]
CONV3-512: [14x14x512]
CONV3-512: [14x14x512]
CONV3-512: [14x14x512]
POOL2: [7x7x512]
FC: [1x1x4096]
FC: [1x1x4096]
FC: [1x1x1000]
Fei-Fei Li & Andrej Karpathy & Justin Johnson
INPUT: [224x224x3] memory: 224*224*3=150K params: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Components of CNNs
Convolution Layers Pooling Layers Fully-Connected Layers
Activation Function
Lecture 6 - 9
CNN Architectures
Case Studies
- AlexNet
- VGG
- ResNet
Also....
- GoogLeNet - DenseNet
- ZFNet - MobileNets
- SENet - NASNet
- Wide ResNet - EfficientNet
- ResNeXT
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
19 layers 22 layers
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
19 layers 22 layers
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
Case Study: AlexNet
[Krizhevsky et al. 2012]
Architecture:
CONV1
MAX POOL1
NORM1
CONV2
MAX POOL2
NORM2
CONV3
CONV4
CONV5
Max POOL3
FC6
FC7
FC8
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Case Study: AlexNet
[Krizhevsky et al. 2012]
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Case Study: AlexNet
[Krizhevsky et al. 2012]
3
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Case Study: AlexNet
[Krizhevsky et al. 2012]
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Case Study: AlexNet
[Krizhevsky et al. 2012]
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Case Study: AlexNet
[Krizhevsky et al. 2012]
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Case Study: AlexNet
[Krizhevsky et al. 2012]
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Case Study: AlexNet
[Krizhevsky et al. 2012]
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Case Study: AlexNet
[Krizhevsky et al. 2012]
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Case Study: AlexNet
[Krizhevsky et al. 2012]
19 layers 22 layers
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
19 layers 22 layers
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
Case Study: VGGNet
[Simonyan and Zisserman, 2014] Softmax
FC 1000
Softmax FC 4096
FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
Pool 3x3 conv, 512
Small filters, Deeper networks 3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 Pool
Input
Input Input
-> 7.3% top 5 error in ILSVRC’14 AlexNet VGG16 VGG19
Case Study: VGGNet
[Simonyan and Zisserman, 2014] Softmax
FC 1000
Softmax FC 4096
FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
Softmax FC 4096
FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
Softmax FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
Pool Pool
Pool Pool
VGG16 VGG19
Case Study: VGGNet
[Simonyan and Zisserman, 2014] Softmax
FC 1000
Softmax FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
Pool Pool
Pool Pool
VGG16 VGG19
Case Study: VGGNet
[Simonyan and Zisserman, 2014] Softmax
FC 1000
Softmax FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
Pool Pool
Pool Pool
VGG16 VGG19
Case Study: VGGNet
[Simonyan and Zisserman, 2014] Softmax
FC 1000
Softmax FC 4096
Pool Pool
Pool Pool
VGG16 VGG19
Case Study: VGGNet
[Simonyan and Zisserman, 2014] Softmax
FC 1000
Softmax FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
Pool Pool
Pool Pool
VGG16 VGG19
Case Study: VGGNet
[Simonyan and Zisserman, 2014] Softmax
FC 1000
Softmax FC 4096
FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
Pool
has same effective receptive field as Pool 3x3 conv, 256 3x3 conv, 256
[7x7] Pool
5x5 conv, 256
Pool
3x3 conv, 64
Pool
3x3 conv, 64
Softmax FC 4096
FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 3x3 conv, 512
3x3 conv, 512
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
3x3 conv, 512
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 Pool
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512
POOL2: [28x28x256] memory: 28*28*256=200K params: 0 3x3 conv, 512
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 Pool
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 3x3 conv, 256
3x3 conv, 256
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
Pool
POOL2: [14x14x512] memory: 14*14*512=100K params: 0 3x3 conv, 128
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 128
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 Pool
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 3x3 conv, 512
3x3 conv, 512
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
3x3 conv, 512
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 Pool
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512
POOL2: [28x28x256] memory: 28*28*256=200K params: 0 3x3 conv, 512
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 Pool
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 3x3 conv, 256
3x3 conv, 256
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
Pool
POOL2: [14x14x512] memory: 14*14*512=100K params: 0 3x3 conv, 128
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 128
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 Pool
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 3x3 conv, 512 conv5-3
3x3 conv, 512 conv5-2
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
3x3 conv, 512 conv5-1
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 Pool
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512 conv4-3
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512 conv4-2
POOL2: [28x28x256] memory: 28*28*256=200K params: 0 3x3 conv, 512
conv4-1
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 3x3 conv, 256 conv3-2
3x3 conv, 256 conv3-1
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
Pool
POOL2: [14x14x512] memory: 14*14*512=100K params: 0 3x3 conv, 128 conv2-2
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 128 conv2-1
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 Pool
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 64 conv1-2
Softmax FC 4096
19 layers 22 layers
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
Case Study: GoogLeNet
[Szegedy et al., 2014]
Inception module
Case Study: GoogLeNet
[Szegedy et al., 2014]
Filter
concatenation
28x28x256
Q1: What are the output sizes of Q: What is the problem with this?
Example: all different filter operations? [Hint: Computational complexity]
Filter
concatenatio
n
28x28x256
Example:
Q1: What are the output sizes of Q: What is the problem with this?
all different filter operations? [Hint: Computational complexity]
Filter
concatenatio
n
28x28x128 28x28x192 28x28x96 28x28x256
1x1 conv, 3x3 conv, 5x5 conv,
3x3 pool
128 192 96
28x28x256
Example:
Q2:What is output size after Q: What is the problem with this?
filter concatenation? [Hint: Computational complexity]
Filter
concatenatio
n
28x28x128 28x28x192 28x28x96 28x28x256
1x1 conv, 3x3 conv, 5x5 conv,
3x3 pool
128 192 96
28x28x256
Example:
Q2:What is output size after Q: What is the problem with this?
filter concatenation? [Hint: Computational complexity]
28x28x(128+192+96+256) = 28x28x672
Filter
concatenatio
n
28x28x128 28x28x192 28x28x96 28x28x256
1x1 conv, 3x3 conv, 5x5 conv,
3x3 pool
128 192 96
28x28x256
Example:
Q2:What is output size after Q: What is the problem with this?
filter concatenation? [Hint: Computational complexity]
28x28x(128+192+96+256) = 28x28x672 Conv Ops:
Filter
concatenatio
[1x1 conv, 128] 28x28x128x1x1x256
n [3x3 conv, 192]
28x28x128 28x28x192 28x28x96 28x28x256
28x28x192x3x3x256
1x1 conv, 3x3 conv, 5x5 conv,
128 192 96
3x3 pool [5x5 conv, 96] 28x28x96x5x5x256
Total: 854M ops
Module input: Input
28x28x256
Example:
Q2:What is output size after Q: What is the problem with this?
filter concatenation? [Hint: Computational complexity]
28x28x(128+192+96+256) = 28x28x672
Filter Conv Ops:
concatenatio
n
[1x1 conv, 128] 28x28x128x1x1x256
28x28x128 28x28x192 28x28x96 28x28x256 [3x3 conv, 192]
1x1 conv, 3x3 conv, 5x5 conv, 28x28x192x3x3x256
3x3 pool
128 192 96 [5x5 conv, 96] 28x28x96x5x5x256
Total: 854M ops
Module input: Input
28x28x256
1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
64-dimensional dot
56 product)
56
64 32
Review: 1x1 convolutions
Alternatively, interpret it as applying
the same FC layer on each input pixel
FC
1x1x64 1x1x32
1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
64-dimensional dot
56 product)
56
64 32
Alternatively, interpret it as applying
the same FC layer on each input pixel
FC
64 32
1x1 CONV
56 with 32 filters
56
preserves spatial
dimensions, reduces depth!
Filter
Filter concatenation
concatenation
Previous Layer
Naive Inception module
Inception module with dimension reduction
Case Study: GoogLeNet
[Szegedy et al., 2014]
1x1 conv “bottleneck”
layers
Filter
Filter concatenation
concatenation
Previous Layer
Naive Inception module
Inception module with dimension reduction
Case Study: GoogLeNet
[Szegedy et al., 2014]
Using same parallel layers as
28x28x480 naive example, and adding “1x1
Filter
concatenatio
conv, 64 filter” bottlenecks:
n
Inception module
Case Study: GoogLeNet
[Szegedy et al., 2014]
Full GoogLeNet
architecture
Stem Network:
Conv-Pool-
2x Conv-Pool
Case Study: GoogLeNet
[Szegedy et al., 2014]
Full GoogLeNet
architecture
Stacked Inception
Modules
Case Study: GoogLeNet
[Szegedy et al., 2014]
Full GoogLeNet
architecture
Classifier output
Case Study: GoogLeNet
[Szegedy et al., 2014]
AvgPool
HxWxc
1x1xc
Full GoogLeNet
architecture
Full GoogLeNet
architecture
Full GoogLeNet
architecture
- 22 layers
- Efficient “Inception” module
- Avoids expensive FC layers
- 12x less params than AlexNet
- 27x less params than VGG-16 Inception module
- ILSVRC’14 classification winner
(6.7% top 5 error)
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
19 layers 22 layers
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
Case Study: ResNet
Softmax
[He et al., 2015] FC 1000
Pool
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
relu 3x3 conv, 64
Very deep networks using residual F(x) + x 3x3 conv, 64
..
conv .
- 152-layer model for ImageNet X
3x3 conv, 128
3x3 conv, 128
F(x) relu
- ILSVRC’15 classification winner identity 3x3 conv, 128
3x3 conv, 128
(3.57% top 5 error) conv 3x3 conv, 128
3x3 conv, 64
detection competitions in X
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64 / 2
Input
Case Study: ResNet
[He et al., 2015]
56-layer
Training error
56-layer
Test error
20-layer
20-layer
Iterations Iterations
Case Study: ResNet
[He et al., 2015]
56-layer
Training error
56-layer
Test error
20-layer
20-layer
Iterations Iterations
Lecture 6 -
Case Study: ResNet
[He et al., 2015]
Solution: Use network layers to fit a residual mapping instead of directly trying to fit
a
desired underlying mapping
H(x)
conv
relu
conv
X
“Plain” layers
Case Study: ResNet
[He et al., 2015]
Solution: Use network layers to fit a residual mapping instead of directly trying to fit
a desired underlying mapping
relu Identity mapping:
H(x) H(x) = F(x) + x H(x) = x if F(x) = 0
conv
conv
F(x) X
relu relu
identity
conv
conv
X X
“Plain” layers Residual block
Case Study: ResNet
[He et al., 2015]
Solution: Use network layers to fit a residual mapping instead of directly trying to fit
a desired underlying mapping
Identity mapping:
H(x) = F(x) + x relu
H(x) H(x) = F(x) + x
H(x) = x if F(x) = 0
conv
conv Use layers to
F(x) relu
X fit residual
relu identity F(x) = H(x) - x
conv
conv instead of
H(x) directly
X X
“Plain” layers Residual block
Case Study: ResNet
[He et al., 2015] Softmax
FC 1000
Pool
F(x) X
relu 3x3 conv, 128
3x3 conv, 64
3x3 conv, 64
X 3x3 conv, 64
Residual block 3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64, / 2
Input
Case Study: ResNet
[He et al., 2015] Softmax
FC 1000
Pool
X 3x3 conv, 64
Residual block 3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64, / 2
Input
Case Study: ResNet
[He et al., 2015] Softmax
FC 1000
Pool
X 3x3 conv, 64
Residual block 3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Pool
Beginning
7x7 conv, 64, / 2 conv layer
Input
Case Study: ResNet
[He et al., 2015] Softmax
FC 1000 No FC layers
Pool besides FC
3x3 conv, 512
1000 to
3x3 conv, 512 output
classes
3x3 conv, 512
X 3x3 conv, 64
- No FC layers at the end Residual block 3x3 conv, 64
classes)
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64, / 2
Input
Case Study: ResNet
[He et al., 2015]
28x28x256
output
28x28x256
input
Case Study: ResNet
[He et al., 2015]
28x28x256
output
1x1 conv, 256 filters projects
back to 256 feature maps
For deeper networks (ResNet- (28x28x256) 1x1 conv, 256
50+), use “bottleneck” layer to BN, relu
improve efficiency (similar to 3x3 conv operates over
3x3 conv, 64
GoogLeNet) only 64 feature maps
BN, relu
1x1 conv, 64 filters 1x1 conv, 64
to project to
28x28x64 28x28x256
input
Case Study: ResNet
[He et al., 2015]
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Comparing complexity...
Inception-v4: Resnet + Inception!
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Comparing complexity...
VGG: most
parameters, most
operations
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Comparing complexity...
GoogLeNet:
most efficient
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Comparing complexity...
AlexNet:
Smaller compute, still memory
heavy, lower accuracy
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Comparing complexity...
ResNet:
Moderate efficiency depending
on model, highest accuracy
propagating information BN
throughout network conv
- Gives better performance
ReLU
BN
Improving ResNets...
Wide Residual Networks
[Zagoruyko et al. 2016]
Dense Block 3
- Dense blocks where each layer is Concat
Conv
connected to every other layer in 1x1 conv, 64 Pool
feedforward fashion Conv
- Alleviates vanishing gradient, Concat
Dense Block 2
Input Input
Dense Block
Efficient Networks...
MobileNets: Efficient Convolutional Neural Networks for
Mobile Applications [Howard et al. 2017]
- Depthwise separable
BatchNorm
convolutions replace
standard convolutions by Pool
BatchNorm
factorizing them into a C2HW Conv (1x1, C->C)
Pointwise
depthwise convolution and a Pool convolutions
1x1 convolution BatchNorm
9C2HW Conv (3x3, C->C)
- Much more efficient, with Pool
little loss in accuracy Standard network
Conv (3x3, C->C, Depthwise
- Follow-up MobileNetV2 work Total compute:9C2HW 9CHW
groups=C) convolutions
in 2018 (Sandler et al.)
MobileNets
- ShuffleNet: Zhang et al, Total compute:9CHW + C2HW
CVPR 2018
Learning to search for network architectures...
Learning Transferable Architectures for Scalable Image
Recognition
[Zoph et al. 2017]
- Applying neural architecture search (NAS) to
a large dataset like ImageNet is expensive
- Design a search space of building blocks
(“cells”) that can be flexibly stacked
- NASNet: Use NAS to find best cell structure
on smaller CIFAR-10 dataset, then transfer
architecture to ImageNet
- Many follow-up works in this
space e.g. AmoebaNet (Real et
al. 2019) and ENAS (Pham,
Guan et al. 2018)
But sometimes smart heuristic is better than NAS ...
EfficientNet: Smart Compound Scaling
[Tan and Le. 2019]
- Increase network capacity by scaling width,
depth, and resolution, while balancing
accuracy and efficiency.
- Search for optimal set of compound
scaling factors given a compute budget
(target memory & flops).
- Scale up using smart heuristic rules
Lecture 6 -
Efficient networks...
https://openai.com/blog/ai-and-efficiency/
Lecture 6 -
Transfer learning
You need a lot of a data if you want to
train/use CNNs?
Transfer Learning with CNNs
Transfer Learning with CNNs
AlexNet:
64 x 3 x 11 x 11
FC-4096
MaxPool
Conv-512
Conv-512
MaxPool
Conv-512
Conv-512
MaxPool
Conv-256
Conv-256
MaxPool
Conv-128
Conv-128
MaxPool
Conv-64
Conv-64
Image
1. Train on Imagenet 2. Small Dataset (C classes)
Donahue et al, “DeCAF: A Deep Convolutional
Activation Feature for Generic Visual Recognition”,
ICML 2014 Razavian et al, “CNN Features Off-the-
FC-1000 FC-C Shelf: An Astounding Baseline for Recognition”, CVPR
Workshops 2014
FC-4096 FC-4096
Reinitialize
FC-4096 FC-4096
MaxPool MaxPool
this and train
Conv-512 Conv-512
Conv-512 Conv-512
MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512
MaxPool MaxPool Freeze these
Conv-256 Conv-256
Conv-256 Conv-256
MaxPool MaxPool
Conv-128 Conv-128
Conv-128 Conv-128
MaxPool MaxPool
Conv-64 Conv-64
Conv-64 Conv-64
Image Image
Lecture 6 - 17
Transfer Learning with CNNs
1. Train on Imagenet 2. Small Dataset (C classes)
Donahue et al, “DeCAF: A Deep Convolutional
Activation Feature for Generic Visual Recognition”,
ICML 2014 Razavian et al, “CNN Features Off-the-
FC-1000 FC-C Shelf: An Astounding Baseline for Recognition”, CVPR
Workshops 2014
FC-4096 FC-4096
Reinitialize Finetuned from AlexNet
FC-4096 FC-4096
MaxPool MaxPool
this and train
Conv-512 Conv-512
Conv-512 Conv-512
MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512
MaxPool MaxPool Freeze these
Conv-256 Conv-256
Conv-256 Conv-256
MaxPool MaxPool
Conv-128 Conv-128
Conv-128 Conv-128
MaxPool MaxPool
Conv-64 Conv-64
Donahue et al, “DeCAF: A Deep Convolutional Activation
Conv-64 Conv-64 Feature for Generic Visual Recognition”, ICML 2014
Image Image
Transfer Learning with CNNs
Donahue et al, “DeCAF: A Deep Convolutional
Activation Feature for Generic Visual Recognition”,
ICML 2014 Razavian et al, “CNN Features Off-the-
Shelf: An Astounding Baseline for Recognition”, CVPR
Workshops 2014
Conv-512
MaxPool
very little data ? ?
Conv-512
More specific
Conv-512
MaxPool
Conv-256
Conv-256
More generic
MaxPool
Conv-128
Conv-512
MaxPool
very little data Use ?
Conv-512
More specific Linear
Conv-512 Classifie
MaxPool
Conv-256
r on top
Conv-256
More generic layer
MaxPool
Conv-128
Conv-512
MaxPool
very little data Use You’re in
Conv-512
More specific Linear trouble… Try
Conv-512 Classifie linear
MaxPool
Conv-256
r on top classifier
Conv-256
More generic layer from different
MaxPool
Conv-128
stages
Conv-128 quite a lot of Finetune a Finetune a
MaxPool
Conv-64 data few layers larger number
Conv-64 of layers or
Image
start from
scratch!
Main takeaways
AlexNet showed that you can use CNNs to train Computer Vision models.
VGG shows that bigger networks work better
ResNet showed us how to train extremely deep networks
- Limited only by GPU & memory!
- Showed diminishing returns as networks got bigger
After ResNet: CNNs were better than the human metric and focus shifted to
other topics:
- Efficient Networks: MobileNet, ShuffleNet