0% found this document useful (0 votes)
42 views176 pages

Lec - 05 - CNN Deep Learning

Uploaded by

maimoonaziz2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views176 pages

Lec - 05 - CNN Deep Learning

Uploaded by

maimoonaziz2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 176

Deep Learning

CS-878

Week-05
A closer look at spatial dimensions:
activation map
32x32x3image
5x5x3filter
32

28

convolve (slide) over all spatial


locations

32 28
3 1

Fei-Fei Li, Ehsan Adeli


A closer look at spatial dimensions:
7
7x7input (spatially)
assume 3x3filter

Fei-Fei Li, Ehsan Adeli


A closer look at spatial dimensions:
7
7x7input (spatially)
assume 3x3filter

Fei-Fei Li, Ehsan Adeli


A closer look at spatial dimensions:
7
7x7input (spatially)
assume 3x3filter

Fei-Fei Li, Ehsan Adeli


A closer look at spatial dimensions:
7
7x7input (spatially)
assume 3x3filter

Fei-Fei Li, Ehsan Adeli


A closer look at spatial dimensions:
7
7x7input (spatially)
assume 3x3filter

=>5x5output
7

Fei-Fei Li, Ehsan Adeli


A closer look at spatial dimensions:
7
7x7input (spatially)
assume 3x3filter
applied with stride 2

Fei-Fei Li, Ehsan Adeli


A closer look at spatial dimensions:
7
7x7input (spatially)
assume 3x3filter
applied with stride 2

Fei-Fei Li, Ehsan Adeli


A closer look at spatial dimensions:
7
7x7input (spatially)
assume 3x3filter
applied with stride 2
=>3x3output!
7

Fei-Fei Li, Ehsan Adeli


A closer look at spatial dimensions:
7
7x7input (spatially)
assume 3x3filter
applied with stride 3?

Fei-Fei Li, Ehsan Adeli


A closer look at spatial dimensions:
7
7x7input (spatially)
assume 3x3filter
applied with stride 3?

7 doesn’t fit!
cannot apply 3x3filter on 7x7
input with stride 3.

Fei-Fei Li, Ehsan Adeli


N
Output size:
(N-F)/ stride +1
F
e.g.N =7,F =3:
N
F stride 1=>(7-3)/1+1=5
stride 2=>(7-3)/2+1=3
stride 3=>(7-3)/3+1=2.33:\
Inpractice:Commonto zeropad theborder

0 0 0 0 0 0
e.g.input 7x7
0 3x3filter, applied with stride 1
0 pad with 1pixel border =>what is the output?
0

(recall:)
(N-F)/ stride +1

Fei-Fei Li, Ehsan Adeli


Inpractice:Commonto zeropad theborder

0 0 0 0 0 0
e.g.input 7x7
0 3x3filter, applied with stride 1
0 pad with 1pixel border =>what is the output?
0
7x7output!
0

(recall:)
(N+2P -F)/ stride +1

Fei-Fei Li, Ehsan Adeli


Inpractice:Commonto zeropad theborder

0 0 0 0 0 0
e.g.input 7x7
0 3x3filter, applied with stride 1
0 pad with 1pixel border =>what is the output?
0
7x7output!
0 in general, common to see CONV layers with stride 1,
filters of size FxF, and zero-padding with (F-1)/2. (will
preservesize spatially)
e.g.F =3=>zero pad with 1 F
=5=>zero pad with 2 F =
7=>zero pad with 3

Fei-Fei Li, Ehsan Adeli


Remember back to…
E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially!
(32->28 ->24 ...).Shrinking too fast is not good, doesn’t work well.

32 28 24

….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10

Fei-Fei Li, Ehsan Adeli


Examplestime:

Inputvolume: 32x32x3
105x5filters with stride 1,pad 2

Output volume size:?


Inputvolume:32x32x3
105x5filters with stride1,pad 2

Outputvolumesize:
(32+2*2-5)/1+1=32spatially,so 32x32x10

Fei-Fei Li, Ehsan Adeli


Examplestime:

Inputvolume:32x32x3
105x5filters with stride1,pad 2

Numberof parametersin this layer?


Examplestime:

Inputvolume:32x32x3
105x5filters with stride1,pad 2

Numberof parametersin this layer? each


filterhas5*5*3+1=76params
(+1for bias)
=>76*10=760

Fei-Fei Li, Ehsan Adeli


Receptive Fields
For convolution with kernel size K,each element in the
output depends on a K xK receptivefield in the input

Input Output

Fei-Fei Li, Ehsan Adeli


Receptive Fields
Each successive convolution adds K –1to the receptive field
size With L layers the receptive field size is 1 +L *(K–1)

Input Output
Becareful–“receptivefieldintheinput”vs.“receptivefieldinthepreviouslayer”

Slideinspiration:JustinJohnson

Fei-Fei Li, Ehsan Adeli April 16,2024


Receptive Fields
Each successive convolution adds K –1to the receptive field
size With L layers the receptive field size is 1 +L *(K–1)

Input Problem: For large images we need many layers for Output
each output to “see” the whole image image

Slideinspiration:JustinJohnson

Fei-Fei Li, Ehsan Adeli April 16,2024


Receptive Fields
Each successive convolution adds K –1to the receptive field
size With L layers the receptive field size is 1 +L *(K–1)

Input Problem: For large images we need many layers for Output
each output to “see” the whole image image
Solution: Downsample inside the network Slideinspiration:JustinJohnson

Fei-Fei Li, Ehsan Adeli


Solution: Strided Convolution

7
7x7input (spatially)
assume 3x3filter
applied with stride 2

Fei-Fei Li, Ehsan Adeli


Solution: Strided Convolution

7
7x7input (spatially)
assume 3x3filter
applied with stride 2

=>3x3output!
7

Fei-Fei Li, Ehsan Adeli


Convolution layer: summary

Let’s assumeinput is W1xH1 xC


Convlayer needs 4hyperparameters:
- Number of filters K
- The filter size F
- The stride S
- The zero padding P
This will produce an output of W2xH2 xK
where:
- W2=(W1-F +2P)/S +1
- H2 =(H1 -F +2P)/S +1
Number of parameters: F2CK and K biases
Fei-Fei Li, Ehsan Adeli
Convolution layer: summary
Common
settings:
Let’s assumeinput is W1xH1 xC K =(powers of 2,e.g.32, 64, 128, 512)
- F=3,S =1,P =1
Convlayer needs 4hyperparameters: - F=5,S =1,P =2
- Number of filters K - F=5,S =2,P =?(whateverfits)
- The filter size F - F=1,S =1,P =0
- The stride S
- The zero padding P
This will produce an output of W2xH2 xK where:
- W2=(W1-F +2P)/S +1
- H2 =(H1 -F +2P)/S +1
Number of parameters: F2CK and K biases
(btw, 1x1 convolution layers make perfect sense)

1x1 CONV
56 with 32filters
56
(each filter has size 1x1x64,
and performs a 64-
dimensional dot product)
56 56
64 32

Fei-Fei Li, Ehsan Adeli


(btw, 1x1 convolution layers make perfect sense)

1x1 CONV
56 with 32filters
56
(each filter has size 1x1x64,
and performs a 64-
dimensional dot product)
56 56
64 32

Fei-Fei Li, Ehsan Adeli


The brain/neuron view of CONV Layer

32x32x3image
5x5x3filter
32

1number:
32 the result of taking a dot product between the
filter and this part of the image
3
(i.e. 5*5*3=75-dimensional dot product)

Fei-Fei Li, Ehsan Adeli


The brain/neuron view of CONV Layer

32

E.g. with 5 filters,


28 CONVlayer consists of
neurons arranged in a 3D grid
(28x28x5)

There will be 5 different neurons


32 28 all looking at the same region in
3 the input volume
5
Fei-Fei Li, Ehsan Adeli
Reminder:Fully ConnectedLayer
Each neuron
32x32x3image ->stretch to 3072x1 looks at the full
input volume
input activation

1 1
10x3072
3072 10
weights
1number:
the resultof taking a dot product
between a row of Wand the input (a
3072-dimensional dot product)

Fei-Fei Li, Ehsan Adeli


Pooling layer
- makes the representations smaller and more manageable
- operates over each activation map independently

Fei-Fei Li, Ehsan Adeli


MAXPOOLING

Single depth slice


1 1 2 4
x max pool with 2x2 filters
5 6 7 8 and stride 2 6 8

3 2 1 0 3 4

1 2 3 4

y
Fei-Fei Li, Ehsan Adeli
MAXPOOLING

Single depth slice


1 1 2 4
x max pool with 2x2 filters
5 6 7 8 and stride 2 6 8

3 2 1 0 3 4

1 2 3 4 • No learnable parameters
• Introduces spatial invariance

y
Fei-Fei Li, Ehsan Adeli
Pooling
maximum

0.77 -0.11 0.11 0.33 0.55 -0.11 0.33

-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11


1.00
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55

0.33 0.33 -0.33 0.55 -0.33 0.33 0.33

0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11

-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11

0.33 -0.11 0.55 0.33 0.11 -0.11 0.77

Brandon Rohrer https://brohrer.github.io/how_convolutional_neural_networks_work.html


Pooling
maximum

0.77 -0.11 0.11 0.33 0.55 -0.11 0.33

-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11


1.00 0.33
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55

0.33 0.33 -0.33 0.55 -0.33 0.33 0.33

0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11

-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11

0.33 -0.11 0.55 0.33 0.11 -0.11 0.77

Brandon Rohrer https://brohrer.github.io/how_convolutional_neural_networks_work.html


Pooling

0.77 -0.11 0.11 0.33 0.55 -0.11 0.33

-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11


1.00 0.33 0.55 0.33
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 max pooling
0.33 1.00 0.33 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 0.33 1.00 0.11
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
0.33 0.55 0.11 0.77
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11

0.33 -0.11 0.55 0.33 0.11 -0.11 0.77


Pooling layer: summary
Let’s assume input is W1 x H1 x C
Conv layer needs 2
hyperparameters:
- The spatial extent F
- The stride S

This will produce an output of W2 x H2 x C


where:
- W2 = (W1 - F )/S + 1
- H2 = (H1 - F)/S + 1

Number of parameters: 0
Fei-Fei Li, Ehsan Adeli
Rectified Linear Units (ReLUs)

0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 0.77

-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11

0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55

0.33 0.33 -0.33 0.55 -0.33 0.33 0.33

0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11

-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11

0.33 -0.11 0.55 0.33 0.11 -0.11 0.77

Brandon Rohrer https://brohrer.github.io/how_convolutional_neural_networks_work.html


Rectified Linear Units (ReLUs)

0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 0.77 0

-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11

0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55

0.33 0.33 -0.33 0.55 -0.33 0.33 0.33

0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11

-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11

0.33 -0.11 0.55 0.33 0.11 -0.11 0.77

Brandon Rohrer https://brohrer.github.io/how_convolutional_neural_networks_work.html


Rectified Linear Units (ReLUs)

0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 0.77 0 0.11 0.33 0.55 0 0.33

-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11

0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55

0.33 0.33 -0.33 0.55 -0.33 0.33 0.33

0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11

-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11

0.33 -0.11 0.55 0.33 0.11 -0.11 0.77

Brandon Rohrer https://brohrer.github.io/how_convolutional_neural_networks_work.html


Rectified Linear Units (ReLUs)

0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 0.77 0 0.11 0.33 0.55 0 0.33

-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0 1.00 0 0.33 0 0.11 0

0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.11 0 1.00 0 0.11 0 0.55

0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.33 0.33 0 0.55 0 0.33 0.33

0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 0.55 0 0.11 0 1.00 0 0.11

-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0 0.11 0 0.33 0 1.00 0

0.33 -0.11 0.55 0.33 0.11 -0.11 0.77 0.33 0 0.55 0.33 0.11 0 0.77
Layers get stacked
The output of one becomes the input of the next.

1.00 0.33 0.55 0.33

0.33 1.00 0.33 0.55

Convolution
-1 -1 -1 -1 -1 -1 -1 -1 -1 0.55 0.33 1.00 0.11

Pooling
-1 1 -1 -1 -1 -1 -1 1 -1

ReLU
0.33 0.55 0.11 0.77

-1 -1 1 -1 -1 -1 1 -1 -1
0.55 0.33 0.55 0.33
-1 -1 -1 1 -1 1 -1 -1 -1
0.33 1.00 0.55 0.11
-1 -1 -1 -1 1 -1 -1 -1 -1
0.55 0.55 0.55 0.11
-1 -1 -1 1 -1 1 -1 -1 -1
0.33 0.11 0.11 0.33
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 0.33 0.55 1.00 0.77

-1 -1 -1 -1 -1 -1 -1 -1 -1 0.55 0.55 1.00 0.33

1.00 1.00 0.11 0.55

0.77 0.33 0.55 0.33

Brandon Rohrer https://brohrer.github.io/how_convolutional_neural_networks_work.html


Deep stacking
Layers can be repeated several (or many) times.

1.00 0.55

-1 -1 -1 -1 -1 -1 -1 -1 -1 0.55 1.00

Convolution
Convolution
Convolution
-1 1 -1 -1 -1 -1 -1 1 -1

Pooling
Pooling

ReLU
ReLU
ReLU
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 1.00 0.55
-1 -1 -1 -1 1 -1 -1 -1 -1
0.55 0.55
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 0.55 1.00
-1 -1 -1 -1 -1 -1 -1 -1 -1
1.00 0.55

Brandon Rohrer https://brohrer.github.io/how_convolutional_neural_networks_work.html


Fully connected layer

Every value gets a vote


1.00

0.55

1.00 0.55 0.55

0.55 1.00 1.00

1.00

1.00 0.55 0.55

0.55 0.55 0.55

0.55
0.55 1.00
0.55
1.00 0.55
1.00

1.00

0.55

Brandon Rohrer https://brohrer.github.io/how_convolutional_neural_networks_work.html


Fully connected layer
Vote depends on how strongly a value predicts X or O

0.55

1.00

1.00

0.55

0.55
X
0.55

0.55

0.55

1.00

0.55

0.55
O
1.00

Brandon Rohrer https://brohrer.github.io/how_convolutional_neural_networks_work.html


FullyConnectedLayer(FClayer)
- Contains neurons that connect to the entire input volume, as in ordinary Neural
Networks

Fei-Fei Li, Ehsan Adeli


[ConvNetJS demo: training on CIFAR-10]

http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html

Fei-Fei Li, Ehsan Adeli


Images

Columns of pixels

Rows of pixels

Brandon Rohrer https://brohrer.github.io/how_convolutional_neural_networks_work.html


Sound

Time steps

frequency band
Intensity in each

Brandon Rohrer https://brohrer.github.io/how_convolutional_neural_networks_work.html


Text
Position in
sentence

dictionary
Words in

Brandon Rohrer https://brohrer.github.io/how_convolutional_neural_networks_work.html


8x8 image, 3x3 filter, Stride 1
8x8 image, 3x3 filter, Stride 1
8x8 image, 3x3 filter, Stride 1
8x8 image, 3x3 filter, Stride 1
8x8 image, 3x3 filter, Stride 1
8x8 image, 3x3 filter, Stride 1
8x8 image, 3x3 filter, Stride 1
8x8 image, 3x3 filter, Stride 1
8x8 image, 3x3 filter, Stride 1
8x8 image, 3x3 filter, Stride 1

Total
Connections?
6x6x1 x 3x3x1

Total Unique
Parameters?

3x3x1
+ bias
8x8 image, 3x3 filter, Stride 1 2x2 pooling, Stride 2
8x8 image, 3x3 filter, Stride 1 2x2 pooling, Stride 2
Model Memory Parameter
INPUT: [224x224x3]
CONV3-64: [224x224x64]
CONV3-64: [224x224x64]
POOL2: [112x112x64]
CONV3-128: [112x112x128]
CONV3-128: [112x112x128]
POOL2: [56x56x128]
CONV3-256: [56x56x256]
CONV3-256: [56x56x256]
CONV3-256: [56x56x256]
POOL2: [28x28x256]
CONV3-512: [28x28x512]
CONV3-512: [28x28x512]
CONV3-512: [28x28x512]
POOL2: [14x14x512]
CONV3-512: [14x14x512]
CONV3-512: [14x14x512]
CONV3-512: [14x14x512]
POOL2: [7x7x512]
FC: [1x1x4096]
FC: [1x1x4096]
FC: [1x1x1000]
Fei-Fei Li & Andrej Karpathy & Justin Johnson
INPUT: [224x224x3] memory: 224*224*3=150K params: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Components of CNNs
Convolution Layers Pooling Layers Fully-Connected Layers

Activation Function

Lecture 6 - 9
CNN Architectures

Case Studies
- AlexNet
- VGG
- ResNet

Also....
- GoogLeNet - DenseNet
- ZFNet - MobileNets
- SENet - NASNet
- Wide ResNet - EfficientNet
- ResNeXT
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

152 layers 152 layers 152 layers

19 layers 22 layers

shallow 8 layers 8 layers

Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

First CNN-based winner 152 layers 152 layers 152 layers

19 layers 22 layers

shallow 8 layers 8 layers

Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
Case Study: AlexNet
[Krizhevsky et al. 2012]

Architecture:
CONV1
MAX POOL1
NORM1
CONV2
MAX POOL2
NORM2
CONV3
CONV4
CONV5
Max POOL3
FC6
FC7
FC8
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images


W’ = (W - F + 2P) / S + 1
First layer (CONV1): 96 11x11 filters applied at stride 4
=>
Q: what is the output volume size? Hint: (227-11)/4+1 = 55

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images


W’ = (W - F + 2P) / S + 1
First layer (CONV1): 96 11x11 filters applied at stride 4
=>
227
Output volume [55x55x96]
55 x 55
227
96

3
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4


=> 11 x 11
Output volume [55x55x96]

Q: What is the total number of parameters in this layer?


3

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4


=> 11 x 11
Output volume [55x55x96]
Parameters: (11*11*3 + 1)*96 = 35K
3

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images


After CONV1: 55x55x96
W’ = (W - F + 2P) / S + 1

Second layer (POOL1): 3x3 filters applied at stride 2

Q: what is the output volume size? Hint: (55-3)/2+1 = 27

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images


After CONV1: 55x55x96
W’ = (W - F + 2P) / S + 1

Second layer (POOL1): 3x3 filters applied at stride 2


Output volume: 27x27x96

Q: what is the number of parameters in this layer?

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images


After CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2


Output volume: 27x27x96
Parameters: 0!

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images


After CONV1: 55x55x96
After POOL1: 27x27x96
...

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Case Study: AlexNet
[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:


[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2 Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Case Study: AlexNet
[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:


[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
Details/Retrospectives:
[27x27x96] MAX POOL1: 3x3 filters at stride 2
- first use of ReLU
[27x27x96] NORM1: Normalization layer
- used LRN layers (not common anymore)
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
- heavy data augmentation
[13x13x256] MAX POOL2: 3x3 filters at stride 2
- dropout 0.5
[13x13x256] NORM2: Normalization layer
- batch size 128
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
- SGD Momentum 0.9
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
- Learning rate 1e-2, reduced by 10
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
manually when val accuracy plateaus
[6x6x256] MAX POOL3: 3x3 filters at stride 2
- L2 weight decay 5e-4
[4096] FC6: 4096 neurons
- 7 CNN ensemble: 18.2% -> 15.4%
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores) Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

First CNN-based winner 152 layers 152 layers 152 layers

19 layers 22 layers

shallow 8 layers 8 layers

Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

Deeper Networks 152 layers 152 layers 152 layers

19 layers 22 layers

shallow 8 layers 8 layers

Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
Case Study: VGGNet
[Simonyan and Zisserman, 2014] Softmax
FC 1000

Softmax FC 4096
FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
Pool 3x3 conv, 512

Small filters, Deeper networks 3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 Pool

8 layers (AlexNet) Softmax


Pool
3x3 conv, 512
3x3 conv, 512

3x3 conv, 512

-> 16 - 19 layers (VGG16Net) FC 1000


FC 4096
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
FC 4096 Pool Pool

Pool 3x3 conv, 256 3x3 conv, 256


Only 3x3 CONV stride 1, pad 1 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256

and 2x2 MAX POOL stride 2 3x3 conv, 384


Pool
Pool
3x3 conv, 128
Pool
3x3 conv, 128

3x3 conv, 384 3x3 conv, 128 3x3 conv, 128

11.7% top 5 error in ILSVRC’13


Pool Pool Pool

5x5 conv, 256 3x3 conv, 64 3x3 conv, 64

(ZFNet) 11x11 conv, 96 3x3 conv, 64 3x3 conv, 64

Input
Input Input
-> 7.3% top 5 error in ILSVRC’14 AlexNet VGG16 VGG19
Case Study: VGGNet
[Simonyan and Zisserman, 2014] Softmax
FC 1000

Softmax FC 4096
FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512

Q: Why use smaller filters? (3x3 conv)


Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
Softmax
FC 1000 3x3 conv, 512 3x3 conv, 512

FC 4096 3x3 conv, 512


3x3 conv, 512
FC 4096 Pool Pool

Pool 3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256 3x3 conv, 256

3x3 conv, 384 Pool Pool

Pool 3x3 conv, 128 3x3 conv, 128

3x3 conv, 384 3x3 conv, 128 3x3 conv, 128

Pool Pool Pool

5x5 conv, 256 3x3 conv, 64 3x3 conv, 64

11x11 conv, 96 3x3 conv, 64 3x3 conv, 64

Input Input Input


Case Study: VGGNet
[Simonyan and Zisserman, 2014] Softmax
FC 1000

Softmax FC 4096
FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512

Q: Why use smaller filters? (3x3 conv)


Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
Softmax

Stack of three 3x3 conv (stride 1) layers FC 1000


FC 4096
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
has same effective receptive field as FC 4096 Pool Pool
3x3 conv, 256
Pool 3x3 conv, 256
one 7x7 conv layer 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256

3x3 conv, 384 Pool Pool

Pool 3x3 conv, 128 3x3 conv, 128

Q: What is the effective receptive field of 3x3 conv, 384


Pool
3x3 conv, 128
Pool
3x3 conv, 128
Pool

three 3x3 conv (stride 1) layers? 5x5 conv, 256


11x11 conv, 96
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64

Input Input Input

AlexNet VGG16 VGG19


Case Study: VGGNet
[Simonyan and Zisserman, 2014] Softmax
FC 1000

Softmax FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool

Q: What is the effective receptive field


FC 4096 3x3 conv, 512
Pool 3x3 conv, 512

of three 3x3 conv (stride 1) layers? 3x3 conv, 512


3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
Input A1 A2 A3 3x3 conv, 512 3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool Pool

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

Pool Pool

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

Pool Pool

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

Conv1 (3x3) Conv2 (3x3) Conv3 (3x3) Input Input

VGG16 VGG19
Case Study: VGGNet
[Simonyan and Zisserman, 2014] Softmax
FC 1000

Softmax FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool

Q: What is the effective receptive field


FC 4096 3x3 conv, 512
Pool 3x3 conv, 512

of three 3x3 conv (stride 1) layers? 3x3 conv, 512


3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
Input A1 A2 A3 3x3 conv, 512 3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool Pool

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

Pool Pool

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

Pool Pool

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

Conv1 (3x3) Conv2 (3x3) Conv3 (3x3) Input Input

VGG16 VGG19
Case Study: VGGNet
[Simonyan and Zisserman, 2014] Softmax
FC 1000

Softmax FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool

Q: What is the effective receptive field


FC 4096 3x3 conv, 512
Pool 3x3 conv, 512

of three 3x3 conv (stride 1) layers? 3x3 conv, 512


3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
Input A1 A2 A3 3x3 conv, 512 3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool Pool

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

Pool Pool

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

Pool Pool

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

Conv1 (3x3) Conv2 (3x3) Conv3 (3x3) Input Input

VGG16 VGG19
Case Study: VGGNet
[Simonyan and Zisserman, 2014] Softmax
FC 1000

Softmax FC 4096

[Simonyan and Zisserman, 2014] FC 1000 FC 4096


FC 4096 Pool
FC 4096 3x3 conv, 512
Q: What is the effective receptive field Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
of three 3x3 conv (stride 1) layers? 3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
Input A1 A2 A3 3x3 conv, 512 3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool Pool

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

Pool Pool

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

Pool Pool

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

Conv1 (3x3) Conv2 (3x3) Conv3 (3x3) Input Input

VGG16 VGG19
Case Study: VGGNet
[Simonyan and Zisserman, 2014] Softmax
FC 1000

Softmax FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool

Q: What is the effective receptive field


FC 4096 3x3 conv, 512
Pool 3x3 conv, 512

of three 3x3 conv (stride 1) layers? 3x3 conv, 512


3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
Input A1 A2 A3 3x3 conv, 512 3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool Pool

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

Pool Pool

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

Pool Pool

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

Conv1 (3x3) Conv2 (3x3) Conv3 (3x3) Input Input

VGG16 VGG19
Case Study: VGGNet
[Simonyan and Zisserman, 2014] Softmax
FC 1000

Softmax FC 4096
FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512

Q: Why use smaller filters? (3x3 conv)


Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
Softmax
FC 1000 3x3 conv, 512 3x3 conv, 512

Stack of three 3x3 conv (stride 1) layers FC 4096


FC 4096
3x3 conv, 512
Pool
3x3 conv, 512

Pool

has same effective receptive field as Pool 3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256 3x3 conv, 256


one 7x7 conv layer 3x3 conv, 384 Pool Pool

Pool 3x3 conv, 128 3x3 conv, 128

3x3 conv, 384 3x3 conv, 128 3x3 conv, 128

[7x7] Pool
5x5 conv, 256
Pool
3x3 conv, 64
Pool
3x3 conv, 64

11x11 conv, 96 3x3 conv, 64 3x3 conv, 64

Input Input Input

AlexNet VGG16 VGG19


Case Study: VGGNet
[Simonyan and Zisserman, 2014] Softmax
FC 1000

Softmax FC 4096
FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512

Q: Why use smaller filters? (3x3 conv)


Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512
Stack of three 3x3 conv (stride 1) layers Softmax
3x3 conv, 512 3x3 conv, 512

has same effective receptive field as FC 1000


FC 4096
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
one 7x7 conv layer FC 4096 Pool Pool
3x3 conv, 256
Pool 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256 3x3 conv, 256

3x3 conv, 384 Pool


But deeper, more non-linearities
Pool
Pool 3x3 conv, 128 3x3 conv, 128

3x3 conv, 384 3x3 conv, 128 3x3 conv, 128

Pool Pool Pool

And fewer parameters: 3 * (32C2) vs. 5x5 conv, 256


11x11 conv, 96
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64

72C2 for C channels per layer Input Input Input

AlexNet VGG16 VGG19


INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
Softmax
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 FC 1000
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 FC 4096
POOL2: [112x112x64] memory: 112*112*64=800K params: 0 FC 4096

CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 Pool

CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 3x3 conv, 512
3x3 conv, 512
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
3x3 conv, 512
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 Pool
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512
POOL2: [28x28x256] memory: 28*28*256=200K params: 0 3x3 conv, 512
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 Pool

CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 3x3 conv, 256
3x3 conv, 256
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
Pool
POOL2: [14x14x512] memory: 14*14*512=100K params: 0 3x3 conv, 128
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 128
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 Pool

CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 64

POOL2: [7x7x512] memory: 7*7*512=25K params: 0 3x3 conv, 64

FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448 Input

FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216 VGG16


FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
Softmax
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 FC 1000
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 FC 4096
POOL2: [112x112x64] memory: 112*112*64=800K params: 0 FC 4096

CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 Pool

CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 3x3 conv, 512
3x3 conv, 512
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
3x3 conv, 512
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 Pool
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512
POOL2: [28x28x256] memory: 28*28*256=200K params: 0 3x3 conv, 512
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 Pool

CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 3x3 conv, 256
3x3 conv, 256
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
Pool
POOL2: [14x14x512] memory: 14*14*512=100K params: 0 3x3 conv, 128
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 128
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 Pool

CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 64

POOL2: [7x7x512] memory: 7*7*512=25K params: 0 3x3 conv, 64

FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448 Input

FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216 VGG16


FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
TOTAL memory: 24M * 4 bytes ~= 96MB / image (for a forward pass)
TOTAL params: 138M parameters
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 Note:
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 Most memory is
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 in early CONV
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0 Most params are
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 in late FC
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
TOTAL memory: 24M * 4 bytes ~= 96MB / image (only forward! ~*2 for
bwd) TOTAL params: 138M parameters
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
Softmax
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 FC 1000 fc8
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 FC 4096 fc7
POOL2: [112x112x64] memory: 112*112*64=800K params: 0 FC 4096 fc6
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 Pool

CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 3x3 conv, 512 conv5-3
3x3 conv, 512 conv5-2
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
3x3 conv, 512 conv5-1
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 Pool
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512 conv4-3
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512 conv4-2
POOL2: [28x28x256] memory: 28*28*256=200K params: 0 3x3 conv, 512
conv4-1

CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 Pool

CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 3x3 conv, 256 conv3-2
3x3 conv, 256 conv3-1
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
Pool
POOL2: [14x14x512] memory: 14*14*512=100K params: 0 3x3 conv, 128 conv2-2
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 128 conv2-1
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 Pool

CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 64 conv1-2

POOL2: [7x7x512] memory: 7*7*512=25K params: 0 3x3 conv, 64 conv1-1

FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448 Input

FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216 VGG16


FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
TOTAL memory: 24M * 4 bytes ~= 96MB / image (only forward! ~*2 for Common names
bwd) TOTAL params: 138M parameters
Softmax
FC 1000

Softmax FC 4096

fc8 FC 1000 FC 4096


fc7 FC 4096 Pool
fc6 FC 4096 3x3 conv, 512
Pool 3x3 conv, 512

Details: conv5-3 3x3 conv, 512 3x3 conv, 512


conv5-2 3x3 conv, 512
- ILSVRC’14 2nd in classification, 1st in
3x3 conv, 512
conv5-1 3x3 conv, 512 Pool

localization Pool 3x3 conv, 512


conv4-3 3x3 conv, 512 3x3 conv, 512
- Similar training procedure as Krizhevsky Softmax
FC 1000 conv4-2 3x3 conv, 512 3x3 conv, 512

2012 fc7 FC 4096 conv4-1


3x3 conv, 512
3x3 conv, 512

fc6 FC 4096 Pool Pool


- No Local Response Normalisation (LRN) Pool conv3-2 3x3 conv, 256 3x3 conv, 256

- Use VGG16 or VGG19 (VGG19 only conv5


conv4
3x3 conv, 256 conv3-1 3x3 conv, 256 3x3 conv, 256

3x3 conv, 384 Pool Pool


slightly better, more memory) Pool
conv2-2 3x3 conv, 128 3x3 conv, 128

- Use ensembles for best results conv3 3x3 conv, 384


Pool
conv2-1 3x3 conv, 128
Pool
3x3 conv, 128
Pool
- FC7 features generalize well to other conv2 5x5 conv, 256 conv1-2 3x3 conv, 64 3x3 conv, 64

tasks conv1 11x11 conv, 96 conv1-1 3x3 conv, 64 3x3 conv, 64

Input Input Input

AlexNet VGG16 VGG19


ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

Deeper Networks 152 layers 152 layers 152 layers

19 layers 22 layers

shallow 8 layers 8 layers

Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
Case Study: GoogLeNet
[Szegedy et al., 2014]

Deeper networks, with


computational efficiency

- ILSVRC’14 classification winner


(6.7% top 5 error)
- 22 layers
- Only 5 million parameters!
12x less than AlexNet Inception module
27x less than VGG-16
- Efficient “Inception” module
- No FC layers
Case Study: GoogLeNet
[Szegedy et al., 2014]

“Inception module”: design a


good local network topology
(network within a network)
and then stack these modules
on top of each other

Inception module
Case Study: GoogLeNet
[Szegedy et al., 2014]

Apply parallel filter operations


Filter
on the input from previous layer:
concatenation - Multiple receptive field
1x1 3x3 5x5 3x3 max sizes for convolution (1x1,
convolution convolution convolution pooling
3x3, 5x5)
- Pooling operation (3x3)
Previous Layer

Concatenate all filter


Naive Inception module outputs
together channel-wise
Case Study: GoogLeNet
[Szegedy et al., 2014]

Apply parallel filter operations


Filter
on the input from previous layer:
concatenation - Multiple receptive field
1x1 3x3 5x5 3x3 max sizes for convolution (1x1,
convolution convolution convolution pooling
3x3, 5x5)
- Pooling operation (3x3)
Previous Layer

Concatenate all filter


Naive Inception module outputs
together channel-wise
Q: What is the problem with this?
[Hint: Computational complexity]
Case Study: GoogLeNet
[Szegedy et al., 2014]

Q: What is the problem with this?


Example: [Hint: Computational complexity]

Filter
concatenation

1x1 conv, 3x3 conv, 5x5 conv,


3x3 pool
128 192 96

Module input: Input

28x28x256

Naive Inception module


Case Study: GoogLeNet
[Szegedy et al., 2014] [Szegedy et al., 2014]

Q1: What are the output sizes of Q: What is the problem with this?
Example: all different filter operations? [Hint: Computational complexity]

Filter
concatenatio
n

1x1 conv, 3x3 conv, 5x5 conv,


3x3 pool
128 192 96

Module input: Input

28x28x256

Naive Inception module


Case Study: GoogLeNet
[Szegedy et al., 2014] [Szegedy et al., 2014]

Example:
Q1: What are the output sizes of Q: What is the problem with this?
all different filter operations? [Hint: Computational complexity]

Filter
concatenatio
n
28x28x128 28x28x192 28x28x96 28x28x256
1x1 conv, 3x3 conv, 5x5 conv,
3x3 pool
128 192 96

Module input: Input

28x28x256

Naive Inception module


Case Study: GoogLeNet
[Szegedy et al., 2014]

Example:
Q2:What is output size after Q: What is the problem with this?
filter concatenation? [Hint: Computational complexity]

Filter
concatenatio
n
28x28x128 28x28x192 28x28x96 28x28x256
1x1 conv, 3x3 conv, 5x5 conv,
3x3 pool
128 192 96

Module input: Input

28x28x256

Naive Inception module


Case Study: GoogLeNet
[Szegedy et al., 2014]

Example:
Q2:What is output size after Q: What is the problem with this?
filter concatenation? [Hint: Computational complexity]
28x28x(128+192+96+256) = 28x28x672
Filter
concatenatio
n
28x28x128 28x28x192 28x28x96 28x28x256
1x1 conv, 3x3 conv, 5x5 conv,
3x3 pool
128 192 96

Module input: Input

28x28x256

Naive Inception module


Case Study: GoogLeNet
[Szegedy et al., 2014]

Example:
Q2:What is output size after Q: What is the problem with this?
filter concatenation? [Hint: Computational complexity]
28x28x(128+192+96+256) = 28x28x672 Conv Ops:
Filter
concatenatio
[1x1 conv, 128] 28x28x128x1x1x256
n [3x3 conv, 192]
28x28x128 28x28x192 28x28x96 28x28x256
28x28x192x3x3x256
1x1 conv, 3x3 conv, 5x5 conv,
128 192 96
3x3 pool [5x5 conv, 96] 28x28x96x5x5x256
Total: 854M ops
Module input: Input

28x28x256

Naive Inception module


Case Study: GoogLeNet
[Szegedy et al., 2014]

Example:
Q2:What is output size after Q: What is the problem with this?
filter concatenation? [Hint: Computational complexity]
28x28x(128+192+96+256) = 28x28x672
Filter Conv Ops:
concatenatio
n
[1x1 conv, 128] 28x28x128x1x1x256
28x28x128 28x28x192 28x28x96 28x28x256 [3x3 conv, 192]
1x1 conv, 3x3 conv, 5x5 conv, 28x28x192x3x3x256
3x3 pool
128 192 96 [5x5 conv, 96] 28x28x96x5x5x256
Total: 854M ops
Module input: Input

28x28x256 Very expensive compute

Naive Inception module Pooling layer also preserves feature


depth, which means total depth after
concatenation can only grow at every
layer!
Case Study: GoogLeNet
[Szegedy et al., 2014]

Q2:What is output size after Q: What is the problem with this?


Example: filter concatenation? [Hint: Computational complexity]
28x28x(128+192+96+256) = 529k
Filter
concatenatio
n
28x28x128 28x28x192 28x28x96 28x28x256 Solution: “bottleneck” layers
1x1 conv, 5x5 conv,
128
3x3 conv,
192 96
3x3 pool that use 1x1 convolutions to
reduce feature channel size
Module input: Input

28x28x256

Naive Inception module


Review: 1x1 convolutions

1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
64-dimensional dot
56 product)
56
64 32
Review: 1x1 convolutions
Alternatively, interpret it as applying
the same FC layer on each input pixel
FC

1x1x64 1x1x32

1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
64-dimensional dot
56 product)
56
64 32
Alternatively, interpret it as applying
the same FC layer on each input pixel
FC

64 32

1x1 CONV
56 with 32 filters
56

preserves spatial
dimensions, reduces depth!

56 Projects depth to lower 56


64 dimension (combination 32
of feature maps)
Case Study: GoogLeNet
[Szegedy et al., 2014]

Filter
Filter concatenation
concatenation

1x1 3x3 5x5 1x1


1x1 3x3 5x5 3x3 max convolution convolution convolution
convolution
convolution convolution convolution pooling

1x1 1x1 3x3 max


convolution convolution pooling
Previous Layer

Previous Layer
Naive Inception module
Inception module with dimension reduction
Case Study: GoogLeNet
[Szegedy et al., 2014]
1x1 conv “bottleneck”
layers
Filter
Filter concatenation
concatenation

1x1 3x3 5x5 1x1


1x1 3x3 5x5 3x3 max convolution convolution convolution
convolution
convolution convolution convolution pooling

1x1 1x1 3x3 max


convolution convolution pooling
Previous Layer

Previous Layer
Naive Inception module
Inception module with dimension reduction
Case Study: GoogLeNet
[Szegedy et al., 2014]
Using same parallel layers as
28x28x480 naive example, and adding “1x1
Filter
concatenatio
conv, 64 filter” bottlenecks:
n

28x28x128 28x28x192 28x28x96 28x28x64 Conv Ops:


1x1 conv, 3x3 conv, 5x5 conv, 1x1 conv, [1x1 conv, 64] 28x28x64x1x1x256
128 192 96 64 [1x1 conv, 64] 28x28x64x1x1x256
28x28x64 28x28x64 28x28x256 [1x1 conv, 128] 28x28x128x1x1x256
1x1 conv, 1x1 conv,
[3x3 conv, 192] 28x28x192x3x3x64
3x3 pool
64 64 [5x5 conv, 96] 28x28x96x5x5x64
[1x1 conv, 64] 28x28x64x1x1x256
Module input: Previous Layer Total: 358M ops
28x28x256
Compared to 854M ops for naive
Inception module with dimension reduction
version
Bottleneck can also reduce depth after
pooling layer
Case Study: GoogLeNet
[Szegedy et al., 2014]

Stack Inception modules


with dimension
reduction on top of each
other

Inception module
Case Study: GoogLeNet
[Szegedy et al., 2014]

Full GoogLeNet
architecture

Stem Network:
Conv-Pool-
2x Conv-Pool
Case Study: GoogLeNet
[Szegedy et al., 2014]

Full GoogLeNet
architecture

Stacked Inception
Modules
Case Study: GoogLeNet
[Szegedy et al., 2014]

Full GoogLeNet
architecture

Classifier output
Case Study: GoogLeNet
[Szegedy et al., 2014]
AvgPool
HxWxc
1x1xc

Full GoogLeNet
architecture

Note: after the last convolutional layer, a global


average pooling layer is used that spatially averages Classifier output
across each feature map, before final FC layer. No
longer multiple expensive FC layers!
Case Study: GoogLeNet
[Szegedy et al., 2014]

Full GoogLeNet
architecture

Auxiliary classification outputs to inject additional gradient at lower layers


(AvgPool- 1x1Conv-FC-FC-Softmax)
Case Study: GoogLeNet
[Szegedy et al., 2014]

Full GoogLeNet
architecture

22 total layers with weights


(parallel layers count as 1 layer => 2 layers per Inception module. Don’t count auxiliary output layers)
Case Study: GoogLeNet
[Szegedy et al., 2014]

Deeper networks, with


computational efficiency

- 22 layers
- Efficient “Inception” module
- Avoids expensive FC layers
- 12x less params than AlexNet
- 27x less params than VGG-16 Inception module
- ILSVRC’14 classification winner
(6.7% top 5 error)
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

152 layers 152 layers 152 layers

19 layers 22 layers

shallow 8 layers 8 layers

Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
Case Study: ResNet
Softmax
[He et al., 2015] FC 1000
Pool

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
relu 3x3 conv, 64
Very deep networks using residual F(x) + x 3x3 conv, 64

connections 3x3 conv, 64

..
conv .
- 152-layer model for ImageNet X
3x3 conv, 128
3x3 conv, 128
F(x) relu
- ILSVRC’15 classification winner identity 3x3 conv, 128
3x3 conv, 128
(3.57% top 5 error) conv 3x3 conv, 128

- Swept all classification and 3x3 conv, 128 / 2

3x3 conv, 64
detection competitions in X
3x3 conv, 64

ILSVRC’15 and COCO’15! Residual block 3x3 conv, 64


3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64 / 2
Input
Case Study: ResNet
[He et al., 2015]

What happens when we continue stacking deeper layers on a “plain” convolutional


neural network?
Case Study: ResNet
[He et al., 2015]

What happens when we continue stacking deeper layers on a “plain” convolutional


neural network?

56-layer

Training error
56-layer
Test error

20-layer

20-layer

Iterations Iterations
Case Study: ResNet
[He et al., 2015]

What happens when we continue stacking deeper layers on a “plain” convolutional


neural network?

56-layer

Training error
56-layer
Test error

20-layer

20-layer

Iterations Iterations

56-layer model performs worse on both test and training error


-> The deeper model performs worse, but it’s not caused by overfitting!

Lecture 6 -
Case Study: ResNet
[He et al., 2015]

Fact: Deep models have more representation power


(more parameters) than shallower models.
Hypothesis: the problem is an optimization problem,
deeper models are harder to optimize
Case Study: ResNet
[He et al., 2015]
Fact: Deep models have more representation power
(more parameters) than shallower models.

Hypothesis: the problem is an optimization problem, H(x)


deeper models are harder to optimize
H(x) Identity
What should the deeper model learn to be at least
as good as the shallower model? relu relu
conv conv
A solution by construction is copying the
learned layers from the shallower model and X X
setting additional layers to identity mapping.
Case Study: ResNet
[He et al., 2015]

Solution: Use network layers to fit a residual mapping instead of directly trying to fit
a
desired underlying mapping

H(x)

conv

relu
conv

X
“Plain” layers
Case Study: ResNet
[He et al., 2015]

Solution: Use network layers to fit a residual mapping instead of directly trying to fit
a desired underlying mapping
relu Identity mapping:
H(x) H(x) = F(x) + x H(x) = x if F(x) = 0

conv
conv
F(x) X
relu relu
identity
conv
conv

X X
“Plain” layers Residual block
Case Study: ResNet
[He et al., 2015]

Solution: Use network layers to fit a residual mapping instead of directly trying to fit
a desired underlying mapping
Identity mapping:
H(x) = F(x) + x relu
H(x) H(x) = F(x) + x
H(x) = x if F(x) = 0
conv
conv Use layers to
F(x) relu
X fit residual
relu identity F(x) = H(x) - x
conv
conv instead of
H(x) directly
X X
“Plain” layers Residual block
Case Study: ResNet
[He et al., 2015] Softmax
FC 1000
Pool

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512

Full ResNet architecture: 3x3 conv, 512

relu 3x3 conv, 512


- Stack residual blocks 3x3 conv, 512, /2
F(x) + x
- Every residual block has ..
two 3x3 conv layers .
3x3 conv, 128
3x3 conv 3x3 conv, 128

F(x) X
relu 3x3 conv, 128

identity 3x3 conv, 128

3x3 conv 3x3 conv, 128


3x3 conv, 128, / 2

3x3 conv, 64
3x3 conv, 64

X 3x3 conv, 64
Residual block 3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64, / 2

Input
Case Study: ResNet
[He et al., 2015] Softmax
FC 1000
Pool

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512

Full ResNet architecture: 3x3 conv, 512

relu 3x3 conv, 512


- Stack residual blocks 3x3 conv, 512, /2
F(x) + x
- Every residual block has ..
two 3x3 conv layers .
3x3 conv, 128
- Periodically, double # of 3x3 conv 3x3 conv, 128
3x3 conv, 128
filters and downsample F(x) X filters, /2
relu 3x3 conv, 128

spatially using stride 2 identity 3x3 conv, 128 spatially with


stride 2
3x3 conv 3x3 conv, 128
(/2 in each dimension) 3x3 conv, 128, / 2
3x3 conv, 64
Reduce the activation 3x3 conv, 64
filters
volume by half.
3x3 conv, 64

X 3x3 conv, 64
Residual block 3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64, / 2

Input
Case Study: ResNet
[He et al., 2015] Softmax
FC 1000
Pool

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512

Full ResNet architecture: 3x3 conv, 512

relu 3x3 conv, 512


- Stack residual blocks 3x3 conv, 512, /2
F(x) + x
- Every residual block has ..
two 3x3 conv layers .
3x3 conv, 128
- Periodically, double # of 3x3 conv 3x3 conv, 128

filters and downsample F(x) X


relu 3x3 conv, 128

spatially using stride 2 identity 3x3 conv, 128

3x3 conv 3x3 conv, 128


(/2 in each dimension) 3x3 conv, 128, / 2

- Additional conv layer at 3x3 conv, 64

the beginning (stem)


3x3 conv, 64

X 3x3 conv, 64
Residual block 3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

Pool
Beginning
7x7 conv, 64, / 2 conv layer
Input
Case Study: ResNet
[He et al., 2015] Softmax
FC 1000 No FC layers
Pool besides FC
3x3 conv, 512
1000 to
3x3 conv, 512 output
classes
3x3 conv, 512

Full ResNet architecture: 3x3 conv, 512


Global
relu 3x3 conv, 512 average
- Stack residual blocks 3x3 conv, 512, /2 pooling layer
F(x) + x
- Every residual block has ..
after last
conv layer
two 3x3 conv layers .
3x3 conv, 128
- Periodically, double # of 3x3 conv 3x3 conv, 128

filters and downsample F(x) X


relu 3x3 conv, 128

spatially using stride 2 identity 3x3 conv, 128

3x3 conv 3x3 conv, 128


(/2 in each dimension) 3x3 conv, 128, / 2

- Additional conv layer at 3x3 conv, 64

the beginning (stem)


3x3 conv, 64

X 3x3 conv, 64
- No FC layers at the end Residual block 3x3 conv, 64

(only FC 1000 to output 3x3 conv, 64

classes)
3x3 conv, 64

- (In theory, you can train a ResNet with Pool


7x7 conv, 64, / 2
input image of variable sizes)
Input
Case Study: ResNet
Softmax
[He et al., 2015] FC 1000
Pool

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512

Total depths of 18, 34, 50, 3x3 conv, 512, /2

101, or 152 layers for ..


.
ImageNet 3x3 conv, 128
3x3 conv, 128

3x3 conv, 128


3x3 conv, 128

3x3 conv, 128


3x3 conv, 128, / 2

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64, / 2
Input
Case Study: ResNet
[He et al., 2015]
28x28x256
output

For deeper networks (ResNet- 1x1 conv, 256


50+), use “bottleneck” layer to BN, relu
improve efficiency (similar to 3x3 conv, 64
GoogLeNet)
BN, relu
1x1 conv, 64

28x28x256
input
Case Study: ResNet
[He et al., 2015]
28x28x256
output
1x1 conv, 256 filters projects
back to 256 feature maps
For deeper networks (ResNet- (28x28x256) 1x1 conv, 256
50+), use “bottleneck” layer to BN, relu
improve efficiency (similar to 3x3 conv operates over
3x3 conv, 64
GoogLeNet) only 64 feature maps
BN, relu
1x1 conv, 64 filters 1x1 conv, 64
to project to
28x28x64 28x28x256
input
Case Study: ResNet
[He et al., 2015]

Training ResNet in practice:

- Batch Normalization after every CONV layer


- Xavier initialization from He et al.
- SGD + Momentum (0.9)
- Learning rate: 0.1, divided by 10 when validation error plateaus
- Mini-batch size 256
- Weight decay of 1e-5
- No dropout used
Comparing complexity...

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Comparing complexity...
Inception-v4: Resnet + Inception!

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Comparing complexity...
VGG: most
parameters, most
operations

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Comparing complexity...
GoogLeNet:
most efficient

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Comparing complexity...
AlexNet:
Smaller compute, still memory
heavy, lower accuracy

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Comparing complexity...
ResNet:
Moderate efficiency depending
on model, highest accuracy

An Analysis of Deep Neural Network Models for Practical


Applications, 2017.
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
But research into CNN architectures is
still flourishing
Improving ResNets...
Identity Mappings in Deep Residual Networks
[He et al. 2016]

- Improved ResNet block design from


creators of ResNet conv

- Creates a more direct path for ReLU

propagating information BN
throughout network conv
- Gives better performance
ReLU

BN
Improving ResNets...
Wide Residual Networks
[Zagoruyko et al. 2016]

- Argues that residuals are the


important factor, not depth
3x3 conv, F x k
- User wider residual blocks (F x k 3x3 conv, F

filters instead of F filters in each


3x3 conv, F x k
layer) 3x3 conv, F

- 50-layer wide ResNet outperforms


152-layer original ResNet
- Increasing width instead of Basic residual block Wide residual block
depth more computationally
efficient (parallelizable)
Improving ResNets...
Aggregated Residual Transformations for Deep Neural Networks (ResNeXt)
256-d out
[Xie et al. 2016]

- Also from creators


256-d out
of ResNet
- Increases width of 1x1 conv, 256 1x1 conv, 1x1 conv, 1x1 conv,
residual block 256 256
32
256

through multiple 3x3 conv, 64 paths


3x3 conv, 4 3x3 conv, 4 ... 3x3 conv, 4
parallel pathways
(“cardinality”) 1x1 conv, 64
1x1 conv, 4 1x1 conv, 4 1x1 conv, 4
- Parallel pathways
256-d in
similar in spirit to
Inception module 256-d in
Other ideas...
Densely Connected Convolutional Networks (DenseNet) Softmax

[Huang et al. 2017] FC


1x1 conv, 64 Pool

Dense Block 3
- Dense blocks where each layer is Concat
Conv
connected to every other layer in 1x1 conv, 64 Pool
feedforward fashion Conv
- Alleviates vanishing gradient, Concat
Dense Block 2

strengthens feature propagation, Conv


Conv

encourages feature reuse Pool

- Showed that shallow 50-layer Concat Conv

network can outperform deeper Conv


Dense Block 1

152 layer ResNet Conv

Input Input

Dense Block
Efficient Networks...
MobileNets: Efficient Convolutional Neural Networks for
Mobile Applications [Howard et al. 2017]
- Depthwise separable
BatchNorm
convolutions replace
standard convolutions by Pool
BatchNorm
factorizing them into a C2HW Conv (1x1, C->C)
Pointwise
depthwise convolution and a Pool convolutions
1x1 convolution BatchNorm
9C2HW Conv (3x3, C->C)
- Much more efficient, with Pool
little loss in accuracy Standard network
Conv (3x3, C->C, Depthwise
- Follow-up MobileNetV2 work Total compute:9C2HW 9CHW
groups=C) convolutions
in 2018 (Sandler et al.)
MobileNets
- ShuffleNet: Zhang et al, Total compute:9CHW + C2HW
CVPR 2018
Learning to search for network architectures...
Learning Transferable Architectures for Scalable Image
Recognition
[Zoph et al. 2017]
- Applying neural architecture search (NAS) to
a large dataset like ImageNet is expensive
- Design a search space of building blocks
(“cells”) that can be flexibly stacked
- NASNet: Use NAS to find best cell structure
on smaller CIFAR-10 dataset, then transfer
architecture to ImageNet
- Many follow-up works in this
space e.g. AmoebaNet (Real et
al. 2019) and ENAS (Pham,
Guan et al. 2018)
But sometimes smart heuristic is better than NAS ...
EfficientNet: Smart Compound Scaling
[Tan and Le. 2019]
- Increase network capacity by scaling width,
depth, and resolution, while balancing
accuracy and efficiency.
- Search for optimal set of compound
scaling factors given a compute budget
(target memory & flops).
- Scale up using smart heuristic rules

Lecture 6 -
Efficient networks...

https://openai.com/blog/ai-and-efficiency/

Lecture 6 -
Transfer learning
You need a lot of a data if you want to
train/use CNNs?
Transfer Learning with CNNs
Transfer Learning with CNNs

AlexNet:
64 x 3 x 11 x 11

(More on this in Lecture 13)


Transfer Learning with CNNs
Test image L2 Nearest neighbors in feature space

(More on this in Lecture 13)


Transfer Learning with CNNs
1. Train on Imagenet Donahue et al, “DeCAF: A Deep Convolutional
Activation Feature for Generic Visual Recognition”,
ICML 2014 Razavian et al, “CNN Features Off-the-
FC-1000 Shelf: An Astounding Baseline for Recognition”, CVPR
Workshops 2014
FC-4096

FC-4096
MaxPool
Conv-512

Conv-512
MaxPool
Conv-512

Conv-512
MaxPool
Conv-256

Conv-256
MaxPool
Conv-128

Conv-128
MaxPool
Conv-64

Conv-64
Image
1. Train on Imagenet 2. Small Dataset (C classes)
Donahue et al, “DeCAF: A Deep Convolutional
Activation Feature for Generic Visual Recognition”,
ICML 2014 Razavian et al, “CNN Features Off-the-
FC-1000 FC-C Shelf: An Astounding Baseline for Recognition”, CVPR
Workshops 2014
FC-4096 FC-4096
Reinitialize
FC-4096 FC-4096
MaxPool MaxPool
this and train
Conv-512 Conv-512

Conv-512 Conv-512
MaxPool MaxPool
Conv-512 Conv-512

Conv-512 Conv-512
MaxPool MaxPool Freeze these
Conv-256 Conv-256

Conv-256 Conv-256
MaxPool MaxPool
Conv-128 Conv-128

Conv-128 Conv-128
MaxPool MaxPool
Conv-64 Conv-64

Conv-64 Conv-64
Image Image

Lecture 6 - 17
Transfer Learning with CNNs
1. Train on Imagenet 2. Small Dataset (C classes)
Donahue et al, “DeCAF: A Deep Convolutional
Activation Feature for Generic Visual Recognition”,
ICML 2014 Razavian et al, “CNN Features Off-the-
FC-1000 FC-C Shelf: An Astounding Baseline for Recognition”, CVPR
Workshops 2014
FC-4096 FC-4096
Reinitialize Finetuned from AlexNet
FC-4096 FC-4096
MaxPool MaxPool
this and train
Conv-512 Conv-512

Conv-512 Conv-512
MaxPool MaxPool
Conv-512 Conv-512

Conv-512 Conv-512
MaxPool MaxPool Freeze these
Conv-256 Conv-256

Conv-256 Conv-256
MaxPool MaxPool
Conv-128 Conv-128

Conv-128 Conv-128
MaxPool MaxPool
Conv-64 Conv-64
Donahue et al, “DeCAF: A Deep Convolutional Activation
Conv-64 Conv-64 Feature for Generic Visual Recognition”, ICML 2014
Image Image
Transfer Learning with CNNs
Donahue et al, “DeCAF: A Deep Convolutional
Activation Feature for Generic Visual Recognition”,
ICML 2014 Razavian et al, “CNN Features Off-the-
Shelf: An Astounding Baseline for Recognition”, CVPR
Workshops 2014

1. Train on Imagenet 2. Small Dataset (C classes) 3. Bigger dataset


FC-1000 FC-C FC-C
FC-4096 FC-4096
FC-4096
Reinitialize Train these
FC-4096 FC-4096 FC-4096
MaxPool MaxPool
this and train
MaxPool
Conv-512 Conv-512 Conv-512
With bigger
Conv-512 Conv-512
Conv-512 dataset, train
MaxPool MaxPool MaxPool
Conv-512 Conv-512 Conv-512
more layers
Conv-512 Conv-512 Conv-512
MaxPool MaxPool Freeze these MaxPool
Conv-256 Conv-256 Conv-256
Freeze these
Conv-256 Conv-256 Conv-256
MaxPool MaxPool MaxPool
Conv-128 Conv-128 Conv-128 Lower learning rate
Conv-128 Conv-128 Conv-128 when finetuning;
MaxPool MaxPool
MaxPool
1/10 of original LR
Conv-64 Conv-64 Conv-64
is good starting
Conv-64 Conv-64 Conv-64
point
FC-1000
FC-4096
very very
FC-4096 similar different
MaxPool dataset dataset
Conv-512

Conv-512
MaxPool
very little data ? ?
Conv-512
More specific
Conv-512
MaxPool
Conv-256

Conv-256
More generic
MaxPool
Conv-128

Conv-128 quite a lot of ? ?


MaxPool
Conv-64
data
Conv-64
Image
FC-1000
FC-4096
very very
FC-4096 similar different
MaxPool dataset dataset
Conv-512

Conv-512
MaxPool
very little data Use ?
Conv-512
More specific Linear
Conv-512 Classifie
MaxPool
Conv-256
r on top
Conv-256
More generic layer
MaxPool
Conv-128

Conv-128 quite a lot of Finetune a ?


MaxPool
Conv-64
data few layers
Conv-64
Image
FC-1000
FC-4096
very very
FC-4096 similar different
MaxPool dataset dataset
Conv-512

Conv-512
MaxPool
very little data Use You’re in
Conv-512
More specific Linear trouble… Try
Conv-512 Classifie linear
MaxPool
Conv-256
r on top classifier
Conv-256
More generic layer from different
MaxPool
Conv-128
stages
Conv-128 quite a lot of Finetune a Finetune a
MaxPool
Conv-64 data few layers larger number
Conv-64 of layers or
Image
start from
scratch!
Main takeaways

AlexNet showed that you can use CNNs to train Computer Vision models.
VGG shows that bigger networks work better
ResNet showed us how to train extremely deep networks
- Limited only by GPU & memory!
- Showed diminishing returns as networks got bigger
After ResNet: CNNs were better than the human metric and focus shifted to
other topics:
- Efficient Networks: MobileNet, ShuffleNet

You might also like