0% found this document useful (0 votes)
56 views161 pages

Convolutional Neural Networks

Uploaded by

张立波
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views161 pages

Convolutional Neural Networks

Uploaded by

张立波
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 161

Convolutional Neural Networks

Wolfgang Ecker, Lorenzo Servadei, Sebastian Schober, Daniela Lopera, Yezi Yang
Agenda

• Intro to Convolutional Neural Networks


• Convolutional Neural Networks – Layers and Structures
• Advancements in Convolutional Neural Networks

2
Intro to Convolutional Neural Networks

3
Agenda

• Intro to Convolutional Neural Networks


• Convolutional Neural Networks – Layers and Structures
• Advancements in Convolutional Neural Networks

4
Convolutional Neural Networks – Layers and Structures

Fully Connected Layer

• How to process a small image with FC layers?

32x32x3 image -> stretch to 3072 x 1

input activation

1 1
10 x 3072
3072 10
weights

5
Fully Connected Layer
• How to process a small image with FC layers?

32x32x3 image -> stretch to 3072 x 1

input activation

1 1
10 x 3072
3072 10
weighs
1 number:
the result of taking a dot product
between a row of W and the input
(a 3072-dimensional dot product)

6
Problems using FC layers on images
• How to process a normal image with FC layers?

1000x1000x3 image

3 billion weights in
one layer!
1000

1000
3
1000 neuron layer

7
Problems using FC layers on images
• How to process a normal image with FC layers?

1000x1000x3 image

3 billion weights in
… one layer!
1000
Solution: weight sharing!

1000
3
1000 neuron layer

8
Why Convolution Layer ?

We want to restrict the degrees of freedom


 We want to preserve the spatial structure
 Weight sharing  using the same weights for different parts of the
image

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 18, 2017

9
What are convolutions ?

𝑓 ∗𝑔= ∫ 𝑓 ( 𝜏 ) 𝑔 ( 𝑡 − 𝜏 ) 𝑑 𝜏
−∞

𝑓 :𝑟𝑒𝑑
𝑔 : 𝑏𝑙𝑢𝑒
𝑓 ∗𝑔: 𝑔𝑟𝑒𝑒𝑛

Convolution of two boxcar functions Convolution of two Gaussians

Application of a filter to a function


The ‘smaller’ one is typically called the filter kernel
https://mathworld.wolfram.com/Convolution.html
10
convolutions on images

11
Image filters

[ ] [ ]
−1 −1 −1 1 1 1
1
−1 8 −1 1 1 1
9
−1 −1 −1 1 1 1
Edge detection Box blur

[ ] [ ]
Input 0 −1 0 1 2 1
−1 5 −1 1
2 4 2
0 −1 0 16
1 2 1
sharpen Gaussian blur

• Each kernel gives a different image filter

https://en.wikipedia.org/wiki/Kernel_(image_processing)
12
Convolution Layer
32x32x3 image -> preserve spatial structure

32 height

32 width
3 depth

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 18, 2017

13
Convolution Layer
32x32x3 image

5x5x3 filter

32

Convolve the filter with the


image
i.e. “slide over the image
spatially, computing dot
32 products”

14
Convolution Layer Filters always extend the full
depth of the input volume
32x32x3 image

5x5x3 filter

32

Convolve the filter with the


image
i.e. “slide over the image spatially,
computing dot products”
32
3

http://www.songho.ca/dsp/convolution/convolution.html#convolution_2d
15
Convolution Layer
32x32x3 image
5x5x3 filter
32

1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product +
3 bias)

16
Convolution Layer activation map
32x32x3 image
5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 18, 2017

17
Convolution Layer
Consider a second green filter activation maps
32x32x3 image
5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1

Lecture 5 - 33

18
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps

32

28

Convolution Layer

32 28
3 6

We stack these up to get a “new image” of size 28x28x6!

19
Preview: ConvNet is a sequence of Convolution Layers, interspersed with
activation functions

32 28

CONV,
ReLU
e.g. 6
5x5x3
32 28
filters
3 6

20
Preview: ConvNet is a sequence of Convolutional Layers, interspersed with
activation functions

32 28 24

….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
32 5x5x3 28 5x5x6 24
filters filters
3 6 10

21
Preview

[Zeiler and Fergus 2013] Visualization of VGG-16 by Lane McIntosh. VGG-


16 architecture from [Simonyan and Zisserman
2014].

22
Preview

23
one filter =>
one activation map

example 5x5 filters


(32 total)

We call the layer convolutional


because it is related to
convolution of two signals:

elementwise multiplication and sum of


a filter and the signal (image)
Figure copyright Andrej Karpathy.

24
Preview

25
A closer look at spatial dimensions:

activation map
32x32x3 image
5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1

26
A closer look at spatial dimensions:

7x7 input (spatially)


assume 3x3 filter

27
A closer look at spatial dimensions:

7x7 input (spatially)


assume 3x3 filter

28
A closer look at spatial dimensions:

7x7 input (spatially)


assume 3x3 filter

29
A closer look at spatial dimensions:

7x7 input (spatially)


assume 3x3 filter

30
A closer look at spatial dimensions:

7x7 input (spatially)


assume 3x3 filter

=> 5x5 output


7

31
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter applied
with stride 2

32
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter applied
with stride 2

33
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter applied
with stride 2
=> 3x3 output!
7

34
A closer look at spatial dimensions:

7
7x7 input (spatially) assume
3x3 filter applied with stride
3?

35
A closer look at spatial dimensions:

7
7x7 input (spatially) assume
3x3 filter applied with stride
3?

7 doesn’t fit!
cannot apply 3x3 filter on 7x7
input with stride 3.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 18, 2017

36
N

Output size:
(N - F) / stride + 1
F
N e.g. N = 7, F = 3:
F stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\

37
In practice: Common to zero pad the border

0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0

38
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0

(recall:)
(N - F) / stride + 1

39
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0

40
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0
in general, common to see CONV layers with
stride 1, filters of size FxF, and zero-padding with
(F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with
2 F = 7 => zero pad
with 3

41
Remember back to…

E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially!
(32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t work well.

32 28 24

….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
32 5x5x3 28 5x5x6 24
filters filters
3 6 10

42
Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Output volume size: ?

43
Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Output volume size:


(32+2*2-5)/1+1 = 32 spatially, so
32x32x10

44
Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Number of parameters in this layer?

45
Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Number of parameters in this layer? each filter has


5*5*3 + 1 = 76 params
=> 76*10 = 760 (+1 for bias)

46
47
Common settings:

K = (powers of 2, e.g. 32, 64, 128, 512)


- F = 3, S = 1, P = 1
- F = 5, S = 1, P = 2
- F = 5, S = 2, P = ? (whatever fits)
- F = 1, S = 1, P = 0

48
(btw, 1x1 convolution layers make perfect sense)

1x1 CONV
56 with 32 filters 56
(each filter has size
1x1x64, and
performs a 64-
56 dimensional dot 56
64 product) 32

49
Dilated Convolution

http://www.icst.pku.edu.cn/struct/Projects/joint_rain_removal.html

50
Dilated Convolution

51
The brain/neuron view of CONV Layer

32x32x3 image
5x5x3 filter
32

1 number:
32 the result of taking a dot product between
the filter and this part of the image
3
(i.e. 5*5*3 = 75-dimensional dot product)

52
The brain/neuron view of CONV Layer

32x32x3 image
5x5x3 filter
32

It’s just a neuron with


local connectivity...
1 number:
32 the result of taking a dot product between
the filter and this part of the image
3
(i.e. 5*5*3 = 75-dimensional dot product)

53
The brain/neuron view of CONV Layer

32

28 An activation map is a 28x28 sheet of


neuron outputs:
1. Each is connected to a small region in the
input
32
2. All of them share parameters
28
3 “5x5 filter” -> “5x5 receptive field for each
neuron”

54
The brain/neuron view of CONV Layer

32

28 E.g. with 5 filters,


CONV layer consists of
neurons arranged in a 3D
grid (28x28x5)
There will be 5 different
32 28
neurons all looking at the
3 5 same region in the input
volume

55
Two more layers to go: POOL/FC

56
Pooling layer

- makes the representations smaller and more


manageable
- operates over each activation map independently:

57
Max Pooling

Single depth slice

x 1 1 2 4
max pool with 2x2 filters
5 6 7 8 and stride 2 6 8

3 2 1 0 3 4

1 2 3 4

58
59
Common settings:

F = 2, S = 2
F = 3, S = 2

60
Translation Equivariance and Invariance

Translation Translation Invariance


Equivariance
Pooling Layer
Convolutional Layer

61
Fully Connected Layer (FC layer)

- Contains neurons that connect to the entire input volume, as in ordinary Neural
Networks

62
Reminder: Fully Connected Layer

Each neuron looks


32x32x3 image -> stretch to 3072 x 1 at the full input
volume

input activation

1 1
3072 10 x 3072 10
weights
1 number:
the result of taking a dot product
between a row of W and the input
(a 3072-dimensional dot product)

63
Summary

- ConvNets stack CONV,POOL,FC layers


- Trend towards smaller filters and deeper architectures
- Trend towards getting rid of POOL/FC layers (just CONV)
- Typical architectures look like
[(CONV-RELU)*N-POOL?]*M-(FC-RELU)*K,SOFTMAX
where N is usually up to ~5, M is large, 0 <= K <= 2.
- but recent advances such as ResNet/GoogLeNet
challenge this paradigm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 18, 2017

64
Classic Architectures

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 18, 2017

65
Review: LeNet-5
[LeCun et al., 1998]

Conv filters were 5x5, applied at stride 1


Subsampling (Pooling) layers were 2x2 applied at stride 2
i.e. architecture is [CONV-POOL-CONV-POOL-FC-FC]

Lecture 9 - 6
6

Lecture 9 - 66
Case Study: AlexNet
[Krizhevsky et al. 2012]

11x11
S=4

227x227x3 ?x?x?

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4


=>
Q: what is the output volume size? Hint: (227-11)/4+1 = 55

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017

2017-04-13 Copyright © Infineon Technologies AG 2018. All rights reserved. - 67


Lecture 9 67
Case Study: AlexNet
[Krizhevsky et al. 2012]

11x11
S=4

227x227x3 55x55x96

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4


=>
Output volume [55x55x96]

Q: What is the total number of parameters in this layer?


Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 68 May 2, 2017

Lecture 9 - 68
Case Study: AlexNet
[Krizhevsky et al. 2012]

11x11
S=4

227x227x3 55x55x96

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4


=>
Output volume [55x55x96]
Parameters: (11*11*3)*96 = 35K

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 69 May 2, 2017

Lecture 9 - 69
Case Study: AlexNet
[Krizhevsky et al. 2012]

MAX POOL

11x11 3x3
S=4 S=2

227x227x3 55x55x96 ?x?x?

Input: 227x227x3 images


After CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2

Q: what is the output volume size? Hint: (55-3)/2+1 = 27

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 70 May 2, 2017

Lecture 9 - 70
Case Study: AlexNet
[Krizhevsky et al. 2012]

MAX POOL

11x11 3x3
S=4 S=2

227x227x3 55x55x96 27x27x96

Input: 227x227x3 images


After CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2


Output volume: 27x27x96

Q: what is the number of parameters in this layer?


Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 71 May 2, 2017

Lecture 9 - 71
Case Study: AlexNet
[Krizhevsky et al. 2012]

MAX POOL

11x11 3x3
S=4 S=2

227x227x3 55x55x96 27x27x96

Input: 227x227x3 images


After CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2


Output volume: 27x27x96
Parameters: 0!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 72 May 2, 2017

Lecture 9 - 72
Case Study: AlexNet
[Krizhevsky et al. 2012]

MAX POOL MAX POOL

11x11 3x3 5x5 3x3


S=4 S=2 S=1 S=2

227x227x3 55x55x96 27x27x96 27x27x256 13x13x256

MAX POOL
3x3 3x3 3x3 3x3
S=1 S=1 S=1 S=2

13x13x384 13x13x384 13x13x256 6x6x256

Lecture 9 - 73
Case Study: AlexNet
[Krizhevsky et al. 2012]

MAX POOL MAX POOL

11x11 3x3 5x5 3x3


S=4 S=2 S=1 S=2

227x227x3 55x55x96 27x27x96 27x27x256 13x13x256

MAX POOL
3x3 3x3 3x3 3x3
S=1 S=1 S=1 S=2

13x13x384 13x13x384 13x13x256 6x6x256 4096 4096 1000

• Softmax for 1000 classes

Lecture 9 - 74
Case Study: AlexNet
[Krizhevsky et al. 2012]

Details/Retrospectives:
-first use of ReLU
-used Norm layers (not common anymore)
-heavy data augmentation
-dropout 0.5
-batch size 128
-SGD Momentum 0.9
-Learning rate 1e-2, reduced by 10
manually when val accuracy plateaus
-L2 weight decay 5e-4
- 7 CNN ensemble: 18.2% -> 15.4%

Lecture 9 - 75
Case Study: AlexNet
[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:


[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[55x55x48] x 2
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 Historical note: Trained on GTX 580
[13x13x256] MAX POOL2: 3x3 filters at stride 2 GPU with only 3 GB of memory.
[13x13x256] NORM2: Normalization layer Network spread across 2 GPUs, half
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 the neurons (feature maps) on each
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad GPU.
1 [13x13x256] CONV5: 256 3x3 filters at stride 1,
pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: AlexNet
[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:


[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer CONV1, CONV2, CONV4, CONV5:
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 Connections only with feature maps
[13x13x256] MAX POOL2: 3x3 filters at stride 2 on same GPU
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad
1 [13x13x256] CONV5: 256 3x3 filters at stride 1,
pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Case Study: AlexNet
[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:


[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer CONV3, FC6, FC7, FC8:
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 Connections with all feature maps in
[13x13x256] MAX POOL2: 3x3 filters at stride 2 preceding layer, communication
[13x13x256] NORM2: Normalization layer across GPUs
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad
1 [13x13x256] CONV5: 256 3x3 filters at stride 1,
pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

First CNN-based winner

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

ZFNet: Improved
hyperparameters over
AlexNet

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
ZFNet [Zeiler and Fergus, 2013]

AlexNet but:
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512
ImageNet top 5
error: 16.4% ->
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 -
11.7% May 2, 2017
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

Deeper Networks

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
Pool 3x3 conv, 512
Small filters, Deeper networks 3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 Pool

8 layers (AlexNet) Softmax


Pool
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512

-> 16 - 19 layers (VGG16Net) FC 1000


FC 4096
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
FC 4096 Pool Pool
Pool 3x3 conv, 256 3x3 conv, 256
Only 3x3 CONV stride 1, pad 1 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256

and 2x2 MAX POOL stride 2 3x3 conv, 384


Pool
Pool
3x3 conv, 128
Pool
3x3 conv, 128
3x3 conv, 384 3x3 conv, 128 3x3 conv, 128
Pool Pool Pool
11.7% top 5 error in ILSVRC’13 5x5 conv, 256 3x3 conv, 64 3x3 conv, 64

(ZFNet) 11x11 conv, 96


Input
3x3 conv, 64
Input
3x3 conv, 64
Input

-> 7.3% top 5 error in AlexNet VGG16 VGG19


ILSVRC’14
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017

Lecture 9 - 83
Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512

Q: Why use smaller filters? (3x3 conv)


Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512

Softmax 3x3 conv, 512 3x3 conv, 512


FC 1000 3x3 conv, 512 3x3 conv, 512
FC 4096 3x3 conv, 512 3x3 conv, 512
FC 4096 Pool Pool
Pool 3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256 3x3 conv, 256
3x3 conv, 384 Pool Pool
Pool 3x3 conv, 128 3x3 conv, 128
3x3 conv, 384 3x3 conv, 128 3x3 conv, 128
Pool Pool Pool
5x5 conv, 256 3x3 conv, 64 3x3 conv, 64
11x11 conv, 96 3x3 conv, 64 3x3 conv, 64
Input Input Input

AlexNet VGG16 VGG19

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017

Lecture 9 - 84
Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512

Q: Why use smaller filters? (3x3 conv)


Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512

Softmax 3x3 conv, 512 3x3 conv, 512

Stack of three 3x3 conv (stride 1) layers FC 1000


FC 4096
3x3 conv, 512

3x3 conv, 512


3x3 conv, 512
3x3 conv, 512

has same effective receptive field as FC 4096 Pool Pool


3x3 conv, 256
Pool 3x3 conv, 256
one 7x7 conv layer 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256
3x3 conv, 384 Pool Pool
Pool 3x3 conv, 128 3x3 conv, 128

Q: What is the effective receptive field of 3x3 conv, 384


Pool
3x3 conv, 128
Pool
3x3 conv, 128
Pool

three 3x3 conv (stride 1) layers? 5x5 conv, 256 3x3 conv, 64 3x3 conv, 64
3x3 conv, 64
11x11 conv, 96 3x3 conv, 64
Input Input Input

AlexNet VGG16 VGG19

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Lecture 9 - 86
Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512

Q: Why use smaller filters? (3x3 conv)


Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512

Softmax 3x3 conv, 512 3x3 conv, 512


FC 1000 3x3 conv, 512 3x3 conv, 512

Stack of three 3x3 conv (stride 1) layers FC 4096


FC 4096
3x3 conv, 512
Pool
3x3 conv, 512
Pool

has same effective receptive field as Pool 3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256 3x3 conv, 256
one 7x7 conv layer 3x3 conv, 384 Pool Pool
Pool 3x3 conv, 128 3x3 conv, 128
3x3 conv, 384 3x3 conv, 128 3x3 conv, 128

[7x7] Pool
5x5 conv, 256
Pool
3x3 conv, 64
Pool
3x3 conv, 64
11x11 conv, 96 3x3 conv, 64 3x3 conv, 64
Input Input Input

AlexNet VGG16 VGG19

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512

Q: Why use smaller filters? (3x3 conv)


Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512
Stack of three 3x3 conv (stride 1) layers Softmax 3x3 conv, 512 3x3 conv, 512

has same effective receptive field as FC 1000


FC 4096
3x3 conv, 512

3x3 conv, 512


3x3 conv, 512
3x3 conv, 512

one 7x7 conv layer FC 4096 Pool Pool


3x3 conv, 256
Pool 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256 3x3 conv, 256
3x3 conv, 384 Pool Pool
But deeper, more non-linearities Pool 3x3 conv, 128 3x3 conv, 128
3x3 conv, 384 3x3 conv, 128 3x3 conv, 128
Pool Pool Pool
2 2
And fewer parameters: 3 * (3 C ) 5x5 conv, 256 3x3 conv, 64 3x3 conv, 64
3x3 conv, 64
11x11 conv, 96 3x3 conv, 64
vs. 72C2 for C channels per layer Input Input Input

AlexNet VGG16 VGG19

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
VGG-16

Softmax
FC 1000
FC 4096
FC 4096
POOL
Pool
[3x3 conv,64]
[3x3 conv,128] 3x3 conv, 512
x2 3x3 conv, 512
x2
3x3 conv, 512
Pool
224x224x3 224x224x64 112x112x64 112x112x128 3x3 conv, 512
3x3 conv, 512

3x3 conv, 512


Pool
3x3 conv, 256
3x3 conv, 256
Pool
POOL
POOL 3x3 conv, 128
[3x3 conv,256] [3x3 conv,512] 3x3 conv, 128

x2 x3 Pool
3x3 conv, 64
3x3 conv, 64

56x56x128 56x56x256 28x28x256 28x28x512 Input

POOL

[3x3 conv,512]
x3

14x14x512 14x14x512 4096 4096 1000


Softmax

Case Study: VGGNet FC 1000


FC 4096
FC 4096
[Simonyan and Zisserman, 2014] Pool
3x3 conv, 512
3x3 conv, 512
• Conv  Pool  Conv  Pool  Conv  FC 3x3 conv, 512
Pool
3x3 conv, 512
3x3 conv, 512
• As we go deeper: Width, Height Number of filters 3x3 conv, 512
Pool
3x3 conv, 256
3x3 conv, 256
• VGG-16: 16 layers that have weights Pool
3x3 conv, 128
138M parameters 3x3 conv, 128
Pool
3x3 conv, 64
3x3 conv, 64
• Note: Most memory is in early CONV Input

Most params are in late FC VGG16

• Large but simple


Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] fc8 FC 1000 FC 4096
FC 4096 Pool
fc7 FC 4096 3x3 conv, 512
Pool 3x3 conv, 512
Details: fc6
conv5-3 3x3 conv, 512 3x3 conv, 512
conv5- 3x3 conv, 512 3x3 conv, 512
- ILSVRC’14 2nd in classification, 1st in localization 2 3x3 conv, 512 Pool

- Similar training procedure as Krizhevsky 2012 conv5-1 Pool


3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Softmax conv4-3
- No Local Response Normalisation (LRN) FC 1000 conv4- 3x3 conv, 512 3x3 conv, 512

- Use VGG16 or VGG19 (VGG19 only fc7 FC 4096


FC 4096
2
conv4-1
3x3 conv, 512
Pool
3x3 conv, 512
Pool

slightly better, more memory) fc6 Pool conv3-2 3x3 conv, 256 3x3 conv, 256

- Use ensembles for best results conv5 3x3 conv, 256


3x3 conv, 384
conv3-
1
3x3 conv, 256
Pool
3x3 conv, 256
Pool

- FC7 features generalize well to other tasks conv4 Pool


conv2-2
3x3 conv, 128 3x3 conv, 128

conv3 3x3 conv, 384 3x3 conv, 128 3x3 conv, 128
conv2-
Pool 1 Pool Pool

conv2 5x5 conv, 256 conv1-2 3x3 conv, 64 3x3 conv, 64


11x11 conv, 96 conv1- 3x3 conv, 64 3x3 conv, 64

conv1 Input 1 Input Input

AlexNet VGG16 VGG19

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

Deeper Networks

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017

Lecture 9 - 92
Inception Layer

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 18, 2017

93
Inception layer
[Szegedy et al., 2014]
› Not sure of the filter size?
 Use them all !
Filter
concatenation
Apply parallel filter operations on
1x1 3x3 5x5 3x3 max the input from previous layer:
convolution convolution convolution pooling
- Multiple receptive field sizes
for convolution (1x1, 3x3,
Previous Layer
5x5)
- Pooling operation (3x3)
Naive Inception module
Concatenate all filter outputs
together depth-wise
Inception layer
[Szegedy et al., 2014]

Apply parallel filter operations on


Filter
the input from previous layer:
concatenation - Multiple receptive field sizes
1x1 3x3 5x5 3x3 max for convolution (1x1, 3x3,
convolution convolution convolution pooling
5x5)
- Pooling operation (3x3)
Previous Layer

Concatenate all filter outputs


Naive Inception module together depth-wise

Q: What is the problem with this?


[Hint: Computational complexity]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Example:

Filter
concatenation

1x1 conv, 3x3 conv, 5x5 conv,


128 192 96 3x3 pool

Module input: Input

28x28x256

Naive Inception module

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Q1: What is the output size of the


Example: 1x1 conv, with 128 filters?

Filter
concatenation

1x1 conv, 3x3 conv, 5x5 conv,


128 192 96 3x3 pool

Module input: Input

28x28x256

Naive Inception module

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Q1: What is the output size of the


Example: 1x1 conv, with 128 filters?

Filter
concatenation

28x28x128
1x1 conv, 3x3 conv, 5x5 conv,
128 192 96 3x3 pool

Module input: Input

28x28x256

Naive Inception module

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Q2: What are the output sizes of


Example: all different filter operations?

Filter
concatenation

28x28x128
1x1 conv, 3x3 conv, 5x5 conv,
128 192 96 3x3 pool

Module input: Input

28x28x256

Naive Inception module

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017

Lecture 9 - 99
Inception layer Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Q2: What are the output sizes of


Example: all different filter operations?

Filter
concatenation

28x28x128 28x28x192 28x28x96 28x28x256


1x1 conv, 3x3 conv, 5x5 conv,
128 192 96 3x3 pool

Module input: Input

28x28x256

Naive Inception module

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Q3:What is output size after


Example: filter concatenation?

Filter
concatenation

28x28x128 28x28x192 28x28x96 28x28x256


1x1 conv, 3x3 conv, 5x5 conv,
128 192 96 3x3 pool

Module input: Input

28x28x256

Naive Inception module

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Q3:What is output size after


Example: filter concatenation?

28x28x(128+192+96+256) = 28x28x672
Filter
concatenation

28x28x128 28x28x192 28x28x96 28x28x256


1x1 conv, 3x3 conv, 5x5 conv,
128 192 96 3x3 pool

Module input: Input

28x28x256

Naive Inception module

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Q3:What is output size after


Example: filter concatenation?
Conv Ops:
28x28x(128+192+96+256) = 28x28x672 [1x1 conv, 128] 28x28x128x1x1x256
Filter [3x3 conv, 192] 28x28x192x3x3x256
concatenation
[5x5 conv, 96] 28x28x96x5x5x256
28x28x128 28x28x192 28x28x96 28x28x256 Total: 854M ops
1x1 conv, 3x3 conv, 5x5 conv,
128 192 96 3x3 pool

Module input: Input

28x28x256

Naive Inception module

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Q3:What is output size after


Example: filter concatenation?
Conv Ops:
28x28x(128+192+96+256) = 28x28x672 [1x1 conv, 128] 28x28x128x1x1x256
Filter [3x3 conv, 192] 28x28x192x3x3x256
concatenation
[5x5 conv, 96] 28x28x96x5x5x256
28x28x128 28x28x192 28x28x96 28x28x256 Total: 854M ops
1x1 conv, 3x3 conv, 5x5 conv,
128 192 96 3x3 pool
Very expensive compute

Module input: Input


Pooling layer also preserves feature
28x28x256 depth, which means total depth after
concatenation can only grow at every
Naive Inception module
layer!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Q3:What is output size after


Example: filter concatenation?

28x28x(128+192+96+256) = 529k Solution: “bottleneck” layers that


Filter
concatenation
use 1x1 convolutions to reduce
feature depth
28x28x128 28x28x192 28x28x96 28x28x256
1x1 conv, 3x3 conv, 5x5 conv,
128 192 96 3x3 pool

Module input: Input

28x28x256

Naive Inception module

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Reminder: 1x1 convolutions

1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
64-dimensional dot
56 product)
56
64 32

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Reminder: 1x1 convolutions

1x1 CONV
56 with 32 filters
56

preserves spatial
dimensions, reduces depth!

56 Projects depth to lower 56


64 dimension (combination of 32
feature maps)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer
[Szegedy et al., 2014]

Filter
Filter concatenation
concatenation

1x1 3x3 5x5 1x1


1x1 3x3 5x5 3x3 max convolution convolution convolution
convolution
convolution convolution convolution pooling

1x1 1x1 3x3 max


convolution convolution pooling
Previous Layer

Previous Layer
Naive Inception module
Inception module with dimension reduction

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer
[Szegedy et al., 2014]
1x1 conv “bottleneck”
layers
Filter
Filter concatenation
concatenation

1x1 3x3 5x5 1x1


1x1 3x3 5x5 3x3 max convolution convolution convolution
convolution
convolution convolution convolution pooling

1x1 1x1 3x3 max


convolution convolution pooling
Previous Layer

Previous Layer
Naive Inception module
Inception module with dimension reduction

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Inception layer Using same parallel layers as
naive example, and adding “1x1
[Szegedy et al., 2014]
conv, 64 filter” bottlenecks:
28x28x480
Filter Conv Ops:
concatenation
[1x1 conv, 64] 28x28x64x1x1x256
[1x1 conv, 64] 28x28x64x1x1x256
28x28x128 28x28x192 28x28x96 28x28x64
[1x1 conv, 128] 28x28x128x1x1x256
1x1 conv, 3x3 conv, 5x5 conv, 1x1 conv, [3x3 conv, 192] 28x28x192x3x3x64
128 192 96 64
[5x5 conv, 96] 28x28x96x5x5x64
28x28x64 28x28x64 28x28x256 [1x1 conv, 64] 28x28x64x1x1x256
1x1 conv, 1x1 conv, 3x3 pool Total: 358M ops
64 64

Module input: Previous Layer


Compared to 854M ops for naive version
28x28x256 Bottleneck can also reduce depth after
pooling layer
Inception module with dimension reduction

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: GoogLeNet
[Szegedy et al., 2014]

Deeper networks, with computational


efficiency

- 22 layers
- Efficient “Inception” module
- No FC layers
- Only 5 million parameters!
12x less than AlexNet Inception module
- ILSVRC’14 classification winner
(6.7% top 5 error)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: GoogLeNet
[Szegedy et al., 2014]

“Inception module”: design a


good local network topology
(network within a network) and
then stack these modules on
top of each other

Inception module

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: GoogLeNet
[Szegedy et al., 2014]

Stack Inception modules


with dimension
reduction on top of each
other

Inception module

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: GoogLeNet
[Szegedy et al., 2014]

Full GoogLeNet
architecture

Stem Network:
Conv-Pool-
2x Conv-Pool

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: GoogLeNet
[Szegedy et al., 2014]

Full GoogLeNet
architecture

Stacked Inception
Modules

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: GoogLeNet
[Szegedy et al., 2014]

Full GoogLeNet
architecture

Classifier output

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: GoogLeNet
[Szegedy et al., 2014]

Full GoogLeNet
architecture

Classifier output
(removed expensive FC layers!)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: GoogLeNet
[Szegedy et al., 2014]

Full GoogLeNet
architecture

Auxiliary classification outputs to inject additional gradient at lower layers


(AvgPool-1x1Conv-FC-FC-Softmax)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: GoogLeNet
[Szegedy et al., 2014]

Full GoogLeNet
architecture

22 total layers with weights (including each parallel layer in an Inception module)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Skip Connections

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 18, 2017

12
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
“Revolution of Depth”

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
The problem of depth

What happens when we continue stacking deeper layers on a “plain” convolutional


neural network?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
The problem of depth

What happens when we continue stacking deeper layers on a “plain” convolutional


neural network?

56-layer
Training error

56-layer

Test error
20-layer

20-layer

Iterations Iterations

Q: What’s strange about these training and test curves?


[Hint: look at the order of the curves]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
The problem of depth

What happens when we continue stacking deeper layers on a “plain” convolutional


neural network?

56-layer
Training error

56-layer

Test error
20-layer

20-layer

Iterations Iterations

56-layer model performs worse on both training and test error


-> The deeper model performs worse, but it’s not caused by overfitting!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
The problem of depth

Hypothesis: the problem is an optimization problem, deeper models are harder to


optimize

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
The problem of depth

Hypothesis: the problem is an optimization problem, deeper models are harder to


optimize

Vanishing and exploding gradients !

The deeper model should be able to perform at


least as well as the shallower model.

A solution by construction is copying the learned


layers from the shallower model and setting
additional layers to identity mapping.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Residual block
[He et al., 2015]

Solution: Use network layers to fit a residual mapping instead of directly trying to fit a
desired underlying mapping
relu
H(x) F(x) + x

conv
conv
F(x) X
relu relu
identity
conv conv

X X
“Plain” Residual
layers block

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Residual block
[He et al., 2015]

Solution: Use network layers to fit a residual mapping instead of directly trying to fit a
desired underlying mapping
H(x) = F(x) + x relu
H(x) F(x) + x
Use layers to
conv
conv fit residual
F(x) relu
X F(x) = H(x) - x
relu identity
instead of
conv conv
H(x) directly

X X
“Plain” Residual
layers block

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Why does residual block work?
𝐿 +1
𝑥

relu 𝐿+1 𝐿+1 𝐿 𝐿+1 𝐿− 1


𝑥 = 𝑓 (𝑊 ⋅ 𝑥 +𝑏 +𝑥 )

conv
What happens if and are zero?

𝑥𝐿 relu
identity
conv

Residual
block
Why does residual block work?
𝐿 +1
𝑥
𝐿+1 𝐿+1 𝐿 𝐿+1 𝐿− 1
relu 𝑥 = 𝑓 (𝑊 ⋅ 𝑥 +𝑏 +𝑥 )
~Zero ~Zero

conv
What happens if and are zero?

𝑥𝐿 relu
identity 𝐿 +1 𝐿 −1
conv
𝑥 = 𝑓 (𝑥 )
 We kept the same values and added a
non-linearity
Residual
block
Why does residual block work?
𝐿 +1
𝑥
𝐿+1 𝐿+1 𝐿 𝐿+1 𝐿− 1
relu 𝑥 = 𝑓 (𝑊 ⋅ 𝑥 +𝑏 +𝑥 )
~Zero ~Zero

conv
What happens if and are zero?

𝑥𝐿 relu
identity 𝐿 +1 𝐿 −1
conv
𝑥 = 𝑓 (𝑥 )
 We kept the same values and added a
non-linearity
Residual
block
• The identity is easy for the residual block to
learn
• Guaranteed it will not hurt performance, only
improve
Case Study: ResNet
Softmax
FC 1000
Pool

[He et al., 2015] 3x3 conv, 512


3x3 conv, 512

3x3 conv, 512

Full ResNet architecture: 3x3 conv, 512

- Stack residual blocks relu 3x3 conv, 512


3x3 conv, 512, /2
F(x) + x
- Every residual block has ..
two 3x3 conv layers .
3x3 conv, 128
3x3 conv 3x3 conv, 128

F(x) X 3x3 conv, 128


relu 3x3 conv, 128
identity
3x3 conv, 128
3x3 conv 3x3 conv, 128, / 2

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
X 3x3 conv, 64
Residual 3x3 conv, 64
block 3x3 conv, 64

Pool
7x7 conv, 64, / 2
Input

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: ResNet
Softmax
FC 1000
Pool

[He et al., 2015] 3x3 conv, 512


3x3 conv, 512

3x3 conv, 512

Full ResNet architecture: 3x3 conv, 512

- Stack residual blocks relu 3x3 conv, 512


3x3 conv, 512, /2
F(x) + x
- Every residual block has ..
two 3x3 conv layers .
3x3 conv, 128
- Periodically, double # of 3x3 conv 3x3 conv, 128
3x3 conv, 128
filters and downsample F(x) X 3x3 conv, 128
filters, /2
relu 3x3 conv, 128
spatially using stride 2 identity spatially with
3x3 conv, 128 stride 2
3x3 conv
(/2 in each dimension) 3x3 conv, 128, / 2

3x3 conv, 64 3x3 conv, 64


3x3 conv, 64 filters
3x3 conv, 64
X 3x3 conv, 64
Residual 3x3 conv, 64
block 3x3 conv, 64

Pool
7x7 conv, 64, / 2
Input

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: ResNet
Softmax
FC 1000
Pool

[He et al., 2015] 3x3 conv, 512


3x3 conv, 512

3x3 conv, 512

Full ResNet architecture: 3x3 conv, 512

- Stack residual blocks relu 3x3 conv, 512


3x3 conv, 512, /2
F(x) + x
- Every residual block has ..
two 3x3 conv layers .
3x3 conv, 128
- Periodically, double # of 3x3 conv 3x3 conv, 128

filters and downsample F(x) X 3x3 conv, 128


relu 3x3 conv, 128
spatially using stride 2 identity
3x3 conv, 128
3x3 conv
(/2 in each dimension) 3x3 conv, 128, / 2

- Additional conv layer at 3x3 conv, 64


3x3 conv, 64

the beginning X 3x3 conv, 64


3x3 conv, 64
Residual 3x3 conv, 64
block 3x3 conv, 64

Pool
7x7 conv, 64, / 2 Beginning
Input conv
layer

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: ResNet
Softmax
FC 1000 No FC layers
Pool besides FC
1000 to
[He et al., 2015] 3x3 conv, 512
3x3 conv, 512 output
classes
3x3 conv, 512

Full ResNet architecture: 3x3 conv, 512


Global
- Stack residual blocks relu 3x3 conv, 512 average
3x3 conv, 512, /2 pooling layer
F(x) + x
- Every residual block has ..
after last
conv layer
two 3x3 conv layers .
3x3 conv, 128
- Periodically, double # of 3x3 conv 3x3 conv, 128

filters and downsample F(x) X 3x3 conv, 128


relu 3x3 conv, 128
spatially using stride 2 identity
3x3 conv, 128
3x3 conv
(/2 in each dimension) 3x3 conv, 128, / 2

- Additional conv layer at 3x3 conv, 64


3x3 conv, 64

the beginning X 3x3 conv, 64

- No FC layers at the end Residual


3x3 conv, 64

3x3 conv, 64
(only FC 1000 to block 3x3 conv, 64

output classes) Pool


7x7 conv, 64, / 2
Input

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: ResNet
Softmax
FC 1000
Pool

[He et al., 2015] 3x3 conv, 512


3x3 conv, 512

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512

Total depths of 34, 50, 101, or 3x3 conv, 512, /2

152 layers for ImageNet ..


.
3x3 conv, 128
3x3 conv, 128

3x3 conv, 128


3x3 conv, 128

3x3 conv, 128


3x3 conv, 128, / 2

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64, / 2
Input

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: ResNet
[He et al., 2015]
28x28x256
output

For deeper networks 1x1 conv, 256


(ResNet-50+), use “bottleneck”
layer to improve efficiency
3x3 conv, 64
(similar to GoogLeNet)
1x1 conv, 64

28x28x256
input

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: ResNet
[He et al., 2015]
28x28x256
output
1x1 conv, 256 filters projects
back to 256 feature maps
For deeper networks (28x28x256) 1x1 conv, 256
(ResNet-50+), use “bottleneck”
layer to improve efficiency 3x3 conv operates over
3x3 conv, 64
(similar to GoogLeNet) only 64 feature maps

1x1 conv, 64 filters 1x1 conv, 64


to project to
28x28x64
28x28x256
input

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Softmax

Case Study: ResNet FC 1000


Pool

3x3 conv, 64
[He et al., 2015] 3x3 conv, 64

3x3 conv, 64
relu 3x3 conv, 64
Very deep networks using residual F(x) + x 3x3 conv, 64

connections 3x3 conv, 64

..
.
conv
- 152-layer model for ImageNet X
3x3 conv, 128
3x3 conv, 128
F(x) relu
- ILSVRC’15 classification winner identity 3x3 conv, 128
3x3 conv, 128

(3.57% top 5 error) conv 3x3 conv, 128

- Swept all classification and 3x3 conv, 128 / 2

3x3 conv, 64
detection competitions in X
3x3 conv, 64

ILSVRC’15 and COCO’15! Residual


3x3 conv, 64
3x3 conv, 64
block 3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64 / 2
Input

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: ResNet
[He et al., 2015]

Training ResNet in practice:

- Batch Normalization
after every CONV layer
- Xavier/2 initialization
from He et al.
- SGD + Momentum
(0.9)
- Learning rate: 0.1,
divided by 10 when
validation error
plateaus
- Mini-batch size 256
Fei-Fei Li & Justin
- Weight decay Johnson
of 1e-5 & Serena Yeung Lecture 9 - May 2, 2017
- No dropout used
Case Study: ResNet
[He et al., 2015]

Experimental Results
- Able to train very deep
networks without degrading
(152 layers on ImageNet, 1202
on Cifar)
- Deeper networks now achieve
lowing training error as
expected
- Swept 1st place in all ILSVRC
and COCO 2015
competitions

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Case Study: ResNet
[He et al., 2015]

Experimental Results
- Able to train very deep
networks without degrading
(152 layers on ImageNet, 1202
on Cifar)
- Deeper networks now achieve
lowing training error as
expected
- Swept 1st place in all ILSVRC
and COCO 2015
competitions ILSVRC 2015 classification winner (3.6%
top 5 error) -- better than “human
performance”! (Russakovsky 2014)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Comparing complexity...

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Comparing complexity... Inception-v4: Resnet + Inception!

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
VGG: Highest
Comparing complexity...
memory,
most
operations

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
GoogLeNet:
Comparing complexity... most efficient

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
AlexNet:
Comparing complexity... Smaller compute, still memory
heavy, lower accuracy

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
ResNet:
Comparing complexity... Moderate efficiency depending on
model, highest accuracy

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Forward pass time and power consumption

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017
Other architectures to know...
Improving ResNets...

Identity Mappings in Deep Residual Networks


[He et al. 2016]

- Improved ResNet block design from


creators of ResNet conv

- Creates a more direct path for ReLU

propagating information throughout BN


network (moves activation to residual conv
mapping pathway)
- Gives better performance ReLU

BN
Beyond ResNets...
Densely Connected Convolutional Networks
Softmax

[Huang et al. 2017] FC


1x1 conv, 64
Pool
- Dense blocks where each layer is Dense Block 3
Concat
connected to every other layer in Conv

feedforward fashion 1x1 conv, 64 Pool

- Alleviates vanishing gradient, Concat


Conv

Dense Block 2
strengthens feature propagation, Conv
Conv
encourages feature reuse Pool
Concat Conv

Dense Block 1
Conv
Conv

Input Input

Dense Block

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture99- -


Lecture May 2, 2017

Lecture 9 - 153
Efficient networks...
SqueezeNet: AlexNet-level Accuracy With 50x Fewer Parameters and
<0.5Mb Model Size
[Iandola et al. 2017]

- Fire modules consisting of a


‘squeeze’ layer with 1x1 filters
feeding an ‘expand’ layer with 1x1
and 3x3 filters
- AlexNet level accuracy on
ImageNet with 50x fewer
parameters
- Can compress to 510x smaller
than AlexNet (0.5Mb) Figure copyright Iandola, Han, Moskewicz, Ashraf, Dally, Keutzer, 2017. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture99- -


Lecture May 2, 2017

Lecture 9 - 154
SqueezeNet
Input

64
1x1 Conv
Squeeze
16
1x1 Conv 3x3 Conv
Expand Expand
64
64
1
5
5
Output
Concat/Eltwise
12
8 fewer parameters and <0.5MB model size”, arXiv 2016
Iandola et al, “SqueezeNet: AlexNet-level accuracy with 50x
Xception Module

16 32

3x3

656 (16x3x3 + 16x32x1x1) parameters opposed to the 4608 (16x32x3x3)

https://arxiv.org/abs/1610.02357
Temporal Convolutional Networks (TCN)

- CNN architectures not only for


mere discrete image learning but
for sequential tasks
- For sequential task, the mapping
of input-output conserves the
size, i.e., input size is same as
output size.
- Sequential data needs to
conserve causality, i.e., future
does not affect past; there is no
leakage from future to past.

Bai, S., Kolter, J. Z., & Koltun, V. (2018). An empirical


evaluation of generic convolutional and recurrent
networks for sequence modeling. arXiv preprint
arXiv:1803.01271.
Temporal Convolutional Networks (TCN)
How to conserve input size to output size?
Use 1-D fully convolutional network (FCN)!

How to achieve causality?


Use causal convolution; output at time t is convolved only with elements from time t and
earlier in the previous layer!

TCN = 1D FCN + Causal convolutions

Exploit dilated convolution and residual connection to cover longer history and staibilize
gradient over deep network.

Bai, S., Kolter, J. Z., & Koltun, V. (2018). An empirical


evaluation of generic convolutional and recurrent
networks for sequence modeling. arXiv preprint
arXiv:1803.01271.
Temporal Convolutional Networks (TCN)

Dilated convolution Residual connection

Bai, S., Kolter, J. Z., & Koltun, V. (2018). An empirical


evaluation of generic convolutional and recurrent
networks for sequence modeling. arXiv preprint
arXiv:1803.01271.
Summary: CNN Architectures
Case Studies
- AlexNet
- VGG
- GoogLeNet
- ResNet

Also....
- Temporal Convolutional - DenseNet
Networks - SqueezeNet
- Xception Module

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 9 - 100 May 2, 2017

Lecture 9 - 160
Sources
• http://cs231n.stanford.edu/2017/syllabus.html
• https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6509978

We highly recommend you to read these papers!


• AlexNet [Krizhevsky et al. 2012]
• ZFNet [Zeiler and Fergus, 2013]
• VGGNet [Simonyan & Zisserman 2014]
• ResNet [He et al. 2015]
• GoogLeNet [Szegedy et al. 2014]
• Xception [Chollet 2016]

161

You might also like