0% found this document useful (0 votes)

44 views

cs231n 2018 Lecture09

The document discusses the AlexNet CNN architecture which was used to win the 2012 ImageNet challenge. It details each layer of the network including the filter sizes and output volumes. It also provides historical context about how the network was trained across multiple GPUs due to memory constraints.

Uploaded by

Đào Tiên

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views

cs231n 2018 Lecture09

Uploaded by

Đào Tiên

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 106

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 1 May 1, 2018

Administrative
A2 due Wed May 2

Midterm: In-class Tue May 8. Covers material through

Lecture 10 (Thu May 3).

Sample midterm released on piazza.

Midterm review session: Fri May 4 discussion section

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 2 May 1, 2018
Last time: Deep learning frameworks
PaddlePaddle Chainer
(Baidu)
Caffe Caffe2
(UC Berkeley) (Facebook) MXNet
(Amazon)
CNTK
Developed by U Washington, CMU, MIT,
Hong Kong U, etc but main framework of
(Microsoft)
Torch PyTorch choice at AWS

(NYU / Facebook) (Facebook)

Deeplearning4j
Theano TensorFlow
(U Montreal) (Google)
And others...
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 3 May 1, 2018
Last time: Deep learning frameworks

(1) Easily build big computational graphs

(2) Easily compute gradients in computational graphs
(3) Run it all efficiently on GPU (wrap cuDNN, cuBLAS, etc)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 4 May 1, 2018
Today: CNN Architectures
Case Studies
- AlexNet
- VGG
- GoogLeNet
- ResNet

Also....
- NiN (Network in Network) - DenseNet
- Wide ResNet - FractalNet
- ResNeXT - SqueezeNet
- Stochastic Depth - NASNet
- Squeeze-and-Excitation Network

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 5 May 1, 2018
Review: LeNet-5
[LeCun et al., 1998]

Conv filters were 5x5, applied at stride 1

Subsampling (Pooling) layers were 2x2 applied at stride 2
i.e. architecture is [CONV-POOL-CONV-POOL-FC-FC]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 6 May 1, 2018
Case Study: AlexNet
[Krizhevsky et al. 2012]

Architecture:
CONV1
MAX POOL1
NORM1
CONV2
MAX POOL2
NORM2
CONV3
CONV4
CONV5
Max POOL3
FC6
FC7
FC8

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 7 May 1, 2018
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4

=>
Q: what is the output volume size? Hint: (227-11)/4+1 = 55

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 8 May 1, 2018
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4

=>
Output volume [55x55x96]

Q: What is the total number of parameters in this layer?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 9 May 1, 2018
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4

=>
Output volume [55x55x96]
Parameters: (11*11*3)*96 = 35K

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 10 May 1, 2018
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images

After CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2

Q: what is the output volume size? Hint: (55-3)/2+1 = 27

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 11 May 1, 2018
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images

After CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2

Output volume: 27x27x96

Q: what is the number of parameters in this layer?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 12 May 1, 2018
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images

After CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2

Output volume: 27x27x96
Parameters: 0!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 13 May 1, 2018
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images

After CONV1: 55x55x96
After POOL1: 27x27x96
...

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 14 May 1, 2018
Case Study: AlexNet
[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:

[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 15 May 1, 2018
Case Study: AlexNet
[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:

[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
Details/Retrospectives:
[27x27x96] MAX POOL1: 3x3 filters at stride 2
- first use of ReLU
[27x27x96] NORM1: Normalization layer
- used Norm layers (not common anymore)
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
- heavy data augmentation
[13x13x256] MAX POOL2: 3x3 filters at stride 2
- dropout 0.5
[13x13x256] NORM2: Normalization layer
- batch size 128
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
- SGD Momentum 0.9
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
- Learning rate 1e-2, reduced by 10
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
manually when val accuracy plateaus
[6x6x256] MAX POOL3: 3x3 filters at stride 2
- L2 weight decay 5e-4
[4096] FC6: 4096 neurons
- 7 CNN ensemble: 18.2% -> 15.4%
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 16 May 1, 2018
Case Study: AlexNet
[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:

[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[55x55x48] x 2
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 Historical note: Trained on GTX 580
[13x13x256] MAX POOL2: 3x3 filters at stride 2 GPU with only 3 GB of memory.
[13x13x256] NORM2: Normalization layer Network spread across 2 GPUs, half
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 the neurons (feature maps) on each
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 GPU.
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 17 May 1, 2018
Case Study: AlexNet
[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:

[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer CONV1, CONV2, CONV4, CONV5:
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 Connections only with feature maps
[13x13x256] MAX POOL2: 3x3 filters at stride 2 on same GPU
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 18 May 1, 2018
Case Study: AlexNet
[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:

[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer CONV3, FC6, FC7, FC8:
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 Connections with all feature maps in
[13x13x256] MAX POOL2: 3x3 filters at stride 2 preceding layer, communication
[13x13x256] NORM2: Normalization layer across GPUs
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 19 May 1, 2018
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

152 layers 152 layers 152 layers

19 layers 22 layers

shallow 8 layers 8 layers

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 20 May 1, 2018
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

First CNN-based winner 152 layers 152 layers 152 layers

19 layers 22 layers

shallow 8 layers 8 layers

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 21 May 1, 2018
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

ZFNet: Improved
hyperparameters over 152 layers 152 layers 152 layers
AlexNet

19 layers 22 layers

shallow 8 layers 8 layers

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 22 May 1, 2018
ZFNet [Zeiler and Fergus, 2013]

AlexNet but:
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512
ImageNet top 5 error: 16.4% -> 11.7%

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 23 May 1, 2018
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

Deeper Networks 152 layers 152 layers 152 layers

19 layers 22 layers

shallow 8 layers 8 layers

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 24 May 1, 2018
Case Study: VGGNet
[Simonyan and Zisserman, 2014]

Small filters, Deeper networks

8 layers (AlexNet)
-> 16 - 19 layers (VGG16Net)

Only 3x3 CONV stride 1, pad 1

and 2x2 MAX POOL stride 2

11.7% top 5 error in ILSVRC’13

(ZFNet)
-> 7.3% top 5 error in ILSVRC’14 AlexNet VGG16 VGG19

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 25 May 1, 2018
Case Study: VGGNet
[Simonyan and Zisserman, 2014]

Q: Why use smaller filters? (3x3 conv)

AlexNet VGG16 VGG19

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 26 May 1, 2018
Case Study: VGGNet
[Simonyan and Zisserman, 2014]

Q: Why use smaller filters? (3x3 conv)

Stack of three 3x3 conv (stride 1) layers

has same effective receptive field as
one 7x7 conv layer

Q: What is the effective receptive field of

three 3x3 conv (stride 1) layers?

AlexNet VGG16 VGG19

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 27 May 1, 2018
Case Study: VGGNet
[Simonyan and Zisserman, 2014]

Q: Why use smaller filters? (3x3 conv)

Stack of three 3x3 conv (stride 1) layers

has same effective receptive field as
one 7x7 conv layer

[7x7]

AlexNet VGG16 VGG19

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 28 May 1, 2018
Case Study: VGGNet
[Simonyan and Zisserman, 2014]

Q: Why use smaller filters? (3x3 conv)

Stack of three 3x3 conv (stride 1) layers

has same effective receptive field as
one 7x7 conv layer

But deeper, more non-linearities

And fewer parameters: 3 * (32C2) vs.

72C2 for C channels per layer
AlexNet VGG16 VGG19

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 29 May 1, 2018
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000 VGG16

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 30 May 1, 2018
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000 VGG16

TOTAL memory: 24M * 4 bytes ~= 96MB / image (for a forward pass)

TOTAL params: 138M parameters

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 31 May 1, 2018
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 Note:
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 Most memory is in
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 early CONV
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0 Most params are
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 in late FC
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000

TOTAL memory: 24M * 4 bytes ~= 96MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 32 May 1, 2018
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000 VGG16

TOTAL memory: 24M * 4 bytes ~= 96MB / image (only forward! ~*2 for bwd) Common names
TOTAL params: 138M parameters

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 33 May 1, 2018
Case Study: VGGNet
[Simonyan and Zisserman, 2014]

Details:
- ILSVRC’14 2nd in classification, 1st in
localization
- Similar training procedure as Krizhevsky
2012
- No Local Response Normalisation (LRN)
- Use VGG16 or VGG19 (VGG19 only
slightly better, more memory)
- Use ensembles for best results
- FC7 features generalize well to other
tasks
AlexNet VGG16 VGG19

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 34 May 1, 2018
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

Deeper Networks 152 layers 152 layers 152 layers

19 layers 22 layers

shallow 8 layers 8 layers

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 35 May 1, 2018
Case Study: GoogLeNet
[Szegedy et al., 2014]

Deeper networks, with computational

efficiency

- 22 layers
- Efficient “Inception” module
- No FC layers
- Only 5 million parameters!
12x less than AlexNet Inception module
- ILSVRC’14 classification winner
(6.7% top 5 error)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 36 May 1, 2018
Case Study: GoogLeNet
[Szegedy et al., 2014]

“Inception module”: design a

good local network topology
(network within a network) and
then stack these modules on
top of each other

Inception module

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 37 May 1, 2018
Case Study: GoogLeNet
[Szegedy et al., 2014]

Apply parallel filter operations on

the input from previous layer:
- Multiple receptive field sizes
for convolution (1x1, 3x3,
5x5)
- Pooling operation (3x3)

Concatenate all filter outputs

Naive Inception module together depth-wise

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 38 May 1, 2018
Case Study: GoogLeNet
[Szegedy et al., 2014]

Apply parallel filter operations on

the input from previous layer:
- Multiple receptive field sizes
for convolution (1x1, 3x3,
5x5)
- Pooling operation (3x3)

Concatenate all filter outputs

Naive Inception module together depth-wise

Q: What is the problem with this?

[Hint: Computational complexity]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 39 May 1, 2018
Case Study: GoogLeNet Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Example:

Module input:
28x28x256

Naive Inception module

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 40 May 1, 2018
Case Study: GoogLeNet Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Q1: What is the output size of the

28x28x(128+192+96+256) = 28x28x672

28x28x128 28x28x192 28x28x96 28x28x256

Module input:
28x28x256

Naive Inception module

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 46 May 1, 2018
Case Study: GoogLeNet Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Q3:What is output size after

Example: filter concatenation?
Conv Ops:
28x28x(128+192+96+256) = 28x28x672 [1x1 conv, 128] 28x28x128x1x1x256
[3x3 conv, 192] 28x28x192x3x3x256
[5x5 conv, 96] 28x28x96x5x5x256
28x28x128 28x28x192 28x28x96 28x28x256 Total: 854M ops

Module input:
28x28x256

Naive Inception module

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 47 May 1, 2018
Case Study: GoogLeNet Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Q3:What is output size after

Very expensive compute

Module input: Pooling layer also preserves feature

28x28x256 depth, which means total depth after
concatenation can only grow at every
Naive Inception module
layer!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 48 May 1, 2018
Case Study: GoogLeNet Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Q3:What is output size after

Stem Network:
Conv-Pool-
2x Conv-Pool

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 56 May 1, 2018
Case Study: GoogLeNet
[Szegedy et al., 2014]

Full GoogLeNet
architecture

Stacked Inception
Modules

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 57 May 1, 2018
Case Study: GoogLeNet
[Szegedy et al., 2014]

Full GoogLeNet
architecture

Classifier output

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 58 May 1, 2018
Case Study: GoogLeNet
[Szegedy et al., 2014]

Full GoogLeNet
architecture

Classifier output
(removed expensive FC layers!)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 59 May 1, 2018
Case Study: GoogLeNet
[Szegedy et al., 2014]

Full GoogLeNet
architecture

Auxiliary classification outputs to inject additional gradient at lower layers

(AvgPool-1x1Conv-FC-FC-Softmax)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 60 May 1, 2018
Case Study: GoogLeNet
[Szegedy et al., 2014]

Full GoogLeNet
architecture

22 total layers with weights

(parallel layers count as 1 layer => 2 layers per Inception module. Don’t count auxiliary output layers)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 61 May 1, 2018
Case Study: GoogLeNet
[Szegedy et al., 2014]

Deeper networks, with computational

efficiency

- 22 layers
- Efficient “Inception” module
- No FC layers
- 12x less params than AlexNet
- ILSVRC’14 classification winner Inception module
(6.7% top 5 error)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 62 May 1, 2018
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
“Revolution of Depth”

152 layers 152 layers 152 layers

19 layers 22 layers

shallow 8 layers 8 layers

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 63 May 1, 2018
Case Study: ResNet
[He et al., 2015]

relu
Very deep networks using residual F(x) + x
connections
..
.
- 152-layer model for ImageNet X
F(x) relu
- ILSVRC’15 classification winner identity
(3.57% top 5 error)
- Swept all classification and
detection competitions in X
ILSVRC’15 and COCO’15! Residual block

optimize

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 68 May 1, 2018
Case Study: ResNet
[He et al., 2015]

Hypothesis: the problem is an optimization problem, deeper models are harder to

optimize

The deeper model should be able to perform at

least as well as the shallower model.

A solution by construction is copying the learned

layers from the shallower model and setting
additional layers to identity mapping.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 69 May 1, 2018
Case Study: ResNet
[He et al., 2015]

Solution: Use network layers to fit a residual mapping instead of directly trying to fit a
desired underlying mapping
relu
H(x) F(x) + x

F(x) X
relu relu
identity

X X
“Plain” layers Residual block

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 70 May 1, 2018
Case Study: ResNet
[He et al., 2015]

Solution: Use network layers to fit a residual mapping instead of directly trying to fit a
desired underlying mapping
H(x) = F(x) + x relu
H(x) F(x) + x
Use layers to
fit residual
F(x) X F(x) = H(x) - x
relu relu
identity
instead of
H(x) directly

X X
“Plain” layers Residual block

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 71 May 1, 2018
Case Study: ResNet
[He et al., 2015]

Full ResNet architecture:

relu
- Stack residual blocks
F(x) + x
- Every residual block has ..
two 3x3 conv layers .

F(x) X
relu
identity

X
Residual block

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 72 May 1, 2018
Case Study: ResNet
[He et al., 2015]

Full ResNet architecture:

relu
- Stack residual blocks
F(x) + x
- Every residual block has ..
two 3x3 conv layers .

- Periodically, double # of 3x3 conv, 128

filters and downsample F(x) X filters, /2
relu
identity spatially with
spatially using stride 2 stride 2
(/2 in each dimension)
3x3 conv, 64
filters

X
Residual block

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 73 May 1, 2018
Case Study: ResNet
[He et al., 2015]

Full ResNet architecture:

relu
- Stack residual blocks
F(x) + x
- Every residual block has ..
two 3x3 conv layers .

- Periodically, double # of
filters and downsample F(x) X
relu
identity
spatially using stride 2
(/2 in each dimension)
- Additional conv layer at
the beginning X
Residual block

Beginning
conv layer

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 74 May 1, 2018
Case Study: ResNet No FC layers
besides FC
1000 to
[He et al., 2015] output
classes

Full ResNet architecture: Global

relu average
- Stack residual blocks pooling layer
F(x) + x
- Every residual block has ..
after last
conv layer
two 3x3 conv layers .

- Periodically, double # of
filters and downsample F(x) X
relu
identity
spatially using stride 2
(/2 in each dimension)
- Additional conv layer at
the beginning X
- No FC layers at the end Residual block
(only FC 1000 to output
classes)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 75 May 1, 2018
Case Study: ResNet
[He et al., 2015]

Total depths of 34, 50, 101, or

..
152 layers for ImageNet .

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 76 May 1, 2018
Case Study: ResNet
[He et al., 2015]

For deeper networks

(ResNet-50+), use “bottleneck”
layer to improve efficiency
(similar to GoogLeNet)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 77 May 1, 2018
Case Study: ResNet
[He et al., 2015]

1x1 conv, 256 filters projects

back to 256 feature maps
For deeper networks (28x28x256)
(ResNet-50+), use “bottleneck”
layer to improve efficiency 3x3 conv operates over
(similar to GoogLeNet) only 64 feature maps

1x1 conv, 64 filters

to project to
28x28x64

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 78 May 1, 2018
Case Study: ResNet
[He et al., 2015]

Training ResNet in practice:

- Batch Normalization after every CONV layer

- Xavier 2/ initialization from He et al.
- SGD + Momentum (0.9)
- Learning rate: 0.1, divided by 10 when validation error plateaus
- Mini-batch size 256
- Weight decay of 1e-5
- No dropout used

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 79 May 1, 2018
Case Study: ResNet
[He et al., 2015]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 80 May 1, 2018
Case Study: ResNet
[He et al., 2015]

Experimental Results
- Able to train very deep
networks without degrading
(152 layers on ImageNet, 1202
on Cifar)
- Deeper networks now achieve
lowing training error as
expected
- Swept 1st place in all ILSVRC
and COCO 2015 competitions
ILSVRC 2015 classification winner (3.6%
top 5 error) -- better than “human
performance”! (Russakovsky 2014)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 81 May 1, 2018
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

152 layers 152 layers 152 layers

19 layers 22 layers