0% found this document useful (0 votes)

93 views

CS60010: Deep Learning CNN - Part 3: Sudeshna Sarkar

This document summarizes several classic convolutional neural network architectures for image classification: 1) LeNet-5 was one of the earliest CNNs and had a [CONV-POOL-CONV-POOL-CONV-FC] structure. 2) AlexNet significantly outperformed other methods in the 2012 ImageNet challenge using GPUs and large datasets. Its architecture included 5 CONV layers, some max pooling layers, and 3 fully connected layers. 3) VGGNet placed second in the 2014 ImageNet challenge using a deeper network of 16-19 layers composed solely of stacked 3x3 convolution filters to achieve large receptive fields.

Uploaded by

DEEP ROY

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views

CS60010: Deep Learning CNN - Part 3: Sudeshna Sarkar

Uploaded by

DEEP ROY

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 167

CS60010: Deep Learning

CNN – Part 3

Sudeshna Sarkar
Spring 2019

7 Feb 2019
CNN on Text
CNN in text classification

Source of image:
http://citeseerx.ist.psu.edu/viewdoc/downlo
ad?doi=10.1.1.703.6858&rep=rep1&type=p
df
Objectives
• We will examine classic CNN architectures with the goal of:
- Gaining intuition for building CNNs
- Reusing CNN architectures
Case Study: LeNet-5 [LeCun et al., 1998]

Conv filters were 5x5, applied at stride 1

Subsampling (Pooling) layers were 2x2 applied at stride 2
i.e. architecture is [CONV-POOL-CONV-POOL-CONV-FC]

21 Jan 2015
The ILSVRC-2012 competition on ImageNet

• The dataset has 1.2 million high- • Some of the best existing
resolution training images. computer vision methods
• The classification task: were tried on this dataset by
• Get the “correct” class in your top
leading computer vision
5 bets. There are 1000 classes.
• The localization task:
groups from Oxford, INRIA,
• For each bet, put a box around XRCE, …
the object. Your box must have at • Computer vision systems use
least 50% overlap with the correct complicated multi-stage
box. systems.
• The early stages are typically
hand-tuned by optimizing a few
parameters.
[Krizhevsky et al. 2012]
Case Study: AlexNet

The AlexNet was submitted to the ImageNet ILSVRC challenge in 2012 and
significantly outperformed the second runner-up (top 5 error of 16% compared to
runner-up with 26% error).
Facilitated by GPUs, highly optimized convolution implementation and large
datasets (ImageNet)
Has 60 million parameters

ImageNet Classification with Deep Convolutional Neural Networks - Alex Krizhevsky, Ilya
Sutskever, Geoffrey E. Hinton; 2012 21 Jan 2015
AlexNet
Architecture – 7 hidden layers not counting some max pooling layers.
CONV1 – The early layers were convolutional.
MAX POOL1 – The last two layers were globally connected.
NORM1
• Input: 227x227x3 images (224x224 before
CONV2 padding)
MAX POOL2
NORM2
• First layer: 96 11x11 filters applied at stride 4
CONV3
CONV4 • Output volume size?
CONV5 (N-F)/s+1 = (227-11)/4+1 = 55 -> [55x55x96]
Max POOL3
FC6 • Number of parameters in this layer?
FC7 (11*11*3)*96 = 35K
FC8
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Krizhevsky et al., 2012]
AlexNet

conv max pool conv max pool

...
11 × 11 3×3 5×5 3×3
s=4 s=2 S=1 s=2
227×227 ×3 P = 0 55×55 × 96 27×27 ×96 P = 2 27×27 ×256

conv conv conv max pool

... ...
3×3 3×3 3×3 3×3
S=1 s=1 S=1 s=2
13×13 P=1 P=1 P=1
13×13 ×384 13×13 ×384 13×13 ×256 6×6 ×256
×256

This slide is taken from Andrew Ng [Krizhevsky et al., 2012]

AlexNet

FC FC
...
⋮ ⋮
Softmax
1000
4096 4096

This slide is taken from Andrew Ng [Krizhevsky et al., 2012]

AlexNet
• Trained on GTX 580 GPU with only
Details/Retrospectives: 3 GB of memory.
• first use of ReLU • Network spread across 2 GPUs, half
• used Norm layers (not the neurons (feature maps) on
common anymore) each GPU.
• heavy data augmentation • CONV1, CONV2, CONV4, CONV5:
• dropout 0.5 Connections only with feature
• batch size 128 maps on same GPU.
• 7 CNN ensemble
• CONV3, FC6, FC7, FC8:
Connections with all feature maps
in preceding layer, communication
across GPUs.

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Krizhevsky et al., 2012]
A neural network for ImageNet

•=
• The activation functions were:
– 7 hidden layers not counting
some max pooling layers. – Rectified linear units in every
hidden layer. These train much
– The early layers were
faster and are more expressive
convolutional.
than logistic units.
– The last two layers were globally
– Competitive normalization to
connected.
suppress hidden activities when
nearby units have stronger
activities. This helps with
variations in intensity.
Error rates on the ILSVRC-2012 competition

classification classification
&localization

• University of Toronto (Alex Krizhevsky) • 16.4%

•
34.1%

• University of Tokyo • 26.1% 53.6%

• Oxford University Computer • 26.9% 50.0%
Vision Group • 27.0%
• INRIA (French national research
institute in CS) + XRCE (Xerox • 29.5%
Research Center Europe)
• University of Amsterdam
ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) winners

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
VGGNet: ILSVRC 2014 2nd place
• Sequence of deeper networks
trained progressively
• Large receptive fields replaced by
successive layers of 3x3
convolutions (with ReLU in
between)

K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image
Recognition, ICLR 2015
VGGNet
Input
3x3 conv, 64 • Smaller filters
3x3 conv, 64
Pool 1/2 Only 3x3 CONV filters, stride 1, pad 1
3x3 conv, 128
3x3 conv, 128
and 2x2 MAX POOL , stride 2
Pool 1/2
3x3 conv, 256
3x3 conv, 256 • Deeper network
Pool 1/2
3x3 conv, 512 AlexNet: 8 layers
3x3 conv, 512
3x3 conv, 512
VGGNet: 16 - 19 layers
Pool 1/2
3x3 conv, 512
3x3 conv, 512 • ZFNet: 11.7% top 5 error in ILSVRC’13
3x3 conv, 512
Pool 1/2 • VGGNet: 7.3% top 5 error in ILSVRC’14
FC 4096
FC 4096
FC 1000
Softmax

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
VGGNet
• Why use smaller filters? (3x3 conv)
Stack of three 3x3 conv (stride 1) layers has the
same effective receptive field as one 7x7
conv layer.

• What is the effective receptive field of three 3x3

conv (stride 1) layers?
7x7
But deeper, more non-linearities
And fewer parameters: 3 * (32C2) vs. 72C2 for C
channels per layer

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
Receptive Field

conv conv conv

ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) winners

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
GoogLeNet: ILSVRC 2014 winner
• The Inception Module
• Inception Module dramatically reduced the number of
parameters in the network
(4M, compared to AlexNet with 60M).
• Uses Average Pooling instead of Fully Connected layers at
the top of the ConvNet
• Several followup versions to the GoogLeNet, most
recently Inception-v4.

C. Szegedy et al., Going deeper with convolutions, CVPR 2015

GoogleNet

• 22 layers
• Efficient “Inception” module - strayed from
the general approach of simply stacking conv
and pooling layers on top of each other in a
sequential structure
• No FC layers
• Only 5 million parameters!
• ILSVRC’14 classification winner (6.7% top 5
error)

[Szegedy et al., 2014]

GoogLeNet
• The Inception Module
• design a good local network topology (network within a network) and
then stack these modules on top of each other
• Parallel paths with different receptive field sizes and operations are
meant to capture sparse patterns of correlations in the stack of
feature maps

C. Szegedy et al., Going deeper with convolutions, CVPR 2015

GoogLeNet
• The Inception Module
• Parallel paths with different receptive field sizes and operations are meant to capture
sparse patterns of correlations in the stack of feature maps
• Use 1x1 convolutions for dimensionality reduction before expensive convolutions

C. Szegedy et al., Going deeper with convolutions, CVPR 2015

Case Study: GoogLeNet [Szegedy et al., 2014]

Inception module

ILSVRC 2014 winner (6.7% top 5 error)

GoogLeNet

Inception module

C. Szegedy et al., Going deeper with convolutions, CVPR 2015

Case Study: GoogLeNet [Szegedy et al., 2014]

1x1 dimension reduction layers

(reduce compute bottlenecks)

Inception module

21 Jan 2015
Case Study: GoogLeNet [Szegedy et al., 2014]

Helper loss (during training only)

Inception module

21 Jan 2015
GoogLeNet

Auxiliary classifier

C. Szegedy et al., Going deeper with convolutions, CVPR 2015

Case Study: GoogLeNet

Fun features:

- Only 5 million params!

(Removes FC layers
completely)

Compared to AlexNet:
- 12X less params
- 2x more compute
- 6.67% (vs. 16.4%)

21 Jan 2015
Inception v2, v3
• Regularize training with batch normalization, reducing importance of
auxiliary classifiers
• More variants of inception modules with aggressive factorization of
filters

C. Szegedy et al., Rethinking the inception architecture for computer vision, CVPR 2016
Inception v2, v3
• Regularize training with batch normalization, reducing importance of
auxiliary classifiers
• More variants of inception modules with aggressive factorization of
filters
• Increase the number of feature maps while decreasing spatial resolution
(pooling)

C. Szegedy et al., Rethinking the inception architecture for computer vision, CVPR 2016
ResNet
• Deep Residual Learning for Image Recognition -
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun;
2015
• Extremely deep network – 152 layers
• Deeper neural networks are more difficult to train.
• Deep networks suffer from vanishing and exploding
gradients.
• Present a residual learning framework to ease the
training of networks that are substantially deeper
than those used previously.

[He et al., 2015]

ResNet

• ILSVRC’15 classification winner (3.57% top 5

error, humans generally hover around a 5-
10% error rate)
Swept all classification and detection
competitions in ILSVRC’15 and COCO’15!

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet

• What happens when we continue stacking deeper layers on a

convolutional neural network?

• 56-layer model performs worse on both training and test error

-> The deeper model performs worse (not caused by overfitting)!
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
• Hypothesis: The problem is an optimization problem. Very
deep networks are harder to optimize.
• Solution: Use network layers to fit residual mapping instead
of directly trying to fit a desired underlying mapping.

• We will use skip connections allowing us to take the activation

from one layer and feed it into another layer, much deeper
into the network.
• Use layers to fit residual F(x) = H(x) – x
instead of H(x) directly

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
Residual Block
Input x goes through conv-relu-conv series and gives us F(x).
That result is then added to the original input x. Let’s call that
H(x) = F(x) + x.
In traditional CNNs, H(x) would just be equal to F(x). So, instead
of just computing that transformation (straight from x to F(x)),
we’re computing the term that we have to add, F(x), to the
input, x.

[He et al., 2015]

ResNet
𝑎[𝑙+1]
𝑎[𝑙] 𝑎[𝑙+2]

Short cut/ skip connection

a[l] 𝐋𝐋𝐋𝐋𝐋𝐋 𝐑𝐑𝐑𝐑 𝐋𝐋𝐋𝐋𝐋𝐋 𝐑𝐑𝐑𝐑 a[l+2]

a[l+1]
𝐳 [𝐥+𝟏] = 𝐖 [𝐥+𝟏] 𝐚[𝐥] + 𝐛 [𝐥+𝟏] 𝐳 [𝐥+𝟐] = 𝐖 [𝐥+𝟐] 𝐚[𝐥+𝟏] + 𝐛 [𝐥+𝟐]

𝐚[𝐥+𝟏] = 𝐠(𝐳 [𝐥+𝟏] ) 𝐚[𝐥+𝟐] = 𝐠(𝐳 [𝐥+𝟐] )

𝐚[𝐥+𝟐] = 𝐠 𝐳 𝐥+𝟐 + 𝐚 𝐥 = 𝐠(𝐖 [𝐥+𝟐] 𝐚[𝐥+𝟏] + 𝐛 [𝐥+𝟐] + 𝐚 𝐥 )

[He et al., 2015]
ResNet
• The residual module
• Introduce skip or shortcut connections (existing before in various forms in
literature)
• Make it easy for network layers to represent the identity mapping

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image
Recognition, CVPR 2016 (Best Paper)
ResNet
• Deep Residual Learning for Image Recognition - Kaiming
He, Xiangyu Zhang, Shaoqing Ren, Jian Sun; 2015
• Extremely deep network – 152 layers

• Deeper neural networks are more difficult to train.

• Deep networks suffer from vanishing and exploding
gradients.
• Present a residual learning framework to ease the
training of networks that are substantially deeper
than those used previously.

[He et al., 2015]

ResNet

• ILSVRC’15 classification winner (3.57% top 5

error, humans generally hover around a 5-
10% error rate)
Swept all classification and detection
competitions in ILSVRC’15 and COCO’15!

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
• What happens when we continue stacking deeper layers on a
convolutional neural network?

• 56-layer model performs worse on both training and test error

-> The deeper model performs worse (not caused by overfitting)!

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
Case Study: ResNet [He et al., 2015]
- Batch Normalization after every CONV layer
- Xavier/2 initialization from He et al.
- SGD + Momentum (0.9)
- Learning rate: 0.1, divided by 10 when validation error
plateaus
- Mini-batch size 256
- Weight decay of 1e-5
- No dropout used
ResNet
• Directly performing 3x3 convolutions
Deeper residual module (bottleneck)
with 256 feature maps at input and
output:
256 x 256 x 3 x 3 ~ 600K operations
• Using 1x1 convolutions to reduce 256
to 64 feature maps, followed by 3x3
convolutions, followed by 1x1
convolutions to expand back to 256
maps:
256 x 64 x 1 x 1 ~ 16K
64 x 64 x 3 x 3 ~ 36K
64 x 256 x 1 x 1 ~ 16K
Total: ~70K

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning
for Image Recognition, CVPR 2016 (Best Paper)
Case Study: ResNet [He et al., 2015]

21 Jan 2015
Accuracy comparison

The best CNN architecture that we currently have and is a

great innovation for the idea of residual learning.

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
21 Jan 2015
ResNet
• Architectures for ImageNet:

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image Recognition,
CVPR 2016 (Best Paper)
Inception v2, v3
• Regularize training with batch normalization, reducing
importance of auxiliary classifiers
• More variants of inception modules with aggressive
factorization of filters

C. Szegedy et al., Rethinking the inception architecture for computer vision, CVPR 2016
Inception v2, v3
• Regularize training with batch normalization, reducing
importance of auxiliary classifiers
• More variants of inception modules with aggressive
factorization of filters
• Increase the number of feature maps while decreasing
spatial resolution (pooling)

C. Szegedy et al., Rethinking the inception architecture for computer vision, CVPR 2016
Inception v4

C. Szegedy et al., Inception-v4, Inception-ResNet and the Impact of Residual Connections on

Learning, arXiv 2016
Summary: ILSVRC 2012-2015
Team Year Place Error (top- External
5) data
SuperVision-Toronto 2012 - 16.4% no
(AlexNet, 7 layers)
SuperVision 2012 1st 15.3% ImageNet
22k
Clarifai – NYU (7 2013 - 11.7% no
layers)
Clarifai 2013 1st 11.2% ImageNet
22k
VGG – Oxford (16 2014 2nd 7.32% no
layers)
http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/
GoogLeNet (19 2014 1st 6.67% no
Accuracy vs. efficiency

https://culurciello.github.io/tech/2016/06/04/nets.html
Design principles
• Reduce filter sizes (except possibly at the lowest layer),
factorize filters aggressively
• Use 1x1 convolutions to reduce and expand the number of
feature maps judiciously
• Use skip connections and/or create multiple paths through
the network
What’s missing from the picture?
• Training tricks and details: initialization, regularization,
normalization
• Training data augmentation
• Averaging classifier outputs over multiple crops/flips
• Ensembles of networks

• What about ILSVRC 2016?

• No more ImageNet classification
• No breakthroughs comparable to ResNet
APPLICATIONS
Object classification [9]

57
Human Pose Estimation [10]

58
Super Resolution [11]

59
CNN on Text
1 1
1 -1 -1 Filter 1
2 0
-1 1 -1
3 0
-1 -1 1
4: 0 3
1 0 0 0 0 1

…
0 1 0 0 1 0 0

0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
10: 0
0 1 0 0 1 0
0 0 1 0 1 0

…
13 0
6 x 6 image
14 0
fewer parameters! 15 1 Only connect to 9
inputs.
16 1
…
“You need a lot of a data if you want to
train/use CNNs”

21 Jan 2015
Transfer Learning

“You need a lot of a data if you want to

train/use CNNs”

21 Jan 2015
Transfer Learning with CNNs

1. Train on
Imagenet

21 Jan 2015
Transfer Learning with CNNs

1. Train on 2. If small dataset: fix

Imagenet all weights (treat CNN
as fixed feature
extractor), retrain only
the classifier

i.e. swap the Softmax

layer at the end

21 Jan 2015
Transfer Learning with CNNs

1. Train on 2. If small dataset: fix 3. If you have medium sized

Imagenet all weights (treat CNN dataset, “finetune”
as fixed feature instead: use the old weights
extractor), retrain only as initialization, train the full
the classifier network or only some of the
higher layers
i.e. swap the Softmax
layer at the end retrain bigger portion of the
network, or even all of it.

21 Jan 2015
Transfer Learning with CNNs
1. Train on 2. If small dataset: fix 3. If you have medium sized
Imagenet all weights (treat CNN dataset, “finetune”
as fixed feature instead: use the old weights
extractor), retrain only as initialization, train the full
the classifier network or only some of the
higher layers
i.e. swap the Softmax
layer at the end retrain bigger portion of the
network, or even all of it.

tip: use only ~1/10th of

the original learning rate
in finetuning to player,
and ~1/100th on
intermediate layers

21 Jan 2015
DeepMind’s AlphaGo

21 Jan 2015
policy network:
[19x19x48] Input
CONV1: 192 5x5 filters , stride 1, pad 2 => [19x19x192]
CONV2..12: 192 3x3 filters, stride 1, pad 1 => [19x19x192]
CONV: 1 1x1 filter, stride 1, pad 0 => [19x19] (probability map of promising moves)

21 Jan 2015
Summary
• ConvNets stack CONV,ReLU,POOL,FC layers
• Trend towards smaller filters and deeper architectures
• Trend towards getting rid of POOL/FC layers (just CONV)
• Early architectures look like
[(CONV-RELU)*N-POOL?]*M-(FC-RELU)*K,SOFTMAX
• but recent advances such as ResNet/GoogLeNet use only Conv-
ReLU, 1x1 convolutions and Softmax

21 Jan 2015
Weight Initialization

- Q: what happens when W=0 init is used?

21 Jan 2015
Weight Initialization
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)

Works ~okay for small networks, but can lead to non-

homogeneous distributions of activations across the
layers of a network.

21 Jan 2015
Activation Statistics

E.g. 10-layer net with

500 neurons on each
layer, using tanh non-
linearities, and
initializing as described
in last slide.

21 Jan 2015
21 Jan 2015
All activations
become zero!
Q: think about the
backward pass. What
do the gradients look
like?

Hint: think about backward

pass for a W*X gate.

21 Jan 2015
*1.0 instead of *0.01 Almost all neurons
completely
saturated, either -1
and 1. Gradients will
be all zero.

21 Jan 2015
“Xavier initialization” [Glorot et al., 2010]

Easy Derivation (linear case):

Assume weights and inbound
activations have mean zero and are
independent.
Their variances multiply for each
term, and then scale by fan_in for
each output term.

21 Jan 2015
but when using the ReLU
nonlinearity it breaks.

21 Jan 2015
He et al., 2015
(note additional /2)

factor of 2 doesn’t seem like much,

but remember it applies
multiplicatively 150 times in a large
ResNet.

21 Jan 2015
He et al., 2015
(note additional /2)

21 Jan 2015
Proper initialization is an active area of research…
Understanding the difficulty of training deep feedforward neural networks
by Glorot and Bengio, 2010

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013

Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014

Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015

Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015

All you need is a good init, Mishkin and Matas, 2015

…

21 Jan 2015
Localization and Detection

Results from Faster R-CNN, Ren et al 2015

21 Jan 2015
Computer Vision Tasks

Classification + Instance
Classification Object Detection
Localization Segmentation

CAT CAT CAT, DOG, DUCK CAT, DOG, DUCK

Single object Multiple objects

21 Jan 2015
Computer Vision Tasks

Classification + Instance
Classification Object Detection
Localization Segmentation

21 Jan 2015
Classification + Localization: Task

Classification: C classes
Input: Image
Output: Class label CAT
Evaluation metric: Accuracy

Localization:
Input: Image
Output: Box in the image (x, y, w, h)
Evaluation metric: Intersection over Union

Classification + Localization: Do both

(x, y, w, h)

21 Jan 2015
Classification + Localization: ImageNet

1000 classes (same as classification)

Each image has 1 class, at least one bounding

box

~800 training images per class

Algorithm produces 5 (class, box) guesses

Example is correct if at least one one guess has

correct class AND bounding box at least 0.5
intersection over union (IoU)

Krizhevsky et. al. 2012

21 Jan 2015
Idea #1: Localization as Regression

Input: image

Neural Net Output:

Box coordinates
(4 numbers)

Loss:
L2 distance
Correct output:
box coordinates
(4 numbers)
Only one object,
simpler than detection

21 Jan 2015
Simple Recipe for Classification + Localization

Step 1: Train (or download) a classification model (AlexNet, VGG, GoogLeNet)

Convolution
and Pooling Fully-connected layers

Softmax loss

Final conv
feature map Class scores
Image

21 Jan 2015
Simple Recipe for Classification + Localization

Step 2: Attach new fully-connected “regression head” to the network

Fully-connected
layers

“Classification head”

Convolution Class scores

and Pooling

Fully-connected
layers

“Regression head”

Final conv
feature map
Box coordinates
Image

21 Jan 2015
Simple Recipe for Classification + Localization

Step 3: Train the regression head only with SGD and L2 loss

Fully-connected
layers

Convolution Class scores

and Pooling

Fully-connected
layers

L2 loss

Final conv
feature map
Box coordinates
Image

21 Jan 2015
Simple Recipe for Classification + Localization

Step 4: At test time use both heads

Fully-connected
layers

Convolution Class scores

and Pooling

Fully-connected
layers

Final conv
feature map
Box coordinates
Image

21 Jan 2015
Per-class vs class agnostic regression

Assume classification over C

classes: Fully-connected
layers
Classification head:
C numbers
(one per class)

Convolution Class scores

and Pooling

Class agnostic:
4 numbers
Fully-connected (one box)
layers

Class specific:
C x 4 numbers
Final conv (one box per class)
feature map
Box coordinates
Image

21 Jan 2015
Where to attach the regression head?

After conv layers:

Overfeat, VGG After last FC layer:
DeepPose, R-CNN

Convolution Fully-connected
and Pooling layers

Softmax loss

Final conv
feature map Class scores
Image

21 Jan 2015
Aside: Localizing multiple objects

Want to localize exactly K objects

in each image
Fully-connected
(e.g. whole cat, cat head, cat left layers
ear, cat right ear for K=4)

Convolution Class scores

and Pooling

Fully-connected
layers

K x 4 numbers
(one box per object)
Final conv
feature map
Box coordinates
Image

21 Jan 2015
Aside: Human Pose Estimation

Represent a person by K joints

Regress (x, y) for each joint from

last fully-connected layer of
AlexNet

(Details: Normalized coordinates,

iterative refinement)

Toshev and Szegedy, “DeepPose: Human Pose Estimation via

Deep Neural Networks”, CVPR 2014

21 Jan 2015
Idea #2: Sliding Window

● Run classification + regression network at multiple

locations on a high-resolution image

● Convert fully-connected layers into convolutional

layers for efficient computation

● Combine classifier and regressor predictions across all

scales for final prediction

21 Jan 2015
Sliding Window: Overfeat
4096 4096 Class scores:
Winner of ILSVRC 2013 localization 1000
challenge

FC FC
Softmax
Convolution loss
+ pooling

FC FC
Feature map:
1024 x 5 x 5 Euclidean
Image: loss
3 x 221 x 221

Boxes:
1024 1000 x 4
4096
Sermanet et al, “Integrated Recognition, Localization and
Detection using Convolutional Networks”, ICLR 2014