0% found this document useful (0 votes)
93 views

CS60010: Deep Learning CNN - Part 3: Sudeshna Sarkar

This document summarizes several classic convolutional neural network architectures for image classification: 1) LeNet-5 was one of the earliest CNNs and had a [CONV-POOL-CONV-POOL-CONV-FC] structure. 2) AlexNet significantly outperformed other methods in the 2012 ImageNet challenge using GPUs and large datasets. Its architecture included 5 CONV layers, some max pooling layers, and 3 fully connected layers. 3) VGGNet placed second in the 2014 ImageNet challenge using a deeper network of 16-19 layers composed solely of stacked 3x3 convolution filters to achieve large receptive fields.

Uploaded by

DEEP ROY
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views

CS60010: Deep Learning CNN - Part 3: Sudeshna Sarkar

This document summarizes several classic convolutional neural network architectures for image classification: 1) LeNet-5 was one of the earliest CNNs and had a [CONV-POOL-CONV-POOL-CONV-FC] structure. 2) AlexNet significantly outperformed other methods in the 2012 ImageNet challenge using GPUs and large datasets. Its architecture included 5 CONV layers, some max pooling layers, and 3 fully connected layers. 3) VGGNet placed second in the 2014 ImageNet challenge using a deeper network of 16-19 layers composed solely of stacked 3x3 convolution filters to achieve large receptive fields.

Uploaded by

DEEP ROY
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 167

CS60010: Deep Learning

CNN – Part 3

Sudeshna Sarkar
Spring 2019

7 Feb 2019
CNN on Text
CNN in text classification

Source of image:
http://citeseerx.ist.psu.edu/viewdoc/downlo
ad?doi=10.1.1.703.6858&rep=rep1&type=p
df
Objectives
• We will examine classic CNN architectures with the goal of:
- Gaining intuition for building CNNs
- Reusing CNN architectures
Case Study: LeNet-5 [LeCun et al., 1998]

Conv filters were 5x5, applied at stride 1


Subsampling (Pooling) layers were 2x2 applied at stride 2
i.e. architecture is [CONV-POOL-CONV-POOL-CONV-FC]

21 Jan 2015
The ILSVRC-2012 competition on ImageNet

• The dataset has 1.2 million high- • Some of the best existing
resolution training images. computer vision methods
• The classification task: were tried on this dataset by
• Get the “correct” class in your top
leading computer vision
5 bets. There are 1000 classes.
• The localization task:
groups from Oxford, INRIA,
• For each bet, put a box around XRCE, …
the object. Your box must have at • Computer vision systems use
least 50% overlap with the correct complicated multi-stage
box. systems.
• The early stages are typically
hand-tuned by optimizing a few
parameters.
[Krizhevsky et al. 2012]
Case Study: AlexNet

The AlexNet was submitted to the ImageNet ILSVRC challenge in 2012 and
significantly outperformed the second runner-up (top 5 error of 16% compared to
runner-up with 26% error).
Facilitated by GPUs, highly optimized convolution implementation and large
datasets (ImageNet)
Has 60 million parameters

ImageNet Classification with Deep Convolutional Neural Networks - Alex Krizhevsky, Ilya
Sutskever, Geoffrey E. Hinton; 2012 21 Jan 2015
AlexNet
Architecture – 7 hidden layers not counting some max pooling layers.
CONV1 – The early layers were convolutional.
MAX POOL1 – The last two layers were globally connected.
NORM1
• Input: 227x227x3 images (224x224 before
CONV2 padding)
MAX POOL2
NORM2
• First layer: 96 11x11 filters applied at stride 4
CONV3
CONV4 • Output volume size?
CONV5 (N-F)/s+1 = (227-11)/4+1 = 55 -> [55x55x96]
Max POOL3
FC6 • Number of parameters in this layer?
FC7 (11*11*3)*96 = 35K
FC8
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Krizhevsky et al., 2012]
AlexNet

conv max pool conv max pool


...
11 × 11 3×3 5×5 3×3
s=4 s=2 S=1 s=2
227×227 ×3 P = 0 55×55 × 96 27×27 ×96 P = 2 27×27 ×256

conv conv conv max pool


... ...
3×3 3×3 3×3 3×3
S=1 s=1 S=1 s=2
13×13 P=1 P=1 P=1
13×13 ×384 13×13 ×384 13×13 ×256 6×6 ×256
×256

This slide is taken from Andrew Ng [Krizhevsky et al., 2012]


AlexNet

FC FC
...
⋮ ⋮
Softmax
1000
4096 4096

This slide is taken from Andrew Ng [Krizhevsky et al., 2012]


AlexNet
• Trained on GTX 580 GPU with only
Details/Retrospectives: 3 GB of memory.
• first use of ReLU • Network spread across 2 GPUs, half
• used Norm layers (not the neurons (feature maps) on
common anymore) each GPU.
• heavy data augmentation • CONV1, CONV2, CONV4, CONV5:
• dropout 0.5 Connections only with feature
• batch size 128 maps on same GPU.
• 7 CNN ensemble
• CONV3, FC6, FC7, FC8:
Connections with all feature maps
in preceding layer, communication
across GPUs.

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Krizhevsky et al., 2012]
A neural network for ImageNet

•=
• The activation functions were:
– 7 hidden layers not counting
some max pooling layers. – Rectified linear units in every
hidden layer. These train much
– The early layers were
faster and are more expressive
convolutional.
than logistic units.
– The last two layers were globally
– Competitive normalization to
connected.
suppress hidden activities when
nearby units have stronger
activities. This helps with
variations in intensity.
Error rates on the ILSVRC-2012 competition

classification classification
&localization

• University of Toronto (Alex Krizhevsky) • 16.4%



34.1%

• University of Tokyo • 26.1% 53.6%


• Oxford University Computer • 26.9% 50.0%
Vision Group • 27.0%
• INRIA (French national research
institute in CS) + XRCE (Xerox • 29.5%
Research Center Europe)
• University of Amsterdam
ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) winners

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
VGGNet: ILSVRC 2014 2nd place
• Sequence of deeper networks
trained progressively
• Large receptive fields replaced by
successive layers of 3x3
convolutions (with ReLU in
between)

K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image
Recognition, ICLR 2015
VGGNet
Input
3x3 conv, 64 • Smaller filters
3x3 conv, 64
Pool 1/2 Only 3x3 CONV filters, stride 1, pad 1
3x3 conv, 128
3x3 conv, 128
and 2x2 MAX POOL , stride 2
Pool 1/2
3x3 conv, 256
3x3 conv, 256 • Deeper network
Pool 1/2
3x3 conv, 512 AlexNet: 8 layers
3x3 conv, 512
3x3 conv, 512
VGGNet: 16 - 19 layers
Pool 1/2
3x3 conv, 512
3x3 conv, 512 • ZFNet: 11.7% top 5 error in ILSVRC’13
3x3 conv, 512
Pool 1/2 • VGGNet: 7.3% top 5 error in ILSVRC’14
FC 4096
FC 4096
FC 1000
Softmax

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
VGGNet
• Why use smaller filters? (3x3 conv)
Stack of three 3x3 conv (stride 1) layers has the
same effective receptive field as one 7x7
conv layer.

• What is the effective receptive field of three 3x3


conv (stride 1) layers?
7x7
But deeper, more non-linearities
And fewer parameters: 3 * (32C2) vs. 72C2 for C
channels per layer

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
Receptive Field

conv conv conv


ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) winners

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
GoogLeNet: ILSVRC 2014 winner
• The Inception Module
• Inception Module dramatically reduced the number of
parameters in the network
(4M, compared to AlexNet with 60M).
• Uses Average Pooling instead of Fully Connected layers at
the top of the ConvNet
• Several followup versions to the GoogLeNet, most
recently Inception-v4.

C. Szegedy et al., Going deeper with convolutions, CVPR 2015


GoogleNet

• 22 layers
• Efficient “Inception” module - strayed from
the general approach of simply stacking conv
and pooling layers on top of each other in a
sequential structure
• No FC layers
• Only 5 million parameters!
• ILSVRC’14 classification winner (6.7% top 5
error)

[Szegedy et al., 2014]


GoogLeNet
• The Inception Module
• design a good local network topology (network within a network) and
then stack these modules on top of each other
• Parallel paths with different receptive field sizes and operations are
meant to capture sparse patterns of correlations in the stack of
feature maps

C. Szegedy et al., Going deeper with convolutions, CVPR 2015


GoogLeNet
• The Inception Module
• Parallel paths with different receptive field sizes and operations are meant to capture
sparse patterns of correlations in the stack of feature maps
• Use 1x1 convolutions for dimensionality reduction before expensive convolutions

C. Szegedy et al., Going deeper with convolutions, CVPR 2015


Case Study: GoogLeNet [Szegedy et al., 2014]

Inception module

ILSVRC 2014 winner (6.7% top 5 error)


GoogLeNet

Inception module

C. Szegedy et al., Going deeper with convolutions, CVPR 2015


Case Study: GoogLeNet [Szegedy et al., 2014]

1x1 dimension reduction layers


(reduce compute bottlenecks)

Inception module

21 Jan 2015
Case Study: GoogLeNet [Szegedy et al., 2014]

Helper loss (during training only)

Inception module

21 Jan 2015
GoogLeNet

Auxiliary classifier

C. Szegedy et al., Going deeper with convolutions, CVPR 2015


Case Study: GoogLeNet

Fun features:

- Only 5 million params!


(Removes FC layers
completely)

Compared to AlexNet:
- 12X less params
- 2x more compute
- 6.67% (vs. 16.4%)

21 Jan 2015
Inception v2, v3
• Regularize training with batch normalization, reducing importance of
auxiliary classifiers
• More variants of inception modules with aggressive factorization of
filters

C. Szegedy et al., Rethinking the inception architecture for computer vision, CVPR 2016
Inception v2, v3
• Regularize training with batch normalization, reducing importance of
auxiliary classifiers
• More variants of inception modules with aggressive factorization of
filters
• Increase the number of feature maps while decreasing spatial resolution
(pooling)

C. Szegedy et al., Rethinking the inception architecture for computer vision, CVPR 2016
ResNet
• Deep Residual Learning for Image Recognition -
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun;
2015
• Extremely deep network – 152 layers
• Deeper neural networks are more difficult to train.
• Deep networks suffer from vanishing and exploding
gradients.
• Present a residual learning framework to ease the
training of networks that are substantially deeper
than those used previously.

[He et al., 2015]


ResNet

• ILSVRC’15 classification winner (3.57% top 5


error, humans generally hover around a 5-
10% error rate)
Swept all classification and detection
competitions in ILSVRC’15 and COCO’15!

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet

• What happens when we continue stacking deeper layers on a


convolutional neural network?

• 56-layer model performs worse on both training and test error


-> The deeper model performs worse (not caused by overfitting)!
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
• Hypothesis: The problem is an optimization problem. Very
deep networks are harder to optimize.
• Solution: Use network layers to fit residual mapping instead
of directly trying to fit a desired underlying mapping.

• We will use skip connections allowing us to take the activation


from one layer and feed it into another layer, much deeper
into the network.
• Use layers to fit residual F(x) = H(x) – x
instead of H(x) directly

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
Residual Block
Input x goes through conv-relu-conv series and gives us F(x).
That result is then added to the original input x. Let’s call that
H(x) = F(x) + x.
In traditional CNNs, H(x) would just be equal to F(x). So, instead
of just computing that transformation (straight from x to F(x)),
we’re computing the term that we have to add, F(x), to the
input, x.

[He et al., 2015]


ResNet
𝑎[𝑙+1]
𝑎[𝑙] 𝑎[𝑙+2]

Short cut/ skip connection

a[l] 𝐋𝐋𝐋𝐋𝐋𝐋 𝐑𝐑𝐑𝐑 𝐋𝐋𝐋𝐋𝐋𝐋 𝐑𝐑𝐑𝐑 a[l+2]


a[l+1]
𝐳 [𝐥+𝟏] = 𝐖 [𝐥+𝟏] 𝐚[𝐥] + 𝐛 [𝐥+𝟏] 𝐳 [𝐥+𝟐] = 𝐖 [𝐥+𝟐] 𝐚[𝐥+𝟏] + 𝐛 [𝐥+𝟐]

𝐚[𝐥+𝟏] = 𝐠(𝐳 [𝐥+𝟏] ) 𝐚[𝐥+𝟐] = 𝐠(𝐳 [𝐥+𝟐] )

𝐚[𝐥+𝟐] = 𝐠 𝐳 𝐥+𝟐 + 𝐚 𝐥 = 𝐠(𝐖 [𝐥+𝟐] 𝐚[𝐥+𝟏] + 𝐛 [𝐥+𝟐] + 𝐚 𝐥 )


[He et al., 2015]
ResNet
• The residual module
• Introduce skip or shortcut connections (existing before in various forms in
literature)
• Make it easy for network layers to represent the identity mapping

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image
Recognition, CVPR 2016 (Best Paper)
ResNet
• Deep Residual Learning for Image Recognition - Kaiming
He, Xiangyu Zhang, Shaoqing Ren, Jian Sun; 2015
• Extremely deep network – 152 layers

• Deeper neural networks are more difficult to train.


• Deep networks suffer from vanishing and exploding
gradients.
• Present a residual learning framework to ease the
training of networks that are substantially deeper
than those used previously.

[He et al., 2015]


ResNet

• ILSVRC’15 classification winner (3.57% top 5


error, humans generally hover around a 5-
10% error rate)
Swept all classification and detection
competitions in ILSVRC’15 and COCO’15!

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
• What happens when we continue stacking deeper layers on a
convolutional neural network?

• 56-layer model performs worse on both training and test error


-> The deeper model performs worse (not caused by overfitting)!

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
Case Study: ResNet [He et al., 2015]
- Batch Normalization after every CONV layer
- Xavier/2 initialization from He et al.
- SGD + Momentum (0.9)
- Learning rate: 0.1, divided by 10 when validation error
plateaus
- Mini-batch size 256
- Weight decay of 1e-5
- No dropout used
ResNet
• Directly performing 3x3 convolutions
Deeper residual module (bottleneck)
with 256 feature maps at input and
output:
256 x 256 x 3 x 3 ~ 600K operations
• Using 1x1 convolutions to reduce 256
to 64 feature maps, followed by 3x3
convolutions, followed by 1x1
convolutions to expand back to 256
maps:
256 x 64 x 1 x 1 ~ 16K
64 x 64 x 3 x 3 ~ 36K
64 x 256 x 1 x 1 ~ 16K
Total: ~70K

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning
for Image Recognition, CVPR 2016 (Best Paper)
Case Study: ResNet [He et al., 2015]

21 Jan 2015
Accuracy comparison

The best CNN architecture that we currently have and is a


great innovation for the idea of residual learning.

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
21 Jan 2015
ResNet
• Architectures for ImageNet:

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image Recognition,
CVPR 2016 (Best Paper)
Inception v2, v3
• Regularize training with batch normalization, reducing
importance of auxiliary classifiers
• More variants of inception modules with aggressive
factorization of filters

C. Szegedy et al., Rethinking the inception architecture for computer vision, CVPR 2016
Inception v2, v3
• Regularize training with batch normalization, reducing
importance of auxiliary classifiers
• More variants of inception modules with aggressive
factorization of filters
• Increase the number of feature maps while decreasing
spatial resolution (pooling)

C. Szegedy et al., Rethinking the inception architecture for computer vision, CVPR 2016
Inception v4

C. Szegedy et al., Inception-v4, Inception-ResNet and the Impact of Residual Connections on


Learning, arXiv 2016
Summary: ILSVRC 2012-2015
Team Year Place Error (top- External
5) data
SuperVision-Toronto 2012 - 16.4% no
(AlexNet, 7 layers)
SuperVision 2012 1st 15.3% ImageNet
22k
Clarifai – NYU (7 2013 - 11.7% no
layers)
Clarifai 2013 1st 11.2% ImageNet
22k
VGG – Oxford (16 2014 2nd 7.32% no
layers)
http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/
GoogLeNet (19 2014 1st 6.67% no
Accuracy vs. efficiency

https://culurciello.github.io/tech/2016/06/04/nets.html
Design principles
• Reduce filter sizes (except possibly at the lowest layer),
factorize filters aggressively
• Use 1x1 convolutions to reduce and expand the number of
feature maps judiciously
• Use skip connections and/or create multiple paths through
the network
What’s missing from the picture?
• Training tricks and details: initialization, regularization,
normalization
• Training data augmentation
• Averaging classifier outputs over multiple crops/flips
• Ensembles of networks

• What about ILSVRC 2016?


• No more ImageNet classification
• No breakthroughs comparable to ResNet
APPLICATIONS
Object classification [9]

57
Human Pose Estimation [10]

58
Super Resolution [11]

59
CNN on Text
1 1
1 -1 -1 Filter 1
2 0
-1 1 -1
3 0
-1 -1 1
4: 0 3
1 0 0 0 0 1


0 1 0 0 1 0 0

0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
10: 0
0 1 0 0 1 0
0 0 1 0 1 0


13 0
6 x 6 image
14 0
fewer parameters! 15 1 Only connect to 9
inputs.
16 1

“You need a lot of a data if you want to
train/use CNNs”

21 Jan 2015
Transfer Learning

“You need a lot of a data if you want to


train/use CNNs”

21 Jan 2015
Transfer Learning with CNNs

1. Train on
Imagenet

21 Jan 2015
Transfer Learning with CNNs

1. Train on 2. If small dataset: fix


Imagenet all weights (treat CNN
as fixed feature
extractor), retrain only
the classifier

i.e. swap the Softmax


layer at the end

21 Jan 2015
Transfer Learning with CNNs

1. Train on 2. If small dataset: fix 3. If you have medium sized


Imagenet all weights (treat CNN dataset, “finetune”
as fixed feature instead: use the old weights
extractor), retrain only as initialization, train the full
the classifier network or only some of the
higher layers
i.e. swap the Softmax
layer at the end retrain bigger portion of the
network, or even all of it.

21 Jan 2015
Transfer Learning with CNNs
1. Train on 2. If small dataset: fix 3. If you have medium sized
Imagenet all weights (treat CNN dataset, “finetune”
as fixed feature instead: use the old weights
extractor), retrain only as initialization, train the full
the classifier network or only some of the
higher layers
i.e. swap the Softmax
layer at the end retrain bigger portion of the
network, or even all of it.

tip: use only ~1/10th of


the original learning rate
in finetuning to player,
and ~1/100th on
intermediate layers

21 Jan 2015
DeepMind’s AlphaGo

21 Jan 2015
policy network:
[19x19x48] Input
CONV1: 192 5x5 filters , stride 1, pad 2 => [19x19x192]
CONV2..12: 192 3x3 filters, stride 1, pad 1 => [19x19x192]
CONV: 1 1x1 filter, stride 1, pad 0 => [19x19] (probability map of promising moves)

21 Jan 2015
Summary
• ConvNets stack CONV,ReLU,POOL,FC layers
• Trend towards smaller filters and deeper architectures
• Trend towards getting rid of POOL/FC layers (just CONV)
• Early architectures look like
[(CONV-RELU)*N-POOL?]*M-(FC-RELU)*K,SOFTMAX
• but recent advances such as ResNet/GoogLeNet use only Conv-
ReLU, 1x1 convolutions and Softmax

21 Jan 2015
Weight Initialization

21 Jan 2015
Weight Initialization

- Q: what happens when W=0 init is used?

21 Jan 2015
Weight Initialization
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)

21 Jan 2015
Weight Initialization
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)

Works ~okay for small networks, but can lead to non-


homogeneous distributions of activations across the
layers of a network.

21 Jan 2015
Activation Statistics

E.g. 10-layer net with


500 neurons on each
layer, using tanh non-
linearities, and
initializing as described
in last slide.

21 Jan 2015
21 Jan 2015
All activations
become zero!
Q: think about the
backward pass. What
do the gradients look
like?

Hint: think about backward


pass for a W*X gate.

21 Jan 2015
*1.0 instead of *0.01 Almost all neurons
completely
saturated, either -1
and 1. Gradients will
be all zero.

21 Jan 2015
“Xavier initialization” [Glorot et al., 2010]

Easy Derivation (linear case):


Assume weights and inbound
activations have mean zero and are
independent.
Their variances multiply for each
term, and then scale by fan_in for
each output term.

21 Jan 2015
but when using the ReLU
nonlinearity it breaks.

21 Jan 2015
He et al., 2015
(note additional /2)

factor of 2 doesn’t seem like much,


but remember it applies
multiplicatively 150 times in a large
ResNet.

21 Jan 2015
He et al., 2015
(note additional /2)

21 Jan 2015
Proper initialization is an active area of research…
Understanding the difficulty of training deep feedforward neural networks
by Glorot and Bengio, 2010

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013

Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014

Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015

Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015

All you need is a good init, Mishkin and Matas, 2015


21 Jan 2015
Localization and Detection

Results from Faster R-CNN, Ren et al 2015

21 Jan 2015
Computer Vision Tasks

Classification + Instance
Classification Object Detection
Localization Segmentation

CAT CAT CAT, DOG, DUCK CAT, DOG, DUCK

Single object Multiple objects

21 Jan 2015
Computer Vision Tasks

Classification + Instance
Classification Object Detection
Localization Segmentation

21 Jan 2015
Classification + Localization: Task

Classification: C classes
Input: Image
Output: Class label CAT
Evaluation metric: Accuracy

Localization:
Input: Image
Output: Box in the image (x, y, w, h)
Evaluation metric: Intersection over Union

Classification + Localization: Do both


(x, y, w, h)

21 Jan 2015
Classification + Localization: ImageNet

1000 classes (same as classification)

Each image has 1 class, at least one bounding


box

~800 training images per class

Algorithm produces 5 (class, box) guesses

Example is correct if at least one one guess has


correct class AND bounding box at least 0.5
intersection over union (IoU)

Krizhevsky et. al. 2012

21 Jan 2015
Idea #1: Localization as Regression

Input: image

Neural Net Output:


Box coordinates
(4 numbers)

Loss:
L2 distance
Correct output:
box coordinates
(4 numbers)
Only one object,
simpler than detection

21 Jan 2015
Simple Recipe for Classification + Localization

Step 1: Train (or download) a classification model (AlexNet, VGG, GoogLeNet)

Convolution
and Pooling Fully-connected layers

Softmax loss

Final conv
feature map Class scores
Image

21 Jan 2015
Simple Recipe for Classification + Localization

Step 2: Attach new fully-connected “regression head” to the network

Fully-connected
layers

“Classification head”

Convolution Class scores


and Pooling

Fully-connected
layers

“Regression head”

Final conv
feature map
Box coordinates
Image

21 Jan 2015
Simple Recipe for Classification + Localization

Step 3: Train the regression head only with SGD and L2 loss

Fully-connected
layers

Convolution Class scores


and Pooling

Fully-connected
layers

L2 loss

Final conv
feature map
Box coordinates
Image

21 Jan 2015
Simple Recipe for Classification + Localization

Step 4: At test time use both heads

Fully-connected
layers

Convolution Class scores


and Pooling

Fully-connected
layers

Final conv
feature map
Box coordinates
Image

21 Jan 2015
Per-class vs class agnostic regression

Assume classification over C


classes: Fully-connected
layers
Classification head:
C numbers
(one per class)

Convolution Class scores


and Pooling

Class agnostic:
4 numbers
Fully-connected (one box)
layers

Class specific:
C x 4 numbers
Final conv (one box per class)
feature map
Box coordinates
Image

21 Jan 2015
Where to attach the regression head?

After conv layers:


Overfeat, VGG After last FC layer:
DeepPose, R-CNN

Convolution Fully-connected
and Pooling layers

Softmax loss

Final conv
feature map Class scores
Image

21 Jan 2015
Aside: Localizing multiple objects

Want to localize exactly K objects


in each image
Fully-connected
(e.g. whole cat, cat head, cat left layers
ear, cat right ear for K=4)

Convolution Class scores


and Pooling

Fully-connected
layers

K x 4 numbers
(one box per object)
Final conv
feature map
Box coordinates
Image

21 Jan 2015
Aside: Human Pose Estimation

Represent a person by K joints

Regress (x, y) for each joint from


last fully-connected layer of
AlexNet

(Details: Normalized coordinates,


iterative refinement)

Toshev and Szegedy, “DeepPose: Human Pose Estimation via


Deep Neural Networks”, CVPR 2014

21 Jan 2015
Idea #2: Sliding Window

● Run classification + regression network at multiple


locations on a high-resolution image

● Convert fully-connected layers into convolutional


layers for efficient computation

● Combine classifier and regressor predictions across all


scales for final prediction

21 Jan 2015
Sliding Window: Overfeat
4096 4096 Class scores:
Winner of ILSVRC 2013 localization 1000
challenge

FC FC
Softmax
Convolution loss
+ pooling

FC

FC

FC FC
Feature map:
1024 x 5 x 5 Euclidean
Image: loss
3 x 221 x 221

Boxes:
1024 1000 x 4
4096
Sermanet et al, “Integrated Recognition, Localization and
Detection using Convolutional Networks”, ICLR 2014

21 Jan 2015
Sliding Window: Overfeat

Network input:
3 x 221 x 221
Larger image:
3 x 257 x 257

21 Jan 2015
Sliding Window: Overfeat

0.
5

Network input:
3 x 221 x 221
Larger image: Classification scores:
3 x 257 x 257 P(cat)

21 Jan 2015
Sliding Window: Overfeat

0. 0.7
5 5

Network input:
3 x 221 x 221
Larger image: Classification scores:
3 x 257 x 257 P(cat)

21 Jan 2015
Sliding Window: Overfeat

0.7
0.5
5

0.6
Network input:
3 x 221 x 221
Larger image: Classification scores:
3 x 257 x 257 P(cat)

21 Jan 2015
Sliding Window: Overfeat

0. 0.7
5 5

0.
0.8
6
Network input:
3 x 221 x 221
Larger image: Classification scores:
3 x 257 x 257 P(cat)

21 Jan 2015
Sliding Window: Overfeat

0.
0.75
5

0.
0.8
6
Network input:
3 x 221 x 221
Larger image: Classification scores:
3 x 257 x 257 P(cat)

21 Jan 2015
Sliding Window: Overfeat

Greedily merge boxes and


scores (details in paper)

0.8

Network input:
3 x 221 x 221
Larger image: Classification score:
3 x 257 x 257 P(cat)

21 Jan 2015
Sliding Window: Overfeat

In practice use many sliding window


locations and multiple scales

Window positions + score maps


Final Predictions
Box regression outputs

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

21 Jan 2015
Efficient Sliding Window: Overfeat

4096 4096 Class scores:


1000

FC FC

Convolution
+ pooling

FC

FC
FC FC
Feature map:
1024 x 5 x 5
Image:
3 x 221 x 221

Boxes:
1024 1000 x 4
4096

21 Jan 2015
Efficient Sliding Window: Overfeat

Efficient sliding window by converting fully-


connected layers into convolutions
Class scores:
4096 x 1 x 1 1024 x 1 x 1 1000 x 1 x 1

Convolution
+ pooling
1 x 1 conv 1 x 1 conv
5x5
conv

5x5
conv

Feature map: 1 x 1 conv 1 x 1 conv


1024 x 5 x 5
Image:
3 x 221 x 221
4096 x 1 x 1 1024 x 1 x 1 Box coordinates:
(4 x 1000) x 1 x 1

21 Jan 2015
Efficient Sliding Window: Overfeat

Training time: Small image,


1 x 1 classifier output

Test time: Larger image, 2 x


2 classifier output, only
extra compute at yellow
regions

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

21 Jan 2015
ImageNet Classification + Localization

AlexNet: Localization method not


published

Overfeat: Multiscale convolutional


regression with box merging

VGG: Same as Overfeat, but fewer scales


and locations; simpler method, gains all
due to deeper features

ResNet: Different localization method


(RPN) and much deeper features

21 Jan 2015
Computer Vision Tasks

Classification + Instance
Classification Object Detection
Localization Segmentation

21 Jan 2015
Computer Vision Tasks

Classification + Instance
Classification Object Detection
Localization Segmentation

21 Jan 2015
Detection as Regression?

DOG, (x, y, w, h)
CAT, (x, y, w, h)
CAT, (x, y, w, h)
DUCK (x, y, w, h)

= 16 numbers

21 Jan 2015
Detection as Regression?

DOG, (x, y, w, h)
CAT, (x, y, w, h)

= 8 numbers

21 Jan 2015
Detection as Regression?

CAT, (x, y, w, h)
CAT, (x, y, w, h)
….
CAT (x, y, w, h)

= many numbers

Need variable sized outputs

21 Jan 2015
Detection as Classification

CAT? NO

DOG? NO

21 Jan 2015
Detection as Classification

CAT? YES!

DOG? NO

21 Jan 2015
Detection as Classification

CAT? NO

DOG? NO

21 Jan 2015
Detection as Classification

Problem: Need to test many positions and scales

Solution: If your classifier is fast enough, just do it

21 Jan 2015
Histogram of Oriented Gradients

Dalal and Triggs, “Histograms of Oriented Gradients for Human Detection”, CVPR 2005
Slide credit: Ross Girshick

21 Jan 2015
Deformable Parts Model (DPM)

Felzenszwalb et al, “Object Detection with Discriminatively


Trained Part Based Models”, PAMI 2010

21 Jan 2015
Aside: Deformable Parts Models are CNNs?

Girschick et al, “Deformable Part Models are Convolutional Neural Networks”, CVPR 2015

21 Jan 2015
Detection as Classification

Problem: Need to test many positions and scales,


and use a computationally demanding classifier (CNN)

Solution: Only look at a tiny subset of possible positions

21 Jan 2015
Region Proposals

● Find “blobby” image regions that are likely to contain objects


● “Class-agnostic” object detector
● Look for “blob-like” regions

21 Jan 2015
Region Proposals: Selective Search

Bottom-up segmentation, merging regions at multiple scales

Convert
regions to
boxes

Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013

21 Jan 2015
Region Proposals: Many other choices

Hosang et al, “What makes for effective detection proposals?”, PAMI 2015

21 Jan 2015
Region Proposals: Many other choices

Hosang et al, “What makes for effective detection proposals?”, PAMI 2015

21 Jan 2015
Putting it together: R-CNN

Girschick et al, “Rich feature hierarchies for


accurate object detection and semantic
segmentation”, CVPR 2014

Slide credit: Ross Girschick

21 Jan 2015
R-CNN Training

Step 1: Train (or download) a classification model for ImageNet (AlexNet)

Convolution
Fully-connected
and Pooling
layers

Softmax loss

Final conv
feature map Class scores
1000 classes
Image

21 Jan 2015
R-CNN Training

Step 2: Fine-tune model for detection


- Instead of 1000 ImageNet classes, want 20 object classes + background
- Throw away final fully-connected layer, reinitialize from scratch
- Keep training model using positive / negative regions from detection images

Re-initialize this layer:


was 4096 x 1000,
Convolution
Fully-connected now will be 4096 x 21
and Pooling
layers

Softmax loss

Final conv
feature map Class scores:
21 classes
Image

21 Jan 2015
R-CNN Training

Step 3: Extract features


- Extract region proposals for all images
- For each region: warp to CNN input size, run forward through CNN, save pool5 features to disk
- Have a big hard drive: features are ~200GB for PASCAL dataset!

Convolution
and Pooling

pool5 features

Image Region Proposals Crop + Warp Forward pass Save to disk

21 Jan 2015
R-CNN Training

Step 4: Train one binary SVM per class to classify region features

Training image regions

Cached region features

Positive samples for cat SVM Negative samples for cat SVM

21 Jan 2015
R-CNN Training

Step 4: Train one binary SVM per class to classify region features

Training image regions

Cached region features

Negative samples for dog SVM Positive samples for dog SVM

21 Jan 2015
R-CNN Training

Step 5 (bbox regression): For each class, train a linear regression model to map from
cached features to offsets to GT boxes to make up for “slightly wrong” proposals

Training image regions

Cached region features

Regression targets (0, 0, 0, 0) (.25, 0, 0, 0) (0, 0, -0.125, 0)


(dx, dy, dw, dh) Proposal is good Proposal too Proposal too
Normalized coordinates far to left wide

21 Jan 2015
Object Detection: Datasets

ImageNet
PASCAL
Detection MS-COCO
VOC
(ILSVRC (2014)
(2010)
2014)

Number of
20 200 80
classes

Number of
images ~20k ~470k ~120k
(train + val)

Mean objects
2.4 1.1 7.2
per image

21 Jan 2015
Object Detection: Evaluation

We use a metric called “mean average precision” (mAP)

Compute average precision (AP) separately for each class, then average over classes

A detection is a true positive if it has IoU (Intersection over Union) with a ground-truth box
greater than some threshold (usually 0.5) ([email protected])

Combine all detections from all test images to draw a precision / recall curve for each class; AP is
area under the curve

TL;DR mAP is a number from 0 to 100; high is good

21 Jan 2015
R-CNN Results

Wang et al, “Regionlets for Generic Object Detection”, ICCV 2013

21 Jan 2015
R-CNN Results

Big improvement compared to pre-CNN


methods

21 Jan 2015
R-CNN Results

Bounding box regression helps


a bit

21 Jan 2015
R-CNN Results

Features from a deeper network


help a lot

21 Jan 2015
R-CNN Problems

1. Slow at test-time: need to run full forward pass of CNN


for each region proposal

2. SVMs and regressors are post-hoc: CNN features not


updated in response to SVMs and regressors

3. Complex multistage training pipeline

21 Jan 2015
Fast R-CNN

Girschick, “Fast R-CNN”, ICCV 2015

Slide credit: Ross Girschick

21 Jan 2015
R-CNN Problem #1:
Slow at test-time due to
independent forward passes of the
CNN

Solution:
Share computation of
convolutional layers between
proposals for an image

21 Jan 2015
R-CNN Problem #2:
Post-hoc training: CNN not
updated in response to final
classifiers and regressors

R-CNN Problem #3:


Complex training pipeline

Solution:
Just train the whole system end-
to-end all at once!

Slide credit: Ross Girschick

21 Jan 2015
Fast R-CNN: Region of Interest Pooling

Convolution
and Pooling Fully-connected
layers

Hi-res input image:


Hi-res conv features: Problem: Fully-connected layers
3 x 800 x 600
CxHxW expect low-res conv features: C x
with region proposal
with region proposal hxw

21 Jan 2015
Fast R-CNN: Region of Interest Pooling

Project region proposal onto


conv feature map
Convolution
and Pooling Fully-connected
layers

Hi-res input image:


Hi-res conv features: Problem: Fully-connected layers
3 x 800 x 600
CxHxW expect low-res conv features: C x
with region proposal
with region proposal hxw

21 Jan 2015
Fast R-CNN: Region of Interest Pooling

Divide projected region


Convolution into h x w grid
and Pooling Fully-connected
layers

Hi-res input image:


Hi-res conv features: Problem: Fully-connected layers
3 x 800 x 600
CxHxW expect low-res conv features: C x
with region proposal
with region proposal hxw

21 Jan 2015
Fast R-CNN: Region of Interest Pooling

Max-pool within
each grid cell
Convolution
and Pooling Fully-connected
layers

Hi-res input image:


Hi-res conv features: RoI conv features: Fully-connected layers expect
3 x 800 x 600
CxHxW Cxhxw low-res conv features:
with region proposal
with region proposal for region proposal Cxhxw

21 Jan 2015
Fast R-CNN: Region of Interest Pooling

Can back propagate


similar to max pooling
Convolution
and Pooling Fully-connected
layers

Hi-res input image:


Hi-res conv features: RoI conv features: Fully-connected layers expect
3 x 800 x 600
CxHxW Cxhxw low-res conv features:
with region proposal
with region proposal for region proposal Cxhxw

21 Jan 2015
Fast R-CNN Results

R-CNN Fast R-CNN

Faster!
Training 84 hours 9.5 hours
Time:

(Speedup) 1x 8.8x

Using VGG-16 CNN on Pascal VOC 2007 dataset

21 Jan 2015
Fast R-CNN Results

R-CNN Fast R-CNN

Faster!
Training 84 hours 9.5 hours
Time:

(Speedup) 1x 8.8x
FASTER!
Test time per 47 seconds 0.32 seconds
image

(Speedup) 1x 146x
Using VGG-16 CNN on Pascal VOC 2007 dataset

21 Jan 2015
Fast R-CNN Results

R-CNN Fast R-CNN

Training 84 hours 9.5 hours


Time:

(Speedup) 1x 8.8x
Faster!

Test time per 47 seconds 0.32 seconds


FASTER!
image

(Speedup) 1x 146x
Better!
mAP (VOC 66.0 66.9
2007)
Using VGG-16 CNN on Pascal VOC 2007 dataset

21 Jan 2015
Fast R-CNN Problem:

Test-time speeds don’t include region proposals

R-CNN Fast R-CNN

Test time per


47 seconds 0.32 seconds
image

(Speedup) 1x 146x

Test time per


image with
50 seconds 2 seconds
Selective
Search
(Speedup) 1x 25x

21 Jan 2015
Fast R-CNN Problem Solution:

Test-time speeds don’t include region proposals


Just make the CNN do region proposals too!

R-CNN Fast R-CNN

Test time per


47 seconds 0.32 seconds
image

(Speedup) 1x 146x

Test time per


image with
50 seconds 2 seconds
Selective
Search
(Speedup) 1x 25x
21 Jan 2015
Faster R-CNN:

Insert a Region Proposal Network


(RPN) after the last convolutional
layer

RPN trained to produce region


proposals directly; no need for
external region proposals!

After RPN, use RoI Pooling and an


upstream classifier and bbox
regressor just like Fast R-CNN

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with


Region Proposal Networks”, NIPS 2015

Slide credit: Ross Girschick

21 Jan 2015
Faster R-CNN: Region Proposal Network

Slide a small window on the feature map

Build a small network for:


• classifying object or not-object, and
• regressing bbox locations 1 x 1 conv 1 x 1 conv

Position of the sliding window provides localization


1 x 1 conv
information with reference to the image

Box regression provides finer localization


information with reference to this sliding window

Slide credit: Kaiming He

21 Jan 2015
Faster R-CNN: Region Proposal Network

Use N anchor boxes at each location

Anchors are translation invariant: use the


same ones at every location

Regression gives offsets from anchor


boxes

Classification gives the probability that


each (regressed) anchor shows an object

21 Jan 2015
Faster R-CNN: Training

In the paper: Ugly pipeline


- Use alternating optimization to train RPN, then
Fast R-CNN with RPN proposals, etc.
- More complex than it has to be

Since publication: Joint training!


One network, four losses
- RPN classification (anchor good / bad)
- RPN regression (anchor -> proposal)
- Fast R-CNN classification (over classes)
- Fast R-CNN regression (proposal -> box)

Slide credit: Ross Girschick

21 Jan 2015
Faster R-CNN: Results

R-CNN Fast R-CNN Faster R-


CNN

Test time per 50 2 seconds 0.2 seconds


image seconds
(with proposals)

(Speedup) 1x 25x 250x

mAP (VOC 2007) 66.0 66.9 66.9

21 Jan 2015
Object Detection State-of-the-art:
ResNet 101 + Faster R-CNN + some extras

He et. al, “Deep Residual Learning for Image Recognition”, arXiv 2015

21 Jan 2015
ImageNet Detection 2013 - 2015

21 Jan 2015
YOLO: You Only Look Once Detection as Regression

Divide image into S x S grid

Within each grid cell predict:


B Boxes: 4 coordinates +
confidence
Class scores: C numbers

Regression from image to


7 x 7 x (5 * B + C) tensor

Direct prediction using a CNN


Redmon et al, “You Only Look Once:
Unified, Real-Time Object Detection”, arXiv 2015

21 Jan 2015
YOLO: You Only Look Once Detection as Regression

Faster than Faster R-CNN, but not


as good

Redmon et al, “You Only Look Once:


Unified, Real-Time Object Detection”, arXiv 2015

21 Jan 2015
Object Detection code links:

R-CNN
(Cafffe + MATLAB): https://github.com/rbgirshick/rcnn
Probably don’t use this; too slow

Fast R-CNN
(Caffe + MATLAB): https://github.com/rbgirshick/fast-rcnn

Faster R-CNN
(Caffe + MATLAB): https://github.com/ShaoqingRen/faster_rcnn
(Caffe + Python): https://github.com/rbgirshick/py-faster-rcnn

YOLO
http://pjreddie.com/darknet/yolo/

21 Jan 2015
Recap
Localization:
- Find a fixed number of objects (one or many)
- L2 regression from CNN features to box coordinates
- Much simpler than detection; consider it for your projects!
- Overfeat: Regression + efficient sliding window with FC -> conv conversion
- Deeper networks do better
Object Detection:
- Find a variable number of objects by classifying image regions
- Before CNNs: dense multiscale sliding window (HoG, DPM)
- Avoid dense sliding window with region proposals
- R-CNN: Selective Search + CNN classification / regression
- Fast R-CNN: Swap order of convolutions and region extraction
- Faster R-CNN: Compute region proposals within the network
- Deeper networks do better

21 Jan 2015

You might also like