CS60010: Deep Learning CNN - Part 3: Sudeshna Sarkar
CS60010: Deep Learning CNN - Part 3: Sudeshna Sarkar
CNN – Part 3
Sudeshna Sarkar
Spring 2019
7 Feb 2019
CNN on Text
CNN in text classification
Source of image:
http://citeseerx.ist.psu.edu/viewdoc/downlo
ad?doi=10.1.1.703.6858&rep=rep1&type=p
df
Objectives
• We will examine classic CNN architectures with the goal of:
- Gaining intuition for building CNNs
- Reusing CNN architectures
Case Study: LeNet-5 [LeCun et al., 1998]
21 Jan 2015
The ILSVRC-2012 competition on ImageNet
• The dataset has 1.2 million high- • Some of the best existing
resolution training images. computer vision methods
• The classification task: were tried on this dataset by
• Get the “correct” class in your top
leading computer vision
5 bets. There are 1000 classes.
• The localization task:
groups from Oxford, INRIA,
• For each bet, put a box around XRCE, …
the object. Your box must have at • Computer vision systems use
least 50% overlap with the correct complicated multi-stage
box. systems.
• The early stages are typically
hand-tuned by optimizing a few
parameters.
[Krizhevsky et al. 2012]
Case Study: AlexNet
The AlexNet was submitted to the ImageNet ILSVRC challenge in 2012 and
significantly outperformed the second runner-up (top 5 error of 16% compared to
runner-up with 26% error).
Facilitated by GPUs, highly optimized convolution implementation and large
datasets (ImageNet)
Has 60 million parameters
ImageNet Classification with Deep Convolutional Neural Networks - Alex Krizhevsky, Ilya
Sutskever, Geoffrey E. Hinton; 2012 21 Jan 2015
AlexNet
Architecture – 7 hidden layers not counting some max pooling layers.
CONV1 – The early layers were convolutional.
MAX POOL1 – The last two layers were globally connected.
NORM1
• Input: 227x227x3 images (224x224 before
CONV2 padding)
MAX POOL2
NORM2
• First layer: 96 11x11 filters applied at stride 4
CONV3
CONV4 • Output volume size?
CONV5 (N-F)/s+1 = (227-11)/4+1 = 55 -> [55x55x96]
Max POOL3
FC6 • Number of parameters in this layer?
FC7 (11*11*3)*96 = 35K
FC8
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Krizhevsky et al., 2012]
AlexNet
FC FC
...
⋮ ⋮
Softmax
1000
4096 4096
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Krizhevsky et al., 2012]
A neural network for ImageNet
•=
• The activation functions were:
– 7 hidden layers not counting
some max pooling layers. – Rectified linear units in every
hidden layer. These train much
– The early layers were
faster and are more expressive
convolutional.
than logistic units.
– The last two layers were globally
– Competitive normalization to
connected.
suppress hidden activities when
nearby units have stronger
activities. This helps with
variations in intensity.
Error rates on the ILSVRC-2012 competition
classification classification
&localization
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
VGGNet: ILSVRC 2014 2nd place
• Sequence of deeper networks
trained progressively
• Large receptive fields replaced by
successive layers of 3x3
convolutions (with ReLU in
between)
K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image
Recognition, ICLR 2015
VGGNet
Input
3x3 conv, 64 • Smaller filters
3x3 conv, 64
Pool 1/2 Only 3x3 CONV filters, stride 1, pad 1
3x3 conv, 128
3x3 conv, 128
and 2x2 MAX POOL , stride 2
Pool 1/2
3x3 conv, 256
3x3 conv, 256 • Deeper network
Pool 1/2
3x3 conv, 512 AlexNet: 8 layers
3x3 conv, 512
3x3 conv, 512
VGGNet: 16 - 19 layers
Pool 1/2
3x3 conv, 512
3x3 conv, 512 • ZFNet: 11.7% top 5 error in ILSVRC’13
3x3 conv, 512
Pool 1/2 • VGGNet: 7.3% top 5 error in ILSVRC’14
FC 4096
FC 4096
FC 1000
Softmax
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
VGGNet
• Why use smaller filters? (3x3 conv)
Stack of three 3x3 conv (stride 1) layers has the
same effective receptive field as one 7x7
conv layer.
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
Receptive Field
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
GoogLeNet: ILSVRC 2014 winner
• The Inception Module
• Inception Module dramatically reduced the number of
parameters in the network
(4M, compared to AlexNet with 60M).
• Uses Average Pooling instead of Fully Connected layers at
the top of the ConvNet
• Several followup versions to the GoogLeNet, most
recently Inception-v4.
• 22 layers
• Efficient “Inception” module - strayed from
the general approach of simply stacking conv
and pooling layers on top of each other in a
sequential structure
• No FC layers
• Only 5 million parameters!
• ILSVRC’14 classification winner (6.7% top 5
error)
Inception module
Inception module
Inception module
21 Jan 2015
Case Study: GoogLeNet [Szegedy et al., 2014]
Inception module
21 Jan 2015
GoogLeNet
Auxiliary classifier
Fun features:
Compared to AlexNet:
- 12X less params
- 2x more compute
- 6.67% (vs. 16.4%)
21 Jan 2015
Inception v2, v3
• Regularize training with batch normalization, reducing importance of
auxiliary classifiers
• More variants of inception modules with aggressive factorization of
filters
C. Szegedy et al., Rethinking the inception architecture for computer vision, CVPR 2016
Inception v2, v3
• Regularize training with batch normalization, reducing importance of
auxiliary classifiers
• More variants of inception modules with aggressive factorization of
filters
• Increase the number of feature maps while decreasing spatial resolution
(pooling)
C. Szegedy et al., Rethinking the inception architecture for computer vision, CVPR 2016
ResNet
• Deep Residual Learning for Image Recognition -
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun;
2015
• Extremely deep network – 152 layers
• Deeper neural networks are more difficult to train.
• Deep networks suffer from vanishing and exploding
gradients.
• Present a residual learning framework to ease the
training of networks that are substantially deeper
than those used previously.
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
Residual Block
Input x goes through conv-relu-conv series and gives us F(x).
That result is then added to the original input x. Let’s call that
H(x) = F(x) + x.
In traditional CNNs, H(x) would just be equal to F(x). So, instead
of just computing that transformation (straight from x to F(x)),
we’re computing the term that we have to add, F(x), to the
input, x.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image
Recognition, CVPR 2016 (Best Paper)
ResNet
• Deep Residual Learning for Image Recognition - Kaiming
He, Xiangyu Zhang, Shaoqing Ren, Jian Sun; 2015
• Extremely deep network – 152 layers
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
• What happens when we continue stacking deeper layers on a
convolutional neural network?
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
Case Study: ResNet [He et al., 2015]
- Batch Normalization after every CONV layer
- Xavier/2 initialization from He et al.
- SGD + Momentum (0.9)
- Learning rate: 0.1, divided by 10 when validation error
plateaus
- Mini-batch size 256
- Weight decay of 1e-5
- No dropout used
ResNet
• Directly performing 3x3 convolutions
Deeper residual module (bottleneck)
with 256 feature maps at input and
output:
256 x 256 x 3 x 3 ~ 600K operations
• Using 1x1 convolutions to reduce 256
to 64 feature maps, followed by 3x3
convolutions, followed by 1x1
convolutions to expand back to 256
maps:
256 x 64 x 1 x 1 ~ 16K
64 x 64 x 3 x 3 ~ 36K
64 x 256 x 1 x 1 ~ 16K
Total: ~70K
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning
for Image Recognition, CVPR 2016 (Best Paper)
Case Study: ResNet [He et al., 2015]
21 Jan 2015
Accuracy comparison
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
21 Jan 2015
ResNet
• Architectures for ImageNet:
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image Recognition,
CVPR 2016 (Best Paper)
Inception v2, v3
• Regularize training with batch normalization, reducing
importance of auxiliary classifiers
• More variants of inception modules with aggressive
factorization of filters
C. Szegedy et al., Rethinking the inception architecture for computer vision, CVPR 2016
Inception v2, v3
• Regularize training with batch normalization, reducing
importance of auxiliary classifiers
• More variants of inception modules with aggressive
factorization of filters
• Increase the number of feature maps while decreasing
spatial resolution (pooling)
C. Szegedy et al., Rethinking the inception architecture for computer vision, CVPR 2016
Inception v4
https://culurciello.github.io/tech/2016/06/04/nets.html
Design principles
• Reduce filter sizes (except possibly at the lowest layer),
factorize filters aggressively
• Use 1x1 convolutions to reduce and expand the number of
feature maps judiciously
• Use skip connections and/or create multiple paths through
the network
What’s missing from the picture?
• Training tricks and details: initialization, regularization,
normalization
• Training data augmentation
• Averaging classifier outputs over multiple crops/flips
• Ensembles of networks
57
Human Pose Estimation [10]
58
Super Resolution [11]
59
CNN on Text
1 1
1 -1 -1 Filter 1
2 0
-1 1 -1
3 0
-1 -1 1
4: 0 3
1 0 0 0 0 1
…
0 1 0 0 1 0 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
10: 0
0 1 0 0 1 0
0 0 1 0 1 0
…
13 0
6 x 6 image
14 0
fewer parameters! 15 1 Only connect to 9
inputs.
16 1
…
“You need a lot of a data if you want to
train/use CNNs”
21 Jan 2015
Transfer Learning
21 Jan 2015
Transfer Learning with CNNs
1. Train on
Imagenet
21 Jan 2015
Transfer Learning with CNNs
21 Jan 2015
Transfer Learning with CNNs
21 Jan 2015
Transfer Learning with CNNs
1. Train on 2. If small dataset: fix 3. If you have medium sized
Imagenet all weights (treat CNN dataset, “finetune”
as fixed feature instead: use the old weights
extractor), retrain only as initialization, train the full
the classifier network or only some of the
higher layers
i.e. swap the Softmax
layer at the end retrain bigger portion of the
network, or even all of it.
21 Jan 2015
DeepMind’s AlphaGo
21 Jan 2015
policy network:
[19x19x48] Input
CONV1: 192 5x5 filters , stride 1, pad 2 => [19x19x192]
CONV2..12: 192 3x3 filters, stride 1, pad 1 => [19x19x192]
CONV: 1 1x1 filter, stride 1, pad 0 => [19x19] (probability map of promising moves)
21 Jan 2015
Summary
• ConvNets stack CONV,ReLU,POOL,FC layers
• Trend towards smaller filters and deeper architectures
• Trend towards getting rid of POOL/FC layers (just CONV)
• Early architectures look like
[(CONV-RELU)*N-POOL?]*M-(FC-RELU)*K,SOFTMAX
• but recent advances such as ResNet/GoogLeNet use only Conv-
ReLU, 1x1 convolutions and Softmax
21 Jan 2015
Weight Initialization
21 Jan 2015
Weight Initialization
21 Jan 2015
Weight Initialization
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)
21 Jan 2015
Weight Initialization
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)
21 Jan 2015
Activation Statistics
21 Jan 2015
21 Jan 2015
All activations
become zero!
Q: think about the
backward pass. What
do the gradients look
like?
21 Jan 2015
*1.0 instead of *0.01 Almost all neurons
completely
saturated, either -1
and 1. Gradients will
be all zero.
21 Jan 2015
“Xavier initialization” [Glorot et al., 2010]
21 Jan 2015
but when using the ReLU
nonlinearity it breaks.
21 Jan 2015
He et al., 2015
(note additional /2)
21 Jan 2015
He et al., 2015
(note additional /2)
21 Jan 2015
Proper initialization is an active area of research…
Understanding the difficulty of training deep feedforward neural networks
by Glorot and Bengio, 2010
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013
Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014
Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015
21 Jan 2015
Localization and Detection
21 Jan 2015
Computer Vision Tasks
Classification + Instance
Classification Object Detection
Localization Segmentation
21 Jan 2015
Computer Vision Tasks
Classification + Instance
Classification Object Detection
Localization Segmentation
21 Jan 2015
Classification + Localization: Task
Classification: C classes
Input: Image
Output: Class label CAT
Evaluation metric: Accuracy
Localization:
Input: Image
Output: Box in the image (x, y, w, h)
Evaluation metric: Intersection over Union
21 Jan 2015
Classification + Localization: ImageNet
21 Jan 2015
Idea #1: Localization as Regression
Input: image
Loss:
L2 distance
Correct output:
box coordinates
(4 numbers)
Only one object,
simpler than detection
21 Jan 2015
Simple Recipe for Classification + Localization
Convolution
and Pooling Fully-connected layers
Softmax loss
Final conv
feature map Class scores
Image
21 Jan 2015
Simple Recipe for Classification + Localization
Fully-connected
layers
“Classification head”
Fully-connected
layers
“Regression head”
Final conv
feature map
Box coordinates
Image
21 Jan 2015
Simple Recipe for Classification + Localization
Step 3: Train the regression head only with SGD and L2 loss
Fully-connected
layers
Fully-connected
layers
L2 loss
Final conv
feature map
Box coordinates
Image
21 Jan 2015
Simple Recipe for Classification + Localization
Fully-connected
layers
Fully-connected
layers
Final conv
feature map
Box coordinates
Image
21 Jan 2015
Per-class vs class agnostic regression
Class agnostic:
4 numbers
Fully-connected (one box)
layers
Class specific:
C x 4 numbers
Final conv (one box per class)
feature map
Box coordinates
Image
21 Jan 2015
Where to attach the regression head?
Convolution Fully-connected
and Pooling layers
Softmax loss
Final conv
feature map Class scores
Image
21 Jan 2015
Aside: Localizing multiple objects
Fully-connected
layers
K x 4 numbers
(one box per object)
Final conv
feature map
Box coordinates
Image
21 Jan 2015
Aside: Human Pose Estimation
21 Jan 2015
Idea #2: Sliding Window
21 Jan 2015
Sliding Window: Overfeat
4096 4096 Class scores:
Winner of ILSVRC 2013 localization 1000
challenge
FC FC
Softmax
Convolution loss
+ pooling
FC
FC
FC FC
Feature map:
1024 x 5 x 5 Euclidean
Image: loss
3 x 221 x 221
Boxes:
1024 1000 x 4
4096
Sermanet et al, “Integrated Recognition, Localization and
Detection using Convolutional Networks”, ICLR 2014
21 Jan 2015
Sliding Window: Overfeat
Network input:
3 x 221 x 221
Larger image:
3 x 257 x 257
21 Jan 2015
Sliding Window: Overfeat
0.
5
Network input:
3 x 221 x 221
Larger image: Classification scores:
3 x 257 x 257 P(cat)
21 Jan 2015
Sliding Window: Overfeat
0. 0.7
5 5
Network input:
3 x 221 x 221
Larger image: Classification scores:
3 x 257 x 257 P(cat)
21 Jan 2015
Sliding Window: Overfeat
0.7
0.5
5
0.6
Network input:
3 x 221 x 221
Larger image: Classification scores:
3 x 257 x 257 P(cat)
21 Jan 2015
Sliding Window: Overfeat
0. 0.7
5 5
0.
0.8
6
Network input:
3 x 221 x 221
Larger image: Classification scores:
3 x 257 x 257 P(cat)
21 Jan 2015
Sliding Window: Overfeat
0.
0.75
5
0.
0.8
6
Network input:
3 x 221 x 221
Larger image: Classification scores:
3 x 257 x 257 P(cat)
21 Jan 2015
Sliding Window: Overfeat
0.8
Network input:
3 x 221 x 221
Larger image: Classification score:
3 x 257 x 257 P(cat)
21 Jan 2015
Sliding Window: Overfeat
Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014
21 Jan 2015
Efficient Sliding Window: Overfeat
FC FC
Convolution
+ pooling
FC
FC
FC FC
Feature map:
1024 x 5 x 5
Image:
3 x 221 x 221
Boxes:
1024 1000 x 4
4096
21 Jan 2015
Efficient Sliding Window: Overfeat
Convolution
+ pooling
1 x 1 conv 1 x 1 conv
5x5
conv
5x5
conv
21 Jan 2015
Efficient Sliding Window: Overfeat
Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014
21 Jan 2015
ImageNet Classification + Localization
21 Jan 2015
Computer Vision Tasks
Classification + Instance
Classification Object Detection
Localization Segmentation
21 Jan 2015
Computer Vision Tasks
Classification + Instance
Classification Object Detection
Localization Segmentation
21 Jan 2015
Detection as Regression?
DOG, (x, y, w, h)
CAT, (x, y, w, h)
CAT, (x, y, w, h)
DUCK (x, y, w, h)
= 16 numbers
21 Jan 2015
Detection as Regression?
DOG, (x, y, w, h)
CAT, (x, y, w, h)
= 8 numbers
21 Jan 2015
Detection as Regression?
CAT, (x, y, w, h)
CAT, (x, y, w, h)
….
CAT (x, y, w, h)
= many numbers
21 Jan 2015
Detection as Classification
CAT? NO
DOG? NO
21 Jan 2015
Detection as Classification
CAT? YES!
DOG? NO
21 Jan 2015
Detection as Classification
CAT? NO
DOG? NO
21 Jan 2015
Detection as Classification
21 Jan 2015
Histogram of Oriented Gradients
Dalal and Triggs, “Histograms of Oriented Gradients for Human Detection”, CVPR 2005
Slide credit: Ross Girshick
21 Jan 2015
Deformable Parts Model (DPM)
21 Jan 2015
Aside: Deformable Parts Models are CNNs?
Girschick et al, “Deformable Part Models are Convolutional Neural Networks”, CVPR 2015
21 Jan 2015
Detection as Classification
21 Jan 2015
Region Proposals
21 Jan 2015
Region Proposals: Selective Search
Convert
regions to
boxes
21 Jan 2015
Region Proposals: Many other choices
Hosang et al, “What makes for effective detection proposals?”, PAMI 2015
21 Jan 2015
Region Proposals: Many other choices
Hosang et al, “What makes for effective detection proposals?”, PAMI 2015
21 Jan 2015
Putting it together: R-CNN
21 Jan 2015
R-CNN Training
Convolution
Fully-connected
and Pooling
layers
Softmax loss
Final conv
feature map Class scores
1000 classes
Image
21 Jan 2015
R-CNN Training
Softmax loss
Final conv
feature map Class scores:
21 classes
Image
21 Jan 2015
R-CNN Training
Convolution
and Pooling
pool5 features
21 Jan 2015
R-CNN Training
Step 4: Train one binary SVM per class to classify region features
Positive samples for cat SVM Negative samples for cat SVM
21 Jan 2015
R-CNN Training
Step 4: Train one binary SVM per class to classify region features
Negative samples for dog SVM Positive samples for dog SVM
21 Jan 2015
R-CNN Training
Step 5 (bbox regression): For each class, train a linear regression model to map from
cached features to offsets to GT boxes to make up for “slightly wrong” proposals
21 Jan 2015
Object Detection: Datasets
ImageNet
PASCAL
Detection MS-COCO
VOC
(ILSVRC (2014)
(2010)
2014)
Number of
20 200 80
classes
Number of
images ~20k ~470k ~120k
(train + val)
Mean objects
2.4 1.1 7.2
per image
21 Jan 2015
Object Detection: Evaluation
Compute average precision (AP) separately for each class, then average over classes
A detection is a true positive if it has IoU (Intersection over Union) with a ground-truth box
greater than some threshold (usually 0.5) ([email protected])
Combine all detections from all test images to draw a precision / recall curve for each class; AP is
area under the curve
21 Jan 2015
R-CNN Results
21 Jan 2015
R-CNN Results
21 Jan 2015
R-CNN Results
21 Jan 2015
R-CNN Results
21 Jan 2015
R-CNN Problems
21 Jan 2015
Fast R-CNN
21 Jan 2015
R-CNN Problem #1:
Slow at test-time due to
independent forward passes of the
CNN
Solution:
Share computation of
convolutional layers between
proposals for an image
21 Jan 2015
R-CNN Problem #2:
Post-hoc training: CNN not
updated in response to final
classifiers and regressors
Solution:
Just train the whole system end-
to-end all at once!
21 Jan 2015
Fast R-CNN: Region of Interest Pooling
Convolution
and Pooling Fully-connected
layers
21 Jan 2015
Fast R-CNN: Region of Interest Pooling
21 Jan 2015
Fast R-CNN: Region of Interest Pooling
21 Jan 2015
Fast R-CNN: Region of Interest Pooling
Max-pool within
each grid cell
Convolution
and Pooling Fully-connected
layers
21 Jan 2015
Fast R-CNN: Region of Interest Pooling
21 Jan 2015
Fast R-CNN Results
Faster!
Training 84 hours 9.5 hours
Time:
(Speedup) 1x 8.8x
21 Jan 2015
Fast R-CNN Results
Faster!
Training 84 hours 9.5 hours
Time:
(Speedup) 1x 8.8x
FASTER!
Test time per 47 seconds 0.32 seconds
image
(Speedup) 1x 146x
Using VGG-16 CNN on Pascal VOC 2007 dataset
21 Jan 2015
Fast R-CNN Results
(Speedup) 1x 8.8x
Faster!
(Speedup) 1x 146x
Better!
mAP (VOC 66.0 66.9
2007)
Using VGG-16 CNN on Pascal VOC 2007 dataset
21 Jan 2015
Fast R-CNN Problem:
(Speedup) 1x 146x
21 Jan 2015
Fast R-CNN Problem Solution:
(Speedup) 1x 146x
21 Jan 2015
Faster R-CNN: Region Proposal Network
21 Jan 2015
Faster R-CNN: Region Proposal Network
21 Jan 2015
Faster R-CNN: Training
21 Jan 2015
Faster R-CNN: Results
21 Jan 2015
Object Detection State-of-the-art:
ResNet 101 + Faster R-CNN + some extras
He et. al, “Deep Residual Learning for Image Recognition”, arXiv 2015
21 Jan 2015
ImageNet Detection 2013 - 2015
21 Jan 2015
YOLO: You Only Look Once Detection as Regression
21 Jan 2015
YOLO: You Only Look Once Detection as Regression
21 Jan 2015
Object Detection code links:
R-CNN
(Cafffe + MATLAB): https://github.com/rbgirshick/rcnn
Probably don’t use this; too slow
Fast R-CNN
(Caffe + MATLAB): https://github.com/rbgirshick/fast-rcnn
Faster R-CNN
(Caffe + MATLAB): https://github.com/ShaoqingRen/faster_rcnn
(Caffe + Python): https://github.com/rbgirshick/py-faster-rcnn
YOLO
http://pjreddie.com/darknet/yolo/
21 Jan 2015
Recap
Localization:
- Find a fixed number of objects (one or many)
- L2 regression from CNN features to box coordinates
- Much simpler than detection; consider it for your projects!
- Overfeat: Regression + efficient sliding window with FC -> conv conversion
- Deeper networks do better
Object Detection:
- Find a variable number of objects by classifying image regions
- Before CNNs: dense multiscale sliding window (HoG, DPM)
- Avoid dense sliding window with region proposals
- R-CNN: Selective Search + CNN classification / regression
- Fast R-CNN: Swap order of convolutions and region extraction
- Faster R-CNN: Compute region proposals within the network
- Deeper networks do better
21 Jan 2015