Lecture 6 CNN - Detection
Lecture 6 CNN - Detection
boris. [email protected]
1
Agenda
ILSVRC 2014
Overfeat: integrated classification, localization, and detection
– Classification with Localization
– Detection.
2
ILSVRC-2014
http://www.image-net.org/challenges/LSVRC/2014/
3
ILSVRC-2014
4
Detection: Examples
5
Detection: PASCAL VOC
http://pascallin.ecs.soton.ac.uk/challenges/VOC/
20 classes:
6
Detection: ILSVRC 2014
http://image-net.org/challenges/LSVRC/2014/
7
Detection paradigms
1. Overfeat
2. Regions with CNN
3. SPP + CNN
4. CNN + Regression
8
OVERFEAT
9
Overfeat: Integrated
classification, localization & detection
http://cilvr.nyu.edu/doku.php?id=software:overfeat:start
Training a convolutional network to simultaneously classify, locate
and detect objects. 3 ideas:
1. apply a ConvNet at multiple locations in the image, in a sliding
window fashion, and over multiple scales.
2. train the system to produce
1. a distribution over categories for each window,
2. a prediction of the location and size of the bounding box containing the
object relative to that of the viewing window
3. accumulate the evidence for each categories at each location
and size.
10
Overfeat: “accurate” net topology
input 3x221x221
1. convo: 7×7 stride 2×2; ReLU; maxpool: 3×3 stride 3×3; output: 96x36x36
2. convo: 7×7 stride 1×1; ReLU; maxpool: 2×2 stride 2×2; output: 256x15x15
3. convo: 3×3 stride 1×1 0-padded; ReLU; output: 512x15x15
4. convo: 3×3 stride 1×1 0-padded; ReLU; output: 512x15x15
5. convo: 3×3 stride 1×1 0-padded; ReLU; output: 1024x15x15
6. convo: 3×3 stride 1×1 0-padded; ReLU; maxpool: 3×3 stride 3×3;
output: 1024x5x5
7. convo: 5×5 stride 1×1; ReLU; output: 4096x1x1
8. full; ReLU; output: 4096x1x1
9. full; output: 1000x1x1
10. softmax; output: 1000x1x1
Feature Extraction: 3 x [231x231] 1024 x [5x5], with total
down-sampling is (2x3x2x3):1=36:1
11
Overfeat: topology summary
Layers 1-5 are similar to Alexnet: conv. layer with ReLU, and max
pooling, but with the following differences:
1. no contrast normalization
2. pooling regions are non-overlapping
3. Smaller stride to improve accuracy
12
Overfeat: classification
Let’s takes image, and apply sliding window [231x231], For each window we
will take best score. Feature extractor has sub-smapling 36:1. If we slide
window with step 36, then output feature will slide with step 1
231x231
5x5
5x5
14
Overfeat: classification
15
Overfeat: classification
Feature Extraction:
we compute first 5 layers for whole image. First 5 layers before pooling
correspond to 12:1 “subsampling” .
Classifier:
The classifier has a fixed-size 5x5 input and is exhaustively applied to the
layer 5 maps. We will shift the classifier’s viewing window by 1 pixel
through pooling layers without subsampling.
In the end we have [MxN] x C scores, where M, N are sliding
windows index, and C – number of classes.
Quiz: How to choose 5 best options?
17
Overfeat: boosting
18
Overfeat: ”fast” net topology
Input 3x231x231
1. convo: 11×11 stride 4×4; ReLU; maxpool: 2×2 stride 2×2; output: 96x24x24
2. convo: 5×5 stride 1×1; ReLU; maxpool: 2×2 stride 2×2; output: 256x12x12
3. convo: 3×3 stride 1×1 0-padded; ReLU; output: 512x12x12
4. convo: 3×3 stride 1×1 0-padded; ReLU; output: 1024x12x12
5. convo: 3×3 stride 1×1 0-padded; ReLU; maxpool: 2×2 stride 2×2; output: 1024x6x6
6. convo: 6×6 stride 1×1; ReLU; output: 3072x1x1
7. full; ReLU; output : 4096x1x1
8. full; output: 1000x1x1
9. softmax; output: 1000x1x1
19
Overfeat : training details
1. Data augmentation:
– Each image is down-sampled so that the smallest dimension is 256 pixels.
We then extract 5 random crops (and their horizontal flips) of size
221x221 pixels
2. Weight initialization
– randomly with (µ, σ) = (0, 1 × 10 -2 ).
3. Training:
– SGD with learning rate = 5 × 10-2 and is decreased by ½ after (30, 50, 60,
70, 80) epochs,
– momentum =0.6 ,
– ℓ2 weight decay =1×10-5 ;
– Dropout in FC layers.
20
Overfeat: localization
21
Overfeat: localization
3. Bounding boxes are merged & accumulated
a) Assign to Cs the set of classes in the top -5 for each scale s ∈ 1 . . . 6, by
taking the maximum detection class outputs across spatial locations for
that scale.
b) Assign to Bs the set of bounding boxes predicted by the regressor
network for each class in Cs, across all spatial locations at scale s.
c) Assign B ←Us Bs
d) Repeat merging until done:
a. (b1, b2) = argmin b1!= b2∈B match_score (b1, b2)
b. If (match_score(b1, b2) > t), then stop;
c. Otherwise, set B ← B\ {b1, b2} ∪ box_merge(b1, b2)
Here match_score = the sum of the distance between centers of the two
bounding boxes and the intersection area of the boxes.
box merge compute the average of the bounding boxes’ coordinates.
22
Overfeat: localization pipleine
23
Overfeat: localization pipleine
24
Overfeat: localization pipleine
25
Single-class Regression vs
Per- Class Regression
Using a different top layer for each class in the regressor network for each class
(Per-Class Regressor (PCR) surprisingly did not outperform using only a single
network shared among all classes (44.1% vs. 31.3%).
26
Overfeat: Detection
The detection task differ from localization in that there can be any
number of object in each image (including zero), and that false
positives are penalized by the mean average precision (mAP)
measure
The main difference with the localization task, is the necessity to
predict a background class when no object is present. Traditionally,
negative examples are initially taken at random for training. Then
the most offending negative errors are added to the training set in
bootstrapping passes.
27
REGIONS WITH CNN
28
R-CNN: Regions with CNN features
29
R-CNN: architecture
30
R-CNN Training
31
R-CNN: PASCAL VOC performance
32
R-CNN: PASCAL VOC performance
33
R-CNN: ILSVRC 2013 performance
34
R-CNN speed and
35
R-CNN CODE
https://github.com/rbgirshick/rcnn
Requires Matlab!
36
CNN WITH
SPATIAL PYRAMID POOLING
37
SPP-net = CNN + SPP
38
http://research.microsoft.com/en-us/um/people/kahe/
CNN topology
Soft Max
Inner Product
ReLUP
BACKWARD
Inner Product
FORWARD
Data Layer
39
Spatial Pyramid Pooling
40
SPP-net training
Size augmentation:
– Imagenet: 224x224 180x180
– Horizontal flipping
– Color altering
Dropout with 2 last FC layers
Learning rate:
– Init lr= 0.01; divide by 10 when error plateau
41
SPP-net: Imagenet classification
42
SPP: Imagenet - Detection
43
Exercises & Projects
Exercise:
– Implement Overfeat network; train classifier.
Projects:
– Install R-CNN
– Re-implement R-CNN in pure Python/C++ to eliminate Matlab
dependency
44
BACKUP
CNN - REGRESSION
45
CNN regression
46
CNN regression
Multi-scale
47
CNN regression
Issues:
1. Overlapping masks for multiple touching objects
2. Localization accuracy
3. Recognition of small objects
Issue1:
– To deal with multiple touching objects, we generate not one but several
masks, each representing either the full object or part of it.
– we use one network to predict the object box mask and four additional
networks to predict four halves of the box: bottom, top, left and right
halves
48