CS7015 (Deep Learning) : Lecture 12: Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only Look Once (YOLO)
CS7015 (Deep Learning) : Lecture 12: Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only Look Once (YOLO)
Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only Look Once
(YOLO)
Mitesh M. Khapra
1/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Acknowledgements
Some images borrowed from Ross Girshick’s original slides on RCNN, Fast
RCNN, etc.
Some ideas borrowed from the presentation of Kaustav Kundu∗
∗
Deep Object Detection
2/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Module 12.1 : Introduction to object detection
3/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
So far we have looked at Image Classification
We will now move on to another Image Processing Task - Object Detection
4/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Task Image classification Object Detection
5/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Region proposals Feature extraction Classifier
person flag ball none
x1 x2 ... xd
h
h∗
x1 x2 ... xd
w w∗
h h∗
w w∗
h h∗
w∗
w
8/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Module 12.2 : RCNN model for object detection
9/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Classifier
..
Feature Extrac- .
Input Region
Region
Proposals
Proposals
tion
10
5
Bounding Box
10 5
Regression
..
Feature Extrac- .
Input Region
Region
Proposals
Proposals
tion
10
5
Bounding Box
10 5
Regression
11/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Classifier
..
Feature Extrac- .
Input Region Proposals
tion
10
5
Bounding Box
10 5
Regression
10 5
5
features
CNN is fine tuned using ground truth
(cropped) object images
12/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Classifier
..
Feature Extrac- .
Input Region Proposals
tion
10
5
Bounding Box
10 5
Regression
...
Linear models (SVMs) are used for classification (1 model per class)
13/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Classifier
..
Feature Extrac- .
Input Region Proposals
tion
10
5
Bounding Box
10 5
Regression
N
X x∗ − x
min − w1T z
w
i=1
h (x,y) h∗(x∗ ,y ∗ ) The proposed regions may not be perfect
w w∗
We want to learn four regression models which will
learn to predict x∗ , y ∗ , w∗ , h∗
Proposed Box True Box We will see their respective objective functions
z : features from pool5 layer of the network N ∗
X x −x 2
min − w1T z
w 14/47
∗ −x
i=1
MiteshxM. Khapra CS7015 (Deep Learning) : Lecture 12
Classifier
10
5
Bounding Box
10 5
Regression
Wregression
..
Feature Extrac- .
Input Region
Region
Proposals
Proposals
tion
10
5
Bounding Box
10 5
Regression
What is the computational cost for processing one image at test time?
Inference Time = Proposal Time + # Proposals × Convolution Time + #
Proposals × classification + # Proposals × regression
16/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
On average selective search
gives 2K region proposal
Each of these pass through
the CNN for feature extrac-
tion
Followed by classification
and regression
17/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
No joint learning
Use ad hoc training objectives
Fine tune network with softmax
classifier (log loss)
Train post-hoc linear SVMs (hinge
loss)
Train post-hoc bounding-box re-
gressors (squared loss)
Training (≈ 3 days) and testing (47s
per image) is slow1 .
Takes a lot of disk space
1
Source: Ross Girshick
1
Using VGG-Net 18/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Region proposals Feature extraction Classifier
19/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Module 12.3 : Fast RCNN model for object detection
20/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Suppose we apply a 3 × 3 kernel on
an image
What is the region of influence of each
pixel in the resulting output ?
Each pixel contributes to a 5 × 5 re-
gion
Suppose we again apply a 3×3 kernel
on this output?
What is the region of influence of the
original pixel from the input ? (a 7×7
region)
21/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
softmax
4
2
22
22
11
11
56
56
28
28
14
14
7
28
14
14
7
28
56
56
112
112
512
224
224
512 512
256 512
128 256
maxpool Conv maxpool Conv maxpool
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096
22/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Using this idea we could get a bound-
ing box’s region of influence on any
layer in the CNN
The projected Region of Interest
(RoI) may be of different sizes
Divide them into k equally sized re-
gions of dimension H × W and do
max pooling in each of those regions
to construct a k dimensional vector
Source: Ross Girshick Connect the k dimensional vector to
a fully connected layer
This max pooling operation is call
RoI pooling
23/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Once we have the FC layer it gives us
the representation of this region pro-
posal
We can then add a softmax layer on
top of it to compute a probability
distribution over the possible object
classes
Similarly we can add a regression
layer on top of it to predict the new
Source: Ross Girshick bounding box (w∗ , h∗ , x∗ , y ∗ )
24/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Recall that the last pooling layer of
W
VGGNet-16 results in an output of
ROI size 512 × 7 × 7
We replace the last max pooling layer
by a RoI pooling layer
Max-pool We set H = W = 7 and divide each
Conv
of these RoIs into (k = 49) regions
We do this for every feature map res-
Input ulting in an ouput of size 512 × 49
This output is of the same size as the
output of the original max pooling
layer
Fast RCNN
26/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Module 12.4 : Faster RCNN model for object detection
27/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
classifier So far the region proposals were be-
ing made using Selective Search al-
RoI pooling gorithm
Idea: Can we use a CNN for making
proposals region proposals also?
How? Well it’s slightly tricky
Region Proposal Network We will illustrate this using
feature maps VGGNet
conv layers
image
28/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Consider the output of the last con-
volutional layer of VGGNet
h
Now consider one cell in one of the
512 512 feature maps
w If we apply a 3 × 3 kernel around this
cell then we will get a 1D representa-
tion for this cell
x1 x2 x512
·
If we repeat this for all the 512 feature
maps then we will get a 512 dimen-
x1 x2 x512
· sional representation for this position
We use this process to get a 512 di-
mensional representation for each of
x1 x2 x512
· the w × h positions
29/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
x1 x2 · · · · · x512
30/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
x1 x2 · · · · · x512
Input
31/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Consider a ground truth object and
its corresponding bounding box
Classification Regression
Consider the projection of this image
onto the conv5 layer
x1x2 · · · · · ·
Consider one such cell in the output
This cell corresponds to a patch in the
original image
Consider the center of this patch
Max-pool
We consider anchor boxes of different
Conv
sizes
For each of these anchor boxes, we
would want the classifier to predict
Input
1 if this anchor box has a reason-
able overlap (IoU > 0.7) with the true
grounding box 32/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Classification Regression
x1x2 · · · · · ·
Input
33/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
The full network is trained using the following objective.
1 X λ X ∗
L (pi , ti ) = Lcls (pi , p∗i ) + pi Lreg (ti , t∗i )
Ncls Nreg
i i
34/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
So far we have seen a CNN based ap-
Fast RCNN proach for region proposals instead of
using selective search
Region Proposals
We can now take these region propos-
Classification Regression als and then add fast RCNN on top
of it to predict the class of the object
x1x2 · · · · ·x512
And regress the proposed bounding
box
Max-pool
Conv
Input
35/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
But the fast RCNN would again use
Fast RCNN a VGG Net
Region Proposals Can’t we use a single VGG Net and
share the parameters of RPN and
Classification Regression RCNN
x1x2 · · · · ·x512
Yes, we can
In practice, we use a 4 step alternat-
ing training process
Max-pool
Conv
Input
36/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Faster RCNN:Training
Fast RCNN
Fine-tune RPN using a pre-trained
Region Proposals ImageNet network
Fine-tune fast RCNN from a pre-
Classification Regression
trained ImageNet network using
x1x2 · · · · ·x512
bounding boxes from step 1
Keeping common convolutional layer
parameters fixed from step 2, fine-
tune RPN (post conv5 layers)
Keeping common convolution layer
parameters fixed from step 3, fine-
Max-pool
tune fc layers of fast RCNN
Conv
Input
37/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Faster RCNN and RPN are the basis of several 1st place entries in the ILSVRC
and COCO tracks on :
Imagenet detection
COCO Segmentation
Imagenet localization
COCO detection
38/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Region proposals Feature extraction Classifier
Fast RCNN
Faster RCNN
39/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Object Detection Performance
41/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
classifier
The approaches that we have seen so
far are two stage approaches
RoI pooling They involve a region proposal stage
and then a classification stage
proposals
Can we have an end-to-end architec-
ture which does both proposal and
classification simultaneously ?
Region Proposal Network
This is the idea behind YOLO-You
feature maps
Only Look Once.
conv layers
image
42/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
P (cow) P (truck)
Divide an image into S × S grids
c w h x y · · (S=7)
P (dog)
For each such cell we are interested in
predicting 5 + k quantities
Probability (confidence) that this cell
is indeed contained in a true bound-
ing box
Bounding boxes + confidence
Width of the bounding box
Height of the bounding box
Center (x,y) of the bounding box
S × S grid on input ProbabilityFinalofdetections
the object in the
bounding box belonging to the k th
class (k - values)
The output layer thus contains S ×
Class probability map
S × (5 + k) elements 43/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
How do we interpret this S×S×(5+k)
dimensional output?
For each cell, we are computing a
bounding box, its confidence and the
confidence Boundingboxes
Bounding boxes++confidence
confidence
object in it
We then retain the most confident
bounding boxes and the correspond-
ing object label
SSFinal
××SSgrid
grid oninput
on input
detections Finaldetections
Final detections
Input Image
Bounding Boxes & Confidence
(0 − ĉ)2
46/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Method Pascal 2007 mAP Speed
DPM v5 33.7 0.07 FPS — 14 sec/ image
RCNN 66.0 0.05 FPS — 20 sec/ image
Fast RCNN 70.0 0.5 FPS — 2 sec/ image
Faster RCNN 73.2 7 FPS — 140 msec/ image
YOLO 69.0 45 FPS — 22 msec/ image
47/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12