0% found this document useful (0 votes)
17 views

CS7015 (Deep Learning) : Lecture 12: Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only Look Once (YOLO)

Uploaded by

krishna s
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

CS7015 (Deep Learning) : Lecture 12: Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only Look Once (YOLO)

Uploaded by

krishna s
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

CS7015 (Deep Learning) : Lecture 12

Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only Look Once
(YOLO)

Mitesh M. Khapra

Department of Computer Science and Engineering


Indian Institute of Technology Madras

1/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Acknowledgements
Some images borrowed from Ross Girshick’s original slides on RCNN, Fast
RCNN, etc.
Some ideas borrowed from the presentation of Kaustav Kundu∗

Deep Object Detection

2/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Module 12.1 : Introduction to object detection

3/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
So far we have looked at Image Classification
We will now move on to another Image Processing Task - Object Detection

4/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Task Image classification Object Detection

Output Car Car, exact bound-


ing box contain-
ing car

5/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Region proposals Feature extraction Classifier
person flag ball none

x1 x2 ... xd

Let us see a typical pipeline for object detection


It starts with a region proposal stage where we identify potential regions which
may contain objects
We could think of these regions as mini-images 6/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Region proposals Feature extraction Bounding box regression

h
h∗
x1 x2 ... xd
w w∗
h h∗
w w∗
h h∗
w∗
w

In addition we would also like to correct the proposed bounding boxes


This is posed as a regression problem (for example, we would like to predict w∗ ,
h∗ from the proposed w and h) 7/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Region proposals Feature extraction Classifier

Let us see how these three compon-


ents have evolved over time
Pre 2012 Propose all possible regions in the
image of varying sizes (almost brute
RCNN
force)
Fast RCNN Use handcrafted features (SIFT,
HOG)
Faster RCNN
Train a linear classifier using these
features
We will now see three algorithms that
progressively improve these compon-
ents

8/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Module 12.2 : RCNN model for object detection

9/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Classifier

..
Feature Extrac- .
Input Region
Region
Proposals
Proposals
tion

10
5
Bounding Box
10 5
Regression

Selective Search for region proposals


Does hierarchical clustering at different scales
For example the figures from left to right show
clusters of increasing sizes
Such a hierarchical clustering is important
as we may find different objects at different
scales 10/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Classifier

..
Feature Extrac- .
Input Region
Region
Proposals
Proposals
tion

10
5
Bounding Box
10 5
Regression

Proposed regions are cropped to form mini im-


ages
Each mini image is scaled to match the CNN’s
(feature extractor) input size

11/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Classifier

..
Feature Extrac- .
Input Region Proposals
tion

10
5
Bounding Box
10 5
Regression

For feature extraction any CNN


trained for Image Classification can
fc7 be used (AlexNet/ VGGNet etc.)
Outputs from fc7 layer are taken as
10

10 5
5
features
CNN is fine tuned using ground truth
(cropped) object images
12/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Classifier

..
Feature Extrac- .
Input Region Proposals
tion

10
5
Bounding Box
10 5
Regression

...

Linear models (SVMs) are used for classification (1 model per class)
13/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Classifier

..
Feature Extrac- .
Input Region Proposals
tion

10
5
Bounding Box
10 5
Regression

N
X x∗ − x
min − w1T z
w
i=1
h (x,y) h∗(x∗ ,y ∗ ) The proposed regions may not be perfect
w w∗
We want to learn four regression models which will
learn to predict x∗ , y ∗ , w∗ , h∗
Proposed Box True Box We will see their respective objective functions
z : features from pool5 layer of the network N  ∗
X x −x 2
min − w1T z
w 14/47
∗ −x
i=1
MiteshxM. Khapra CS7015 (Deep Learning) : Lecture 12
Classifier

WCON V Wclassif ier


..
Feature Extrac- .
Input Region
Region
Proposals
Proposals
tion

10
5
Bounding Box
10 5
Regression
Wregression

What are the parameters of this model?


WCON V is taken as it is from a CNN trained for Image classification (say on
ImageNet)
WCON V is then fine tuned using ground truth (cropped) object images
Wclassif ier is learned using ground truth (cropped) object images
Wregression is learned using ground truth bounding boxes
15/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Classifier

..
Feature Extrac- .
Input Region
Region
Proposals
Proposals
tion

10
5
Bounding Box
10 5
Regression

What is the computational cost for processing one image at test time?
Inference Time = Proposal Time + # Proposals × Convolution Time + #
Proposals × classification + # Proposals × regression

16/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
On average selective search
gives 2K region proposal
Each of these pass through
the CNN for feature extrac-
tion
Followed by classification
and regression

Source: Ross Girshick

17/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
No joint learning
Use ad hoc training objectives
Fine tune network with softmax
classifier (log loss)
Train post-hoc linear SVMs (hinge
loss)
Train post-hoc bounding-box re-
gressors (squared loss)
Training (≈ 3 days) and testing (47s
per image) is slow1 .
Takes a lot of disk space

1
Source: Ross Girshick
1
Using VGG-Net 18/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Region proposals Feature extraction Classifier

Region Proposals: Selective


Search
Pre 2012 Feature Extraction: CNNs
Classifier: Linear
RCNN

19/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Module 12.3 : Fast RCNN model for object detection

20/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Suppose we apply a 3 × 3 kernel on
an image
What is the region of influence of each
pixel in the resulting output ?
Each pixel contributes to a 5 × 5 re-
gion
Suppose we again apply a 3×3 kernel
on this output?
What is the region of influence of the
original pixel from the input ? (a 7×7
region)

21/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
softmax
4

2
22

22

11

11

56

56

28

28

14
14
7

28

14

14

7
28
56

56
112

112
512
224

224

512 512
256 512
128 256
maxpool Conv maxpool Conv maxpool
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096

22/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Using this idea we could get a bound-
ing box’s region of influence on any
layer in the CNN
The projected Region of Interest
(RoI) may be of different sizes
Divide them into k equally sized re-
gions of dimension H × W and do
max pooling in each of those regions
to construct a k dimensional vector
Source: Ross Girshick Connect the k dimensional vector to
a fully connected layer
This max pooling operation is call
RoI pooling
23/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Once we have the FC layer it gives us
the representation of this region pro-
posal
We can then add a softmax layer on
top of it to compute a probability
distribution over the possible object
classes
Similarly we can add a regression
layer on top of it to predict the new
Source: Ross Girshick bounding box (w∗ , h∗ , x∗ , y ∗ )

24/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Recall that the last pooling layer of
W
VGGNet-16 results in an output of
ROI size 512 × 7 × 7
We replace the last max pooling layer
by a RoI pooling layer
Max-pool We set H = W = 7 and divide each
Conv
of these RoIs into (k = 49) regions
We do this for every feature map res-
Input ulting in an ouput of size 512 × 49
This output is of the same size as the
output of the original max pooling
layer

It is thus compatible with the dimen-


sions of the weight matrix connecting
Mitesh M. Khapra
the original pooling layer to the first25/47
CS7015 (Deep Learning) : Lecture 12
Region proposals Feature extraction Classifier

Region Proposals: Selective


Search
Pre 2012 Feature Extraction: CNN
Classifier: CNN
RCNN

Fast RCNN

26/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Module 12.4 : Faster RCNN model for object detection

27/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
classifier So far the region proposals were be-
ing made using Selective Search al-
RoI pooling gorithm
Idea: Can we use a CNN for making
proposals region proposals also?
How? Well it’s slightly tricky
Region Proposal Network We will illustrate this using
feature maps VGGNet

conv layers

image

28/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Consider the output of the last con-
volutional layer of VGGNet
h
Now consider one cell in one of the
512 512 feature maps
w If we apply a 3 × 3 kernel around this
cell then we will get a 1D representa-
tion for this cell
x1 x2 x512
·
If we repeat this for all the 512 feature
maps then we will get a 512 dimen-
x1 x2 x512
· sional representation for this position
We use this process to get a 512 di-
mensional representation for each of
x1 x2 x512
· the w × h positions

29/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
x1 x2 · · · · · x512

We now consider k bounding boxes


(called anchor boxes) of different sizes
& aspect ratio
We are interested in the following two
questions:
Max-pool
Given the 512d representation of a
position, what is the probability that
Conv
a given anchor box centered at this
position contains an object?
(Classification)
How do you predict the true bound-
Input
ing box from this anchor box? (Re-
gression)

30/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
x1 x2 · · · · · x512

We train a classification model and a


regression model to address these two
Max-pool questions
Conv
How do we get the ground truth data?
What is the objective function used
for training?

Input

31/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Consider a ground truth object and
its corresponding bounding box
Classification Regression
Consider the projection of this image
onto the conv5 layer
x1x2 · · · · · ·
Consider one such cell in the output
This cell corresponds to a patch in the
original image
Consider the center of this patch
Max-pool
We consider anchor boxes of different
Conv
sizes
For each of these anchor boxes, we
would want the classifier to predict
Input
1 if this anchor box has a reason-
able overlap (IoU > 0.7) with the true
grounding box 32/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Classification Regression

x1x2 · · · · · ·

We train a classification model and a


regression model to address these two
questions
Max-pool
How do we get the ground truth data?
Conv
What is the objective function used
for training?

Input

33/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
The full network is trained using the following objective.
1 X λ X ∗
L (pi , ti ) = Lcls (pi , p∗i ) + pi Lreg (ti , t∗i )
Ncls Nreg
i i

p∗i = 1 if anchor box contains ground truth object


=0 otherwise
pi = predicted probability of anchor box containing an object
Ncls = batch-size
Nreg = batch-size × k
k = anchor boxes

34/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
So far we have seen a CNN based ap-
Fast RCNN proach for region proposals instead of
using selective search
Region Proposals
We can now take these region propos-
Classification Regression als and then add fast RCNN on top
of it to predict the class of the object
x1x2 · · · · ·x512
And regress the proposed bounding
box

Max-pool

Conv

Input

35/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
But the fast RCNN would again use
Fast RCNN a VGG Net
Region Proposals Can’t we use a single VGG Net and
share the parameters of RPN and
Classification Regression RCNN
x1x2 · · · · ·x512
Yes, we can
In practice, we use a 4 step alternat-
ing training process

Max-pool

Conv

Input

36/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Faster RCNN:Training
Fast RCNN
Fine-tune RPN using a pre-trained
Region Proposals ImageNet network
Fine-tune fast RCNN from a pre-
Classification Regression
trained ImageNet network using
x1x2 · · · · ·x512
bounding boxes from step 1
Keeping common convolutional layer
parameters fixed from step 2, fine-
tune RPN (post conv5 layers)
Keeping common convolution layer
parameters fixed from step 3, fine-
Max-pool
tune fc layers of fast RCNN
Conv

Input

37/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Faster RCNN and RPN are the basis of several 1st place entries in the ILSVRC
and COCO tracks on :
Imagenet detection
COCO Segmentation
Imagenet localization
COCO detection

38/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Region proposals Feature extraction Classifier

Region Proposals: CNN


Feature Extraction: CNN
Pre 2012 Classifier: CNN
RCNN

Fast RCNN

Faster RCNN

39/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Object Detection Performance

Source: Ross Girshick


40/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Module 12.5 : YOLO model for object detection

41/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
classifier
The approaches that we have seen so
far are two stage approaches
RoI pooling They involve a region proposal stage
and then a classification stage
proposals
Can we have an end-to-end architec-
ture which does both proposal and
classification simultaneously ?
Region Proposal Network
This is the idea behind YOLO-You
feature maps
Only Look Once.

conv layers

image

42/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
P (cow) P (truck)
Divide an image into S × S grids
c w h x y · · (S=7)
P (dog)
For each such cell we are interested in
predicting 5 + k quantities
Probability (confidence) that this cell
is indeed contained in a true bound-
ing box
Bounding boxes + confidence
Width of the bounding box
Height of the bounding box
Center (x,y) of the bounding box
S × S grid on input ProbabilityFinalofdetections
the object in the
bounding box belonging to the k th
class (k - values)
The output layer thus contains S ×
Class probability map
S × (5 + k) elements 43/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
How do we interpret this S×S×(5+k)
dimensional output?
For each cell, we are computing a
bounding box, its confidence and the
confidence Boundingboxes
Bounding boxes++confidence
confidence
object in it
We then retain the most confident
bounding boxes and the correspond-
ing object label
SSFinal
××SSgrid
grid oninput
on input
detections Finaldetections
Final detections
Input Image
Bounding Boxes & Confidence

ity map Classprobability


Class probabilitymap
map
44/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
P (cow)
How do we train this network ?
P (truck)
y · · Consider a cell such that the center
cĉ w
ŵ h
ĥ x
x̂ ŷ `ˆ1 `ˆ2 `ˆk
of the true bonding box lies in it
P (dog)
The network is initialized randomly
and it will predict some values for
c, w, h, x, y & `
We can then compute the following
Bounding boxes + confidence
losses
(x − x̂)2
(y − ŷ)2
√ √
( w − ŵ)2
S × S grid on input √ p Final detections
( h − ĥ)2
(1 − ĉ)2
Pk ˆ 2
i=1 (`i − `i )
Class probability map
And train the network to minimize45/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
ĉ ŵ ĥ x̂ ŷ `ˆ1 `ˆ2 · · `ˆk
Now consider a grid which does not
contain any object
For this grid we do not care about the
predictions w, h, x, y & `
But we want the confidence to be low
Bounding boxes + confidence
So we minimize only the following loss

(0 − ĉ)2

S × S grid on input Final detections

Class probability map

46/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12
Method Pascal 2007 mAP Speed
DPM v5 33.7 0.07 FPS — 14 sec/ image
RCNN 66.0 0.05 FPS — 20 sec/ image
Fast RCNN 70.0 0.5 FPS — 2 sec/ image
Faster RCNN 73.2 7 FPS — 140 msec/ image
YOLO 69.0 45 FPS — 22 msec/ image

47/47
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

You might also like