0% found this document useful (0 votes)
30 views

2.ObjectDetection Two Stage

Two-stage object detectors first generate object proposals from an image and then classify and refine the bounding boxes of each proposal. They classify proposals using a classification head and refine bounding boxes using a regression head of a convolutional neural network. The classification head is typically trained first before training the regression head. At test time, both heads are used to classify objects and refine bounding box coordinates.

Uploaded by

Uyên Vương
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

2.ObjectDetection Two Stage

Two-stage object detectors first generate object proposals from an image and then classify and refine the bounding boxes of each proposal. They classify proposals using a classification head and refine bounding boxes using a regression head of a convolutional neural network. The classification head is typically trained first before training the regression head. At test time, both heads are used to classify objects and refine bounding box coordinates.

Uploaded by

Uyên Vương
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Two-stage object

detectors

CV3DST | Prof. Leal-Taixé 1


Types of object detectors
• One-stage detectors
Class score (cat,
Classification
dog, person)
Feature
Image
extraction Bounding box
Localization
(x,y,w,h)

• Two-stage detectors
Class score (cat,
Classification
Extraction of dog, person)
Feature
Image object
extraction
proposals Localization Refine bounding box
(Δx, Δy, Δw, Δh)
CV3DST | Prof. Leal-Taixé 2
Types of object detectors
• One-stage detectors
Class score (cat,
Classification
dog, person)
Feature
Image
extraction Bounding box
Localization
(x,y,w,h)

• Two-stage detectors
Class score (cat,
Classification
Extraction of dog, person)
Feature
Image object
extraction
proposals Localization Refine bounding box
(Δx, Δy, Δw, Δh)
CV3DST | Prof. Leal-Taixé 3
Localization
• Bounding box regression

Output:
Box coordinates (x,y,w,h)
Feature extraction
(this time with a
Neural Network) L2 loss function
Image
Ground truth: Box
coordinates

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 4
Localization
• Bounding box regression

Output:
Box coordinates (x,y,w,h)

L2 loss function
Convolutional
Image Neural Network Ground truth: Box
coordinates

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 5
Localization and classification
• Bounding box regression
Fully connected

Output:
Box coordinates (x,y,w,h)

Convolutional
Image Neural Network

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 6
Localization and classification
• Bounding box regression
Fully connected

L2 loss
Output:
Box coordinates (x,y,w,h)

Convolutional
Image Neural Network Softmax loss
Output:
Class scores

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 7
Localization and classification
• Bounding box regression
Regression head

Output:
Box coordinates (x,y,w,h)

Convolutional
Image Neural Network
Output: Classification
Class scores head

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 8
Localization and classification
• It was typical to train the classification head first,
freeze the layers
• Then train the regression head

• At test time, we use both!

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

CV3DST | Prof. Leal-Taixé 10


Overfeat
• Sliding window + box regression + classification

Feature map Boxes


(5 x 5 x 1024) (1000 x 4)

Convolutional
Image Neural Network Class scores
(221 x 221 x 3) 1000

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 11
Overfeat
• Sliding window + box regression + classification

Image (468 x 356 x 3)

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 12
Overfeat
• Sliding window + box regression + classification

Image (468 x 356 x 3)

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 13
Overfeat
• Sliding window + box regression + classification

Image (468 x 356 x 3)

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 14
Overfeat
• Sliding window + box regression + classification

Image (468 x 356 x 3)

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 15
Overfeat
• Sliding window + box regression + classification

We end up with
many predictions
and we have to
combine them for a
final detection (in
Overfeat they have
a greedy method)
Image (468 x 356 x 3)

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 16
Overfeat
• Sliding window + box regression + classification

We end up with
many predictions
and we have to
combine them for a
final detection (in
Overfeat they have
a greedy method)
Image (468 x 356 x 3)

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 17
Overfeat
• In practice: use many sliding window locations and
multiple scales
Window positions + score maps Box regression outputs Final Predictions

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

CV3DST | Prof. Leal-Taixé


Lecture 8 - 31 18
Overfeat
• Sliding window + box regression + classification

Feature map Boxes


(5 x 5 x 1024) (1000 x 4)

Convolutional
Image Neural Network Class scores
(221 x 221 x 3) 1000
What prevents us from dealing with any image size?
Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 19
What about multiple objects?
• Localization: Regression
• How about detection?

CV3DST | Prof. Leal-Taixé 20


What about multiple objects?
• Localization: Regression
• How about detection?

3 objects means
having an output of
12 numbers (3 x 4)

CV3DST | Prof. Leal-Taixé 21


What about multiple objects?
• Localization: Regression
• How about detection?

14 objects means
having an output of
56 numbers (14 x 4)

CV3DST | Prof. Leal-Taixé 22


What about multiple objects?
• Localization: Regression
• How about detection?

• Having a variable sized output is not optimal for Neural


Networks

• There are a couple of workarounds:


– RNN: Romera-Paredes and Torr. Recurrent Instance Segmentation. ECCV
2016.
– Set prediction: Rezatofighi, Kaskman, Motlagh, Shi, Cremers, Leal-Taixé,
Reid. Deep Perm-Set Net: Learn to predict sets with unknown permutation
and cardinality using deep neural networks. Arxiv: 1805.00613

CV3DST | Prof. Leal-Taixé 23


Detection as classification?
• Localization: Regression
• How about detection? Regression

Is this a Flamingo?

NO

CV3DST | Prof. Leal-Taixé 24


Detection as classification?
• Localization: Regression
• How about detection? Regression

Is this a Flamingo?

NO

CV3DST | Prof. Leal-Taixé 25


Detection as classification?
• Localization: Regression
• How about detection? Regression

Is this a Flamingo?

YES!

CV3DST | Prof. Leal-Taixé 26


Detection as classification?
• Localization: Regression
• How about detection? Classification

• Problem:
– Expensive to try all possible positions, scales and aspect
ratios
– How about trying only on a subset of boxes with most
potential?

CV3DST | Prof. Leal-Taixé 27


Region Proposals
• We have already seen a method that gives us
“interesting” regions in an image that potentially
contain an object

• Step 1: Obtain region


proposals
• Step 2: Classify them.

CV3DST | Prof. Leal-Taixé


Lecture 8 - 49 28
The R-CNN family

CV3DST | Prof. Leal-Taixé 29


R-CNN

Girschick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014

CV3DST | Prof. Leal-Taixé 30


R-CNN
Classification head
Regression head to
refine the
bounding box
location Extract features

Warping to a fix
size 227 x 227

Girschick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014

CV3DST | Prof. Leal-Taixé 31


R-CNN
• Training scheme:
– 1. Pre-train the CNN on ImageNet
– 2. Finetune the CNN on the number of classes the
detector is aiming to classify (softmax loss)
– 3. Train a linear Support Vector Machine classifier to
classify image regions. One SVM per class! (hinge loss)
– 4. Train the bounding box regressor (L2 loss)

CV3DST | Prof. Leal-Taixé 32


R-CNN
• PROS:
– The pipeline of proposals, feature extraction and SVM
classification is well-known and tested. Only features are
changed (CNN instead of HOG).
– CNN summarizes each proposal into a 4096 vector
(much more compact representation compared to HOG)
– Leverage transfer learning: the CNN can be pre-trained
for image classification with C classes. One needs only to
change the FC layers to deal with Z classes.

CV3DST | Prof. Leal-Taixé 33


R-CNN
• CONS: Let us try to solve this first
– Slow! 47s/image with VGG16 backbone. One considers
around 2000 proposals per image, they need to be
warped and forwarded through the CNN.
– Training is also slow and complex
– The object proposal algorithm is fixed. Feature extraction
and SVM classifier are trained separately à not exploiting
learning to its full potential.

CV3DST | Prof. Leal-Taixé 34


SPP-Net How do we “pool”
these features into
a common size

Frozen

He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. ECCV 2014.

CV3DST | Prof. Leal-Taixé 35


SPP-Net
• It solved the R-CNN problem of being slow at test
time
• It still has some problems inherited from R-CNN:
– Training is still slow (a bit faster than R-CNN)
– Training scheme is still complex
– Still no end-to-end training

He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. ECCV 2014.

CV3DST | Prof. Leal-Taixé 36


Fast R-CNN

CV3DST | Prof. Leal-Taixé 37


Fast R-CNN

Shared
computation at
test time (like
SPP)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8Slide- credit:
Girschick, “Fast R-CNN”, ICCV 2015
1 Feb 2016
67 Ross Girschick
CV3DST | Prof. Leal-Taixé 38
Fast R-CNN

Region of
Interest Pooling

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8Slide- credit:
Girschick, “Fast R-CNN”, ICCV 2015
1 Feb 2016
67 Ross Girschick
CV3DST | Prof. Leal-Taixé 39
Fast R-CNN: RoI Pooling
• Region of Interest Pooling

Boxes
(1000 x 4)

Class scores
1000
Convolutional Feature map
Image FC layers
Neural (L x K x C)
(N x M x 3) expect a fixed
Network size
(H x W x C)

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 40
Fast R-CNN: RoI Pooling
• Region of Interest Pooling

Boxes
(1000 x 4)

Class scores
1000
Convolutional Feature map
Image FC layers
Neural (L x K x C)
(N x M x 3) expect a fixed
Network size
We have to transform
(H x W x C)
this feature map into
size (H x W xLecture
C) 8 - 12
CV3DST | Prof. Leal-Taixé 41
Fast R-CNN: RoI Pooling
• Region of Interest Pooling

Zoom in
Boxes
(1000 x 4)

Class scores
1000
Feature map
(L x K x C) FC layers
expect a fixed
size
(H x W x C)

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 42
Fast R-CNN: RoI Pooling
• Region of Interest Pooling

Zoom in
Boxes
(1000 x 4)

Class scores
1000
Feature map
(L x K x C) We put a H x W FC layers
grid on top expect a fixed
size
(H x W x C)

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 43
Fast R-CNN: RoI Pooling
• Region of Interest Pooling

Zoom in Pooling
Boxes
(1000 x 4)

Class scores
Feature map 1000
Feature map
(H x W x C) FC layers
(L x K x C) We put a H x W
grid on top expect a fixed
size
(H x W x C)

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 44
Fast R-CNN: RoI Pooling
• RoI Pooling: how do you do backpropagation?

Zoom in Pooling
Boxes
(1000 x 4)

Class scores
Feature map 1000
Feature map
(H x W x C) FC layers
(L x K x C) We put a H x W
grid on top expect a fixed
size
Like max-pooling! (H x W x C)

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 45
Fast R-CNN Results
• VGG-16 CNN on Pascal VOC 2007 dataset
R-CNN Fast R-CNN

Training Time: 84 hours 9.5 hours


Faster!
(Speedup) 1x 8.8x

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8 - 77 1 Feb 2016
CV3DST | Prof. Leal-Taixé 46
Fast R-CNN Results
• VGG-16 CNN on Pascal VOC 2007 dataset
R-CNN Fast R-CNN

Training Time: 84 hours 9.5 hours


Faster!
(Speedup) 1x 8.8x

Test time per image 47 seconds 0.32 seconds


FASTER!
(Speedup) 1x 146x

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8 - 77 1 Feb 2016
CV3DST | Prof. Leal-Taixé 47
Fast R-CNN Results
• VGG-16 CNN on Pascal VOC 2007 dataset
R-CNN Fast R-CNN

Training Time: 84 hours 9.5 hours


Faster!
(Speedup) 1x 8.8x

Test time per image 47 seconds 0.32 seconds


FASTER!
(Speedup) 1x 146x

mAP (VOC 2007) 66.0 66.9


Better!

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8 - 77 1 Feb 2016
CV3DST | Prof. Leal-Taixé 48
Fast R-CNN Results The test times
do not include
proposal
• VGG-16 CNN on Pascal VOC 2007 dataset generation!

R-CNN Fast R-CNN

Training Time: 84 hours 9.5 hours


Faster!
(Speedup) 1x 8.8x

Test time per image 47 seconds 0.32 seconds


FASTER!
(Speedup) 1x 146x

mAP (VOC 2007) 66.0 66.9


Better!

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8 - 77 1 Feb 2016
CV3DST | Prof. Leal-Taixé 49
Fast R-CNN Results With proposals
included

• VGG-16 CNN on Pascal VOC 2007 dataset


R-CNN Fast R-CNN

Training Time: 84 hours 9.5 hours


Faster!
(Speedup) 1x 8.8x

Test time per image 50 seconds 2 seconds


FASTER!
(Speedup) 1x 25x

mAP (VOC 2007) 66.0 66.9


Better!

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8 - 77 1 Feb 2016
CV3DST | Prof. Leal-Taixé 50
Faster R-CNN

CV3DST | Prof. Leal-Taixé 51


Faster R-CNN:
• Solution: Have the proposal
generation integrated with the
rest of the pipeline

• Region Proposal Network


(RPN) trained to produce
region proposals directly.

• After RPN, everything is like


Fast R-CNN

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015

CV3DST | Prof. Leal-Taixé Slide credit: Ross Girschick Lecture 8 - 80 52


Region proposal network
• How to extract proposals
• How many
proposals?
ü We need to decide
Extract a fixed number
proposals

(H x W x 4096) • Where are they


placed?
Image
ü Densely
(N x M x 3)

Zoom in
CV3DST | Prof. Leal-Taixé
Lecture 8 - 12 53
Region proposal network
• We fix the number of proposals by using a set of n=9
anchors per location. 2

• 9 anchors = 3 scales
and 3 aspect ratios

Zoomed in
feature map
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
CV3DST | Prof. Leal-Taixé 54
Region proposal network
• We fix the number of proposals by using a set of n=9
anchors per location. 2

• 9 anchors = 3 scales
and 3 aspect ratios
• We extract a descriptor
per location

Zoomed in
feature map
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
CV3DST | Prof. Leal-Taixé 55
Region proposal network
• How to extract proposals

3x3 conv

(H x W x 256)
(H x W x 4096)
Image
(N x M x 3)
#anchors per image? (H x W x n)

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 56
Region proposal network
Anchor
• How to extract proposals 1 classification score per regression
proposal (object/non- to proposal
object) box

3x3 conv 1x1 conv

(H x W x 256) (H x W x (2n+4n))
(H x W x 4096)
Image
(N x M x 3)
#anchors per image? (H x W x n)

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 57
Region proposal network
Anchor
• How to extract proposals 1 classification score per regression
proposal (object/non- to proposal
object) box

3x3 conv 1x1 conv

(H x W x 256) (H x W x (2n+4n))
(H x W x 4096)
Image RPN
(N x M x 3)
Per feature map location, I get a set of anchor correction and classification into
object/non-object
CV3DST | Prof. Leal-Taixé
Lecture 8 - 12 58
RPN: training and losses
• Classification ground truth: We compute p⇤ which
indicates how much an anchor overlaps with the
ground truth bounding boxes
p⇤ = 1 if IoU > 0.7
p⇤ = 0 if IoU < 0.3
• 1 indicates the anchor represent an object
(foreground) and 0 indicates background object. The
rest do not contribute to the training.

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 59
RPN: training and losses
• For an image, we randomly sample 256 anchors to
form a mini-batch (balanced objects vs. non-objects)
• We calculate the classification loss (binary cross-
entropy).
• Those anchors that do contain an object are used to
compute the regression loss

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 60
RPN: training and losses
• Each anchor is described by the center position,
width and height xa , ya , wa , ha

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 61
RPN: training and losses
• Each anchor is described by the center position,
width and height xa , ya , wa , ha
• What the network actually predicts are tx , ty , tw , th

Normalized x tx = (x xa )/wa , ty = (y ya )/ha , Normalized y

Normalized width tw = log(w/wa ), th = log(h/ha ), Normalized height

• Smooth L1 loss on regression targets

CV3DST | Prof. Leal-Taixé


Lecture 8 - 12 62
Faster R-CNN: Training
• First implementation, training of RPN separate from
3 4
the rest.
• Now we can train jointly! 1 2

• Four losses:
1. RPN classification (object/non-object)
2. RPN regression (anchor -> proposal)
3. Fast R-CNN classification (type of object)
4. Fast R-CNN regression (proposal -> box)

CV3DST | Prof. Leal-Taixé Slide credit: Ross Girschick 63


Faster R-CNN
• 10x faster at test time wrt Fast R-CNN
• Trained end-to-end including feature extraction,
region proposals, classifier and regressor
• More accurate, since proposals are learned. RPN is
fully convolutional

CV3DST | Prof. Leal-Taixé 64


Faster R-CNN: Results

R-CNN Fast R-CNN Faster R-CNN


Test time 50 seconds 2 seconds 0.2 seconds
per image
(with proposals)
(Speedup) 1x 25x 250x

mAP (VOC 2007) 66.0 66.9 66.9

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8 - 84 1 Feb 2016
CV3DST | Prof. Leal-Taixé 65
Two-stage object
detectors

CV3DST | Prof. Leal-Taixé 66


Related works
• Shrivastava, Gupta, Girshick. “Training region-based object
detectors with online hard example mining”. CVPR 2016.
• Dai, Li, He and Sun. “R-FCN: Object detection via region-
based fully convolutional networks”. 2016.
• Dai, Qi, Xiong, Li, Zhang, Hu and Wei. “Deformable
convolutional networks”. ICCV 2017.
• Lin, Dollar, Girshick, He, Hariharan and Belongie. “Feature
Pyramid Networks for object detection”. CVPR 2017.

CV3DST | Prof. Leal-Taixé 67

You might also like