0% found this document useful (0 votes)
20 views54 pages

L7 Detection

Object detection involves predicting bounding boxes and class labels for objects in images. R-CNN was an early two-stage detector that used selective search to generate region proposals which were then classified using CNN features. Fast R-CNN improved on R-CNN by making the whole system trainable end-to-end using a multi-task loss over classification and bounding box regression. It introduced ROI pooling to extract fixed-length feature vectors from convolutional feature maps for each region proposal. Faster R-CNN built on this by incorporating the region proposal network to generate proposals within the detection network.

Uploaded by

Agha Kazim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views54 pages

L7 Detection

Object detection involves predicting bounding boxes and class labels for objects in images. R-CNN was an early two-stage detector that used selective search to generate region proposals which were then classified using CNN features. Fast R-CNN improved on R-CNN by making the whole system trainable end-to-end using a multi-task loss over classification and bounding box regression. It introduced ROI pooling to extract fixed-length feature vectors from convolutional feature maps for each region proposal. Faster R-CNN built on this by incorporating the region proposal network to generate proposals within the detection network.

Uploaded by

Agha Kazim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Object detection

Image source
Outline
• Task definition and evaluation
• Two-stage detectors:
• R-CNN
• Fast R-CNN
• Faster R-CNN
• Single-stage and multi-resolution detectors
• Recent trends
Object detection evaluation
• At test time, predict bounding boxes, class labels, and confidence
scores
• For each detection, determine whether it is a true or false positive
• PASCAL criterion: Area(GT ∩ Det) / Area(GT ∪ Det) > 0.5
• For multiple detections of the same ground truth box, only one is
considered a true positive

dog: 0.6
dog
dog: 0.55

cat: 0.8 cat

Ground truth (GT)


Object detection evaluation
• At test time, predict bounding boxes, class labels, and confidence
scores
• For each detection, determine whether it is a true or false positive
• For each class, sort detections from highest to lowest confidence,
plot Recall-Precision curve and compute Average Precision
(area under the curve)
• Take mean of AP over classes to get mAP
Precision:
true positive detections /
total detections
Recall:
true positive detections /
total positive test instances
PASCAL VOC Challenge (2005-2012)

• 20 challenge classes:
• Person
• Animals: bird, cat, cow, dog, horse, sheep
• Vehicles: airplane, bicycle, boat, bus, car, motorbike, train
• Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor

• Dataset size (by 2012): 11.5K training/validation images,


27K bounding boxes, 7K segmentations

http://host.robots.ox.ac.uk/pascal/VOC/
Progress on PASCAL detection
PASCAL VOC

Before CNNs

After CNNs
More recent benchmark: COCO

http://cocodataset.org/#home
COCO dataset: Tasks

image classification object detection

semantic segmentation instance segmentation

• Also: keypoint prediction, captioning, question answering…


COCO detection metrics

• Leaderboard: http://cocodataset.org/#detection-leaderboard
• Not updated since 2020
Object detection: Outline
• Task definition and evaluation
• Two-stage detectors

Proposal
Generation

Region Proposals

Image source
R-CNN: Region proposals + CNN features
Source: R. Girshick
SVMs Classify regions with SVMs
SVMs

SVMs Forward each region


through ConvNet
ConvNet
ConvNet

ConvNet
Warped image regions

Region proposals

Input image

R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014
R-CNN details

• Regions: ~2000 Selective Search proposals


• Network: AlexNet pre-trained on ImageNet (1000 classes), fine-tuned
on PASCAL (21 classes)
• Final detector: warp proposal regions, extract fc7 network activations
(4096 dimensions), classify with linear SVM
• Bounding box regression to refine box locations
• Performance: mAP of 53.7% on PASCAL 2010
(vs. 35.1% for Selective Search and 33.4% for Deformable Part Models)
R-CNN pros and cons
• Pros
• Much more accurate than previous approaches!
• Any deep architecture can immediately be “plugged in”
• Cons
• Not a single end-to-end system
• Fine-tune network with softmax classifier (log loss)
• Train post-hoc linear SVMs (hinge loss)
• Train post-hoc bounding-box regressions (least squares)
• Training was slow (84h), took up a lot of storage
• 2000 CNN passes per image
• Inference (detection) was slow (47s / image with VGG16)
Fast R-CNN

Softmax classifier Linear +


softmax Linear Bounding-box regressors

FCs Fully-connected layers

RoI Pooling layer

Region Conv5 feature map of image


proposals

Forward whole image through ConvNet

ConvNet

Source: R. Girshick R. Girshick, Fast R-CNN, ICCV 2015


RoI pooling
• “Crop and resample” a fixed-size feature representing a
region of interest out of the outputs of the last conv layer
• Use nearest-neighbor interpolation of coordinates, max pooling

Conv feature map RoI


pooling
layer
FC layers

Region of Interest RoI


(RoI) feature
Source: R. Girshick, K. He
RoI pooling illustration

Image source
Prediction
• For each RoI, network predicts probabilities for 𝐶 + 1 classes
(class 0 is background) and four bounding box offsets for 𝐶
classes

R. Girshick, Fast R-CNN, ICCV 2015


Fast R-CNN training
Log loss + smooth L1 loss Multi-task loss

Linear +
softmax Linear

FCs

Trainable

ConvNet

Source: R. Girshick R. Girshick, Fast R-CNN, ICCV 2015


Multi-task loss
• Loss for ground truth class 𝑦, predicted class probabilities 𝑃(𝑦), ground

truth box 𝑏, and predicted box 𝑏:
𝐿 𝑦, 𝑃, 𝑏, 𝑏෠ = −log 𝑃(𝑦) + 𝜆𝕀[𝑦 ≥ 1]𝐿reg (𝑏, 𝑏)

softmax loss regression loss

• Regression loss: smooth 𝐿1 loss on top of log space offsets relative to


proposal

𝐿reg 𝑏, 𝑏෠ = ෍ smooth𝐿1 (𝑏𝑖 − 𝑏෠𝑖 )


𝑖={𝑥,𝑦,𝑤,ℎ}
Bounding box regression
Ground truth box
Target offset
to predict*
Region proposal
Predicted (a.k.a default box,
Loss
offset prior, reference,
anchor)

Predicted
box

*Typically in transformed,
normalized coordinates
ROI pooling: Backpropagation
• Similar to max pooling, but has to take into account overlap of
pooling regions
𝑟1
RoI pooling
𝑧1,4

𝑟1 𝑧2,1

𝑥33 𝑟2

𝑟2

RoI pooling

Feature Map

Source: Ross
Girshick
ROI pooling: Backpropagation
• Similar to max pooling, but has to take into account overlap of
pooling regions
𝑟1

𝑖 ∗ 1,4 = 33 𝑧1,4
𝑖 ∗ 2,1 = 33 𝑧2,1
𝑟1
Backward Pass:
max pooling 𝜕𝑒
𝑥33 “switch” 𝑟2 Have ,
𝜕𝑧
(argmax 𝜕𝑒
want
back-pointer) 𝜕𝑥
𝑟2

𝜕𝑒 𝜕𝑒 𝜕𝑧𝑟𝑗 ∗
𝜕𝑒
= ෍෍ = ෍ ෍ 𝕀 𝑖 = 𝑖 𝑟, 𝑗
𝜕𝑥𝑖 𝜕𝑧𝑟𝑗 𝜕𝑥𝑖 𝜕𝑧𝑟𝑗
𝑟 𝑗 𝑟 𝑗
Over regions 𝑟, 1 if 𝑟, 𝑗 “pooled”
RoI indices 𝑗 input 𝑖; 0 o/w Source: Ross Girshick
Mini-batch sampling
• Sample a few images (e.g., 2)
• Sample many regions from each image (64)

... ... ... ...

Sample images

SGD mini-batch

Source: R. Girshick, K. He
Fast R-CNN results

Fast R-CNN R-CNN


Train time (h) 9.5 84
- Speedup 8.8x
Test time / image 0.32s 47.0s
- Test speedup 146x
mAP 66.9% 66.0% (vs. 53.7% for AlexNet)

Timings exclude object proposal time, which is equal for all methods.
All methods use VGG16.

Source: R. Girshick, K. He
Faster R-CNN

Region
proposals

Region Proposal
Network feature map
feature map

share features

CNN CNN

S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with
Region Proposal Networks, NIPS 2015
Region proposal network (RPN)
• Idea: put an “anchor box” of fixed size over each position in
the feature map and try to predict whether this box is likely to
contain an object

Anchor is
an object?

Figure source: J. Johnson


Region proposal network (RPN)
• Idea: put an “anchor box” of fixed size over each position in
the feature map and try to predict whether this box is likely to
contain an object

Anchor is
an object?

Figure source: J. Johnson


Region proposal network (RPN)
• Idea: put an “anchor box” of fixed size over each position in
the feature map and try to predict whether this box is likely to
contain an object

Conv
Anchor is
an object?

Figure source: J. Johnson


Region proposal network (RPN)
• Idea: put an “anchor box” of fixed size over each position in
the feature map and try to predict whether this box is likely to
contain an object
• Introduce anchor boxes at multiple scales and aspect ratios
to handle a wider range of object sizes and shapes

Anchor is object?

Conv
Anchor is object?
Anchor is object?
Anchor is object?

Figure source: J. Johnson


Faster R-CNN RPN design
• Slide a small window (3x3) over the conv5 layer
• Predict object/no object
• Regress bounding box coordinates with reference to anchors
(3 scales x 3 aspect ratios)
One network, four losses
Classification Bounding-box
loss regression loss

Classification Bounding-box
loss regression loss RoI pooling

proposals

Region Proposal
Network
feature map

CNN

image
Source: R. Girshick, K. He
Faster R-CNN results
Object detection progress

Faster R-CNN
Fast R-CNN

Before CNNs R-CNNv1

After CNNs
Outline
• Task definition and evaluation
• Two-stage detectors
• R-CNN
• Fast R-CNN
• Faster R-CNN
• Single-stage and multi-resolution detectors
Streamlined detection architectures
• The Faster R-CNN pipeline separates proposal generation
and region classification
RPN Region Classification +
Proposals Regression

Conv feature RoI RoI


map of the pooling Detections
features
entire image

• Is it possible to do detection in one shot?


Classification +
Conv feature Regression
map of the Detections
entire image
YOLO
• Divide the image into a coarse grid and directly predict class
label and a few candidate boxes for each grid cell

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You Only Look Once: Unified, Real-Time
Object Detection, CVPR 2016
YOLO
1. Take conv feature maps at 7x7 resolution
2. Add two FC layers to predict, at each location,
a score for each class and 2 bboxes w/ confidences
• For PASCAL, output is 7 × 7 × 30 (30 = 20 + 2 ∗ (4 + 1))

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You Only Look Once: Unified, Real-Time
Object Detection, CVPR 2016
YOLO
• Objective function:

Regression

Object/no object
confidence

Class prediction
YOLO
• Objective function:
Cell i contains object,
predictor j is
responsible for it

Small deviations matter


less for larger boxes
than for smaller boxes

Confidence for object

Confidence for no object

Down-weight loss from Class probability


boxes that don’t contain
objects (𝜆noobj = 0.5)
YOLO: Results
• Each grid cell predicts only two boxes and can only have one class –
this limits the number of nearby objects that can be predicted
• Localization accuracy suffers compared to Fast(er) R-CNN due to
coarser features, errors on small boxes
• 7x speedup over Faster R-CNN (45-155 FPS vs. 7-18 FPS)

Performance on PASCAL 2007


YOLO v2
• Remove FC layer, do VOC 2007 results

convolutional prediction
with anchor boxes
instead
• Increase resolution of
input images and conv
feature maps
• Improve accuracy using
batch normalization and
other tricks YouTube demo

J. Redmon and A. Farhadi, YOLO9000: Better, Faster, Stronger, CVPR 2017


Multi-resolution prediction: SSD
• Predict boxes of different size from different conv maps
• Each level of resolution has its own predictor

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. Berg, SSD: Single Shot MultiBox Detector, ECCV 2016
Multi-resolution prediction: SSD
• Predict boxes of different size from different conv maps
• Each level of resolution has its own predictor

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. Berg, SSD: Single Shot MultiBox Detector, ECCV 2016
Feature pyramid networks
• Improve predictive power of
lower-level feature maps by
adding contextual information
from higher-level feature maps
• Predict different sizes of
bounding boxes from different
levels of the pyramid (but
share parameters of
predictors)

T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, Feature pyramid networks for object detection, CVPR 2017
RetinaNet
• Combine feature pyramid network with focal loss to reduce the standard
cross-entropy loss for well-classified examples

T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollar, Focal loss for dense object detection, ICCV 2017
RetinaNet
• Combine feature pyramid network with focal loss to reduce the standard
cross-entropy loss for well-classified examples

T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollar, Focal loss for dense object detection, ICCV 2017
RetinaNet: Results

T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollar, Focal loss for dense object detection, ICCV 2017
Outline
• Task definition and evaluation
• Two-stage detectors
• R-CNN
• Fast R-CNN
• Faster R-CNN
• Single-stage and multi-resolution detectors
• Recent trends
CornerNet

H. Law and J. Deng, CornerNet: Detecting Objects as Paired Keypoints, ECCV 2018
CornerNet

H. Law and J. Deng, CornerNet: Detecting Objects as Paired Keypoints, ECCV 2018
CenterNet
• Use an additional center point to verify predictions:

K. Duan et al. CenterNet: Keypoint Triplets for Object Detection, ICCV 2019
CenterNet

K. Duan et al. CenterNet: Keypoint Triplets for Object Detection, ICCV 2019
CenterNet

K. Duan et al. CenterNet: Keypoint Triplets for Object Detection, ICCV 2019
Detection Transformer (DETR)

N. Carion et al., End-to-end object detection with transformers, ECCV 2020

You might also like