L7 Detection
L7 Detection
Image source
Outline
• Task definition and evaluation
• Two-stage detectors:
• R-CNN
• Fast R-CNN
• Faster R-CNN
• Single-stage and multi-resolution detectors
• Recent trends
Object detection evaluation
• At test time, predict bounding boxes, class labels, and confidence
scores
• For each detection, determine whether it is a true or false positive
• PASCAL criterion: Area(GT ∩ Det) / Area(GT ∪ Det) > 0.5
• For multiple detections of the same ground truth box, only one is
considered a true positive
dog: 0.6
dog
dog: 0.55
• 20 challenge classes:
• Person
• Animals: bird, cat, cow, dog, horse, sheep
• Vehicles: airplane, bicycle, boat, bus, car, motorbike, train
• Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor
http://host.robots.ox.ac.uk/pascal/VOC/
Progress on PASCAL detection
PASCAL VOC
Before CNNs
After CNNs
More recent benchmark: COCO
http://cocodataset.org/#home
COCO dataset: Tasks
• Leaderboard: http://cocodataset.org/#detection-leaderboard
• Not updated since 2020
Object detection: Outline
• Task definition and evaluation
• Two-stage detectors
Proposal
Generation
Region Proposals
Image source
R-CNN: Region proposals + CNN features
Source: R. Girshick
SVMs Classify regions with SVMs
SVMs
ConvNet
Warped image regions
Region proposals
Input image
R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014
R-CNN details
ConvNet
Image source
Prediction
• For each RoI, network predicts probabilities for 𝐶 + 1 classes
(class 0 is background) and four bounding box offsets for 𝐶
classes
Linear +
softmax Linear
FCs
Trainable
ConvNet
Predicted
box
*Typically in transformed,
normalized coordinates
ROI pooling: Backpropagation
• Similar to max pooling, but has to take into account overlap of
pooling regions
𝑟1
RoI pooling
𝑧1,4
𝑟1 𝑧2,1
𝑥33 𝑟2
𝑟2
RoI pooling
Feature Map
Source: Ross
Girshick
ROI pooling: Backpropagation
• Similar to max pooling, but has to take into account overlap of
pooling regions
𝑟1
𝑖 ∗ 1,4 = 33 𝑧1,4
𝑖 ∗ 2,1 = 33 𝑧2,1
𝑟1
Backward Pass:
max pooling 𝜕𝑒
𝑥33 “switch” 𝑟2 Have ,
𝜕𝑧
(argmax 𝜕𝑒
want
back-pointer) 𝜕𝑥
𝑟2
𝜕𝑒 𝜕𝑒 𝜕𝑧𝑟𝑗 ∗
𝜕𝑒
= = 𝕀 𝑖 = 𝑖 𝑟, 𝑗
𝜕𝑥𝑖 𝜕𝑧𝑟𝑗 𝜕𝑥𝑖 𝜕𝑧𝑟𝑗
𝑟 𝑗 𝑟 𝑗
Over regions 𝑟, 1 if 𝑟, 𝑗 “pooled”
RoI indices 𝑗 input 𝑖; 0 o/w Source: Ross Girshick
Mini-batch sampling
• Sample a few images (e.g., 2)
• Sample many regions from each image (64)
Sample images
SGD mini-batch
Source: R. Girshick, K. He
Fast R-CNN results
Timings exclude object proposal time, which is equal for all methods.
All methods use VGG16.
Source: R. Girshick, K. He
Faster R-CNN
Region
proposals
Region Proposal
Network feature map
feature map
share features
CNN CNN
S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with
Region Proposal Networks, NIPS 2015
Region proposal network (RPN)
• Idea: put an “anchor box” of fixed size over each position in
the feature map and try to predict whether this box is likely to
contain an object
Anchor is
an object?
Anchor is
an object?
Conv
Anchor is
an object?
Anchor is object?
Conv
Anchor is object?
Anchor is object?
Anchor is object?
Classification Bounding-box
loss regression loss RoI pooling
proposals
Region Proposal
Network
feature map
CNN
image
Source: R. Girshick, K. He
Faster R-CNN results
Object detection progress
Faster R-CNN
Fast R-CNN
After CNNs
Outline
• Task definition and evaluation
• Two-stage detectors
• R-CNN
• Fast R-CNN
• Faster R-CNN
• Single-stage and multi-resolution detectors
Streamlined detection architectures
• The Faster R-CNN pipeline separates proposal generation
and region classification
RPN Region Classification +
Proposals Regression
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You Only Look Once: Unified, Real-Time
Object Detection, CVPR 2016
YOLO
1. Take conv feature maps at 7x7 resolution
2. Add two FC layers to predict, at each location,
a score for each class and 2 bboxes w/ confidences
• For PASCAL, output is 7 × 7 × 30 (30 = 20 + 2 ∗ (4 + 1))
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You Only Look Once: Unified, Real-Time
Object Detection, CVPR 2016
YOLO
• Objective function:
Regression
Object/no object
confidence
Class prediction
YOLO
• Objective function:
Cell i contains object,
predictor j is
responsible for it
convolutional prediction
with anchor boxes
instead
• Increase resolution of
input images and conv
feature maps
• Improve accuracy using
batch normalization and
other tricks YouTube demo
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. Berg, SSD: Single Shot MultiBox Detector, ECCV 2016
Multi-resolution prediction: SSD
• Predict boxes of different size from different conv maps
• Each level of resolution has its own predictor
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. Berg, SSD: Single Shot MultiBox Detector, ECCV 2016
Feature pyramid networks
• Improve predictive power of
lower-level feature maps by
adding contextual information
from higher-level feature maps
• Predict different sizes of
bounding boxes from different
levels of the pyramid (but
share parameters of
predictors)
T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, Feature pyramid networks for object detection, CVPR 2017
RetinaNet
• Combine feature pyramid network with focal loss to reduce the standard
cross-entropy loss for well-classified examples
T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollar, Focal loss for dense object detection, ICCV 2017
RetinaNet
• Combine feature pyramid network with focal loss to reduce the standard
cross-entropy loss for well-classified examples
T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollar, Focal loss for dense object detection, ICCV 2017
RetinaNet: Results
T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollar, Focal loss for dense object detection, ICCV 2017
Outline
• Task definition and evaluation
• Two-stage detectors
• R-CNN
• Fast R-CNN
• Faster R-CNN
• Single-stage and multi-resolution detectors
• Recent trends
CornerNet
H. Law and J. Deng, CornerNet: Detecting Objects as Paired Keypoints, ECCV 2018
CornerNet
H. Law and J. Deng, CornerNet: Detecting Objects as Paired Keypoints, ECCV 2018
CenterNet
• Use an additional center point to verify predictions:
K. Duan et al. CenterNet: Keypoint Triplets for Object Detection, ICCV 2019
CenterNet
K. Duan et al. CenterNet: Keypoint Triplets for Object Detection, ICCV 2019
CenterNet
K. Duan et al. CenterNet: Keypoint Triplets for Object Detection, ICCV 2019
Detection Transformer (DETR)