Object detection
Presenter
Contents
1. Object Detection
2. Faster R-CNN
3. YOLO
4. SSD
Computer Vision Tasks
Object Detection
deer
cat
Object Detection as Classification
deer?
CNN cat?
background?
Object Detection as Classification
deer?
CNN cat?
background?
Object Detection as Classification
deer?
CNN cat?
background?
Object Detection as Classification
with Sliding Window
deer?
CNN cat?
background?
Object Detection as Classification
with Box Proposals
Box Proposal Method – SS: Selective Search
Segmentation As
Selective Search for
Object Recognition. van
de Sande et al. ICCV
2011
R-CNN
Fast-RCNN
Faster-RCNN
YOLO- You Only Look Once
Idea: No bounding
box proposals.
Predict a class and a
box for every location
in a grid.
https://arxiv.org/abs/1506.02640 Redmon et al. CVPR 2016.
YOLO- You Only Look Once
YOLO- You Only Look Once
YOLO- You Only Look Once
YOLO- You Only Look Once
• Non-maximal suppression:
YOLO v2
19
YOLO v2
Each cell has 5 anchor boxes. Each anchor
includes:
• Bouding box: 4 real numbers in the range
[0, 1] – offsets of anchor box.
• Objectness score.
• Class score.
5 anchor boxes
-> Each grid cell outputs 5 * (4 + 1 + 20) =
125 real numbers
20
YOLO v2
Linear
Image 2 FC reg
CNN
448 x 448 x 3
7 x 7 x 1024 4096 7 x 7 x 30
YOLO v1
21
YOLO v2
2 x Conv3, 1 x Conv1,
Image 1024 125
CNN
448 x 448 x 3
7 x 7 x 1024 7 x 7 x 1024 7 x 7 x 125
YOLO v2
22
YOLO v2
23
YOLO v2
• Disadvantages of YOLO v1, v2
Last feature map => hard to detect small objects
24
YOLO v3 - Feature Pyramid
Input image Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6
C1 C2 C3 C4 C5 C6
Conv 1x1 Conv 1x1 Upsample
C6 P5 + U6
2 x (Conv 3x3, 1024) Upsample
T5
1 x (Conv 1x1, 75)
P4 + U4
T4
3 anchor boxes for each scale
T4 T5
25
SSD: Single Shot Detector
Idea: Similar to YOLO, but denser grid map, multiscale grid maps. +
Data augmentation + Hard negative mining + Other design choices
in the network. Liu et al. ECCV 2016.
SSD: Single Shot Detector
• Base network : VGG-16
• Add extra convolution feature layers on top of base network
• Multi-scale feature maps for detection
Liu et al. ECCV 2016.
SSD: Single Shot Detector
Input feature map Predictor 𝑝(𝑐𝑙𝑎𝑠𝑠3 )
Loss
𝑝(𝑐𝑙𝑎𝑠𝑠2 )
𝑝(𝑐𝑙𝑎𝑠𝑠1 )
5x5x
3x3 21classes softmax
𝑝(𝑐𝑙𝑎𝑠𝑠)
conv 𝐿 𝑥, 𝑐, , 𝑙, 𝑔 =
ℎ
𝑤
𝑦
1
3x3 𝑥
(𝐿𝑐𝑜𝑛𝑓 𝑥, 𝑐
conv
𝑁
5x5x +𝛼𝐿𝑙𝑜𝑐 (𝑥, 𝑙, 𝑔))
4 box offset (𝑥, 𝑦, 𝑤, ℎ)
@5x5x256
Feature map