0% found this document useful (0 votes)
11 views

Object Detection

The document discusses object detection and summarizes the evolution of datasets and methods over time. It describes early datasets focused on single categories like faces and pedestrians. It then summarizes broader datasets like PASCAL VOC, COCO that cover more categories and challenges. The document reviews techniques like sliding windows, object proposals using segmentation, and the R-CNN approach that leveraged region proposals and CNN features for classification and bounding box regression.

Uploaded by

Atul Verma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Object Detection

The document discusses object detection and summarizes the evolution of datasets and methods over time. It describes early datasets focused on single categories like faces and pedestrians. It then summarizes broader datasets like PASCAL VOC, COCO that cover more categories and challenges. The document reviews techniques like sliding windows, object proposals using segmentation, and the R-CNN approach that leveraged region proposals and CNN features for classification and bounding box regression.

Uploaded by

Atul Verma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

Object detection

Biplab Banerjee
The object detection problem
Datasets

• Face detection
• One category: face
• Frontal faces
• Fairly rigid, unoccluded

1990’s
Human Face Detection in Visual Scenes. H. Rowley, S. Baluja, T. Kanade. 1995.
Pedestrians

• One category:
pedestrians
• Slight pose variations
and small distortions
• Partial occlusions
Faces

1990’s 2000’
s Histograms of Oriented Gradients for Human Detection. N. Dalal and B. Triggs. CVPR 2005
PASCAL VOC

• 20 categories
• 10K images
• Large pose variations,
heavy occlusions
• Generic scenes
• Cleaned
Faces
up
performance metric
1990’s 2000’ 2007 -
s 2012
Coco

• 80 diverse categories
• 100K images
• Heavy occlusions,
many objects per
image, large scale
variations
Faces

1990’s 2000’ 2007 - 2014 -


s 2012
Evaluation metric
Matching detections to ground truth
Why is detection hard(er)?

• Precise localization
Why is detection hard(er)?

• Much larger impact of pose


Why is detection hard(er)?
• Occlusion makes localization difficult
Why is detection hard(er)?

• Counting
Why is detection hard(er)?

• Small objects
Object Detection
deer

cat
Object Detection as Classification
deer?
CNN cat?
background?
Object Detection as Classification
deer?
CNN cat?
background?
Object Detection as Classification
deer?
CNN cat?
background?
Object Detection as Classification
with Sliding Window
deer?
CNN cat?
background?
Problems with sliding window approach

1. Fine-tune the CNN with this new training data


2. Pass the sliding windows through the CNN for binary classification
3. Huge computational cost! Can we do better?
Dealing with scale
Dealing with scale
• Use same window size, but run on image pyramid
Scanning window results on PASCAL
VOC 2007 VOC 2010

DPM v5 (Girshick et al. 2011) 33.7% 29.6%

UVA sel. search (Uijlings et al.


35.1%
2013)
Regionlets (Wang et al. 2013) 41.7% 39.7%

SegDPM (Fidler et al. 2013) 40.4%

R-CNN Reference systems 54.2% 50.2%

R-CNN + bbox regression 58.5% 53.7%

metric: mean average precision (higher is better)


Slide credit : Ross
Girshick
Idea 2: Object proposals
• Use segmentation to produce ~5K candidates

Selective Search for Object Recognition


J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, A. W. M. Smeulders
In International Journal of Computer Vision 2013.
Idea 2: object proposals
• Many different segmentation algorithms (k-means on color, k-means
on color+position, N-cuts….)
• Many hyperparameters (number of clusters, weights on edges)
• Try everything!
• Every cluster is a candidate object
• Thousands of segmentations -> thousands of candidate objects
Idea 2: Object proposals
• Tens of ways of
generating candidates
(“proposals”)
• What fraction of ground
truth objects have
proposals near them?

What makes for effective detection proposals? J. Hosang, R. Benenson, P. Dollar, B. Schiele. In TPAMI
What do we do with proposals?
• Each proposal is a group of pixels
• Take tight fitting box and classify it
• Can leverage any image classification approach

Horse
Proposal methods results
VOC 2007 VOC 2010

DPM v5 (Girshick et al. 2011) 33.7% 29.6%

UVA sel. search (Uijlings et al.


35.1%
2013)
Regionlets (Wang et al. 2013) 41.7% 39.7%

SegDPM (Fidler et al. 2013) 40.4%

R-CNN Reference systems 54.2% 50.2%

R-CNN + bbox regression 58.5% 53.7%

metric: mean average precision (higher is better)


Slide credit : Ross
Girshick
Proposal methods results
VOC 2007 VOC 2010

DPM v5 (Girshick et al. 2011) 33.7% 29.6%

UVA sel. search (Uijlings et al.


35.1%
2013)
Regionlets (Wang et al. 2013) 41.7% 39.7%

SegDPM (Fidler et al. 2013) 40.4%

R-CNN Reference systems 54.2% 50.2%

R-CNN + bbox regression 58.5% 53.7%

metric: mean average precision (higher is better)


Slide credit : Ross
Girshick
So, we do this
A better approach

Classification + Regression
R-CNN: Regions with CNN features

Input Extract region Compute CNN Classify regions


image proposals (~2k / image) features (linear SVM)

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
R. Girshick, J. Donahue, T. Darrell, J. Malik Slide credit : Ross
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014 Girshick
R-CNN at test time: Step 2

Input Extract region Compute CNN


image proposals (~2k / image) features

a. Crop Slide credit : Ross


Girshick
R-CNN at test time: Step 2

Input Extract region Compute CNN


image proposals (~2k / image) features

227 x 227

a. Crop b. Scale (anisotropic) Slide credit : Ross


Girshick
R-CNN at test time: Step 2

Input Extract region Compute CNN


image proposals (~2k / image) features

c. Forward propagate
1. Crop b. Scale (anisotropic) Slide credit : Ross
Output: “fc7” features Girshick
R-CNN at test time: Step 3

Input Extract region Compute CNN Classify


image proposals (~2k / image) features regions

person? 1.6
...

horse? -0.3
...

4096-dimensional linear classifiers


Warped proposal Slide credit : Ross
fc7 feature vector (SVM or softmax) Girshick
Step 4: Object proposal refinement

Linear regression

on CNN features

Original Predicted
proposal object bounding box

Bounding-box regression

Slide credit : Ross


Girshick
R-CNN results on PASCAL
VOC 2007 VOC 2010

DPM v5 (Girshick et al. 2011) 33.7% 29.6%

UVA sel. search (Uijlings et al.


35.1%
2013)
Regionlets (Wang et al. 2013) 41.7% 39.7%

SegDPM (Fidler et al. 2013) 40.4%

R-CNN Reference systems 54.2% 50.2%

R-CNN + bbox regression 58.5% 53.7%

metric: mean average precision (higher is better)


Slide credit : Ross
Girshick
R-CNN results on PASCAL
VOC 2007 VOC 2010

DPM v5 (Girshick et al. 2011) 33.7% 29.6%

UVA sel. search (Uijlings et al.


35.1%
2013)
Regionlets (Wang et al. 2013) 41.7% 39.7%

SegDPM (Fidler et al. 2013) 40.4%

R-CNN 54.2% 50.2%

R-CNN + bbox regression 58.5% 53.7%

metric: mean average precision (higher is better)


Slide credit : Ross
Girshick
Training R-CNN
• Train convolutional network on ImageNet classification
• Finetune on detection
• Classification problem!
• Proposals with IoU > 50% are positives
• Sample fixed proportion of positives in each batch because of imbalance
Other details - Non-max suppression

0.
9
0. How do we deal with
8 multiple detections on the
same object?
Other details - Non-max suppression
• Go down the list of detections starting from highest scoring
• Eliminate any detection that overlaps highly with a higher scoring
detection
• Separate, heuristic step
Selective search
Fine-tune the CNN
Bounding box regressor
Bounding box regressor
Normalized difference between predicted and true box

Learnable
parameter
Fast r-CNN
Fast r-CNN a closer look
Time comparison
Two issues
• How to find the location in the feature maps for a given roi

• How to re-shape the rois in the feature maps so they can be fed to
the fc layers
Transform the original roi into feature maps
Problems
• The conversion may have quantization problem.
• Remember each box is represented by (x, y, w, h)
• Since the reduction is 1/16th the original image size in VGG, x/16, y/16
may be fractions.
Green – displacement
Blue – loss of information
Roipool and Roialign
Roi-Pool
Quantization twice
Faster r-CNN
Can we get rid off the proposal
generation by an ad-hoc technique?
Region proposal network
RPN
Faster rCNN training
Mask r-CNN

 Faster r-CNN detector + FCN segmentation


 Binary segmentation inside each bounding box
Yolo
• Extremely fast (45 frames per second)

• Reason globally on the entire image

• Can learn generalizable representations


Stages
Each cell predicts B=2 boxes(x,y,w,h) and confidences of each
box: P(Object)
Hence, this cell is responsible for
predicting the detection for DOG
Comparisons
Evaluations
Precision and recall
• True positive
• True negative
• False positive
• False negative
Some interpretations
Some interpretations
PR curve, for different confidence
Average Precision and mean AP (for N classes)

If all points are taken, AP is also called the AUC

You might also like