0% found this document useful (0 votes)
30 views97 pages

IT5409 Ch7 Part1 Object Detection v2

This document discusses object detection techniques in computer vision. It begins with an overview of window-based generic object detection pipelines that involve building object models, generating candidate windows, and scoring candidates with classifiers. Boosting classifiers and their use in face detection are then covered. Features for representation and discriminative classifiers like support vector machines and boosting are also discussed. The document focuses in depth on the Viola-Jones face detection method, covering its integral image-based features and use of AdaBoost for feature selection and classifier training.

Uploaded by

Bui Minh Duc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views97 pages

IT5409 Ch7 Part1 Object Detection v2

This document discusses object detection techniques in computer vision. It begins with an overview of window-based generic object detection pipelines that involve building object models, generating candidate windows, and scoring candidates with classifiers. Boosting classifiers and their use in face detection are then covered. Features for representation and discriminative classifiers like support vector machines and boosting are also discussed. The document focuses in depth on the Viola-Jones face detection method, covering its integral image-based features and use of AdaBoost for feature selection and classifier training.

Uploaded by

Bui Minh Duc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 97

Computer Vision

Chapter 7 (part 1): Object detection


Course Content
• Chapter 1. Introduction
• Chapter 2. Image formation, acquisition and digitization
• Chapter 3. Image Processing
• Chapter 4. Feature detection and matching
• Chapter 5. Segmentation
• Chapter 6. Motion object detection and tracking
• Chapter 7. Object recognition and deep learning
‒ Object Detection
‒ Object Recognition
‒ Deep Learning
Contents
• Window-based generic object detection: basic
pipeline
• Boosting classifiers
• Face detection as case study
• SVM + HOG for human detection as case study
• Object proposals
• [DPM]
• Evaluation

3
Object Detection
• Problem: Detecting and localizing generic objects
from various categories, such as cars, people, etc.

• Challenges:
‒ Illumination,
‒ viewpoint,
‒ deformations,
‒ Intra-class
variability

4
Window-based generic
object detection
Basic pipeline

5
Generic category recognition:
basic framework
• Build/train object model
‒ Choose a representation
‒ Learn or fit parameters of model / classifier

• Generate candidates in new image


• Score the candidates

6
Window-based models
Building an object model
Given the representation, train a binary classifier

Car/non-car
Classifier

No,Yes,
notcar.
a car.

Slide: Kristen Grauman


7
Window-based models
Generating and scoring candidates

Car/non-car
Classifier

Slide: Kristen Grauman


8
Window-based models
Generating and scoring candidates
• Slide through the image and check if there is an
object at every location

YES!! Person match found

9
Window-based models
Generating and scoring candidates
• But what if we were looking
for buses?

No bus found!

• We will never find the object


if we don’t choose our
window size wisely!

Bus found

10
Multi-scale sliding window
• Work with multiple size windows

• Create a feature pyramid

11
Window-based object detection: recap
Training:
1. Obtain training data
2. Define features
3. Define classifier

Given new image:


1. Slide window Training examples
2. Score by classifier

Car/non-car
Classifier
Feature
extraction

Slide: Kristen Grauman


12
13

Features
• HOG

• Bags of visual words

• Haar features, …
Discriminative classifier construction
Nearest neighbor Neural networks

106 examples

Support Vector Machines Boosting Conditional Random Fields

14
Boosting classifiers

15
Boosting intuition

Weak
Classifier 1

Slide credit: Paul Viola


16
Boosting illustration

Weights
Increased

17
Boosting illustration

Weak
Classifier 2

18
Boosting illustration

Weights
Increased

19
Boosting illustration

Weak
Classifier 3

20
Boosting illustration

Final classifier is
a combination of weak
classifiers

21
Boosting: training
• Initially, weight each training example equally
• In each boosting round:
‒ Find the weak learner that achieves the lowest weighted training error
‒ Raise weights of training examples misclassified by current weak
learner

• Compute final classifier as linear combination of all weak


learners
‒ (weight of each learner is directly proportional to its accuracy)
• Exact formulas for re-weighting and combining weak
learners depend on the particular boosting scheme
(e.g., AdaBoost)
Slide credit: Lana Lazebnik
22
Face detection
as case study

23
Viola-Jones face detector

24
Viola-Jones face detector
Main idea:
‒ Represent local texture with efficiently
computable “rectangular” features within window
of interest
‒ Select discriminative features to be weak classifiers
‒ Use boosted combination of them as final classifier
‒ Form a cascade of such classifiers, rejecting clear
negatives quickly

25
Viola-Jones detector: features
• “Rectangular” filters
Feature output is difference
between adjacent regions

Value at (x,y) is
• Efficiently computable sum of pixels
above and to the
with integral image: left of (x,y)
any sum can be
computed in constant
time.
Integral image

Slide: Kristen Grauman 26


Computing the integral image

Lana Lazebnik
27
Computing the integral image

ii(x, y-1)
s(x-1, y)

i(x, y)

• Cumulative row sum: s(x, y) = s(x–1, y) + i(x, y)


• Integral image: ii(x, y) = ii(x, y−1) + s(x, y)
Lana Lazebnik
28
Computing sum within a rectangle
• Let A,B,C,D be the values of
the integral image at the
corners of a rectangle D B

• Then the sum of original image


values within the rectangle can A
be computed as: C
sum = A – B – C + D

• Only 3 additions are required


for any size of rectangle!

Lana Lazebnik
29
Viola-Jones detector: features
• “Rectangular” filters
Feature output is difference
between adjacent regions

Value at (x,y) is
• Efficiently computable with sum of pixels
above and to the
integral image: any sum left of (x,y)
can be computed in
constant time
Avoid scaling images →
scale features directly for Integral image

same cost
30
Viola-Jones detector: features
Considering all
possible filter
parameters: position,
scale, and type:
180,000+ possible
features associated
with each 24 x 24
window
Which subset of these features should we use
to determine if a window has a face?
Use AdaBoost both to select the informative features
and to form the classifier

31
Viola-Jones detector: AdaBoost
• Want to select the single rectangle feature and threshold
that best separates positive (faces) and negative (non-
faces) training examples, in terms of weighted error.

Resulting weak classifier:

For next round, reweight the


examples according to errors,


Outputs of a possible choose another filter/threshold
rectangle feature on
combo.
faces and non-faces.

Slide: Kristen Grauman


32
Start with
AdaBoost Algorithm uniform weights
on training
examples

For T rounds {x1,…xn}

Evaluate
weighted error
for each feature,
pick best.

Re-weight the examples:


Incorrectly classified -> more weight
Correctly classified -> less weight

Final classifier is combination of the


weak ones, weighted according to error
they had.
33
Viola-Jones Face Detector: Results

First two features


selected

34
• Even if the filters are fast to compute, each
new image has a lot of possible windows to
search.

• How to make the detection more efficient?

35
Cascading classifiers for detection

• Form a cascade with low false negative rates early on


• Apply less accurate but faster classifiers first to immediately
discard windows that clearly appear to be negative
Slide: Kristen Grauman
36
Training the cascade
• Set target detection and false positive rates for each
stage
• Keep adding features to the current stage until its target
rates have been met
‒ Need to lower AdaBoost threshold to maximize detection (as
opposed to minimizing total classification error)
‒ Test on a validation set
• If the overall false positive rate is not low enough, then
add another stage
• Use false positives from current stage as the
negative training examples for the next stage

37
Viola-Jones detector: summary
Train cascade of
classifiers with
AdaBoost
Faces
New image

Selected features,
Non-faces thresholds, and weights

• Train with 5K positives, 350M negatives


• Real-time detector using 38 layer cascade
• 6061 features in all layers
[Implementation available in OpenCV] Slide: Kristen Grauman
38
Viola-Jones detector: summary
• A seminal approach to real-time object detection
‒ 26.949 citations
• Training is slow, but detection is very fast
• Key ideas
‒ Integral images for fast feature evaluation
‒ Boosting for feature selection
‒ Attentional cascade of classifiers for fast rejection of non-face
windows

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features.
CVPR 2001.
P. Viola and M. Jones. Robust real-time face detection. IJCV 57(2), 2004.

39
Viola-Jones Face Detector: Results

40
Viola-Jones Face Detector: Results

41
Viola-Jones Face Detector: Results

42
Detecting profile faces?
Can we use the same detector?

43
Viola-Jones Face Detector: Results

Paul Viola, ICCV tutorial 44


Example using Viola-Jones detector

Frontal faces detected and then tracked, character names


inferred with alignment of script and subtitles.
Everingham, M., Sivic, J. and Zisserman, A.
"Hello! My name is... Buffy" - Automatic naming of characters in TV video,
BMVC 2006. http://www.robots.ox.ac.uk/~vgg/research/nface/index.html

45
46
Slide: Kristen Grauman
47
Consumer application: iPhoto

http://www.apple.com/ilife/iphoto/
Slide credit: Lana Lazebnik
48
Consumer application: iPhoto

Things iPhoto thinks are faces

Slide credit: Lana Lazebnik


49
Consumer application: iPhoto
• Can be trained to recognize pets!

http://www.maclife.com/article/news/iphotos_faces_recognizes_cats

Slide credit: Lana Lazebnik

50
Privacy Gift Shop – CV Dazzle
• http://www.wired.com/2015/06/facebook-can-recognize-even-dont-show-face/
• Wired, June 15, 2015

Slide: Kristen Grauman

51
Boosting: pros and cons
• Advantages of boosting
‒ Integrates classification with feature selection
‒ Complexity of training is linear in the number of training examples
‒ Flexibility in the choice of weak learners, boosting scheme
‒ Testing is fast
‒ Easy to implement

• Disadvantages
‒ Needs many training examples
‒ Other discriminative models may outperform in practice (SVMs,
CNNs,…)
• especially for many-class problems

Slide credit: Lana Lazebnik


52
Window-based models:
Two case studies

Boosting + face SVM + person


detection detection

Viola & Jones e.g., Dalal & Triggs

53
SVM + HOG for human detection
as case study

54
Linear classifiers

55
Linear classifiers
• Find linear function to separate positive and negative
examples

x i positive : xi  w + b  0
x i negative : xi  w + b  0

Which line
is best?

56
Support Vector Machines (SVMs)
• Discriminative
classifier based on
optimal separating
line (for 2d case)

• Maximize the margin


between the positive
and negative training
examples

57
Support vector machines
• Want line that maximizes the margin

x i positive ( yi = 1) : xi  w + b  1
x i negative ( yi = −1) : x i  w + b  −1

For support, vectors, x i  w + b = 1

Support vectors Margin


C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and
Knowledge Discovery, 1998

58
Support vector machines
• Want line that maximizes the margin

x i positive ( yi = 1) : xi  w + b  1
x i negative ( yi = −1) : x i  w + b  −1

For support vectors, x i  w + b = 1

Distance between point | xi  w + b |


and line: || w ||
For support vectors:
wΤ x + b  1 1 −1 2
= M= − =
Support vectors Margin M w w w w w

59
Support vector machines
• Want line that maximizes the margin

x i positive ( yi = 1) : xi  w + b  1
x i negative ( yi = −1) : x i  w + b  −1

For support vectors, x i  w + b = 1

Distance between point | xi  w + b |


and line: || w ||

Therefore, the margin is 2 / ||w||


Support vectors Margin M

60
Finding the maximum margin line
1. Maximize margin 2/||w||
2. Correctly classify all training data points:
x i positive ( yi = 1) : xi  w + b  1
x i negative ( yi = −1) : x i  w + b  −1

Quadratic optimization problem:

1 T
Minimize 2
w w

Subject to yi(w·xi+b) ≥ 1

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and
Knowledge Discovery, 1998
61
Finding the maximum margin line
• Solution: w = i  i yi x i

learned Support
weight vector

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and
Knowledge Discovery, 1998
62
Finding the maximum margin line
• Solution: w = i  i yi x i
b = yi – w·xi (for any support vector)
w  x + b = i  i yi x i  x + b
• Classification function:
f ( x) = sign (w  x + b)
= sign (  y x  x + b)
i i i i

If f(x) < 0, classify as negative,


if f(x) > 0, classify as positive
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and
Knowledge Discovery, 1998

63
Person detection
with HoG’s & linear SVM’s
• Histogram of oriented gradients (HoG):
‒ Map each grid cell in the input window to a histogram
counting the gradients per orientation.
• Train a linear SVM
‒ using training set of pedestrian vs. non-pedestrian
windows.

Dalal & Triggs, CVPR 2005


64
Person detection
with HoGs & linear SVMs

• For more detail about HoG:


‒ Histograms of Oriented Gradients for Human Detection, Navneet Dalal,
Bill Triggs, International Conference on Computer Vision & Pattern
Recognition - June 2005
‒ http://lear.inrialpes.fr/pubs/2005/DT05/

65
Window-based detection: strengths
• Sliding window detection and global appearance
descriptors:
‒ Simple detection protocol to implement
‒ Good feature choices critical
‒ Past successes for certain classes

Slide: Kristen Grauman


66
Window-based detection: Limitations
• High computational complexity
‒ For example: 250,000 locations x 30 orientations x 4
scales = 30,000,000 evaluations!
‒ If training binary detectors independently, means cost
increases linearly with number of classes
• With so many windows, false positive rate better
be low

Slide: Kristen Grauman


67
Limitations (continued)
• Not all objects are “box” shaped

Slide: Kristen Grauman


68
Limitations (continued)
• Non-rigid, deformable objects not captured well with
representations assuming a fixed 2d structure; or must assume
fixed viewpoint
• Objects with less-regular textures not captured well with holistic
appearance-based descriptions

Slide: Kristen Grauman


69
Limitations (continued)

Sliding window Detector’s view

If considering windows in isolation,


context is lost

Figure credit: Derek Hoiem


Slide: Kristen Grauman
70
Limitations (continued)
• In practice, often entails large, cropped training set
(expensive)
• Requiring good match to a global appearance
description can lead to sensitivity to partial occlusions

Slide: Kristen Grauman


71
Object proposals

72
Object proposals
Main idea:
• Learn to generate category-independent regions/boxes
that have object-like properties.
• Let object detector search over “proposals”, not
exhaustive sliding windows

Alexe et al. Measuring the objectness of image windows, PAMI 2012


73
Object proposals

Multi-scale
saliency

Color
contrast

Alexe et al. Measuring the objectness of image windows, PAMI 2012


74
Object proposals
Edge density Superpipxel straddling

Alexe et al. Measuring the objectness of image windows, PAMI 2012


75
Object proposals Yellow box: object detected
Cyan box: groundtruth
More proposals

Alexe et al. Measuring the objectness of image windows, PAMI 2012


76
Deformable Part Model (DPM)
• Represents an object as a
collection of parts arranged in a
deformable configuration
• Each part represents local
appearances
• Spring-like connections between
certain pairs of parts

Fischler and Elschlager, Pictoral Structures,


1973
Felzenszwalb et al. , PAMI 2010
78
Deformable Part Model (DPM)

79
Deformable Part Model (DPM)
• References
‒ Pedro F. Felzenszwalb & Daniel P. Huttenlocher, Pictorial Structures for Object
Recognition, IJCV 2005
• https://www.cs.cornell.edu/~dph/papers/pict-struct-ijcv.pdf
‒ P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object
detection with discriminatively trained part based models. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010

80
Object detection: Evaluation

81
Object Detection Benchmarks
• PASCAL VOC Challenge
• ImageNet Large Scale Visual Recognition Challenge
(ILSVR)
‒ 200 Categories for detection

• Common Objects in Context (COCO)


‒ 80 Object categories

82
How do we evaluate object detection?

predictions
ground truth
True positive:
- The overlap of the prediction
with the ground truth is MORE
than a threshold value (0.5)

83
How do we evaluate object detection?

predictions
ground truth
True positive:
False positive:
- The overlap of the prediction
with the ground truth is LESS
than a threshold value (0.5)

84
How do we evaluate object detection?

predictions
ground truth
True positive:
False positive:
False negative:
- The objects that our model
doesn’t find

85
How do we evaluate object detection?

predictions
ground truth

True positive:
False positive:
False negative:
- The objects that our model
doesn’t find

What is a True Negative?

86
𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃

𝑇𝑃
𝑟𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁

87
How do we evaluate object detection?

predictions
ground truth
True positive: 1
False positive: 2
False negative: 1

So what is the
- precision?
- recall?

88
Precision versus recall
• Precision:
‒ how many of the object detections
are correct?
𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
• Recall:
‒ how many of the ground truth objects 𝑇𝑃
can the model detect? 𝑟𝑒𝑐𝑎𝑙𝑙 =
‒ True Positive Rate (TPR)
𝑇𝑃 + 𝐹𝑁

89
• In reality, our model makes a lot of predictions with varying scores
between 0 and 1

predictions
ground truth

Here are all the boxes that


are predicted with score > 0.

This means that our


- Recall is perfect!
- But our precision is BAD!

90
How do we evaluate object detection?

predictions
ground truth

Here are all the boxes that


are predicted with score > 0.5

We are setting a threshold of


0.5

91
Precision – recall curve (PR curve)

92
Which model is the best?

93
Which model is the best?

• Area under curve (AUC), average precision (AP)


• F1-score (highest value at optimal confidential score)

94
Which model is the best?

AP: The metric calculates the average precision (AP) for each
class individually across all of the IoU thresholds

mAP: the average of AP

95
Summary
• Object recognition as classification task
‒ Boosting (face detection ex)
‒ Support vector machines and HOG (human detection
ex)
‒ Sliding window search paradigm
• Pros and cons
• Speed up with attentional cascade
• Object proposals, proposal regions as alternative

96
References
Most of these slides were adapted from:

1. Kristen Grauman (CS 376: Computer Vision, Spring 2018, The


University of Texas at Austin)

97
Thank
you!

98

You might also like