Histogram of Oriented Gradients
(HOG) for Object Detection
Navneet DALAL
Joint work with
Bill TRIGGS and Cordelia SCHMID
Goal & Challenges
Goal: Detect and localise people in images
and videos
n Wide variety of articulated poses
n Variable appearance and clothing
n Complex backgrounds
n Unconstrained illumination
n Occlusions, different scales
n Videos sequences involves motion of
the subject, the camera and the
objects in the background
Main assumption: upright fully visible
people
2
Chronology
n Haar Wavelets as features + AdaBoost for learning
u Viola & Jones, ICCV 2001
u De-facto standard for detecting faces in images
n Another approach: Haar wavelets + SVM:
u Papageorgiou & Poggio, 2000; Mohan et al 2000
+1 -1 +1 -2 +1
3
Multiple votes onto a pixel in the reference segment are count
matching a contour fragment is the ratio of sum of overlap for
Chronology fragment to the total length of all segments in the same fragme
uses hard matching constraints: it considers only lines within an
els. The matched segment can optionally be weighted by the am
fragment has undergone to be classified as a match.
n Edge templates from Gavrila et al
Given this matching cost, a set of highly relevant contour fr
n Based on Information Information
bottleneck principle
Bottleneck of Tishby
(IB) principle of Tishby et
et al.al
[1999]. Thi
Information (MI) between the training images and the set of fr
n Maximize MI betweenformation
edge fragments & the
required to express detection task
images) while at same time m
J Supports irregular shapes
& partial occlusions
J Window free framework
L Sensitive to edge detection
& edge threshold
L Not resistant to local
illumination changes
L Needs segmented positive
images
At par with then s-o-a
4
C.1 Key Point Based Person Detector
Chronology
During the initial months of this thesis, we tried to develop key point based approaches to
human detection
n Key but detectors
point we were disappointed
repeat by onthe poor repeatability of the features for this
backgrounds
task. We evaluated Harris [Harris and Stephens 1988], LoG [Lindeberg 1998], Harris-Laplace
n Key and
[Mikolajczyk point detectors
Schmid do sets.
2004] feature not Our
repeat on people,
experiments even
showed that when
even for two close
by frameslooking at sequence,
in a video two consecutive
these detectorsframes of consistently
did not fire a video on the same scene ele-
ments. They were unable to cope with the sheer variation in human clothing, appearance and
n Leibe
articulation. et C.1
Figure al, illustrates
2005; Mikolajczyk et[Mori
this. A super-pixel al, 2004
et al. 2004] based approach was also
evaluated but resulted in similar conclusions. This led us to try silhouette or shape fragment
Needed a different approach
based approaches as an alternative.
(a) (b)
5
Overview of Methodology
Detection Phase
Scale-space pyramid
Scan image(s) at all
scales and locations
`
Extract features over
windows
Run linear SVM Detection window
classifier on all
locations
Fuse multiple
detections in 3-D Focus on building robust
position & scale space feature sets (static & motion)
Object detections with
bounding boxes
6
HOG for Finding People in
Images
N. Dalal and B. Triggs. Histograms of Oriented Gradients for Human Detection. CVPR, 2005 7
Static Feature Extraction
Input image
Detection window
Normalise gamma
Compute gradients
Cell
Weighted vote in spatial &
orientation cells
Block
Contrast normalise over
overlapping spatial cells Overlap
of Blocks
Collect HOGs over
detection window
Feature vector f = [ ..., ..., ...]
Linear SVM
N. Dalal and B. Triggs. Histograms of Oriented Gradients for Human Detection. CVPR, 2005 8
Overview of Learning Phase
Learning phase
Input: Annotations on training Resample negative training
images images to create hard
examples
Create fixed-resolution
normalised training image Encode images into feature
data set spaces
Encode images into feature Learn binary classifier
spaces
Object/Non-object decision
Learn binary classifier
Retraining reduces false
positives by an order of
magnitude!
9
HOG Descriptors
Parameters Schemes
n Gradient scale n RGB or Lab, colour/gray-space
n Orientation bins n Block normalisation
n Percentage of block L2-norm,
2
overlap or v ←v/ v 2 +ε
L1-norm,
v ← v /( v 1 + ε )
Block
R-HOG/SIFT
C-HOG
Center bin
Cell
10
Evaluation Data Sets
MIT pedestrian database INRIA person database
507 positive windows 1208 positive windows
Test Train
Test Train
Negative data unavailable 1218 negative images
200 positive windows 566 positive windows
Negative data unavailable 453 negative images
Overall 709 annotations+ Overall 1774 annotations+
reflections reflections
11
Overall Performance
MIT pedestrian database INRIA person database
n R/C-HOG give near perfect separation on MIT database
n Have 1-2 order lower false positives than other descriptors
12
Performance on INRIA Database
13
Effect of Parameters
Gradient smoothing, σ Orientation bins, β
n Reducing gradient scale n Increasing orientation bins
from 3 to 0 decreases false from 4 to 9 decreases false
positives by 10 times positives by 10 times
14
Normalisation Method & Block Overlap
Normalisation method Block overlap
n Strong local normalisation n Overlapping blocks improve
is essential performance, but descriptor
size increases
15
Effect of Block and Cell Size
64
128
n Trade off between need for local spatial invariance and
need for finer spatial resolution
16
Descriptor Cues
Input Average Weighted Weighted Outside-in
example gradients pos wts neg wts weights
n Most important cues are head, shoulder, leg silhouettes
n Vertical gradients inside a person are counted as negative
n Overlapping blocks just outside the contour are most
important
17
Multi-Scale Object Localisation
Bias
Clip Detection Score
Multi-scale dense scan of
detection window
s (in log)
y
x
Threshold
Η i = [exp(si )σ x , exp( si )σ y , σ s ]
n 2
f (x) = ∑i wi exp⎛⎜ − (x − xi ) / Η i−1 / 2 ⎞⎟
⎝ ⎠
Final detections
Apply robust mode detection,
like mean shift
18
Effect of Spatial Smoothing
n Spatial smoothing aspect ratio as
per window shape, smallest sigma
approx. equal to stride/cell size
n Relatively independent of scale
smoothing, sigma equal to 0.4 to 0.7
octaves gives good results
19
Effect of Other Parameters
Different mappings Effect of scale-ratio
n Hard clipping of SVM scores n Fine scale sampling helps improve
gives the best results than simple recall
probabilistic mapping of these
scores
20
HOGs vs approaches till date…
HOG still among the best detector in terms of FPPI
) Overall (b) Typical aspect ratios
n See Dollar et al, 1
CVPR 2009 0.9 0.
Pedestrian 0.8 0.
Detection: A 0.7 0.
Benchmark
miss rate
miss rate
0.6 0.
0.5 0.
VJ (0.85)
HOG (0.44)
0.4 FtrMine (0.55) 0.
Shapelet (0.80)
MultiFtr (0.54)
LatSvm (0.63)
HikSvm (0.77)
0.3 0.
0 1 2 −2 −1 0 1 2
10 10 10 10 10 10 10 10
positives per image false positives per image
21
Results Using Static HOG
No temporal smoothing of detections 22
Conclusions for Static Case
n Fine grained features improve performance
u Rectify fine gradients then pool spatially
• No gradient smoothing, [1 0 -1] derivative mask
• Orientation voting into fine bins
• Spatial voting into coarser bins
u Use gradient magnitude (no thresholding)
u Strong local normalization
u Use overlapping blocks
u Robust non-maximum suppression
• Fine scale sampling, hard clipping & anisotropic kernel
J Human detection rate of 90% at 10-4 false positives per window
L Slower than integral images of Viola & Jones, 2001
23
Applications to Other Classes
M. Everingham et al. The 2005 PASCAL Visual Object Classes Challenge. Proceedings of the PASCAL Challenge Workshop, 2006. 24
Motion HOG for Finding People
in Videos
N. Dalal, B. Triggs and C. Schmid. Human Detection Using Oriented Histograms of Flow and Appearance. ECCV, 2006. 25
Finding People in Videos
n Motivation
u Human motion is very
characteristic
n Requirements
u Must work for moving
camera and background
u Robust coding of relative
motion of human parts
n Previous works
Courtesy: R. Blake
u Viola et al, 2003 Vanderbilt Univ
u Gavrila et al, 2004
u Efros et al, 2003
N. Dalal, B. Triggs and C. Schmid. Human Detection Using Oriented Histograms of Flow and Appearance. ECCV, 2006. 26
Handling Camera Motion
n Camera motion characterisation
u Pan and tilt is locally translational
u Rest is depth induced motion parallax
n Use local differential of flow
u Cancels out effects of camera rotation
u Highlights 3D depth boundaries
u Highlights motion boundaries
n Robust encoding into oriented histograms
u Some focus on capturing motion boundaries
u Other focus on capturing internal motion or relative dynamics of
different limbs
27
Motion HOG Processing Chain
Detection windows Input image Consecutive image
Normalise gamma & colour
Flow field Magnitude of flow
Compute optical flow
Compute differential flow
Differential flow X Differential flow Y
Accumulate votes for
differential flow orientation
over spatial cells
Normalise contrast within Block Cell
overlapping blocks of cells
Overlap
Collect HOGs for all blocks of Blocks
over detection window
28
Overview of Feature Extraction
Input image Consecutive image(s)
Appearance
Channel
Channel
Motion
Static HOG Motion HOG
Encoding Encoding
Collect HOGs over
detection window Data Set
5 DVDs, 182 shots
Linear SVM Train 5562 positive windows
Same 5 DVDs, 50 shots
Object/Non-object decision Test 1 1704 positive windows
6 new DVDs, 128 shots
Test 2
2700 positive windows
29
Coding Motion Boundaries
n Treat x, y-flow components
as independent images
n Take their local gradients
separately, and compute
HOGs as in static images First Second Estd. Flow
frame frame flow mag.
Motion Boundary Histograms
(MBH) encode depth and motion x-flow y-flow Avg. Avg.
boundaries diff diff x-flow y-flow
diff diff
30
Coding Internal Dynamics
n Ideally compute relative displacements
of different limbs
u Requires reliable part detectors
n Parts are relatively localised in our
detection windows
n Allows different coding schemes based
on fixed spatial differences
Internal Motion Histograms (IMH) encode
relative dynamics of different regions
31
…IMH Continued
n Simple difference
u Take x, y differentials of flow
vector images [Ix, Iy ] n Wavelet-style cell
u Variants may use larger differences
spatial displacements while
+1 +1 +1
differencing, e.g. [1 0 0 0 -1]
-2
n Center cell difference
-1 -1 +1
+1 +1 +1 +1
+1 -1 +1 -2 +1
+1 -1 +1 -1
+1 -1 +1 -1
+1 +1 +1 +1 -1 +1 -1
-1 +1 -1 +1
32
Flow Methods
n Proesman’s flow [ Proesmans et al. ECCV 1994]
u 15 seconds per frame
n Our flow method
u Multi-scale pyramid based method, no regularization
u Brightness constancy based damped least squares solution
−1 T
T
( T
)
[ x, y ] = A A + βI A b on 5X5 window
u 1 second per frame
n MPEG-4 based block matching
u Runs in real-time
Input image Proesman’s flow Our multi-scale flow
33
Performance Comparison
Only motion information Appearance + motion
n With motion only, MBH n Combined with appearance,
scheme on Proesmans centre difference IMH
flow works best performs best
34
Trained on Static & Flow
Tested on flow only Tested on appearance + flow
n Adding static images during test reduces performance
margin
n No deterioration in performance on static images
35
Trained on Static & Flow
Tested on flow only Tested on appearance + flow
n Adding static images during test reduces performance
margin
n No deterioration in performance on static images
36
Motion HOG Video
No temporal smoothing, each pair of frames treated
independently
37
r
r
r
0.3 0.3
0.4
HOG, IMHwd and SVM
Recall-Precision for Motion HOG
HOG, IMHwd, Haar and SVM
0.30.2 0.2
HOG, IMHwd and MPLBoost (K=3)
HOG, IMHwd, Haar
HOG, and MPLBoost
IMHwd and SVM(K=4) HOG, IMHwd, Haar and MPLBoost (K=4)
0.2 HOG, IMHwd andand
HOG AdaBoost
MPLBoost (K=3) HOG, Haar and MPLBoost (K=4)
0.1 HOG, IMHwd, Haar
HOG, and AdaBoost
IMHwd and HIKSVM 0.1 HOG, IMHwd and HIKSVM
0.1 HOG, Haar and
HOG MPLBoost
and HIKSVM (K=4) HOG and HIKSVM
HOG, IMHwd and
Ess et HIKSVM
al. (ICCV’07) - Full system Ess et al. (ICCV’07) - Full system
0.00.0
0.00.0 0.1 HOGfalse
0.2 +1-precision
0.5
0.3IMHwd + HIK-SVM
0.4 0.5 1.0 0.6
positives per image
0.7 1.5
0.8 0.9 1.0 2.0
0.0
0.0 0.5 1.0
false positives per image
1.5 2.0
(c) Comparison
(e) Including to [11] (ETH-01)
motion features (ETH-02) (f) Comparison to [11] (ETH-02)
ETH02 ETH03
1.00.7 0.7
0.9
0.6 0.6
0.8
0.70.5 0.5
0.6
0.4 0.4
recall
recall
recall
0.5
0.3 0.3
0.4
HOG, IMHwd and SVM
HOG, IMHwd, Haar and SVM
0.30.2 0.2
HOG, IMHwd and MPLBoost (K=3)
HOG,
HOG,IMHwd,
IMHwd, Haar andand
Haar MPLBoost
MPLBoost(K=4)
(K=4) HOG, IMHwd and MPLBoost (K=3)
0.2 HOG, IMHwd
HOG, HaarandandAdaBoost
MPLBoost (K=4) HOG and MPLBoost (K=3)
0.1 HOG,
HOG,IMHwd,
IMHwd Haar
andand AdaBoost
HIKSVM 0.1 HOG, IMHwd and HIKSVM
0.1 HOG
HOGandand
MPLBoost
HIKSVM (K=3) HOG and HIKSVM
HOG,
EssIMHwd and HIKSVM
et al. (ICCV’07) - Full system Ess et al. (ICCV’07) - Full system
0.00.0 0.0
0.00.0 0.1 0.2 0.3
0.5 0.4 0.5 1.0 0.6 0.7 0.8
1.5 0.9 1.0 2.0 0.0 0.5 1.0 1.5 2.0
false1-precision
positives per image false positives per image
(h) Including motion features
(f) Comparison (ETH-03)
to [11] (ETH-02) (i) Comparison to [11] (ETH-03)
1.0
0.7
n Wojek et al, CVPR 09 1.0
0.9
0.6
n Robust regularized flow + max in non-max suppression 0.9
0.8 0.8
0.5
0.7 0.7
38
Conclusions for Motion HOG
n Summary
u When combined with appearance, IMH outperforms MBH
u Regularization in flow estimates reduces performance
u MPEG4 block matching looks good but motion estimates not
good for detection
u Larger spatial difference masks help
u Strong local normalization is very important
u Relatively insensitive to number of orientation bins
J Window classifier reduces false positives by 10 times
L Slow compared to static HOG (probably not any more — FlowLib from
GPU4Vision)
39
Summary
n Bottom-up approach to object detection
n Robust feature encoding for person detection
n Gives state-of-the-art results for person detection
n Also works well for other object classes
n Proposed differential motion features vectors for feature
extraction from videos
40
Extensions
n Real time feature computation (Wojek et al, DAGM 08;
Wang et al, ICCV 09)
n AdaBoost rejection cascade algorithms (Zhu et al, CVPR
06; Laptev, BMVC 06)
n Part based detector for partial occlusions (Felzenszwalb
et al, PAMI 09; Wang et al, ICCV 09)
n Motion HOG extended (Wojek et al, CVPR 09; Laptev et
al, CVPR 08)
n Histogram intersection kernel (Maji et al, CVPR 2008,
CVPR 2009, ICCV 2009)
n Higher level image analysis (Hoiem IJCV 08)
41
Features for Object Detection
n Local Binary Pattern
u Wang et al, ICCV 2009
n Co-occurrence Matrices + HOG + PLS
u Schwartz et al ICCV 2009
n Color HOG (Discriminative segmentation of fg/bg
regions)
u Ott & Everingham, ICCV 2009
42
Founder & CEO
Gesture Detection Using Webcams
43
Complete Lean Back Experience
44
Beta Launch in July 2011
n State of art work in research & engineering
n Candidates for usability studies
n Summer internships
Contact: [email protected]
http://botsquare.com
45
Thank You
Contact: [email protected]
46