0% found this document useful (0 votes)

15 views19 pages

Ref 2

The document describes an object detection system using mixtures of multiscale deformable part models. The system achieves state-of-the-art results on PASCAL datasets. It relies on new methods for discriminative training with partially labeled data, combining a margin-sensitive approach for hard negative mining with a latent SVM formulation.

Uploaded by

sagar patole

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views19 pages

Ref 2

Uploaded by

sagar patole

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO.

9, SEPTEMBER 2010 1627

Object Detection with

Discriminatively Trained Part-Based Models
Pedro F. Felzenszwalb, Member, IEEE Computer Society, Ross B. Girshick, Student Member, IEEE,
David McAllester, and Deva Ramanan, Member, IEEE

Abstract—We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to
represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges. While
deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the
PASCAL data sets. Our system relies on new methods for discriminative training with partially labeled data. We combine a margin-
sensitive approach for data-mining hard negative examples with a formalism we call latent SVM. A latent SVM is a reformulation of
MI-SVM in terms of latent variables. A latent SVM is semiconvex, and the training problem becomes convex once latent information is
specified for the positive examples. This leads to an iterative training algorithm that alternates between fixing latent values for positive
examples and optimizing the latent SVM objective function.

Index Terms—Object recognition, deformable models, pictorial structures, discriminative training, latent SVM.

1 INTRODUCTION
Deformable part models such as pictorial structures
O BJECTrecognition is one of the fundamental challenges
in computer vision. In this paper, we consider the
problem of detecting and localizing generic objects from
provide an elegant framework for object detection. Yet it
has been difficult to establish their value in practice. On
categories such as people or cars in static images. This is a difficult data sets, deformable part models are often
difficult problem because objects in such categories can outperformed by simpler models such as rigid templates
vary greatly in appearance. Variations arise not only from [10] or bag-of-features [44]. One of the goals of our work is
changes in illumination and viewpoint, but also due to to address this performance gap.
nonrigid deformations and intraclass variability in shape While deformable models can capture significant varia-
and other visual properties. For example, people wear tions in appearance, a single deformable model is often not
different clothes and take a variety of poses, while cars expressive enough to represent a rich object category.
come in various shapes and colors. Consider the problem of modeling the appearance of
We describe an object detection system that represents
bicycles in photographs. People build bicycles of different
highly variable objects using mixtures of multiscale
types (e.g., mountain bikes, tandems, and 19th-century
deformable part models. These models are trained using a
discriminative procedure that only requires bounding boxes cycles with one big wheel and a small one) and view them
for the objects in a set of images. The resulting system is in various poses (e.g., frontal versus side views). The system
both efficient and accurate, achieving state-of-the-art results described here uses mixture models to deal with these more
on the PASCAL VOC benchmarks [11], [12], [13] and the significant variations.
INRIA Person data set [10]. We are ultimately interested in modeling objects using
Our approach builds on the pictorial structures frame- “visual grammars.” Grammar-based models (e.g., [16], [24],
work [15], [20]. Pictorial structures represent objects by a [45]) generalize deformable part models by representing
collection of parts arranged in a deformable configuration. objects using variable hierarchical structures. Each part in a
Each part captures local appearance properties of an object grammar-based model can be defined directly or in terms of
while the deformable configuration is characterized by other parts. Moreover, grammar-based models allow for,
spring-like connections between certain pairs of parts. and explicitly model, structural variations. These models
also provide a natural framework for sharing information
and computation between different object classes. For
. P.F. Felzenszwalb and R.B. Girshick are with the Department of Computer
Science, University of Chicago, 1100 E. 58th Street, Chicago, IL 60637. example, different models might share reusable parts.
E-mail: {pff, rbg}@cs.uchicago.edu. Although grammar-based models are our ultimate goal,
. D. McAllester is with the Toyota Technological Institute at Chicago, 6045 we have adopted a research methodology under which we
S. Kenwood Ave., Chicago, IL 60637. E-mail: [email protected]. gradually move toward richer models while maintaining a
. D. Ramanan is with the Department of Computer Science, University of
California, Irvine, 3019 Donald Bren Hall, Irvine, CA 92697. high level of performance. Improving performance by
E-mail: [email protected]. enriched models is surprisingly difficult. Simple models
Manuscript received 29 May 2009; accepted 25 July 2009; published online have historically outperformed sophisticated models in
10 Sept. 2009. computer vision, speech recognition, machine translation,
Recommended for acceptance by B. Triggs. and information retrieval. For example, until recently
For information on obtaining reprints of this article, please send e-mail to:
speech recognition and machine translation systems based
[email protected], and reference IEEECS Log Number
TPAMI-2009-05-0336. on n-gram language models outperformed systems based
Digital Object Identifier no. 10.1109/TPAMI.2009.167. on grammars and phrase structure. In our experience,
0162-8828/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society
1628 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010

maintaining performance seems to require gradual enrich-

ment of the model.
One reason why simple models can perform better in
practice is that rich models often suffer from difficulties in
training. For object detection, rigid templates and bag-of-
features models can be easily trained using discriminative
methods such as support vector machines (SVM). Richer
models are more difficult to train, in particular because they
often make use of latent information.
Consider the problem of training a part-based model
from images labeled only with bounding boxes around the
objects of interest. Since the part locations are not labeled,
they must be treated as latent (hidden) variables during
training. More complete labeling might support better
training, but it can also result in inferior training if the
labeling used suboptimal parts. Automatic part labeling has
the potential to achieve better performance by automatically
finding effective parts. More elaborate labeling is also time
consuming and expensive.
The Dalal-Triggs detector [10], which won the 2006
PASCAL object detection challenge, used a single filter on
histogram of oriented gradients (HOG) features to represent
an object category. This detector uses a sliding window Fig. 1. Detections obtained with a single component person model. The
approach, where a filter is applied at all positions and scales model is defined by (a) a coarse root filter, (b) several higher resolution
of an image. We can think of the detector as a classifier part filters, and (c) a spatial model for the location of each part relative to
the root. The filters specify weights for histogram of oriented gradients
which takes as input an image, a position within that image, features. Their visualization shows the positive weights at different
and a scale. The classifier determines whether or not there is orientations. The visualization of the spatial models reflects the “cost” of
an instance of the target category at the given position and placing the center of a part at different locations relative to the root.
scale. Since the model is a simple filter, we can compute a
score as ðxÞ, where is the filter, x is an image with a configuration, and ðx; zÞ is a concatenation of subwindows
specified position and scale, and ðxÞ is a feature vector. A from a feature pyramid and part deformation features.
major innovation of the Dalal-Triggs detector was the We note that (1) can handle very general forms of latent
construction of particularly effective features. information. For example, z could specify a derivation
Our first innovation involves enriching the Dalal-Triggs under a rich visual grammar.
model using a star-structured part-based model defined by Our second class of models represents an object category
a “root” filter (analogous to the Dalal-Triggs filter) plus a set by a mixture of star models. The score of a mixture model at
of part filters and deformation models. The score of one of a particular position and scale is the maximum over
our star models at a particular position and scale within an components of the score of that component model at the
image is the score of the root filter at the given location plus given location. In this case the latent information, z, specifies
the sum over parts of the maximum, over placements of a component label and a configuration for that component.
that part, of the part filter score at its location minus a Fig. 2 shows a mixture model for the bicycle category.
deformation cost measuring the deviation of the part from To obtain high performance using discriminative train-
its ideal location relative to the root. Both root and part filter ing it is often important to use large training sets. In the case
scores are defined by the dot product between a filter (a set of object detection, the training problem is highly unba-
of weights) and a subwindow of a feature pyramid lanced because there is vastly more background than
computed from the input image. Fig. 1 shows a star model objects. This motivates a process of searching through the
for the person category. background data to find a relatively small number of
In our models, the part filters capture features at twice potential false positives, or hard negative examples.
the spatial resolution relative to the features captured by the A methodology of data mining for hard negative
root filter. In this way, we model visual appearance at examples was adopted by Dalal and Triggs [10], but goes
multiple scales. back at least to the bootstrapping methods used by Sung and
To train models using partially labeled data, we use a Poggio [38] and Rowley et al. [35]. Here, we analyze data-
latent variable formulation of MI-SVM [3] that we call latent mining algorithms for SVM and LSVM training. We prove
SVM (LSVM). In a latent SVM, each example x is scored by that data-mining methods can be made to converge to the
a function of the following form: optimal model defined in terms of the entire training set.
Our object models are defined by filters that score
f ðxÞ ¼ max ðx; zÞ: ð1Þ subwindows of a feature pyramid. We have investigated
z2ZðxÞ
feature sets similar to the HOG features from [10] and
Here, is a vector of model parameters, z are latent values, found lower dimensional features which perform as well as
and ðx; zÞ is a feature vector. In the case of one of our star the original ones. By doing principal component analysis
models, is the concatenation of the root filter, the part filters, (PCA) on HOG features, the dimensionality of the feature
and deformation cost weights, z is a specification of the object vector can be significantly reduced with no noticeable loss
FELZENSZWALB ET AL.: OBJECT DETECTION WITH DISCRIMINATIVELY TRAINED PART-BASED MODELS 1629

Fig. 2. Detections obtained with a two-component bicycle model. These examples illustrate the importance of deformations and mixture models. In
this model, the first component captures sideways views of bicycles while the second component captures frontal and near frontal views. The
sideways component can deform to match a “wheelie.”

of information. Moreover, by examining the principal locations, and their geometric arrangement is captured by a
eigenvectors, we discover structure that leads to “analytic” set of “springs” connecting pairs of parts. The patchwork of
versions of low-dimensional features which are easily parts model from [2] is similar, but it explicitly considers
interpretable and can be computed efficiently. how the appearance model of overlapping parts interact.
We have also considered some specific problems that Our models are largely based on the pictorial structures
arise in the PASCAL object detection challenge and similar framework from [15], [20]. We use a dense set of possible
data sets. We show how the locations of parts in an object positions and scales in an image, and define a score for
hypothesis can be used to predict a bounding box for the placing a filter at each of these locations. The geometric
object. This is done by training a model specific predictor configuration of the filters is captured by a set of
using least-squares regression. We also demonstrate a deformation costs (“springs”) connecting each part filter
simple method for aggregating the output of several object to the root filter, leading to a star-structured pictorial
detectors. The basic idea is that objects of some categories structure model. Note that we do not model interactions
provide evidence for, or against, objects of other categories between overlapping parts. While we might benefit from
in the same image. We exploit this idea by training a modeling such interactions, this does not appear to be a
category-specific classifier that rescores every detection of
problem when using models trained with a discriminative
that category using its original score and the highest scoring
procedure, and it significantly simplifies the problem of
detection from each of the other categories.
matching a model to an image.
The introduction of new local and semilocal features has
2 RELATED WORK played an important role in advancing the performance of
There is a significant body of work on deformable models object recognition methods. These features are typically
of various types for object detection, including several invariant to illumination changes and small deformations.
kinds of deformable template models (e.g., [7], [8], [21], Many recent approaches use wavelet-like features [30], [41]
[43]) and a variety of part-based models (e.g., [2], [6], [9], or locally normalized histograms of gradients [10], [29]. Other
[15], [18], [20], [28], [42]). methods, such as [5], learn dictionaries of local structures
In the constellation models from [18], [42], parts are from training images. In our work, we use histogram of
constrained to be in a sparse set of locations determined by gradient (HOG) features from [10] as a starting point, and
an interest point operator, and their geometric arrangement introduce a variation that reduces the feature size with no loss
is captured by a Gaussian distribution. In contrast, pictorial in performance. As in [26], we used PCA to discover low-
structure models [15], [20] define a matching problem dimensional features, but we note that the eigenvectors we
where parts have an individual match cost in a dense set of obtain have a clear structure that leads to a new set of
1630 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010

“analytic” features. This removes the need to perform a costly [25]). The approach described here requires relatively few
projection step when computing dense feature maps. passes through the complete set of training examples and is
Significant variations in shape and appearance, such as particularly well suited for training with very large data
those caused by extreme viewpoint changes, are not well sets, where only a fraction of the examples can fit in RAM.
captured by a 2D deformable model. Aspect graphs [31] are The use of context for object detection and recognition
a classical formalism for capturing significant changes that has received increasing attention in the recent years. Some
are due to viewpoint variation. Mixture models provide a methods (e.g., [39]) use low-level holistic image features for
simpler alternative approach. For example, it is common to defining likely object hypothesis. The method in [22] uses a
use multiple templates to encode frontal and side views of coarse but semantically rich representation of a scene,
faces and cars [36]. Mixture models have been used to including its 3D geometry, estimated using a variety of
capture other aspects of appearance variation as well, such techniques. Here, we define the context of an image using
as when there are multiple natural subclasses in an object the results of running a variety of object detectors in the
category [5]. image. The idea is related to [33] where a CRF was used to
Matching a deformable model to an image is a difficult capture co-occurrences of objects, although we use a very
optimization problem. Local search methods require initi- different approach to capture this information.
alization near the correct solution [2], [7], [43]. To guarantee A preliminary version of our system was described in
a globally optimal match, more aggressive search is needed. [17]. The system described here differs from the one in
One popular approach for part-based models is to restrict [17] in several ways, including the introduction of mixture
part locations to a small set of possible locations returned by models; here, we optimize the true latent SVM objective
an interest point detector [1], [18], [42]. Tree (and star) function using stochastic gradient descent, while in [17]
structured pictorial structure models [9], [15], [19] allow for
we used an SVM package to optimize a heuristic
the use of dynamic programming and generalized distance
approximation of the objective; here, we use new features
transforms to efficiently search over all possible object
that are both lower dimensional and more informative; we
configurations in an image, without restricting the possible
now postprocess detections via bounding box prediction
locations for each part. We use these techniques for
and context rescoring.
matching our models to images.
Part-based deformable models are parameterized by the
appearance of each part and a geometric model capturing 3 MODELS
spatial relationships among parts. For generative models,
All of our models involve linear filters that are applied to
one can learn model parameters using maximum likelihood
estimation. In a fully supervised setting, training images are dense feature maps. A feature map is an array whose entries
labeled with part locations and models can often be learned are d-dimensional feature vectors computed from a dense
using simple methods [9], [15]. In a weakly supervised grid of locations in an image. Intuitively, each feature vector
setting, training images may not specify locations of parts. describes a local image patch. In practice, we use a variation
In this case, one can simultaneously estimate part locations of the HOG features from [10], but the framework described
and learn model parameters with EM [2], [18], [42]. here is independent of the specific choice of features.
Discriminative training methods select model parameters A filter is a rectangular template defined by an array of
so as to minimize the mistakes of a detection algorithm on a d-dimensional weight vectors. The response, or score, of a
set of training images. Such approaches directly optimize filter F at a position ðx; yÞ in a feature map G is the “dot
the decision boundary between positive and negative product” of the filter and a subwindow of the feature map
examples. We believe that this is one reason for the success with top-left corner at ðx; yÞ:
of simple models trained with discriminative methods, such X
as the Viola-Jones [41] and Dalal-Triggs [10] detectors. It has F ½x0 ; y0 G½x þ x0 ; y þ y0 :
x0 ;y0
been more difficult to train part-based models discrimina-
tively, though strategies exist [4], [23], [32], [34]. We would like to define a score at different positions and
Latent SVMs are related to hidden CRFs [32]. However, scales in an image. This is done using a feature pyramid
in a latent SVM, we maximize over latent part locations as which specifies a feature map for a finite number of scales
opposed to marginalizing over them, and we use a hinge in a fixed range. In practice, we compute feature pyramids
loss rather than log loss in training. This leads to an efficient by computing a standard image pyramid via repeated
coordinate-descent style algorithm for training, as well as a smoothing and subsampling, and then computing a feature
data-mining algorithm that allows for learning with very map from each level of the image pyramid. Fig. 3 illustrates
large data sets. A latent SVM can be viewed as a type of the construction.
energy-based model [27]. The scale sampling in a feature pyramid is determined by a
A latent SVM is equivalent to the MI-SVM formulation of parameter defining the number of levels in an octave. That
multiple instance learning (MIL) in [3], but we find the
is, is the number of levels we need to go down in the
latent variable formulation more natural for the problems
pyramid to get to a feature map computed at twice the
we are interested in.1 A different MIL framework was
resolution of another one. In practice, we have used ¼ 5 in
previously used for training object detectors with weakly
training and ¼ 10 at test time. Fine sampling of scale space is
labeled data in [40].
important for obtaining high performance with our models.
Our method for data-mining hard examples during
The system in [10] uses a single filter to define an object
training is related to working set methods for SVMs (e.g.,
model. That system detects objects by computing the score
1. We defined a latent SVM in [17] before realizing the relationship to of the filter at each position and scale of a HOG feature
MI-SVM. pyramid and thresholding the scores.
FELZENSZWALB ET AL.: OBJECT DETECTION WITH DISCRIMINATIVELY TRAINED PART-BASED MODELS 1631

We require that the level of each part is such that the feature
map at that level was computed at twice the resolution of
the root level, li ¼ l0 for i > 0.
The score of a hypothesis is given by the scores of each
filter at their respective locations (the data term) minus a
deformation cost that depends on the relative position of
each part with respect to the root (the spatial prior), plus
the bias,

scoreðp0 ; . . . ; pn Þ
X n X
n
ð2Þ
¼ Fi0 ðH; pi Þ di d ðdxi ; dyi Þ þ b;
i¼0 i¼1

where

ðdxi ; dyi Þ ¼ ðxi ; yi Þ ð2ðx0 ; y0 Þ þ vi Þ ð3Þ

Fig. 3. A feature pyramid and an instantiation of a person model within gives the displacement of the ith part relative to its anchor
that pyramid. The part filters are placed at twice the spatial resolution of position and
the placement of the root.
d ðdx; dyÞ ¼ ðdx; dy; dx2 ; dy2 Þ ð4Þ
Let F be a w h filter. Let H be a feature pyramid and
are deformation features.
p ¼ ðx; y; lÞ specify a position ðx; yÞ in the lth level of the
Note that if di ¼ ð0; 0; 1; 1Þ, the deformation cost for the
pyramid. Let ðH; p; w; hÞ denote the vector obtained by
ith part is the squared distance between its actual position
concatenating the feature vectors in the w h subwindow of
and its anchor position relative to the root. In general, the
H with top-left corner at p in row-major order. The score of F
at p is F 0 ðH; p; w; hÞ, where F 0 is the vector obtained by deformation cost is an arbitrary separable quadratic func-
concatenating the weight vectors in F in row-major order. tion of the displacements.
Below we write F 0 ðH; pÞ since the subwindow dimen- The bias term is introduced in the score to make the
sions are implicitly defined by the dimensions of the filter F . scores of multiple models comparable when we combine
them into a mixture model.
3.1 Deformable Part Models The score of a hypothesis z can be expressed in terms of a
Our star models are defined by a coarse root filter that dot product, ðH; zÞ, between a vector of model
approximately covers an entire object and higher resolution parameters and a vector ðH; zÞ,
part filters that cover smaller parts of the object. Fig. 3
¼ ðF00 ; . . . ; Fn0 ; d1 ; . . . ; dn ; bÞ; ð5Þ
illustrates an instantiation of such a model in a feature
pyramid. The root filter location defines a detection
window (the pixels contributing to the region of the feature ðH; zÞ ¼ ððH; p0 Þ; . . . ðH; pn Þ;
map covered by the filter). The part filters are placed ð6Þ
d ðdx1 ; dy1 Þ; . . . ; d ðdxn ; dyn Þ; 1Þ:
levels down in the pyramid, so the features at that level are
computed at twice the resolution of the features in the root This illustrates a connection between our models and linear
filter level. classifiers. We use this relationship for learning the model
We have found that using higher resolution features for parameters with the latent SVM framework.
defining part filters is essential for obtaining high recogni-
tion performance. With this approach, the part filters 3.2 Matching
capture finer resolution features that are localized to greater To detect objects in an image, we compute an overall score
accuracy when compared to the features captured by the for each root location according to the best possible
root filter. Consider building a model for a face. The root placement of the parts:
filter could capture coarse resolution edges such as the face
boundary while the part filters could capture details such as scoreðp0 Þ ¼ max scoreðp0 ; . . . ; pn Þ: ð7Þ
p1 ;...;pn
eyes, nose, and mouth.
A model for an object with n parts is formally defined by High-scoring root locations define detections, while the
an ðn þ 2Þ-tuple ðF0 ; P1 ; . . . ; Pn ; bÞ, where F0 is a root filter, locations of the parts that yield a high-scoring root location
Pi is a model for the ith part, and b is a real-valued bias define a full object hypothesis.
term. Each part model is defined by a 3-tuple ðFi ; vi ; di Þ By defining an overall score for each root location, we can
where Fi is a filter for the ith part, vi is a two-dimensional detect multiple instances of an object (we assume there is at
vector specifying an “anchor” position for part i relative to most one instance per root location). This approach is related
the root position, and di is a four-dimensional vector to sliding-window detectors because we can think of scoreðp0 Þ
specifying coefficients of a quadratic function defining a as a score for the detection window specified by the root filter.
deformation cost for each possible placement of the part We use dynamic programming and generalized distance
relative to the anchor position. transforms (min-convolutions) [14], [15] to compute the best
An object hypothesis specifies the location of each filter locations for the parts as a function of the root location. The
in the model in a feature pyramid, z ¼ ðp0 ; . . . ; pn Þ, where resulting method is very efficient, taking OðnkÞ time once
pi ¼ ðxi ; yi ; li Þ specifies the level and position of the ith filter. filter responses are computed, where n is the number of
1632 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010

parts in the model and k is the total number of locations in As in the case of a single component model, the score of a
the feature pyramid. We briefly describe the method here hypothesis for a mixture model can be expressed by a dot
and refer the reader to [14], [15] for more details. product between a vector of model parameters and a vector
Let Ri;l ðx; yÞ ¼ Fi0 ðH; ðx; y; lÞÞ be an array storing the ðH; zÞ. For a mixture model, the vector is the concatena-
response of the ith model filter in the lth level of the feature tion of the model parameter vectors for each component. The
pyramid. The matching algorithm starts by computing vector ðH; zÞ is sparse, with nonzero entries defined by
these responses. Note that Ri;l is a cross-correlation between ðH; z0 Þ in a single interval matching the interval of c in :
Fi and level l of the feature pyramid.
After computing filter responses, we transform the ¼ ð1 ; . . . ; m Þ: ð11Þ
responses of the part filters to allow for spatial uncertainty:
ðH; zÞ ¼ ð0; . . . ; 0; ðH; z0 Þ; 0; . . . ; 0Þ: ð12Þ
Di;l ðx; yÞ ¼ max Ri;l ðx þ dx; y þ dyÞ di d ðdx; dyÞ : ð8Þ
dx;dy 0
With this construction, ðH; zÞ ¼ c ðH; z Þ.
This transformation spreads high filter scores to nearby To detect objects using a mixture model, we use the
locations, taking into account the deformation costs. The matching algorithm described above to find root locations
value Di;l ðx; yÞ is the maximum contribution of the ith part that yield high-scoring hypotheses independently for each
to the score of a root location that places the anchor of this component.
part at position ðx; yÞ in level l.
The transformed array, Di;l , can be computed in linear
time from the response array, Ri;l , using the generalized
4 LATENT SVM
distance transform algorithm from [14]. Consider a classifier that scores an example x with a
The overall root scores at each level can be expressed by function of the form
the sum of the root filter response at that level, plus shifted
versions of transformed and subsampled part responses: f ðxÞ ¼ max ðx; zÞ: ð13Þ
z2ZðxÞ

X
n
Here, is a vector of model parameters and z are latent
scoreðx0 ; y0 ; l0 Þ ¼ R0;l0 ðx0 ; y0 Þ þ Di;l0 ð2ðx0 ; y0 Þ þ vi Þ þ b: values. The set ZðxÞ defines the possible latent values for an
i¼1
example x. A binary label for x can be obtained by
ð9Þ thresholding its score.
Recall that is the number of levels we need to go down in In analogy to classical SVMs, we train from labeled
the feature pyramid to get to a feature map that was examples D ¼ ðhx1 ; y1 i; . . . ; hxn ; yn iÞ where yi 2 f1; 1g by
computed at exactly twice the resolution of another one. minimizing the objective function
Fig. 4 illustrates the matching process. Xn
1
To understand (9) note that for a fixed root location we LD ðÞ ¼ kk2 þ C maxð0; 1 yi f ðxi ÞÞ; ð14Þ
can independently pick the best location for each part 2 i¼1
because there are no interactions among parts in the score of
where maxð0; 1 yi f ðxi ÞÞ is the standard hinge loss and
a hypothesis. The transformed arrays Di;l give the contribu-
the constant C controls the relative weight of the regular-
tion of the ith part to the overall root score as a function of
ization term.
the anchor position for the part. So, we obtain the total score
Note that if there is a single possible latent value for each
of a root position at level l by adding up the root filter
example (jZðxi Þj ¼ 1), then f is linear in and we obtain
response and the contributions from each part, which are
linear SVMs as a special case of latent SVMs.
precomputed in Di;l .
In addition to computing Di;l , the algorithm from [14] 4.1 Semiconvexity
can also compute optimal displacements for a part as a A latent SVM leads to a nonconvex optimization problem.
function of its anchor position: However, a latent SVM is semiconvex in the sense described
below, and the training problem becomes convex once latent
Pi;l ðx; yÞ ¼ arg max Ri;l ðx þ dx; y þ dyÞ di d ðdx; dyÞ :
dx;dy information is specified for the positive training examples.
ð10Þ Recall that the maximum of a set of convex functions
is convex. In a linear SVM, we have that f ðxÞ ¼ ðxÞ
After finding a root location ðx0 ; y0 ; l0 Þ with high score, we is linear in . In this case, the hinge loss is convex for
can find the corresponding part locations by looking up the each example because it is always the maximum of two
optimal displacements in Pi;l0 ð2ðx0 ; y0 Þ þ vi Þ. convex functions.
3.3 Mixture Models Note that f ðxÞ as defined in (13) is a maximum of
functions each of which is linear in . Hence, f ðxÞ is convex
A mixture model with m components is defined by an m-
tuple, M ¼ ðM1 ; . . . ; Mm Þ, where Mc is the model for the in and thus the hinge loss, maxð0; 1 yi f ðxi ÞÞ, is convex
cth component. in when yi ¼ 1. That is, the loss function is convex in
An object hypothesis for a mixture model specifies a for negative examples. We call this property of the loss
mixture component, 1 c m, and a location for each filter function semiconvexity.
of Mc , z ¼ ðc; p0 ; . . . ; pnc Þ. Here, nc is the number of parts in In a general latent SVM, the hinge loss is not convex for a
Mc . The score of this hypothesis is the score of the positive example because it is the maximum of a convex
hypothesis z0 ¼ ðp0 ; . . . ; pnc Þ for the cth model component. function (zero) and a concave function ð1 yi f ðxi ÞÞ.
FELZENSZWALB ET AL.: OBJECT DETECTION WITH DISCRIMINATIVELY TRAINED PART-BASED MODELS 1633

Fig. 4. The matching process at one scale. Responses from the root and part filters are computed a different resolutions in the feature pyramid. The
transformed responses are combined to yield a final score for each root location. We show the responses and transformed responses for the “head”
and “right shoulder” parts. Note how the “head” filter is more discriminative. The combined scores clearly show two good hypothesis for the object at
this scale.

Now consider a latent SVM where there is a single 4.2 Optimization

possible latent value for each positive example. In this case, Let Zp specify a latent value for each positive example in a
f ðxi Þ is linear for a positive example and the loss due to training set D. We can define an auxiliary objective function
each positive is convex. Combined with the semiconvexity LD ð; Zp Þ ¼ LDðZp Þ ðÞ, where DðZp Þ is derived from D by
property, (14) becomes convex. restricting the latent values for the positive examples
1634 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010

Pn
according to Zp . That is, for a positive example, we set i¼1 hð; xi ; yi Þ with nhð; xi ; yi Þ. The resulting algorithm
Zðxi Þ ¼ fzi g, where zi is the latent value specified for xi by repeatedly updates as follows:
Zp . Note that
1. Let t be the learning rate for iteration t.
LD ðÞ ¼ min LD ð; Zp Þ: ð15Þ 2. Let i be a random example.
Zp
3. Let zi ¼ argmaxz2Zðxi Þ ðxi ; zÞ.
In particular, LD ðÞ LD ð; Zp Þ. The auxiliary objective 4. If yi f ðxi Þ ¼ yi ð ðxi ; zi ÞÞ 1, set :¼ t .
function bounds the LSVM objective. This justifies training 5. Else set :¼ t ð Cnyi ðxi ; zi ÞÞ.
a latent SVM by minimizing LD ð; Zp Þ. As in gradient-descent methods for linear SVMs, we
In practice, we minimize LD ð; Zp Þ using a “coordinate- obtain a procedure that is quite similar to the perceptron
descent” approach: algorithm. If f correctly classifies the random example xi
(beyond the margin), we simply shrink . Otherwise, we
1. Relabel positive examples: Optimize LD ð; Zp Þ over Zp
shrink and add a scalar multiple of ðxi ; zi Þ to it.
by selecting the highest scoring latent value for each
For linear SVMs, a learning rate t ¼ 1=t has been shown
positive example, zi ¼ argmaxz2Zðxi Þ ðxi ; zÞ.
to work well [37]. However, the time for convergence
2. Optimize beta: Optimize LD ð; Zp Þ over by
depends on the number of training examples, which for us
solving the convex optimization problem defined
can be very large. In particular, if there are many “easy”
by LDðZp Þ ðÞ.
examples, step 2 will often pick one of these and we do not
Both steps always improve or maintain the value of
make much progress.
LD ð; Zp Þ. After convergence, we have a relatively strong
local optimum in the sense that step 1 searches over an 4.4 Data-Mining Hard Examples, SVM Version
exponentially large space of latent values for positive When training a model for object detection, we often have a
examples while step 2 searches over all possible models, very large number of negative examples (a single image can
implicitly considering the exponentially large space of yield 105 examples for a scanning window classifier). This
latent values for all negative examples. can make it infeasible to consider all negative examples
We note, however, that careful initialization of may be simultaneously. Instead, it is common to construct training
necessary because otherwise we may select unreasonable data consisting of the positive instances and “hard
latent values for the positive examples in step 1 and this negative” instances.
could lead to a bad model. Bootstrapping methods train a model with an initial
The semiconvexity property is important because it subset of negative examples, and then collect negative
leads to a convex optimization problem in step 2, even
examples that are incorrectly classified by this initial model
though the latent values for the negative examples are not
to form a set of hard negatives. A new model is trained with
fixed. A similar procedure that fixes latent values for all
the hard negative examples, and the process may be
examples in each round would likely fail to yield good
results. Suppose we let Z specify latent values for all repeated a few times.
Here, we describe a data-mining algorithm motivated by
examples in D. Since LD ðÞ effectively maximizes over
negative latent values, LD ðÞ could be much larger than the bootstrapping idea for training a classical (nonlatent)
LD ð; ZÞ, and we should not expect that minimizing SVM. The method solves a sequence of training problems
LD ð; ZÞ would lead to a good model. using a relatively small number of hard examples and
converges to the exact solution of the training problem
4.3 Stochastic Gradient Descent defined by a large training set. This requires a margin-
Step 2 (Optimize Beta) of the coordinate-descent method can sensitive definition of hard examples.
be solved via quadratic programming [3]. It can also be We define hard and easy instances of a training set D
solved via stochastic gradient descent. Here, we describe a relative to as follows:
gradient-descent approach for optimizing over an
arbitrary training set D. In practice, we use a modified Hð; DÞ ¼ fhx; yi 2 D j yf ðxÞ < 1g: ð18Þ
version of this procedure that works with a cache of feature
vectors for DðZp Þ (see Section 4.5). Eð; DÞ ¼ fhx; yi 2 D j yf ðxÞ > 1g: ð19Þ
Let zi ðÞ ¼ argmaxz2Zðxi Þ ðxi ; zÞ.
That is, Hð; DÞ are the examples in D that are incorrectly
Then, f ðxi Þ ¼ ðxi ; zi ðÞÞ.
We can compute a subgradient of the LSVM objective classified or inside the margin of the classifier defined by
function as follows: . Similarly, Eð; DÞ are the examples in D that are
correctly classified and outside the margin. Examples on
X
n
the margin are neither hard nor easy.
rLD ðÞ ¼ þ C hð; xi ; yi Þ; ð16Þ
Let ðDÞ ¼ argmin LD ðÞ.
i¼1
Since LD is strictly convex, ðDÞ is unique.
Given a large training set D, we would like to find a
0; if yi f ðxi Þ 1;
hð; xi ; yi Þ ¼ ð17Þ small set of examples C D such that ðCÞ ¼ ðDÞ.
yi ðxi ; zi ðÞÞ; otherwise:
Our method starts with an initial “cache” of examples
In stochastic gradient descent, we approximate rLD and alternates between training a model and updating the
using a subset of the examples and take a step in its negative cache. In each iteration, we remove easy examples from the
direction. Using a single example, hxi ; yi i, we approximate cache and add new hard examples. A special case involves
FELZENSZWALB ET AL.: OBJECT DETECTION WITH DISCRIMINATIVELY TRAINED PART-BASED MODELS 1635

keeping all positive examples in the cache and data mining representation is simpler (it is application independent)
over negatives. and can be much more compact.
Let C1 D be an initial cache of examples. The A feature vector cache F is a set of pairs ði; vÞ where 1
algorithm repeatedly trains a model and updates the cache i n is the index of an example and v ¼ ðxi ; zÞ for some
as follows: z 2 Zðxi Þ. Note that we may have several pairs ði; vÞ 2 F for
each example xi . If the training set has fixed labels for positive
1. Let t :¼ ðCt Þ (train a model using Ct ). examples, this may still be true for the negative examples.
2. If Hðt ; DÞ Ct , stop and return t . Let IðF Þ be the examples indexed by F . The feature
3. Let Ct0 :¼ Ct nX for any X such that X Eðt ; Ct Þ vectors in F define an objective function for where we
(shrink the cache). only consider examples indexed by IðF Þ and, for each
4. Let Ctþ1 :¼ Ct0 [ X for any X such that X D and
example, we only consider feature vectors in the cache:
X \ Hðt ; DÞnCt 6¼ ; (grow the cache).
1 X
In step 3, we shrink the cache by removing examples
LF ðÞ ¼ kk2 þ C max 0; 1 yi max v : ð20Þ
from Ct that are outside the margin defined by t . In step 4, 2 i2IðF Þ
ði;vÞ2F
we grow the cache by adding examples from D, including
at least one new example that is inside the margin defined We can optimize LF via gradient descent by modifying
by t . Such an example must exist; otherwise, we would the method in Section 4.3. Let V ðiÞ be the set of feature
have returned in step 2. vectors v such that ði; vÞ 2 F . Then, each gradient-descent
The following theorem shows that, when we stop, we iteration simplifies to:
have found ðDÞ:
1. Let t be the learning rate for iteration t.
Theorem 1. Let C D and ¼ ðCÞ. If Hð; DÞ C, then 2. Let i 2 IðF Þ be a random example indexed by F .
¼ ðDÞ. 3. Let vi ¼ argmaxv2V ðiÞ v.
Proof. C D implies LD ð ðDÞÞ LC ð ðCÞÞ ¼ LC ðÞ. 4. If yi ð vi Þ 1, set ¼ t .
Since Hð; DÞ C, all examples in DnC have zero loss 5. Else set ¼ t ð Cnyi vi Þ.
on . This implies LC ðÞ ¼ LD ðÞ. We conclude Now the size of IðF Þ controls the number of iterations
LD ð ðDÞÞ LD ðÞ and, because LD has a unique necessary for convergence, while the size of V ðiÞ controls
minimum, ¼ ðDÞ. u
t the time it takes to execute step 3. In step 5, n ¼ jIðF Þj.
Let ðF Þ ¼ argmin LF ðÞ.
The next result shows the algorithm will stop after a We would like to find a small cache for DðZp Þ with
finite number of iterations. Intuitively, this follows from the ðF Þ ¼ ðDðZp ÞÞ.
fact that LCt ð ðCt ÞÞ grows in each iteration, but it is We define the hard feature vectors of a training set D
bounded by LD ð ðDÞÞ. relative to as
Theorem 2. The data-mining algorithm terminates. n
Hð; DÞ ¼ ði; ðxi ; zi ÞÞ j zi ¼ arg max ðxi ; zÞ and
Proof. When we shrink, the cache Ct0 contains all examples z2Zðxi Þ
from Ct with nonzero loss in a ball around t . This o
implies LCt0 is identical to LCt in a ball around t and, yi ð ðxi ; zi ÞÞ < 1 :
since t is a minimum of LCt , it also must be a minimum ð21Þ
of LCt0 . Thus, LCt0 ð ðCt0 ÞÞ ¼ LCt ð ðCt ÞÞ.
When we grow, the cache Ctþ1 nCt0 contains at least one That is, Hð; DÞ are pairs ði; vÞ where v is the highest
example hx; yi with nonzero loss at t . Since Ct0 Ctþ1 , we scoring feature vector from an example xi that is inside the
have LCtþ1 ðÞ LCt0 ðÞ for all . If ðCtþ1 Þ 6¼ ðCt0 Þ, then margin of the classifier defined by .
LCtþ1 ð ðCtþ1 ÞÞ > LCt0 ð ðCt0 ÞÞ because LCt0 has a unique We define the easy feature vectors in a cache F as
minimum. If ðCtþ1 Þ ¼ ðCt0 Þ, then LCtþ1 ð ðCtþ1 ÞÞ >
LCt0 ð ðCt0 ÞÞ due to hx; yi. Eð; F Þ ¼ fði; vÞ 2 F j yi ð vÞ > 1g: ð22Þ
We conclude LCtþ1 ð ðCtþ1 ÞÞ > LCt ð ðCt ÞÞ. Since These are the feature vectors from F that are outside the
there are finitely many caches, the loss in the cache can margin defined by .
only grow a finite number of times. u
t Note that if yi ð vÞ 1, then ði; vÞ is not considered easy
4.5 Data-Mining Hard Examples, LSVM Version even if there is another feature vector for the ith example in
Now we describe a data-mining algorithm for training a the cache with higher score than v under .
latent SVM when the latent values for the positive examples are Now we describe the data-mining algorithm for comput-
fixed. That is, we are optimizing LDðZp Þ ðÞ and not LD ðÞ. As ing ðDðZp ÞÞ.
discussed above, this restriction ensures that the optimiza- The algorithm works with a cache of feature vectors for
tion problem is convex. DðZp Þ. It alternates between training a model and updating
For a latent SVM instead of keeping a cache of examples x, the cache.
we keep a cache of ðx; zÞ pairs where z 2 ZðxÞ. This makes Let F1 be an initial cache of feature vectors. Now
it possible to avoid doing inference over all of ZðxÞ in the consider the following iterative algorithm:
inner loop of an optimization algorithm such as gradient
descent. Moreover, in practice, we can keep a cache of 1. Let t :¼ ðFt Þ (train a model).
feature vectors, ðx; zÞ, instead of ðx; zÞ pairs. This 2. If Hð; DðZp ÞÞ Ft , stop and return t .
1636 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010

3. Let Ft0 :¼ Ft nX for any X such that X Eðt ; Ft Þ Now consider a background image I 2 N. We do not want
(shrink the cache). the object detector to “fire” in any location of the feature
4. Let Ftþ1 :¼ Ft0 [ X for any X such that X \ pyramid for I. This means the overall score (7) of every root
Hðt ; DðZp ÞÞnFt 6¼ ; (grow the cache). location should be low. Let G be a dense set of locations in the
Step 3 shrinks the cache by removing easy feature feature pyramid. We define a different negative example x
vectors. Step 4 grows the cache by adding “new” feature for each location ði; j; lÞ 2 G. We define ZðxÞ so that the level
vectors, including at least one from Hðt ; DðZp ÞÞ. of the root filter specified by z 2 ZðxÞ is l and the center of its
Note that over time we will accumulate multiple feature detection window is ði; jÞ. Note that there are a very large
vectors from the same negative example in the cache. number of negative examples obtained from each image.
We can show that this algorithm will eventually stop and This is consistent with the requirement that a scanning
return ðDðZp ÞÞ. This follows from arguments analogous window classifier should have a low false positive rate.
to the ones used in Section 4.4. The procedure Train is outlined below. The outermost
loop implements a fixed number of iterations of coordinate
descent on LD ð; Zp Þ. Lines 3-6 implement the Relabel
5 TRAINING MODELS positives step. The resulting feature vectors, one per positive
Now we consider the problem of training models from example, are stored in Fp . Lines 7-14 implement the Optimize
images labeled with bounding boxes around the objects of beta step. Since the number of negative examples implicitly
interest. This is the type of data available in the PASCAL defined by N is very large, we use the LSVM data-mining
data sets. Each data set contains thousands of images and algorithm. We iterate data mining a fixed number of times
each image has annotations specifying a bounding box and rather than until convergence for practical reasons. At each
a class label for each target object present in the image. Note iteration, we collect hard negative examples in Fn , train a
that this is a weakly labeled setting since the bounding new model using gradient descent, and then shrink Fn by
boxes do not specify component labels or part locations. removing easy feature vectors. During data mining, we grow
We describe a procedure for initializing the structure of a the cache by iterating over the images in N sequentially, until
mixture model and learning all parameters. Parameter we reach a memory limit. In practice we use ¼ 0:002.
learning is done by constructing a latent SVM training
problem. We train the latent SVM using the coordinate-
descent approach described in Section 4.2 together with the
data-mining and gradient-descent algorithms that work with
a cache of feature vectors from Section 4.5. Since the
coordinate-descent method is susceptible to local minima,
we must take care to ensure a good initialization of the model.

5.1 Learning Parameters

Let c be an object class. We assume that the training examples
for c are given by positive bounding boxes P and a set of
background images N. P is a set of pairs ðI; BÞ where I is an
image and B is a bounding box for an object of class c in I.
Let M be a (mixture) model with fixed structure. Recall
that the parameters for a model are defined by a vector .
To learn , we define a latent SVM training problem with an
implicitly defined training set D, with positive examples
from P and negative examples from N.
Each example hx; yi 2 D has an associated image and
feature pyramid HðxÞ. Latent values z 2 ZðxÞ specify an
instantiation of M in the feature pyramid HðxÞ.
Now define ðx; zÞ ¼ ðHðxÞ; zÞ. Then, ðx; zÞ is
exactly the score of the hypothesis z for M on HðxÞ.
A positive bounding box ðI; BÞ 2 P specifies that the
object detector should “fire” in a location defined by B. This The function detect-bestð; I; BÞ finds the highest
means the overall score (7) of a root location defined by B scoring object hypothesis with a root filter that significantly
should be high. overlaps B in I. The function detect-allð; I; tÞ computes
For each ðI; BÞ 2 P , we define a positive example x for the best object hypothesis for each root location and selects
the LSVM training problem. We define ZðxÞ so that the the ones that score above t. Both of these functions can be
detection window of a root filter specified by a hypothesis implemented using the matching procedure in Section 3.2.
z 2 ZðxÞ overlaps with B by at least 70 percent. There are The function gradient-descentðF Þ trains using feature
usually many root locations, including at different scales, vectors in the cache as described in Section 4.5. In practice,
that define detection windows with 70 percent overlap. We we modified the algorithm to constrain the coefficients of
have found that treating the root location as a latent variable the quadratic terms in the deformation models to be above
is helpful to compensate for noisy bounding box labels in P . 0.01. This ensures that the deformation costs are convex,
A similar idea was used in [40]. and not “too flat.” We also constrain the model to be
FELZENSZWALB ET AL.: OBJECT DETECTION WITH DISCRIMINATIVELY TRAINED PART-BASED MODELS 1637

Fig. 5. (a) and (b) The initial root filters for a car model (the result of Phase 1 of the initialization process). (c) The initial part-based model for a car
(the result of Phase 3 of the initialization process).

symmetric along the vertical axis. Filters that are positioned of parts at six per component and, using a small pool of
along the center vertical axis of the model are constrained to rectangular part shapes, we greedily place parts to cover
be self-symmetric. Part filters that are off-center have a high-energy regions of the root filter.2 A part is either
symmetric part on the other side of the model. This anchored along the central vertical axis of the root filter or it
effectively reduces the number of parameters to be learned is off-center and has a symmetric part on the other side of
in half. the root filter. Once a part is placed, the energy of the
covered portion of the root filter is set to zero, and we look
5.2 Initialization
for the next highest energy region, until six parts are chosen.
The LSVM coordinate-descent algorithm is susceptible to The part filters are initialized by interpolating the root
local minima and thus sensitive to initialization. This is a filter to twice the spatial resolution. The deformation
common limitation of other methods that use latent parameters for each part are initialized to di ¼ ð0; 0; :1; :1Þ.
information as well. We initialize and train mixture models
This pushes part locations to be fairly close to their anchor
in three phases as follows:
position. Fig. 5c shows the results of this phase when
Phase 1. Initializing Root Filters: For training a mixture
training a two-component car model. The resulting model
model with m components we sort the bounding boxes in P
serves as the initial model for the last round of parameter
by their aspect ratio and split them into m groups of equal
learning. The final car model is shown in Fig. 9.
size P1 ; . . . ; Pm . Aspect ratio is used as a simple indicator of
extreme intraclass variation. We train m different root filters
F1 ; . . . ; Fm , one for each group of positive bounding boxes. 6 FEATURES
To define the dimensions of Fi , we select the mean aspect
Here, we describe the 36-dimensional HOG features from [10]
ratio of the boxes in Pi and the largest area not larger than
and introduce an alternative 13-dimensional feature set that
80 percent of the boxes. This ensures that, for most pairs
captures essentially the same information.3 We have found
ðI; BÞ 2 Pi , we can place Fi in the feature pyramid of I so
that it significantly overlaps with B. that augmenting this low-dimensional feature set to include
We train Fi using a standard SVM, with no latent both contrast sensitive and contrast insensitive features,
information, as in [10]. For ðI; BÞ 2 Pi , we warp the image leading to a 31-dimensional feature vector, improves perfor-
region under B so that its feature map has the same mance for most classes of the PASCAL data sets.
dimensions as Fi . This leads to a positive example. We
6.1 HOG Features
select random subwindows of appropriate dimension from
images in N to define negative examples. Figs. 5a and 5b 6.1.1 Pixel-Level Feature Maps
show the result of this phase when training a two- Let ðx; yÞ and rðx; yÞ be the orientation and magnitude of
component car model. the intensity gradient at a pixel ðx; yÞ in an image. As in [10],
Phase 2. Merging Components: We combine the initial we compute gradients using finite difference filters,
root filters into a mixture model with no parts and retrain ½1; 0; þ1 and its transpose. For color images, we use the
the parameters of the combined model using Train on the color channel with the largest gradient magnitude to define
full (unsplit and without warping) data sets P and N. In and r at each pixel.
this case, the component label and root location are the only The gradient orientation at each pixel is discretized into
latent variables for each example. The coordinate-descent one of the p values using either a contrast sensitive (B1 ), or
training algorithm can be thought of as a discriminative insensitive (B2 ), definition,
clustering method that alternates between assigning cluster
(mixture) labels for each positive example and estimating 2. The “energy” of a region is defined by the norm of the positive weights
cluster “means” (root filters). in a subwindow.
3. There are some small differences between the 36-dimensional features
Phase 3. Initializing Part Filters: We initialize the parts of defined here and the ones in [10], but we have found that these differences
each component using a simple heuristic. We fix the number did not have any significant effect on the performance of our system.
1638 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010

Fig. 6. PCA of HOG features. Each eigenvector is displayed as a 4 9 matrix so that each row corresponds to one normalization factor and each
column to one orientation bin. The eigenvalues are displayed on top of the eigenvectors. The linear subspace spanned by the top 11 eigenvectors
captures essentially all of the information in a feature vector. Note how all of the top eigenvectors are either constant along each column or row of the
matrix representation.

pðx; yÞ vector Cði; jÞ. We can write these factors as N; ði; jÞ with
B1 ðx; yÞ ¼ round mod p; ð23Þ
2 ; 2 f1; 1g,

pðx; yÞ N; ði; jÞ ¼ ðkCði; jÞk2 þ kCði þ ; jÞk2
B2 ðx; yÞ ¼ round mod p: ð24Þ 1 ð26Þ
þ kCði; j þ Þk2 þ kCði þ ; j þ Þk2 Þ2 :
Below, we use B to denote either B1 or B2 . Each factor measures the “gradient energy” in a square
We define a pixel-level feature map that specifies a block of four cells containing ði; jÞ.
sparse histogram of gradient magnitudes at each pixel. Let Let T ðvÞ denote the component-wise truncation of a
b 2 f0; . . . ; p 1g range over orientation bins. The feature vector v by (the ith entry in T ðvÞ is the minimum of the
vector at ðx; yÞ is ith entry of v and ). The HOG feature map is obtained by
concatenating the result of normalizing the cell-based
rðx; yÞ; if b ¼ Bðx; yÞ; feature map C with respect to each normalization factor
F ðx; yÞb ¼ ð25Þ
0; otherwise: followed by truncation,
We can think of F as an oriented edge map with 0 1
T ðCði; jÞ=N1;1 ði; jÞÞ
p orientation channels. For each pixel, we select a channel B T ðCði; jÞ=Nþ1;1 ði; jÞÞ C
by discretizing the gradient orientation. The gradient Hði; jÞ ¼ B C
@ T ðCði; jÞ=Nþ1;þ1 ði; jÞÞ A: ð27Þ
magnitude can be seen as a measure of edge strength. T ðCði; jÞ=N1;þ1 ði; jÞÞ

6.1.2 Spatial Aggregation Commonly used HOG features are defined using p ¼ 9
Let F be a pixel-level feature map for a w h image. Let contrast insensitive gradient orientations (discretized with
k > 0 be a parameter specifying the side length of a square B2 ), a cell size of k ¼ 8 and truncation ¼ 0:2. This leads to
image region. We define a dense grid of rectangular “cells” a 36-dimensional feature vector. We used these parameters
in the analysis described below.
and aggregate pixel-level features to obtain a cell-based
feature map C, with feature vectors Cði; jÞ for 0 i 6.2 PCA and Analytic Dimensionality Reduction
bðw 1Þ=kc and 0 j bðh 1Þ=kc. This aggregation pro- We collected a large number of 36-dimensional HOG
vides some invariance to small deformations and reduces features from different resolutions of a large number of
the size of a feature map. images and performed PCA on these vectors. The principal
The simplest approach for aggregating features is to map components are shown in Fig. 6. The results lead to a
each pixel ðx; yÞ into a cell ðbx=kc; by=kcÞ and define the number of interesting discoveries.
feature vector at a cell to be the sum (or average) of the The eigenvalues indicate that the linear subspace
pixel-level features in that cell. spanned by the top 11 eigenvectors captures essentially all
Rather than mapping each pixel to a unique cell, we of the information in a HOG feature. In fact, we obtain the
follow [10] and use a “soft binning” approach where each same detection performance in all categories of the PASCAL
pixel contributes to the feature vectors in the four cells 2007 data set using the original 36-dimensional features or
around it using bilinear interpolation. 11-dimensional features defined by projection to the top
eigenvectors. Using lower dimensional features leads to
6.1.3 Normalization and Truncation models with fewer parameters and speeds up the detection
Gradients are invariant to changes in bias. Invariance to and learning algorithms. We note, however, that some of the
gain can be achieved via normalization. Dalal and Triggs gain is lost because we need to perform a relatively costly
[10] used four different normalization factors for the feature projection step when computing feature pyramids.
FELZENSZWALB ET AL.: OBJECT DETECTION WITH DISCRIMINATIVELY TRAINED PART-BASED MODELS 1639

Recall that a 36-dimensional HOG feature is defined Finally, we note that the top eigenvectors in Fig. 6 can be
using four different normalizations of a 9-dimensional roughly interpreted as a two-dimensional separable Fourier
histogram over orientations. Thus, a 36-dimensional HOG basis. Each eigenvector can be roughly seen as a sine or
feature is naturally viewed as a 4 9 matrix. The top cosine function of one variable. This observation could be
eigenvectors in Fig. 6 have a very special structure: They are used to define features using a finite number of Fourier basis
functions instead of a finite number of discrete orientations.
each (approximately) constant along each row or column of
The appearance of Fourier basis in Fig. 6 is an interesting
their matrix representation. Thus, the top eigenvectors lie
empirical result. The eigenvectors of a d d covariance
(approximately) in a linear subspace defined by sparse matrix form a Fourier basis when is circulant, i.e., i;j ¼
vectors that have ones along a single row or column of their kði j mod dÞ for some function k. Circulant covariance
matrix representation. matrices arise from probability distributions on vectors that
Let V ¼ fu1 ; . . . ; u9 g [ fv1 ; . . . ; v4 g with are invariant to rotation of the vector coordinates. The
appearance of a two-dimensional Fourier basis in Fig. 6 is
1; if j ¼ k; evidence that the distribution of HOG feature vectors on
uk ði; jÞ ¼ ð28Þ
0; otherwise; natural images have (approximately) a two-dimensional
rotational invariance. We can rotate the orientation bins and

1; if i ¼ k; independently rotate the four normalizations blocks.
vk ði; jÞ ¼ ð29Þ
0; otherwise:
We can define a 13-dimensional feature by taking the dot 7 POSTPROCESSING
product of a 36-dimensional HOG feature with each uk and 7.1 Bounding Box Prediction
vk . Projection into each uk is computed by summing over The desired output of an object detection system is not
the four normalizations for a fixed orientation. Projection entirely clear. The goal in the PASCAL challenge is to
into each vk is computed by summing over nine orientations predict the bounding boxes of objects. In our previous work
for a fixed normalization.4 [17], we reported bounding boxes derived from root filter
As in the case of 11-dimensional PCA features, we obtain locations. Yet detection with one of our models localizes
the same performance using the 36-dimensional HOG each part filter in addition to the root filter. Furthermore,
features or the 13-dimensional features defined by V . part filters are localized with greater spatial precision than
However, the computation of the 13-dimensional features root filters. It is clear that our original approach discards
is much less costly than performing projections to the top potentially valuable information gained from using a
eigenvectors obtained via PCA since the uk and vk are multiscale deformable part model.
sparse. Moreover, the 13-dimensional features have a In the current system, we use the complete configuration
simple interpretation as nine orientation features and four of an object hypothesis, z, to predict a bounding box for the
features that reflect the overall gradient energy in different object. This is implemented using functions that map a
feature vector gðzÞ, to the upper left, ðx1 ; y1 Þ, and lower
areas around a cell.
right, ðx2 ; y2 Þ, corners of the bounding box. For a model
We can also define low-dimensional features that are
with n parts, gðzÞ is a 2n þ 3 dimensional vector containing
contrast sensitive. We have found that performance on some the width of the root filter in image pixels (this provides
object categories improves using contrast sensitive features, scale information) and the location of the upper left corner
while some categories benefit from contrast insensitive of each filter in the image.
features. Thus, in practice, we use feature vectors that Each object in the PASCAL training data is labeled by a
include both contrast sensitive and insensitive information. bounding box. After training a model, we use the output of
Let C be a cell-based feature map computed by aggregat- our detector on each instance to learn four linear functions
ing a pixel-level feature map with nine contrast insensitive for predicting x1 , y1 , x2 , and y2 from gðzÞ. This is done via
orientations. Let D be a similar cell-based feature map linear least-squares regression, independently for each
computed using 18 contrast sensitive orientations. We define component of a mixture model.
four normalization factors for the ði; jÞ cell of C and D using Fig. 7 illustrates an example of bounding prediction for a
C as in (26). We can normalize and truncate Cði; jÞ and Dði; jÞ car detection. This simple method yields small but noticeable
using these factors to obtain 4 ð9 þ 18Þ ¼ 108 dimensional improvements in performance for some categories in the
feature vectors, F ði; jÞ. In practice, we use an analytic PASCAL data sets (see Section 8).
projection of these 108-dimensional vectors, defined by 27
7.2 Nonmaximum Suppression
sums over different normalizations, one for each orientation
channel of F , and four sums over the nine contrast Using the matching procedure from Section 3.2, we usually
insensitive orientations, one for each normalization factor. get multiple overlapping detections for each instance of an
We use a cell size of k ¼ 8 and truncation value of ¼ 0:2. object. We use a greedy procedure for eliminating repeated
The final feature map has 31-dimensional vectors Gði; jÞ, detections via nonmaximum suppression.
After applying the bounding box prediction method
with 27 dimensions corresponding to different orientation
described above, we have a set of detections D for a
channels (9 contrast insensitive and 18 contrast sensitive) and
particular object category in an image. Each detection is
4 dimensions capturing the overall gradient energy in square
defined by a bounding box and a score. We sort the
blocks of four cells around ði; jÞ. detections in D by score, and greedily select the highest
scoring ones while skipping detections with bounding boxes
4. The 13-dimensional feature is not a linear projection of the 36-
dimensional feature into V because the uk and vk are not orthogonal. In fact, that are at least 50 percent covered by a bounding box of a
the linear subspace spanned by V has dimension 12. previously selected detection.
1640 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010

Fig. 7. A car detection and the bounding box predicted from the object
configuration.

7.3 Contextual Information

We have implemented a simple procedure to rescore
detections using contextual information.
Let ðD1 ; . . . ; Dk Þ be a set of detections obtained using
k different models (for different object categories) in an
image I. Each detection ðB; sÞ 2 Di is defined by a
bounding box B ¼ ðx1 ; y1 ; x2 ; y2 Þ and a score s. We define
the context of I in terms of a k-dimensional vector
cðIÞ ¼ ð ðs1 Þ; . . . ; ðsk ÞÞ, where si is the score of the highest
scoring detection in Di and ðxÞ ¼ 1=ð1 þ expð2xÞÞ is a
logistic function for renormalizing the scores.
To rescore a detection ðB; sÞ in an image I, we build a Fig. 8. Precision/Recall curves for models trained on the person and car
categories of the PASCAL 2006 data set. We show results for one- and
25-dimensional feature vector with the original score of the two-component models with and without parts, and a two-component
detection, the top-left and bottom-right bounding box model with parts and bounding box prediction. In parenthesis, we show
coordinates, and the image context, the average precision score for each model.

g ¼ ð ðsÞ; x1 ; y1 ; x2 ; y2 ; cðIÞÞ: ð30Þ 8 EMPIRICAL RESULTS

The coordinates x1 ; y1 ; x2 ; y2 2 ½0; 1 are normalized by the We evaluated our system using the PASCAL VOC 2006,
width and height of the image. We use a category-specific 2007, and 2008 comp3 challenge data sets and protocol. We
classifier to score this vector to obtain a new score for the refer to [11], [12], [13] for details, but emphasize that these
detection. The classifier is trained to distinguish correct benchmarks are widely acknowledged as difficult testbeds
detections from false positives by integrating contextual for object detection.
information defined by g. Each data set contains thousands of images of real-world
To get training data for the rescoring classifier, we run our scenes. The data sets specify ground-truth bounding boxes
object detectors on images that are annotated with bounding for several object classes. At test time, the goal is to predict
boxes around the objects of interest (such as provided in the the bounding boxes of all objects of a given class in an
PASCAL data sets). Each detection returned by one of our image (if any). In practice, a system will output a set of
models leads to an example g that is labeled as a true positive bounding boxes with corresponding scores, and we can
or false positive detection, depending on whether or not it threshold these scores at different points to obtain a
significantly overlaps an object of the correct category. precision-recall curve across all images in the test set. For
This rescoring procedure leads to a noticeable improve- a particular threshold, the precision is the fraction of the
ment in the average precision (AP) on several categories in reported bounding boxes that are correct detections, while
the PASCAL data sets (see Section 8). In our experiments, recall is the fraction of the objects found.
we used the same data set for training models and for A predicted bounding box is considered correct if it
training the rescoring classifiers. We used SVMs with overlaps more than 50 percent with a ground-truth bound-
quadratic kernels for rescoring. ing box; otherwise, the bounding box is considered a false
FELZENSZWALB ET AL.: OBJECT DETECTION WITH DISCRIMINATIVELY TRAINED PART-BASED MODELS 1641

Fig. 9. Some of the models learned on the PASCAL 2007 data set.

positive detection. Multiple detections are penalized. If a using those models. We show both high-scoring correct
system predicts several bounding boxes that overlap with a detections and high-scoring false positives.
single ground-truth bounding box, only one prediction is In some categories, our false detections are often due to
considered correct, the others are considered false positives. confusion among classes, such as between horse and cow or
One scores a system by the AP of its precision-recall curve between car and bus. In other categories, false detections are
across a test set. often due to the relatively strict bounding box criteria. The
We trained a two-component model for each class in two false positives shown for the person category are due to
each data set. Fig. 9 shows some of the models learned on insufficient overlap with the ground-truth bounding box. In
the 2007 data set. Fig. 10 shows some detections we obtain the cat category, we often detect the face of a cat and report a
1642 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010

Fig. 10. Examples of high-scoring detections on the PASCAL 2007 data set, selected from the top 20 highest scoring detections in each class. The
framed images (last two in each row) illustrate false positives for each category. Many false positives (such as for person and cat) are due to the
bounding box scoring criteria.

bounding box that is too small because it does not include the learn a model where one of the components corresponds to a
rest of the cat. In fact, the top 20 highest scoring false positive cat face model, see Fig. 9.
bounding boxes for the cat category correspond to the face of Tables 1 and 2 summarize the results of our system on the
a cat. This is an extreme case, but it gives an explanation for 2006 and 2007 challenge data sets. Table 3 summarizes the
our low AP score in this category. In many of the positive results on the more recent 2008 data set, together with
training examples for cats only the face is visible, and we the systems that entered the official competition in 2008.
FELZENSZWALB ET AL.: OBJECT DETECTION WITH DISCRIMINATIVELY TRAINED PART-BASED MODELS 1643

TABLE 1 significantly improve the detection accuracy. Mixture

PASCAL VOC 2006 Results models are important in the car category but not in the
person category of the 2006 data set.
We also trained and tested a one-component model on the
INRIA Person data set [10]. We scored the model with the
PASCAL evaluation methodology (using the PASCAL
(a) Average precision scores of the base system, (b) scores using
bounding box prediction, and (c) scores using bounding box prediction
development kit) over the complete test set, including images
and context rescoring. without people. We obtained an AP score of 0.869 in this data
set using the base system with bounding box prediction.
Empty boxes indicate that a method was not tested in the
corresponding object class. The entry labeled “UofCTTIUCI” 9 DISCUSSION
is a preliminary version of the system described here. Our
system obtains the best AP score in 9 out of the 20 categories We described an object detection system based on mixtures
and the second best in 8. Moreover, in some categories such of multiscale deformable part models. Our system relies
as person, we obtain a score significantly above the second heavily on new methods for discriminative training of
best score. classifiers that make use of latent information. It also relies
For all of the experiments shown here, we used the objects heavily on efficient methods for matching deformable
not marked as difficult from the trainval data sets to train models to images. The resulting system is both efficient
models (we include the objects marked as truncated). Our and accurate, leading to state-of-the-art results on difficult
system is fairly efficient. Using a Desktop computer it takes data sets.
about 4 hours to train a model on the PASCAL 2007 Our models are already capable of representing highly
trainval data set and 3 hours to evaluate it on the test variable object classes, but we would like to move toward
data set. There are 4,952 images in the test data set, so the richer models. The framework described here allows for
average running time per image is around 2 seconds. All of exploration of additional latent structure. For example, one
the experiments were done on a 2.8 GHz 8-core Intel Xeon can consider deeper part hierarchies (parts with parts) or
Mac Pro computer running Mac OS X 10.5. The system mixture models with many components. In the future, we
makes use of the multiple-core architecture for computing would like to build grammar-based models that represent
filter responses in parallel, although the rest of the objects with variable hierarchical structures. These models
computation runs in a single thread. should allow for mixture models at the part level, and allow
We evaluated different aspects of our system on the for reusability of parts, both in different components of an
longer established 2006 data set. Fig. 8 summarizes results object and among different object models.
of different models on the person and car categories. We
trained models with one and two components with and
without parts. We also show the result of a two-component ACKNOWLEDGMENTS
model with parts and bounding box prediction. We see that This material is based upon work supported by the US
the use of parts (and bounding box prediction) can National Science Foundation under Grant No. IIS 0746569

TABLE 2
PASCAL VOC 2007 Results

(a) Average precision scores of the base system, (b) scores using bounding box prediction, and (c) scores using bounding box prediction and context
rescoring.

TABLE 3
PASCAL VOC 2008 Results

Top: (a) Average precision scores of the base system, (b) scores using bounding box prediction, (c) scores using bounding box prediction and
context rescoring, and (d) ranking of final scores relative to systems in the 2008 competition. Bottom: The systems that participated in the
competition (UofCTTIUCI is a preliminary version of our system and we don’t include it in the ranking).
1644 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010

(Pedro J. Felzenszwalb and Ross B. Girshick), IIS 0811340 [23] A. Holub and P. Perona, “A Discriminative Framework for
Modelling Object Classes,” Proc. IEEE Conf. Computer Vision and
(David McAllester), and IIS 0812428 (Deva Ramanan). Pattern Recognition, 2005.
[24] Y. Jin and S. Geman, “Context and Hierarchy in a Probabilistic
Image Model,” Proc. IEEE Conf. Computer Vision and Pattern
REFERENCES Recognition, 2006.
[1] Y. Amit and A. Kong, “Graphical Templates for Model Registra- [25] T. Joachims, “Making Large-Scale SVM Learning Practical,”
tion,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, Advances in Kernel Methods—Support Vector Learning, B. Schölkopf,
no. 3, pp. 225-236, Mar. 1996. C. Burges, and A. Smola, eds., MIT Press, 1999.
[2] Y. Amit and A. Trouve, “POP: Patchwork of Parts Models for [26] Y. Ke and R. Sukthankar, “PCA-SIFT: A More Distinctive
Object Recognition,” Int’l J. Computer Vision, vol. 75, no. 2, pp. 267- Representation for Local Image Descriptors,” Proc. IEEE Conf.
282, 2007. Computer Vision and Pattern Recognition, 2004.
[3] S. Andrews, I. Tsochantaridis, and T. Hofmann, “Support Vector [27] Y. LeCun, S. Chopra, R. Hadsell, R. Marc’Aurelio, and F. Huang,
Machines for Multiple-Instance Learning,” Proc. Advances in “A Tutorial on Energy-Based Learning,” Predicting Structured
Neural Information Processing Systems, 2003. Data, G. Bakir, T. Hofman, B. Schölkopf, A. Smola, and B. Taskar,
[4] A. Bar-Hillel and D. Weinshall, “Efficient Learning of Relational eds. MIT Press, 2006.
Object Class Models,” Int’l J. Computer Vision, vol. 77, no. 1, [28] B. Leibe, A. Leonardis, and B. Schiele, “Robust Object Detection
pp. 175-198, 2008. with Interleaved Categorization and Segmentation,” Int’l J.
[5] E. Bernstein and Y. Amit, “Part-Based Statistical Models for Object Computer Vision, vol. 77, no. 1, pp. 259-289, 2008.
Classification and Detection,” Proc. IEEE Conf. Computer Vision and [29] D. Lowe, “Distinctive Image Features from Scale-Invariant
Pattern Recognition, 2005. Keypoints,” Int’l J. Computer Vision, vol. 60, no. 2, pp. 91-110,
[6] M. Burl, M. Weber, and P. Perona, “A Probabilistic Approach to Nov. 2004.
Object Recognition Using Local Photometry and Global Geome- [30] C. Papageorgiou, M. Oren, and T. Poggio, “A General Framework
try,” Proc. European Conf. Computer Vision, 1998. for Object Detection,” Proc. IEEE Int’l Conf. Computer Vision, 1998.
[7] T. Cootes, G. Edwards, and C. Taylor, “Active Appearance [31] W. Plantinga and C. Dyer, “An Algorithm for Constructing the
Models,” IEEE Trans. Pattern Analysis and Machine Intelligence, Aspect Graph,” Proc. 27th Ann. Symp. Foundations of Computer
vol. 23, no. 6, pp. 681-685, June 2001. Science, 1985, pp. 123-131, 1986.
[8] J. Coughlan, A. Yuille, C. English, and D. Snow, “Efficient [32] A. Quattoni, S. Wang, L. Morency, M. Collins, and T. Darrell,
Deformable Template Detection and Localization without User “Hidden Conditional Random Fields,” IEEE Trans. Pattern
Initialization,” Computer Vision and Image Understanding, vol. 78, Analysis and Machine Intelligence, vol. 29, no. 10, pp. 1848-1852,
no. 3, pp. 303-319, June 2000. Oct. 2007.
[9] D. Crandall, P. Felzenszwalb, and D. Huttenlocher, “Spatial Priors [33] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S.
for Part-Based Recognition Using Statistical Models,” Proc. IEEE Belongie, “Objects in Context,” Proc. IEEE Int’l Conf. Computer
Conf. Computer Vision and Pattern Recognition, 2005. Vision, 2007.
[10] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for [34] D. Ramanan and C. Sminchisescu, “Training Deformable Models
Human Detection,” Proc. IEEE Conf. Computer Vision and Pattern for Localization,” Proc. IEEE Conf. Computer Vision and Pattern
Recognition, 2005. Recognition, 2006.
[11] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A. [35] H. Rowley, S. Baluja, and T. Kanade, “Human Face Detection in
Zisserman, “The PASCAL Visual Object Classes Challenge Visual Scenes,” Technical Report CMU-CS-95-158R, Carnegie
2007 (VOC 2007) Results,” http://www.pascal-network.org/ Mellon Univ., 1995.
challenges/VOC/voc2007/, 2007. [36] H. Schneiderman and T. Kanade, “A Statistical Method for 3d
[12] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A. Object Detection Applied to Faces and Cars,” Proc. IEEE Conf.
Zisserman, “The PASCAL Visual Object Classes Challenge Computer Vision and Pattern Recognition, 2000.
2008 (VOC 2008) Results,” http://www.pascal-network.org/ [37] S. Shalev-Shwartz, Y. Singer, and N. Srebro, “Pegasos: Primal
challenges/VOC/voc2008/, 2008. Estimated Sub-Gradient Solver for SVM,” Proc. Int’l Conf. Machine
[13] M. Everingham, A. Zisserman, C.K.I. Williams, and L. Van Gool, Learning, 2007.
“The PASCAL Visual Object Classes Challenge 2006 (VOC 2006) [38] K. Sung and T. Poggio, “Example-Based Learning for View-Based
Results,” http://www.pascal-network.org/challenges/VOC/ Human Face Detection,” Technical Report A.I. Memo No. 1521,
voc2006/, 2006. Massachusetts Inst. of Technology, 1994.
[14] P. Felzenszwalb and D. Huttenlocher, “Distance Transforms of [39] A. Torralba, “Contextual Priming for Object Detection,” Int’l J.
Sampled Functions,” Technical Report 2004-1963, Cornell Univ. Computer Vision, vol. 53, no. 2, pp. 169-191, July 2003.
CIS, 2004. [40] P. Viola, J. Platt, and C. Zhang, “Multiple Instance Boosting for
[15] P. Felzenszwalb and D. Huttenlocher, “Pictorial Structures for Object Detection,” Proc. Advances in Neural Information Processing
Object Recognition,” Int’l J. Computer Vision, vol. 61, no. 1, pp. 55- Systems, 2005.
79, 2005. [41] P. Viola and M. Jones, “Robust Real-Time Face Detection,” Int’l J.
[16] P. Felzenszwalb and D. McAllester, “The Generalized A Computer Vision, vol. 57, no. 2, pp. 137-154, May 2004.
Architecture,” J. Artificial Intelligence Research, vol. 29, pp. 153- [42] M. Weber, M. Welling, and P. Perona, “Towards Automatic
190, 2007. Discovery of Object Categories,” Proc. IEEE Conf. Computer Vision
[17] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A Discrimina- and Pattern Recognition, 2000.
tively Trained, Multiscale, Deformable Part Model,” Proc. IEEE [43] A. Yuille, P. Hallinan, and D. Cohen, “Feature Extraction from
Conf. Computer Vision and Pattern Recognition, 2008. Faces Using Deformable Templates,” Int’l J. Computer Vision,
[18] R. Fergus, P. Perona, and A. Zisserman, “Object Class Recognition vol. 8, no. 2, pp. 99-111, 1992.
by Unsupervised Scale-Invariant Learning,” Proc. IEEE Conf. [44] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, “Local
Computer Vision and Pattern Recognition, 2003. Features and Kernels for Classification of Texture and Object
[19] R. Fergus, P. Perona, and A. Zisserman, “A Sparse Object Categories: A Comprehensive Study,” Int’l J. Computer Vision,
Category Model for Efficient Learning and Exhaustive Recogni- vol. 73, no. 2, pp. 213-238, June 2007.
tion,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, [45] S. Zhu and D. Mumford, “A Stochastic Grammar of Images,”
2005. Foundations and Trends in Computer Graphics and Vision, vol. 2,
[20] M. Fischler and R. Elschlager, “The Representation and Matching no. 4, pp. 259-362, 2007.
of Pictorial Structures,” IEEE Trans. Computers, vol. 22, no. 1,
pp. 67-92, Jan. 1973.
[21] U. Grenander, Y. Chow, and D. Keenan, HANDS: A Pattern-
Theoretic Study of Biological Shapes. Springer-Verlag, 1991.
[22] D. Hoiem, A. Efros, and M. Hebert, “Putting Objects in
Perspective,” Int’l J. Computer Vision, vol. 80, no. 1, pp. 3-15,
Oct. 2008.
FELZENSZWALB ET AL.: OBJECT DETECTION WITH DISCRIMINATIVELY TRAINED PART-BASED MODELS 1645

Pedro F. Felzenszwalb received the BS degree David McAllester received the BS, MS, and
in computer science from Cornell University in PhD degrees from the Massachusetts Institute
1999. He received the PhD degree in computer of Technology (MIT) in 1978, 1979, and 1987,
science from the Massachusetts Institute of respectively. He served on the faculty of Cornell
Technology (MIT) in 2003. After leaving MIT, University for the academic year of 1987-1988
he spent one year as a postdoctoral researcher and served on the faculty of MIT from 1988 to
at Cornell University. He joined the Department 1995. He was a member of the technical staff at
of Computer Science at the University of AT&T Labs-Research from 1995 to 2002. Since
Chicago in 2004, where he is currently an 2002, he has been a professor and a chief
associate professor. His work has been sup- academic officer at the Toyota Technological
ported by the US National Science Foundation, including the CAREER Institute at Chicago. He has been a fellow of the American Association
award received in 2008. His main research interests are in computer of Artificial Intelligence (AAAI) since 1997. He has more than 80 refereed
vision, geometric algorithms, and artificial intelligence. He is currently an publications. His research is currently focused on applications of
associate editor of the IEEE Transactions on Pattern Analysis and machine learning to computer vision. His past research has included
Machine Intelligence. He is a member of the IEEE Computer Society. machine learning theory, the theory of programming languages,
automated reasoning, AI planning, computer game playing (computer
Ross B. Girshick received the BS degree in chess), and computational linguistics. A 1991 paper on AI planning
computer science from Brandeis University in proved to be one of the most influential papers of the decade in that
2004. He recently completed his second year area. A 1993 paper on computer game algorithms influenced the design
of PhD studies at the University of Chicago of the algorithms used in the Deep Blue system that defeated Gary
under the supervision of Professor Pedro F. Kasparov. A 1998 paper on machine learning theory introduced PAC-
Felzenszwalb. His current research focuses Bayesian theorems which combine Bayesian and non-Bayesian
primarily on the detection and localization of methods.
rich object categories in natural images. Before
starting his PhD studies, he worked as a Deva Ramanan received the PhD degree in
software engineer for three years. He is a electrical engineering and computer science
student member of the IEEE. from the University of California, Berkeley, in
2005. He is an assistant professor of computer
science at the University of California, Irvine.
From 2005 to 2007, he was a research assistant
professor at the Toyota Technological Institute
at Chicago. His work has been supported by the
US National Science Foundation (NSF), both
through large-scale grants and a graduate
fellowship, and the University of California Micro program. He regularly
reviews for major computer vision journals, has served on panels for
NSF, and has served on program committees for major conferences in
computer vision, machine learning, and artificial intelligence. His
interests include object recognition, tracking, and activity recognition.
He is a member of the IEEE.

. For more information on this or any other computing topic,

please visit our Digital Library at www.computer.org/publications/dlib.

Village Map: Taluka: Phaltan District: Satara
No ratings yet
Village Map: Taluka: Phaltan District: Satara
1 page
Taye It Report-1
0% (1)
Taye It Report-1
43 pages
Object Detection With Discriminatively Trained Part Based Models
No ratings yet
Object Detection With Discriminatively Trained Part Based Models
20 pages
General Framework For Object Detection
No ratings yet
General Framework For Object Detection
9 pages
Object Detection Using The Statistics
No ratings yet
Object Detection Using The Statistics
27 pages
EScholarship UC Item 3rd9150m
No ratings yet
EScholarship UC Item 3rd9150m
128 pages
Object Detection With Deformable Part-Based Models: Many Slides Based On
No ratings yet
Object Detection With Deformable Part-Based Models: Many Slides Based On
32 pages
PHD Visual Object Category Recognition
No ratings yet
PHD Visual Object Category Recognition
193 pages
Detection New
No ratings yet
Detection New
13 pages
18 TallapallyHarini 162-170
No ratings yet
18 TallapallyHarini 162-170
9 pages
Experiments With Patch-Based Object Classification
No ratings yet
Experiments With Patch-Based Object Classification
6 pages
Learning To Detect Objects in Images Via A Sparse, Part-Based Representation
No ratings yet
Learning To Detect Objects in Images Via A Sparse, Part-Based Representation
16 pages
Object Detection With Deep Learning: A Review
No ratings yet
Object Detection With Deep Learning: A Review
21 pages
Recent Advances in Deep Learning For Object Detection
No ratings yet
Recent Advances in Deep Learning For Object Detection
26 pages
A Review On Various Methodologies Used For Vehicle Classification, Helmet Detection and Number Plate Recognition
No ratings yet
A Review On Various Methodologies Used For Vehicle Classification, Helmet Detection and Number Plate Recognition
9 pages
SAVi++ Towards End-to-End Object-CentricSAVi++ Towards End-to-End Object-Centric
No ratings yet
SAVi++ Towards End-to-End Object-CentricSAVi++ Towards End-to-End Object-Centric
21 pages
Object Detection
No ratings yet
Object Detection
13 pages
Computer Vision Application
No ratings yet
Computer Vision Application
2 pages
Object Recognition
No ratings yet
Object Recognition
116 pages
Bag of Words Model: Unlocking Visual Intelligence with Bag of Words
From Everand
Bag of Words Model: Unlocking Visual Intelligence with Bag of Words
Fouad Sabry
No ratings yet
A Stochastic Grammar of Images (Song-Chun Zhu, David Mumford) (Z-Library)
No ratings yet
A Stochastic Grammar of Images (Song-Chun Zhu, David Mumford) (Z-Library)
119 pages
SR22804211151
No ratings yet
SR22804211151
8 pages
Sensors 22 04833
No ratings yet
Sensors 22 04833
17 pages
Attribute-Centric Recognition For Cross-Category Generalization
No ratings yet
Attribute-Centric Recognition For Cross-Category Generalization
8 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
From Everand
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
Fouad Sabry
No ratings yet
Tulsiani Viewpoints and Keypoints 2015 CVPR Paper
No ratings yet
Tulsiani Viewpoints and Keypoints 2015 CVPR Paper
10 pages
Realtime Visual Recognition in Deep Convolutional Neural Networks
No ratings yet
Realtime Visual Recognition in Deep Convolutional Neural Networks
13 pages
Incremental Learning
No ratings yet
Incremental Learning
8 pages
An Evaluation of Deep Learning Methods For Small Object
No ratings yet
An Evaluation of Deep Learning Methods For Small Object
18 pages
Classifier
No ratings yet
Classifier
39 pages
ObjectRecognitionIntro 2NOV
No ratings yet
ObjectRecognitionIntro 2NOV
28 pages
FR AIA David Forsyth Distribute
No ratings yet
FR AIA David Forsyth Distribute
41 pages
Articulated Pose Estimation With Flexible Mixtures-Of-Parts
No ratings yet
Articulated Pose Estimation With Flexible Mixtures-Of-Parts
8 pages
Object Detection With Deep Learning: A Review
No ratings yet
Object Detection With Deep Learning: A Review
21 pages
Research Article: An Evaluation of Deep Learning Methods For Small Object Detection
No ratings yet
Research Article: An Evaluation of Deep Learning Methods For Small Object Detection
18 pages
Centroidal Profiles
No ratings yet
Centroidal Profiles
6 pages
Object Detection Using ELAN
No ratings yet
Object Detection Using ELAN
6 pages
Traffic Sign Recognition Systems
100% (1)
Traffic Sign Recognition Systems
101 pages
Ballsack, Rotate, Eat Poop
No ratings yet
Ballsack, Rotate, Eat Poop
1 page
Region-Based Convolutional Networks For Accurate Object Detection and Segmentation
No ratings yet
Region-Based Convolutional Networks For Accurate Object Detection and Segmentation
21 pages
Object Detection Using Deep Learning
No ratings yet
Object Detection Using Deep Learning
5 pages
Scalable Object Detection
No ratings yet
Scalable Object Detection
8 pages
OD Trans Christopher-Lang2022 Q2
No ratings yet
OD Trans Christopher-Lang2022 Q2
15 pages
Keypoint Recognition Using Randomized Trees
No ratings yet
Keypoint Recognition Using Randomized Trees
29 pages
Pict Struct Ijcv PDF
No ratings yet
Pict Struct Ijcv PDF
42 pages
Deep Learning in Indus Valley Script Digitization
No ratings yet
Deep Learning in Indus Valley Script Digitization
55 pages
Literature Survey For Robotics
No ratings yet
Literature Survey For Robotics
6 pages
Image and Video Analytics Unit 3
No ratings yet
Image and Video Analytics Unit 3
18 pages
Foundational Models and Architectures S1: Generative AI, #1
From Everand
Foundational Models and Architectures S1: Generative AI, #1
Leaster Startx
No ratings yet
Vehicle Identification Systems Using Virtual Line Sensors and Speed Up Robust Features
No ratings yet
Vehicle Identification Systems Using Virtual Line Sensors and Speed Up Robust Features
15 pages
Detection v2
No ratings yet
Detection v2
12 pages
Incorporating Visual Information Into Natural Language Processing
No ratings yet
Incorporating Visual Information Into Natural Language Processing
151 pages
Research Paper in VLM For Multilabel Classification
No ratings yet
Research Paper in VLM For Multilabel Classification
15 pages
Object Detection and Localization Using Local and Global Features
No ratings yet
Object Detection and Localization Using Local and Global Features
19 pages
2404.18144v1 Pages 7
No ratings yet
2404.18144v1 Pages 7
10 pages
Synopsis
No ratings yet
Synopsis
9 pages
Patch-Based Experiments With Object Classification in Video
No ratings yet
Patch-Based Experiments With Object Classification in Video
13 pages
IGNOU MCA Object-Oriented Analysis and Design Previous Years Unsolved Papers MCS 219
From Everand
IGNOU MCA Object-Oriented Analysis and Design Previous Years Unsolved Papers MCS 219
Manish Soni
No ratings yet
Composite Pattern in Modern Software Design: Definitive Reference for Developers and Engineers
From Everand
Composite Pattern in Modern Software Design: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Object Recognition: Keywords: Support Vector Machine, Quadratic
No ratings yet
Object Recognition: Keywords: Support Vector Machine, Quadratic
3 pages
Vehicle Detection in Videos Using OpenCV and Python
No ratings yet
Vehicle Detection in Videos Using OpenCV and Python
1 page
Btech Cs 4 Sem Operating System Btcoc403 May 2019 (1) DBATu
No ratings yet
Btech Cs 4 Sem Operating System Btcoc403 May 2019 (1) DBATu
2 pages
Btech Cs 4 Sem Operating Systems Btcoc402 Aug 2022 (1) DBATu
No ratings yet
Btech Cs 4 Sem Operating Systems Btcoc402 Aug 2022 (1) DBATu
3 pages
Btech Comp 4 Sem Operating System Btcoc402 Jul 2023 (1) DBATu
No ratings yet
Btech Comp 4 Sem Operating System Btcoc402 Jul 2023 (1) DBATu
2 pages
Ref 6
No ratings yet
Ref 6
8 pages
DMG Mobile Voucher
No ratings yet
DMG Mobile Voucher
1 page
Computer Vision 3
No ratings yet
Computer Vision 3
38 pages
Bio-Data: Avinash V. Mulik
No ratings yet
Bio-Data: Avinash V. Mulik
2 pages
(Rev.00) Offer For Jain Solar 15 HP Ac Pumping System - 03.12.2020
No ratings yet
(Rev.00) Offer For Jain Solar 15 HP Ac Pumping System - 03.12.2020
5 pages
Help Desk Representative Job Description
100% (1)
Help Desk Representative Job Description
3 pages
Startup Guide Pro
No ratings yet
Startup Guide Pro
38 pages
Tera Com
No ratings yet
Tera Com
9 pages
National Cyber Security Policy 2021 WWW - Csstimes.pk
No ratings yet
National Cyber Security Policy 2021 WWW - Csstimes.pk
23 pages
FINAL - EXAM - ENG228ZIS - 2021S - 34D19.ĐỀ 1
No ratings yet
FINAL - EXAM - ENG228ZIS - 2021S - 34D19.ĐỀ 1
6 pages
Adv Sec Arch Spec Parnter Req Etmg en
No ratings yet
Adv Sec Arch Spec Parnter Req Etmg en
5 pages
Factoring Perfect Square Trinomials
No ratings yet
Factoring Perfect Square Trinomials
14 pages
GL Bajaj Dec 2022 14 (2) CHAP-1
No ratings yet
GL Bajaj Dec 2022 14 (2) CHAP-1
8 pages
Dell Laptop Price in Nepal 2022
No ratings yet
Dell Laptop Price in Nepal 2022
6 pages
Multimedia: Marian C. de Luna CEIT Instructor
No ratings yet
Multimedia: Marian C. de Luna CEIT Instructor
11 pages
Woodville SlidesCarnival
No ratings yet
Woodville SlidesCarnival
30 pages
Ewst
No ratings yet
Ewst
167 pages
Smart Sensor
No ratings yet
Smart Sensor
11 pages
Examination of The Proof of Riemann's Hypothesis
No ratings yet
Examination of The Proof of Riemann's Hypothesis
71 pages
Reference Citation
100% (1)
Reference Citation
8 pages
6SL3100 0BE21 6AB0 Datasheet en
No ratings yet
6SL3100 0BE21 6AB0 Datasheet en
402 pages
Extensive Study Guide
No ratings yet
Extensive Study Guide
27 pages
Asset Management System Introduction
No ratings yet
Asset Management System Introduction
7 pages
SPPU B.voc Robotics Automation Syllabus
No ratings yet
SPPU B.voc Robotics Automation Syllabus
68 pages
PRAESENSA 2.10 Configuration Manual EnUS 100857072779
No ratings yet
PRAESENSA 2.10 Configuration Manual EnUS 100857072779
212 pages
(Ebook PDF) A Short Course in Photography: Digital 3rd Editionpdf Download
80% (5)
(Ebook PDF) A Short Course in Photography: Digital 3rd Editionpdf Download
54 pages
Class-IV Worksheet 5
No ratings yet
Class-IV Worksheet 5
12 pages
The Internet of Things With ESP32
80% (5)
The Internet of Things With ESP32
40 pages
LCD27 Kuhn SM PDF
No ratings yet
LCD27 Kuhn SM PDF
45 pages
Unit Test 2 PDF
No ratings yet
Unit Test 2 PDF
1 page
Ai and Pentesting
No ratings yet
Ai and Pentesting
5 pages
Internet of Things IoT Assisted Context Aware Fertilizer Recommendation
No ratings yet
Internet of Things IoT Assisted Context Aware Fertilizer Recommendation
15 pages
09.Project-Hospital Management System PDF
100% (9)
09.Project-Hospital Management System PDF
50 pages
Chapter 1 - The Importance of MIS
100% (1)
Chapter 1 - The Importance of MIS
47 pages

Ref 2

Uploaded by

Ref 2

Uploaded by

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO.

9, SEPTEMBER 2010 1627

Object Detection with

maintaining performance seems to require gradual enrich-

ðdxi ; dyi Þ ¼ ðxi ; yi Þ  ð2ðx0 ; y0 Þ þ vi Þ ð3Þ

Now consider a latent SVM where there is a single 4.2 Optimization

5.1 Learning Parameters

7.3 Contextual Information

g ¼ ð ðsÞ; x1 ; y1 ; x2 ; y2 ; cðIÞÞ: ð30Þ 8 EMPIRICAL RESULTS

TABLE 1 significantly improve the detection accuracy. Mixture

. For more information on this or any other computing topic,

You might also like

ðdxi ; dyi Þ ¼ ðxi ; yi Þ ð2ðx0 ; y0 Þ þ vi Þ ð3Þ