Ref 2
Ref 2
Abstract—We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to
represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges. While
deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the
PASCAL data sets. Our system relies on new methods for discriminative training with partially labeled data. We combine a margin-
sensitive approach for data-mining hard negative examples with a formalism we call latent SVM. A latent SVM is a reformulation of
MI-SVM in terms of latent variables. A latent SVM is semiconvex, and the training problem becomes convex once latent information is
specified for the positive examples. This leads to an iterative training algorithm that alternates between fixing latent values for positive
examples and optimizing the latent SVM objective function.
Index Terms—Object recognition, deformable models, pictorial structures, discriminative training, latent SVM.
1 INTRODUCTION
Deformable part models such as pictorial structures
O BJECTrecognition is one of the fundamental challenges
in computer vision. In this paper, we consider the
problem of detecting and localizing generic objects from
provide an elegant framework for object detection. Yet it
has been difficult to establish their value in practice. On
categories such as people or cars in static images. This is a difficult data sets, deformable part models are often
difficult problem because objects in such categories can outperformed by simpler models such as rigid templates
vary greatly in appearance. Variations arise not only from [10] or bag-of-features [44]. One of the goals of our work is
changes in illumination and viewpoint, but also due to to address this performance gap.
nonrigid deformations and intraclass variability in shape While deformable models can capture significant varia-
and other visual properties. For example, people wear tions in appearance, a single deformable model is often not
different clothes and take a variety of poses, while cars expressive enough to represent a rich object category.
come in various shapes and colors. Consider the problem of modeling the appearance of
We describe an object detection system that represents
bicycles in photographs. People build bicycles of different
highly variable objects using mixtures of multiscale
types (e.g., mountain bikes, tandems, and 19th-century
deformable part models. These models are trained using a
discriminative procedure that only requires bounding boxes cycles with one big wheel and a small one) and view them
for the objects in a set of images. The resulting system is in various poses (e.g., frontal versus side views). The system
both efficient and accurate, achieving state-of-the-art results described here uses mixture models to deal with these more
on the PASCAL VOC benchmarks [11], [12], [13] and the significant variations.
INRIA Person data set [10]. We are ultimately interested in modeling objects using
Our approach builds on the pictorial structures frame- “visual grammars.” Grammar-based models (e.g., [16], [24],
work [15], [20]. Pictorial structures represent objects by a [45]) generalize deformable part models by representing
collection of parts arranged in a deformable configuration. objects using variable hierarchical structures. Each part in a
Each part captures local appearance properties of an object grammar-based model can be defined directly or in terms of
while the deformable configuration is characterized by other parts. Moreover, grammar-based models allow for,
spring-like connections between certain pairs of parts. and explicitly model, structural variations. These models
also provide a natural framework for sharing information
and computation between different object classes. For
. P.F. Felzenszwalb and R.B. Girshick are with the Department of Computer
Science, University of Chicago, 1100 E. 58th Street, Chicago, IL 60637. example, different models might share reusable parts.
E-mail: {pff, rbg}@cs.uchicago.edu. Although grammar-based models are our ultimate goal,
. D. McAllester is with the Toyota Technological Institute at Chicago, 6045 we have adopted a research methodology under which we
S. Kenwood Ave., Chicago, IL 60637. E-mail: [email protected]. gradually move toward richer models while maintaining a
. D. Ramanan is with the Department of Computer Science, University of
California, Irvine, 3019 Donald Bren Hall, Irvine, CA 92697. high level of performance. Improving performance by
E-mail: [email protected]. enriched models is surprisingly difficult. Simple models
Manuscript received 29 May 2009; accepted 25 July 2009; published online have historically outperformed sophisticated models in
10 Sept. 2009. computer vision, speech recognition, machine translation,
Recommended for acceptance by B. Triggs. and information retrieval. For example, until recently
For information on obtaining reprints of this article, please send e-mail to:
speech recognition and machine translation systems based
[email protected], and reference IEEECS Log Number
TPAMI-2009-05-0336. on n-gram language models outperformed systems based
Digital Object Identifier no. 10.1109/TPAMI.2009.167. on grammars and phrase structure. In our experience,
0162-8828/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society
1628 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010
Fig. 2. Detections obtained with a two-component bicycle model. These examples illustrate the importance of deformations and mixture models. In
this model, the first component captures sideways views of bicycles while the second component captures frontal and near frontal views. The
sideways component can deform to match a “wheelie.”
of information. Moreover, by examining the principal locations, and their geometric arrangement is captured by a
eigenvectors, we discover structure that leads to “analytic” set of “springs” connecting pairs of parts. The patchwork of
versions of low-dimensional features which are easily parts model from [2] is similar, but it explicitly considers
interpretable and can be computed efficiently. how the appearance model of overlapping parts interact.
We have also considered some specific problems that Our models are largely based on the pictorial structures
arise in the PASCAL object detection challenge and similar framework from [15], [20]. We use a dense set of possible
data sets. We show how the locations of parts in an object positions and scales in an image, and define a score for
hypothesis can be used to predict a bounding box for the placing a filter at each of these locations. The geometric
object. This is done by training a model specific predictor configuration of the filters is captured by a set of
using least-squares regression. We also demonstrate a deformation costs (“springs”) connecting each part filter
simple method for aggregating the output of several object to the root filter, leading to a star-structured pictorial
detectors. The basic idea is that objects of some categories structure model. Note that we do not model interactions
provide evidence for, or against, objects of other categories between overlapping parts. While we might benefit from
in the same image. We exploit this idea by training a modeling such interactions, this does not appear to be a
category-specific classifier that rescores every detection of
problem when using models trained with a discriminative
that category using its original score and the highest scoring
procedure, and it significantly simplifies the problem of
detection from each of the other categories.
matching a model to an image.
The introduction of new local and semilocal features has
2 RELATED WORK played an important role in advancing the performance of
There is a significant body of work on deformable models object recognition methods. These features are typically
of various types for object detection, including several invariant to illumination changes and small deformations.
kinds of deformable template models (e.g., [7], [8], [21], Many recent approaches use wavelet-like features [30], [41]
[43]) and a variety of part-based models (e.g., [2], [6], [9], or locally normalized histograms of gradients [10], [29]. Other
[15], [18], [20], [28], [42]). methods, such as [5], learn dictionaries of local structures
In the constellation models from [18], [42], parts are from training images. In our work, we use histogram of
constrained to be in a sparse set of locations determined by gradient (HOG) features from [10] as a starting point, and
an interest point operator, and their geometric arrangement introduce a variation that reduces the feature size with no loss
is captured by a Gaussian distribution. In contrast, pictorial in performance. As in [26], we used PCA to discover low-
structure models [15], [20] define a matching problem dimensional features, but we note that the eigenvectors we
where parts have an individual match cost in a dense set of obtain have a clear structure that leads to a new set of
1630 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010
“analytic” features. This removes the need to perform a costly [25]). The approach described here requires relatively few
projection step when computing dense feature maps. passes through the complete set of training examples and is
Significant variations in shape and appearance, such as particularly well suited for training with very large data
those caused by extreme viewpoint changes, are not well sets, where only a fraction of the examples can fit in RAM.
captured by a 2D deformable model. Aspect graphs [31] are The use of context for object detection and recognition
a classical formalism for capturing significant changes that has received increasing attention in the recent years. Some
are due to viewpoint variation. Mixture models provide a methods (e.g., [39]) use low-level holistic image features for
simpler alternative approach. For example, it is common to defining likely object hypothesis. The method in [22] uses a
use multiple templates to encode frontal and side views of coarse but semantically rich representation of a scene,
faces and cars [36]. Mixture models have been used to including its 3D geometry, estimated using a variety of
capture other aspects of appearance variation as well, such techniques. Here, we define the context of an image using
as when there are multiple natural subclasses in an object the results of running a variety of object detectors in the
category [5]. image. The idea is related to [33] where a CRF was used to
Matching a deformable model to an image is a difficult capture co-occurrences of objects, although we use a very
optimization problem. Local search methods require initi- different approach to capture this information.
alization near the correct solution [2], [7], [43]. To guarantee A preliminary version of our system was described in
a globally optimal match, more aggressive search is needed. [17]. The system described here differs from the one in
One popular approach for part-based models is to restrict [17] in several ways, including the introduction of mixture
part locations to a small set of possible locations returned by models; here, we optimize the true latent SVM objective
an interest point detector [1], [18], [42]. Tree (and star) function using stochastic gradient descent, while in [17]
structured pictorial structure models [9], [15], [19] allow for
we used an SVM package to optimize a heuristic
the use of dynamic programming and generalized distance
approximation of the objective; here, we use new features
transforms to efficiently search over all possible object
that are both lower dimensional and more informative; we
configurations in an image, without restricting the possible
now postprocess detections via bounding box prediction
locations for each part. We use these techniques for
and context rescoring.
matching our models to images.
Part-based deformable models are parameterized by the
appearance of each part and a geometric model capturing 3 MODELS
spatial relationships among parts. For generative models,
All of our models involve linear filters that are applied to
one can learn model parameters using maximum likelihood
estimation. In a fully supervised setting, training images are dense feature maps. A feature map is an array whose entries
labeled with part locations and models can often be learned are d-dimensional feature vectors computed from a dense
using simple methods [9], [15]. In a weakly supervised grid of locations in an image. Intuitively, each feature vector
setting, training images may not specify locations of parts. describes a local image patch. In practice, we use a variation
In this case, one can simultaneously estimate part locations of the HOG features from [10], but the framework described
and learn model parameters with EM [2], [18], [42]. here is independent of the specific choice of features.
Discriminative training methods select model parameters A filter is a rectangular template defined by an array of
so as to minimize the mistakes of a detection algorithm on a d-dimensional weight vectors. The response, or score, of a
set of training images. Such approaches directly optimize filter F at a position ðx; yÞ in a feature map G is the “dot
the decision boundary between positive and negative product” of the filter and a subwindow of the feature map
examples. We believe that this is one reason for the success with top-left corner at ðx; yÞ:
of simple models trained with discriminative methods, such X
as the Viola-Jones [41] and Dalal-Triggs [10] detectors. It has F ½x0 ; y0 G½x þ x0 ; y þ y0 :
x0 ;y0
been more difficult to train part-based models discrimina-
tively, though strategies exist [4], [23], [32], [34]. We would like to define a score at different positions and
Latent SVMs are related to hidden CRFs [32]. However, scales in an image. This is done using a feature pyramid
in a latent SVM, we maximize over latent part locations as which specifies a feature map for a finite number of scales
opposed to marginalizing over them, and we use a hinge in a fixed range. In practice, we compute feature pyramids
loss rather than log loss in training. This leads to an efficient by computing a standard image pyramid via repeated
coordinate-descent style algorithm for training, as well as a smoothing and subsampling, and then computing a feature
data-mining algorithm that allows for learning with very map from each level of the image pyramid. Fig. 3 illustrates
large data sets. A latent SVM can be viewed as a type of the construction.
energy-based model [27]. The scale sampling in a feature pyramid is determined by a
A latent SVM is equivalent to the MI-SVM formulation of parameter defining the number of levels in an octave. That
multiple instance learning (MIL) in [3], but we find the
is, is the number of levels we need to go down in the
latent variable formulation more natural for the problems
pyramid to get to a feature map computed at twice the
we are interested in.1 A different MIL framework was
resolution of another one. In practice, we have used ¼ 5 in
previously used for training object detectors with weakly
training and ¼ 10 at test time. Fine sampling of scale space is
labeled data in [40].
important for obtaining high performance with our models.
Our method for data-mining hard examples during
The system in [10] uses a single filter to define an object
training is related to working set methods for SVMs (e.g.,
model. That system detects objects by computing the score
1. We defined a latent SVM in [17] before realizing the relationship to of the filter at each position and scale of a HOG feature
MI-SVM. pyramid and thresholding the scores.
FELZENSZWALB ET AL.: OBJECT DETECTION WITH DISCRIMINATIVELY TRAINED PART-BASED MODELS 1631
We require that the level of each part is such that the feature
map at that level was computed at twice the resolution of
the root level, li ¼ l0 for i > 0.
The score of a hypothesis is given by the scores of each
filter at their respective locations (the data term) minus a
deformation cost that depends on the relative position of
each part with respect to the root (the spatial prior), plus
the bias,
scoreðp0 ; . . . ; pn Þ
X n X
n
ð2Þ
¼ Fi0 ðH; pi Þ di d ðdxi ; dyi Þ þ b;
i¼0 i¼1
where
Fig. 3. A feature pyramid and an instantiation of a person model within gives the displacement of the ith part relative to its anchor
that pyramid. The part filters are placed at twice the spatial resolution of position and
the placement of the root.
d ðdx; dyÞ ¼ ðdx; dy; dx2 ; dy2 Þ ð4Þ
Let F be a w h filter. Let H be a feature pyramid and
are deformation features.
p ¼ ðx; y; lÞ specify a position ðx; yÞ in the lth level of the
Note that if di ¼ ð0; 0; 1; 1Þ, the deformation cost for the
pyramid. Let ðH; p; w; hÞ denote the vector obtained by
ith part is the squared distance between its actual position
concatenating the feature vectors in the w h subwindow of
and its anchor position relative to the root. In general, the
H with top-left corner at p in row-major order. The score of F
at p is F 0 ðH; p; w; hÞ, where F 0 is the vector obtained by deformation cost is an arbitrary separable quadratic func-
concatenating the weight vectors in F in row-major order. tion of the displacements.
Below we write F 0 ðH; pÞ since the subwindow dimen- The bias term is introduced in the score to make the
sions are implicitly defined by the dimensions of the filter F . scores of multiple models comparable when we combine
them into a mixture model.
3.1 Deformable Part Models The score of a hypothesis z can be expressed in terms of a
Our star models are defined by a coarse root filter that dot product, ðH; zÞ, between a vector of model
approximately covers an entire object and higher resolution parameters and a vector ðH; zÞ,
part filters that cover smaller parts of the object. Fig. 3
¼ ðF00 ; . . . ; Fn0 ; d1 ; . . . ; dn ; bÞ; ð5Þ
illustrates an instantiation of such a model in a feature
pyramid. The root filter location defines a detection
window (the pixels contributing to the region of the feature ðH; zÞ ¼ ððH; p0 Þ; . . . ðH; pn Þ;
map covered by the filter). The part filters are placed ð6Þ
d ðdx1 ; dy1 Þ; . . . ; d ðdxn ; dyn Þ; 1Þ:
levels down in the pyramid, so the features at that level are
computed at twice the resolution of the features in the root This illustrates a connection between our models and linear
filter level. classifiers. We use this relationship for learning the model
We have found that using higher resolution features for parameters with the latent SVM framework.
defining part filters is essential for obtaining high recogni-
tion performance. With this approach, the part filters 3.2 Matching
capture finer resolution features that are localized to greater To detect objects in an image, we compute an overall score
accuracy when compared to the features captured by the for each root location according to the best possible
root filter. Consider building a model for a face. The root placement of the parts:
filter could capture coarse resolution edges such as the face
boundary while the part filters could capture details such as scoreðp0 Þ ¼ max scoreðp0 ; . . . ; pn Þ: ð7Þ
p1 ;...;pn
eyes, nose, and mouth.
A model for an object with n parts is formally defined by High-scoring root locations define detections, while the
an ðn þ 2Þ-tuple ðF0 ; P1 ; . . . ; Pn ; bÞ, where F0 is a root filter, locations of the parts that yield a high-scoring root location
Pi is a model for the ith part, and b is a real-valued bias define a full object hypothesis.
term. Each part model is defined by a 3-tuple ðFi ; vi ; di Þ By defining an overall score for each root location, we can
where Fi is a filter for the ith part, vi is a two-dimensional detect multiple instances of an object (we assume there is at
vector specifying an “anchor” position for part i relative to most one instance per root location). This approach is related
the root position, and di is a four-dimensional vector to sliding-window detectors because we can think of scoreðp0 Þ
specifying coefficients of a quadratic function defining a as a score for the detection window specified by the root filter.
deformation cost for each possible placement of the part We use dynamic programming and generalized distance
relative to the anchor position. transforms (min-convolutions) [14], [15] to compute the best
An object hypothesis specifies the location of each filter locations for the parts as a function of the root location. The
in the model in a feature pyramid, z ¼ ðp0 ; . . . ; pn Þ, where resulting method is very efficient, taking OðnkÞ time once
pi ¼ ðxi ; yi ; li Þ specifies the level and position of the ith filter. filter responses are computed, where n is the number of
1632 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010
parts in the model and k is the total number of locations in As in the case of a single component model, the score of a
the feature pyramid. We briefly describe the method here hypothesis for a mixture model can be expressed by a dot
and refer the reader to [14], [15] for more details. product between a vector of model parameters and a vector
Let Ri;l ðx; yÞ ¼ Fi0 ðH; ðx; y; lÞÞ be an array storing the ðH; zÞ. For a mixture model, the vector is the concatena-
response of the ith model filter in the lth level of the feature tion of the model parameter vectors for each component. The
pyramid. The matching algorithm starts by computing vector ðH; zÞ is sparse, with nonzero entries defined by
these responses. Note that Ri;l is a cross-correlation between ðH; z0 Þ in a single interval matching the interval of c in :
Fi and level l of the feature pyramid.
After computing filter responses, we transform the ¼ ð1 ; . . . ; m Þ: ð11Þ
responses of the part filters to allow for spatial uncertainty:
ðH; zÞ ¼ ð0; . . . ; 0; ðH; z0 Þ; 0; . . . ; 0Þ: ð12Þ
Di;l ðx; yÞ ¼ max Ri;l ðx þ dx; y þ dyÞ di d ðdx; dyÞ : ð8Þ
dx;dy 0
With this construction, ðH; zÞ ¼ c ðH; z Þ.
This transformation spreads high filter scores to nearby To detect objects using a mixture model, we use the
locations, taking into account the deformation costs. The matching algorithm described above to find root locations
value Di;l ðx; yÞ is the maximum contribution of the ith part that yield high-scoring hypotheses independently for each
to the score of a root location that places the anchor of this component.
part at position ðx; yÞ in level l.
The transformed array, Di;l , can be computed in linear
time from the response array, Ri;l , using the generalized
4 LATENT SVM
distance transform algorithm from [14]. Consider a classifier that scores an example x with a
The overall root scores at each level can be expressed by function of the form
the sum of the root filter response at that level, plus shifted
versions of transformed and subsampled part responses: f ðxÞ ¼ max ðx; zÞ: ð13Þ
z2ZðxÞ
X
n
Here, is a vector of model parameters and z are latent
scoreðx0 ; y0 ; l0 Þ ¼ R0;l0 ðx0 ; y0 Þ þ Di;l0 ð2ðx0 ; y0 Þ þ vi Þ þ b: values. The set ZðxÞ defines the possible latent values for an
i¼1
example x. A binary label for x can be obtained by
ð9Þ thresholding its score.
Recall that is the number of levels we need to go down in In analogy to classical SVMs, we train from labeled
the feature pyramid to get to a feature map that was examples D ¼ ðhx1 ; y1 i; . . . ; hxn ; yn iÞ where yi 2 f1; 1g by
computed at exactly twice the resolution of another one. minimizing the objective function
Fig. 4 illustrates the matching process. Xn
1
To understand (9) note that for a fixed root location we LD ðÞ ¼ kk2 þ C maxð0; 1 yi f ðxi ÞÞ; ð14Þ
can independently pick the best location for each part 2 i¼1
because there are no interactions among parts in the score of
where maxð0; 1 yi f ðxi ÞÞ is the standard hinge loss and
a hypothesis. The transformed arrays Di;l give the contribu-
the constant C controls the relative weight of the regular-
tion of the ith part to the overall root score as a function of
ization term.
the anchor position for the part. So, we obtain the total score
Note that if there is a single possible latent value for each
of a root position at level l by adding up the root filter
example (jZðxi Þj ¼ 1), then f is linear in and we obtain
response and the contributions from each part, which are
linear SVMs as a special case of latent SVMs.
precomputed in Di;l .
In addition to computing Di;l , the algorithm from [14] 4.1 Semiconvexity
can also compute optimal displacements for a part as a A latent SVM leads to a nonconvex optimization problem.
function of its anchor position: However, a latent SVM is semiconvex in the sense described
below, and the training problem becomes convex once latent
Pi;l ðx; yÞ ¼ arg max Ri;l ðx þ dx; y þ dyÞ di d ðdx; dyÞ :
dx;dy information is specified for the positive training examples.
ð10Þ Recall that the maximum of a set of convex functions
is convex. In a linear SVM, we have that f ðxÞ ¼ ðxÞ
After finding a root location ðx0 ; y0 ; l0 Þ with high score, we is linear in . In this case, the hinge loss is convex for
can find the corresponding part locations by looking up the each example because it is always the maximum of two
optimal displacements in Pi;l0 ð2ðx0 ; y0 Þ þ vi Þ. convex functions.
3.3 Mixture Models Note that f ðxÞ as defined in (13) is a maximum of
functions each of which is linear in . Hence, f ðxÞ is convex
A mixture model with m components is defined by an m-
tuple, M ¼ ðM1 ; . . . ; Mm Þ, where Mc is the model for the in and thus the hinge loss, maxð0; 1 yi f ðxi ÞÞ, is convex
cth component. in when yi ¼ 1. That is, the loss function is convex in
An object hypothesis for a mixture model specifies a for negative examples. We call this property of the loss
mixture component, 1 c m, and a location for each filter function semiconvexity.
of Mc , z ¼ ðc; p0 ; . . . ; pnc Þ. Here, nc is the number of parts in In a general latent SVM, the hinge loss is not convex for a
Mc . The score of this hypothesis is the score of the positive example because it is the maximum of a convex
hypothesis z0 ¼ ðp0 ; . . . ; pnc Þ for the cth model component. function (zero) and a concave function ð1 yi f ðxi ÞÞ.
FELZENSZWALB ET AL.: OBJECT DETECTION WITH DISCRIMINATIVELY TRAINED PART-BASED MODELS 1633
Fig. 4. The matching process at one scale. Responses from the root and part filters are computed a different resolutions in the feature pyramid. The
transformed responses are combined to yield a final score for each root location. We show the responses and transformed responses for the “head”
and “right shoulder” parts. Note how the “head” filter is more discriminative. The combined scores clearly show two good hypothesis for the object at
this scale.
Pn
according to Zp . That is, for a positive example, we set i¼1 hð; xi ; yi Þ with nhð; xi ; yi Þ. The resulting algorithm
Zðxi Þ ¼ fzi g, where zi is the latent value specified for xi by repeatedly updates as follows:
Zp . Note that
1. Let t be the learning rate for iteration t.
LD ðÞ ¼ min LD ð; Zp Þ: ð15Þ 2. Let i be a random example.
Zp
3. Let zi ¼ argmaxz2Zðxi Þ ðxi ; zÞ.
In particular, LD ðÞ LD ð; Zp Þ. The auxiliary objective 4. If yi f ðxi Þ ¼ yi ð ðxi ; zi ÞÞ 1, set :¼ t .
function bounds the LSVM objective. This justifies training 5. Else set :¼ t ð Cnyi ðxi ; zi ÞÞ.
a latent SVM by minimizing LD ð; Zp Þ. As in gradient-descent methods for linear SVMs, we
In practice, we minimize LD ð; Zp Þ using a “coordinate- obtain a procedure that is quite similar to the perceptron
descent” approach: algorithm. If f correctly classifies the random example xi
(beyond the margin), we simply shrink . Otherwise, we
1. Relabel positive examples: Optimize LD ð; Zp Þ over Zp
shrink and add a scalar multiple of ðxi ; zi Þ to it.
by selecting the highest scoring latent value for each
For linear SVMs, a learning rate t ¼ 1=t has been shown
positive example, zi ¼ argmaxz2Zðxi Þ ðxi ; zÞ.
to work well [37]. However, the time for convergence
2. Optimize beta: Optimize LD ð; Zp Þ over by
depends on the number of training examples, which for us
solving the convex optimization problem defined
can be very large. In particular, if there are many “easy”
by LDðZp Þ ðÞ.
examples, step 2 will often pick one of these and we do not
Both steps always improve or maintain the value of
make much progress.
LD ð; Zp Þ. After convergence, we have a relatively strong
local optimum in the sense that step 1 searches over an 4.4 Data-Mining Hard Examples, SVM Version
exponentially large space of latent values for positive When training a model for object detection, we often have a
examples while step 2 searches over all possible models, very large number of negative examples (a single image can
implicitly considering the exponentially large space of yield 105 examples for a scanning window classifier). This
latent values for all negative examples. can make it infeasible to consider all negative examples
We note, however, that careful initialization of may be simultaneously. Instead, it is common to construct training
necessary because otherwise we may select unreasonable data consisting of the positive instances and “hard
latent values for the positive examples in step 1 and this negative” instances.
could lead to a bad model. Bootstrapping methods train a model with an initial
The semiconvexity property is important because it subset of negative examples, and then collect negative
leads to a convex optimization problem in step 2, even
examples that are incorrectly classified by this initial model
though the latent values for the negative examples are not
to form a set of hard negatives. A new model is trained with
fixed. A similar procedure that fixes latent values for all
the hard negative examples, and the process may be
examples in each round would likely fail to yield good
results. Suppose we let Z specify latent values for all repeated a few times.
Here, we describe a data-mining algorithm motivated by
examples in D. Since LD ðÞ effectively maximizes over
negative latent values, LD ðÞ could be much larger than the bootstrapping idea for training a classical (nonlatent)
LD ð; ZÞ, and we should not expect that minimizing SVM. The method solves a sequence of training problems
LD ð; ZÞ would lead to a good model. using a relatively small number of hard examples and
converges to the exact solution of the training problem
4.3 Stochastic Gradient Descent defined by a large training set. This requires a margin-
Step 2 (Optimize Beta) of the coordinate-descent method can sensitive definition of hard examples.
be solved via quadratic programming [3]. It can also be We define hard and easy instances of a training set D
solved via stochastic gradient descent. Here, we describe a relative to as follows:
gradient-descent approach for optimizing over an
arbitrary training set D. In practice, we use a modified Hð; DÞ ¼ fhx; yi 2 D j yf ðxÞ < 1g: ð18Þ
version of this procedure that works with a cache of feature
vectors for DðZp Þ (see Section 4.5). Eð; DÞ ¼ fhx; yi 2 D j yf ðxÞ > 1g: ð19Þ
Let zi ðÞ ¼ argmaxz2Zðxi Þ ðxi ; zÞ.
That is, Hð; DÞ are the examples in D that are incorrectly
Then, f ðxi Þ ¼ ðxi ; zi ðÞÞ.
We can compute a subgradient of the LSVM objective classified or inside the margin of the classifier defined by
function as follows: . Similarly, Eð; DÞ are the examples in D that are
correctly classified and outside the margin. Examples on
X
n
the margin are neither hard nor easy.
rLD ðÞ ¼ þ C hð; xi ; yi Þ; ð16Þ
Let ðDÞ ¼ argmin LD ðÞ.
i¼1
Since LD is strictly convex, ðDÞ is unique.
Given a large training set D, we would like to find a
0; if yi f ðxi Þ 1;
hð; xi ; yi Þ ¼ ð17Þ small set of examples C D such that ðCÞ ¼ ðDÞ.
yi ðxi ; zi ðÞÞ; otherwise:
Our method starts with an initial “cache” of examples
In stochastic gradient descent, we approximate rLD and alternates between training a model and updating the
using a subset of the examples and take a step in its negative cache. In each iteration, we remove easy examples from the
direction. Using a single example, hxi ; yi i, we approximate cache and add new hard examples. A special case involves
FELZENSZWALB ET AL.: OBJECT DETECTION WITH DISCRIMINATIVELY TRAINED PART-BASED MODELS 1635
keeping all positive examples in the cache and data mining representation is simpler (it is application independent)
over negatives. and can be much more compact.
Let C1 D be an initial cache of examples. The A feature vector cache F is a set of pairs ði; vÞ where 1
algorithm repeatedly trains a model and updates the cache i n is the index of an example and v ¼ ðxi ; zÞ for some
as follows: z 2 Zðxi Þ. Note that we may have several pairs ði; vÞ 2 F for
each example xi . If the training set has fixed labels for positive
1. Let t :¼ ðCt Þ (train a model using Ct ). examples, this may still be true for the negative examples.
2. If Hðt ; DÞ Ct , stop and return t . Let IðF Þ be the examples indexed by F . The feature
3. Let Ct0 :¼ Ct nX for any X such that X Eðt ; Ct Þ vectors in F define an objective function for where we
(shrink the cache). only consider examples indexed by IðF Þ and, for each
4. Let Ctþ1 :¼ Ct0 [ X for any X such that X D and
example, we only consider feature vectors in the cache:
X \ Hðt ; DÞnCt 6¼ ; (grow the cache).
1 X
In step 3, we shrink the cache by removing examples
LF ðÞ ¼ kk2 þ C max 0; 1 yi max v : ð20Þ
from Ct that are outside the margin defined by t . In step 4, 2 i2IðF Þ
ði;vÞ2F
we grow the cache by adding examples from D, including
at least one new example that is inside the margin defined We can optimize LF via gradient descent by modifying
by t . Such an example must exist; otherwise, we would the method in Section 4.3. Let V ðiÞ be the set of feature
have returned in step 2. vectors v such that ði; vÞ 2 F . Then, each gradient-descent
The following theorem shows that, when we stop, we iteration simplifies to:
have found ðDÞ:
1. Let t be the learning rate for iteration t.
Theorem 1. Let C D and ¼ ðCÞ. If Hð; DÞ C, then 2. Let i 2 IðF Þ be a random example indexed by F .
¼ ðDÞ. 3. Let vi ¼ argmaxv2V ðiÞ v.
Proof. C D implies LD ð ðDÞÞ LC ð ðCÞÞ ¼ LC ðÞ. 4. If yi ð vi Þ 1, set ¼ t .
Since Hð; DÞ C, all examples in DnC have zero loss 5. Else set ¼ t ð Cnyi vi Þ.
on . This implies LC ðÞ ¼ LD ðÞ. We conclude Now the size of IðF Þ controls the number of iterations
LD ð ðDÞÞ LD ðÞ and, because LD has a unique necessary for convergence, while the size of V ðiÞ controls
minimum, ¼ ðDÞ. u
t the time it takes to execute step 3. In step 5, n ¼ jIðF Þj.
Let ðF Þ ¼ argmin LF ðÞ.
The next result shows the algorithm will stop after a We would like to find a small cache for DðZp Þ with
finite number of iterations. Intuitively, this follows from the ðF Þ ¼ ðDðZp ÞÞ.
fact that LCt ð ðCt ÞÞ grows in each iteration, but it is We define the hard feature vectors of a training set D
bounded by LD ð ðDÞÞ. relative to as
Theorem 2. The data-mining algorithm terminates. n
Hð; DÞ ¼ ði; ðxi ; zi ÞÞ j zi ¼ arg max ðxi ; zÞ and
Proof. When we shrink, the cache Ct0 contains all examples z2Zðxi Þ
from Ct with nonzero loss in a ball around t . This o
implies LCt0 is identical to LCt in a ball around t and, yi ð ðxi ; zi ÞÞ < 1 :
since t is a minimum of LCt , it also must be a minimum ð21Þ
of LCt0 . Thus, LCt0 ð ðCt0 ÞÞ ¼ LCt ð ðCt ÞÞ.
When we grow, the cache Ctþ1 nCt0 contains at least one That is, Hð; DÞ are pairs ði; vÞ where v is the highest
example hx; yi with nonzero loss at t . Since Ct0 Ctþ1 , we scoring feature vector from an example xi that is inside the
have LCtþ1 ðÞ LCt0 ðÞ for all . If ðCtþ1 Þ 6¼ ðCt0 Þ, then margin of the classifier defined by .
LCtþ1 ð ðCtþ1 ÞÞ > LCt0 ð ðCt0 ÞÞ because LCt0 has a unique We define the easy feature vectors in a cache F as
minimum. If ðCtþ1 Þ ¼ ðCt0 Þ, then LCtþ1 ð ðCtþ1 ÞÞ >
LCt0 ð ðCt0 ÞÞ due to hx; yi. Eð; F Þ ¼ fði; vÞ 2 F j yi ð vÞ > 1g: ð22Þ
We conclude LCtþ1 ð ðCtþ1 ÞÞ > LCt ð ðCt ÞÞ. Since These are the feature vectors from F that are outside the
there are finitely many caches, the loss in the cache can margin defined by .
only grow a finite number of times. u
t Note that if yi ð vÞ 1, then ði; vÞ is not considered easy
4.5 Data-Mining Hard Examples, LSVM Version even if there is another feature vector for the ith example in
Now we describe a data-mining algorithm for training a the cache with higher score than v under .
latent SVM when the latent values for the positive examples are Now we describe the data-mining algorithm for comput-
fixed. That is, we are optimizing LDðZp Þ ðÞ and not LD ðÞ. As ing ðDðZp ÞÞ.
discussed above, this restriction ensures that the optimiza- The algorithm works with a cache of feature vectors for
tion problem is convex. DðZp Þ. It alternates between training a model and updating
For a latent SVM instead of keeping a cache of examples x, the cache.
we keep a cache of ðx; zÞ pairs where z 2 ZðxÞ. This makes Let F1 be an initial cache of feature vectors. Now
it possible to avoid doing inference over all of ZðxÞ in the consider the following iterative algorithm:
inner loop of an optimization algorithm such as gradient
descent. Moreover, in practice, we can keep a cache of 1. Let t :¼ ðFt Þ (train a model).
feature vectors, ðx; zÞ, instead of ðx; zÞ pairs. This 2. If Hð; DðZp ÞÞ Ft , stop and return t .
1636 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010
3. Let Ft0 :¼ Ft nX for any X such that X Eðt ; Ft Þ Now consider a background image I 2 N. We do not want
(shrink the cache). the object detector to “fire” in any location of the feature
4. Let Ftþ1 :¼ Ft0 [ X for any X such that X \ pyramid for I. This means the overall score (7) of every root
Hðt ; DðZp ÞÞnFt 6¼ ; (grow the cache). location should be low. Let G be a dense set of locations in the
Step 3 shrinks the cache by removing easy feature feature pyramid. We define a different negative example x
vectors. Step 4 grows the cache by adding “new” feature for each location ði; j; lÞ 2 G. We define ZðxÞ so that the level
vectors, including at least one from Hðt ; DðZp ÞÞ. of the root filter specified by z 2 ZðxÞ is l and the center of its
Note that over time we will accumulate multiple feature detection window is ði; jÞ. Note that there are a very large
vectors from the same negative example in the cache. number of negative examples obtained from each image.
We can show that this algorithm will eventually stop and This is consistent with the requirement that a scanning
return ðDðZp ÞÞ. This follows from arguments analogous window classifier should have a low false positive rate.
to the ones used in Section 4.4. The procedure Train is outlined below. The outermost
loop implements a fixed number of iterations of coordinate
descent on LD ð; Zp Þ. Lines 3-6 implement the Relabel
5 TRAINING MODELS positives step. The resulting feature vectors, one per positive
Now we consider the problem of training models from example, are stored in Fp . Lines 7-14 implement the Optimize
images labeled with bounding boxes around the objects of beta step. Since the number of negative examples implicitly
interest. This is the type of data available in the PASCAL defined by N is very large, we use the LSVM data-mining
data sets. Each data set contains thousands of images and algorithm. We iterate data mining a fixed number of times
each image has annotations specifying a bounding box and rather than until convergence for practical reasons. At each
a class label for each target object present in the image. Note iteration, we collect hard negative examples in Fn , train a
that this is a weakly labeled setting since the bounding new model using gradient descent, and then shrink Fn by
boxes do not specify component labels or part locations. removing easy feature vectors. During data mining, we grow
We describe a procedure for initializing the structure of a the cache by iterating over the images in N sequentially, until
mixture model and learning all parameters. Parameter we reach a memory limit. In practice we use ¼ 0:002.
learning is done by constructing a latent SVM training
problem. We train the latent SVM using the coordinate-
descent approach described in Section 4.2 together with the
data-mining and gradient-descent algorithms that work with
a cache of feature vectors from Section 4.5. Since the
coordinate-descent method is susceptible to local minima,
we must take care to ensure a good initialization of the model.
Fig. 5. (a) and (b) The initial root filters for a car model (the result of Phase 1 of the initialization process). (c) The initial part-based model for a car
(the result of Phase 3 of the initialization process).
symmetric along the vertical axis. Filters that are positioned of parts at six per component and, using a small pool of
along the center vertical axis of the model are constrained to rectangular part shapes, we greedily place parts to cover
be self-symmetric. Part filters that are off-center have a high-energy regions of the root filter.2 A part is either
symmetric part on the other side of the model. This anchored along the central vertical axis of the root filter or it
effectively reduces the number of parameters to be learned is off-center and has a symmetric part on the other side of
in half. the root filter. Once a part is placed, the energy of the
covered portion of the root filter is set to zero, and we look
5.2 Initialization
for the next highest energy region, until six parts are chosen.
The LSVM coordinate-descent algorithm is susceptible to The part filters are initialized by interpolating the root
local minima and thus sensitive to initialization. This is a filter to twice the spatial resolution. The deformation
common limitation of other methods that use latent parameters for each part are initialized to di ¼ ð0; 0; :1; :1Þ.
information as well. We initialize and train mixture models
This pushes part locations to be fairly close to their anchor
in three phases as follows:
position. Fig. 5c shows the results of this phase when
Phase 1. Initializing Root Filters: For training a mixture
training a two-component car model. The resulting model
model with m components we sort the bounding boxes in P
serves as the initial model for the last round of parameter
by their aspect ratio and split them into m groups of equal
learning. The final car model is shown in Fig. 9.
size P1 ; . . . ; Pm . Aspect ratio is used as a simple indicator of
extreme intraclass variation. We train m different root filters
F1 ; . . . ; Fm , one for each group of positive bounding boxes. 6 FEATURES
To define the dimensions of Fi , we select the mean aspect
Here, we describe the 36-dimensional HOG features from [10]
ratio of the boxes in Pi and the largest area not larger than
and introduce an alternative 13-dimensional feature set that
80 percent of the boxes. This ensures that, for most pairs
captures essentially the same information.3 We have found
ðI; BÞ 2 Pi , we can place Fi in the feature pyramid of I so
that it significantly overlaps with B. that augmenting this low-dimensional feature set to include
We train Fi using a standard SVM, with no latent both contrast sensitive and contrast insensitive features,
information, as in [10]. For ðI; BÞ 2 Pi , we warp the image leading to a 31-dimensional feature vector, improves perfor-
region under B so that its feature map has the same mance for most classes of the PASCAL data sets.
dimensions as Fi . This leads to a positive example. We
6.1 HOG Features
select random subwindows of appropriate dimension from
images in N to define negative examples. Figs. 5a and 5b 6.1.1 Pixel-Level Feature Maps
show the result of this phase when training a two- Let ðx; yÞ and rðx; yÞ be the orientation and magnitude of
component car model. the intensity gradient at a pixel ðx; yÞ in an image. As in [10],
Phase 2. Merging Components: We combine the initial we compute gradients using finite difference filters,
root filters into a mixture model with no parts and retrain ½1; 0; þ1 and its transpose. For color images, we use the
the parameters of the combined model using Train on the color channel with the largest gradient magnitude to define
full (unsplit and without warping) data sets P and N. In and r at each pixel.
this case, the component label and root location are the only The gradient orientation at each pixel is discretized into
latent variables for each example. The coordinate-descent one of the p values using either a contrast sensitive (B1 ), or
training algorithm can be thought of as a discriminative insensitive (B2 ), definition,
clustering method that alternates between assigning cluster
(mixture) labels for each positive example and estimating 2. The “energy” of a region is defined by the norm of the positive weights
cluster “means” (root filters). in a subwindow.
3. There are some small differences between the 36-dimensional features
Phase 3. Initializing Part Filters: We initialize the parts of defined here and the ones in [10], but we have found that these differences
each component using a simple heuristic. We fix the number did not have any significant effect on the performance of our system.
1638 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010
Fig. 6. PCA of HOG features. Each eigenvector is displayed as a 4 9 matrix so that each row corresponds to one normalization factor and each
column to one orientation bin. The eigenvalues are displayed on top of the eigenvectors. The linear subspace spanned by the top 11 eigenvectors
captures essentially all of the information in a feature vector. Note how all of the top eigenvectors are either constant along each column or row of the
matrix representation.
pðx; yÞ vector Cði; jÞ. We can write these factors as N; ði; jÞ with
B1 ðx; yÞ ¼ round mod p; ð23Þ
2 ; 2 f1; 1g,
pðx; yÞ N; ði; jÞ ¼ ðkCði; jÞk2 þ kCði þ ; jÞk2
B2 ðx; yÞ ¼ round mod p: ð24Þ 1 ð26Þ
þ kCði; j þ Þk2 þ kCði þ ; j þ Þk2 Þ2 :
Below, we use B to denote either B1 or B2 . Each factor measures the “gradient energy” in a square
We define a pixel-level feature map that specifies a block of four cells containing ði; jÞ.
sparse histogram of gradient magnitudes at each pixel. Let Let T ðvÞ denote the component-wise truncation of a
b 2 f0; . . . ; p 1g range over orientation bins. The feature vector v by (the ith entry in T ðvÞ is the minimum of the
vector at ðx; yÞ is ith entry of v and ). The HOG feature map is obtained by
concatenating the result of normalizing the cell-based
rðx; yÞ; if b ¼ Bðx; yÞ; feature map C with respect to each normalization factor
F ðx; yÞb ¼ ð25Þ
0; otherwise: followed by truncation,
We can think of F as an oriented edge map with 0 1
T ðCði; jÞ=N1;1 ði; jÞÞ
p orientation channels. For each pixel, we select a channel B T ðCði; jÞ=Nþ1;1 ði; jÞÞ C
by discretizing the gradient orientation. The gradient Hði; jÞ ¼ B C
@ T ðCði; jÞ=Nþ1;þ1 ði; jÞÞ A: ð27Þ
magnitude can be seen as a measure of edge strength. T ðCði; jÞ=N1;þ1 ði; jÞÞ
6.1.2 Spatial Aggregation Commonly used HOG features are defined using p ¼ 9
Let F be a pixel-level feature map for a w h image. Let contrast insensitive gradient orientations (discretized with
k > 0 be a parameter specifying the side length of a square B2 ), a cell size of k ¼ 8 and truncation ¼ 0:2. This leads to
image region. We define a dense grid of rectangular “cells” a 36-dimensional feature vector. We used these parameters
in the analysis described below.
and aggregate pixel-level features to obtain a cell-based
feature map C, with feature vectors Cði; jÞ for 0 i 6.2 PCA and Analytic Dimensionality Reduction
bðw 1Þ=kc and 0 j bðh 1Þ=kc. This aggregation pro- We collected a large number of 36-dimensional HOG
vides some invariance to small deformations and reduces features from different resolutions of a large number of
the size of a feature map. images and performed PCA on these vectors. The principal
The simplest approach for aggregating features is to map components are shown in Fig. 6. The results lead to a
each pixel ðx; yÞ into a cell ðbx=kc; by=kcÞ and define the number of interesting discoveries.
feature vector at a cell to be the sum (or average) of the The eigenvalues indicate that the linear subspace
pixel-level features in that cell. spanned by the top 11 eigenvectors captures essentially all
Rather than mapping each pixel to a unique cell, we of the information in a HOG feature. In fact, we obtain the
follow [10] and use a “soft binning” approach where each same detection performance in all categories of the PASCAL
pixel contributes to the feature vectors in the four cells 2007 data set using the original 36-dimensional features or
around it using bilinear interpolation. 11-dimensional features defined by projection to the top
eigenvectors. Using lower dimensional features leads to
6.1.3 Normalization and Truncation models with fewer parameters and speeds up the detection
Gradients are invariant to changes in bias. Invariance to and learning algorithms. We note, however, that some of the
gain can be achieved via normalization. Dalal and Triggs gain is lost because we need to perform a relatively costly
[10] used four different normalization factors for the feature projection step when computing feature pyramids.
FELZENSZWALB ET AL.: OBJECT DETECTION WITH DISCRIMINATIVELY TRAINED PART-BASED MODELS 1639
Recall that a 36-dimensional HOG feature is defined Finally, we note that the top eigenvectors in Fig. 6 can be
using four different normalizations of a 9-dimensional roughly interpreted as a two-dimensional separable Fourier
histogram over orientations. Thus, a 36-dimensional HOG basis. Each eigenvector can be roughly seen as a sine or
feature is naturally viewed as a 4 9 matrix. The top cosine function of one variable. This observation could be
eigenvectors in Fig. 6 have a very special structure: They are used to define features using a finite number of Fourier basis
functions instead of a finite number of discrete orientations.
each (approximately) constant along each row or column of
The appearance of Fourier basis in Fig. 6 is an interesting
their matrix representation. Thus, the top eigenvectors lie
empirical result. The eigenvectors of a d d covariance
(approximately) in a linear subspace defined by sparse matrix form a Fourier basis when is circulant, i.e., i;j ¼
vectors that have ones along a single row or column of their kði j mod dÞ for some function k. Circulant covariance
matrix representation. matrices arise from probability distributions on vectors that
Let V ¼ fu1 ; . . . ; u9 g [ fv1 ; . . . ; v4 g with are invariant to rotation of the vector coordinates. The
appearance of a two-dimensional Fourier basis in Fig. 6 is
1; if j ¼ k; evidence that the distribution of HOG feature vectors on
uk ði; jÞ ¼ ð28Þ
0; otherwise; natural images have (approximately) a two-dimensional
rotational invariance. We can rotate the orientation bins and
1; if i ¼ k; independently rotate the four normalizations blocks.
vk ði; jÞ ¼ ð29Þ
0; otherwise:
We can define a 13-dimensional feature by taking the dot 7 POSTPROCESSING
product of a 36-dimensional HOG feature with each uk and 7.1 Bounding Box Prediction
vk . Projection into each uk is computed by summing over The desired output of an object detection system is not
the four normalizations for a fixed orientation. Projection entirely clear. The goal in the PASCAL challenge is to
into each vk is computed by summing over nine orientations predict the bounding boxes of objects. In our previous work
for a fixed normalization.4 [17], we reported bounding boxes derived from root filter
As in the case of 11-dimensional PCA features, we obtain locations. Yet detection with one of our models localizes
the same performance using the 36-dimensional HOG each part filter in addition to the root filter. Furthermore,
features or the 13-dimensional features defined by V . part filters are localized with greater spatial precision than
However, the computation of the 13-dimensional features root filters. It is clear that our original approach discards
is much less costly than performing projections to the top potentially valuable information gained from using a
eigenvectors obtained via PCA since the uk and vk are multiscale deformable part model.
sparse. Moreover, the 13-dimensional features have a In the current system, we use the complete configuration
simple interpretation as nine orientation features and four of an object hypothesis, z, to predict a bounding box for the
features that reflect the overall gradient energy in different object. This is implemented using functions that map a
feature vector gðzÞ, to the upper left, ðx1 ; y1 Þ, and lower
areas around a cell.
right, ðx2 ; y2 Þ, corners of the bounding box. For a model
We can also define low-dimensional features that are
with n parts, gðzÞ is a 2n þ 3 dimensional vector containing
contrast sensitive. We have found that performance on some the width of the root filter in image pixels (this provides
object categories improves using contrast sensitive features, scale information) and the location of the upper left corner
while some categories benefit from contrast insensitive of each filter in the image.
features. Thus, in practice, we use feature vectors that Each object in the PASCAL training data is labeled by a
include both contrast sensitive and insensitive information. bounding box. After training a model, we use the output of
Let C be a cell-based feature map computed by aggregat- our detector on each instance to learn four linear functions
ing a pixel-level feature map with nine contrast insensitive for predicting x1 , y1 , x2 , and y2 from gðzÞ. This is done via
orientations. Let D be a similar cell-based feature map linear least-squares regression, independently for each
computed using 18 contrast sensitive orientations. We define component of a mixture model.
four normalization factors for the ði; jÞ cell of C and D using Fig. 7 illustrates an example of bounding prediction for a
C as in (26). We can normalize and truncate Cði; jÞ and Dði; jÞ car detection. This simple method yields small but noticeable
using these factors to obtain 4 ð9 þ 18Þ ¼ 108 dimensional improvements in performance for some categories in the
feature vectors, F ði; jÞ. In practice, we use an analytic PASCAL data sets (see Section 8).
projection of these 108-dimensional vectors, defined by 27
7.2 Nonmaximum Suppression
sums over different normalizations, one for each orientation
channel of F , and four sums over the nine contrast Using the matching procedure from Section 3.2, we usually
insensitive orientations, one for each normalization factor. get multiple overlapping detections for each instance of an
We use a cell size of k ¼ 8 and truncation value of ¼ 0:2. object. We use a greedy procedure for eliminating repeated
The final feature map has 31-dimensional vectors Gði; jÞ, detections via nonmaximum suppression.
After applying the bounding box prediction method
with 27 dimensions corresponding to different orientation
described above, we have a set of detections D for a
channels (9 contrast insensitive and 18 contrast sensitive) and
particular object category in an image. Each detection is
4 dimensions capturing the overall gradient energy in square
defined by a bounding box and a score. We sort the
blocks of four cells around ði; jÞ. detections in D by score, and greedily select the highest
scoring ones while skipping detections with bounding boxes
4. The 13-dimensional feature is not a linear projection of the 36-
dimensional feature into V because the uk and vk are not orthogonal. In fact, that are at least 50 percent covered by a bounding box of a
the linear subspace spanned by V has dimension 12. previously selected detection.
1640 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010
Fig. 7. A car detection and the bounding box predicted from the object
configuration.
Fig. 9. Some of the models learned on the PASCAL 2007 data set.
positive detection. Multiple detections are penalized. If a using those models. We show both high-scoring correct
system predicts several bounding boxes that overlap with a detections and high-scoring false positives.
single ground-truth bounding box, only one prediction is In some categories, our false detections are often due to
considered correct, the others are considered false positives. confusion among classes, such as between horse and cow or
One scores a system by the AP of its precision-recall curve between car and bus. In other categories, false detections are
across a test set. often due to the relatively strict bounding box criteria. The
We trained a two-component model for each class in two false positives shown for the person category are due to
each data set. Fig. 9 shows some of the models learned on insufficient overlap with the ground-truth bounding box. In
the 2007 data set. Fig. 10 shows some detections we obtain the cat category, we often detect the face of a cat and report a
1642 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010
Fig. 10. Examples of high-scoring detections on the PASCAL 2007 data set, selected from the top 20 highest scoring detections in each class. The
framed images (last two in each row) illustrate false positives for each category. Many false positives (such as for person and cat) are due to the
bounding box scoring criteria.
bounding box that is too small because it does not include the learn a model where one of the components corresponds to a
rest of the cat. In fact, the top 20 highest scoring false positive cat face model, see Fig. 9.
bounding boxes for the cat category correspond to the face of Tables 1 and 2 summarize the results of our system on the
a cat. This is an extreme case, but it gives an explanation for 2006 and 2007 challenge data sets. Table 3 summarizes the
our low AP score in this category. In many of the positive results on the more recent 2008 data set, together with
training examples for cats only the face is visible, and we the systems that entered the official competition in 2008.
FELZENSZWALB ET AL.: OBJECT DETECTION WITH DISCRIMINATIVELY TRAINED PART-BASED MODELS 1643
TABLE 2
PASCAL VOC 2007 Results
(a) Average precision scores of the base system, (b) scores using bounding box prediction, and (c) scores using bounding box prediction and context
rescoring.
TABLE 3
PASCAL VOC 2008 Results
Top: (a) Average precision scores of the base system, (b) scores using bounding box prediction, (c) scores using bounding box prediction and
context rescoring, and (d) ranking of final scores relative to systems in the 2008 competition. Bottom: The systems that participated in the
competition (UofCTTIUCI is a preliminary version of our system and we don’t include it in the ranking).
1644 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010
(Pedro J. Felzenszwalb and Ross B. Girshick), IIS 0811340 [23] A. Holub and P. Perona, “A Discriminative Framework for
Modelling Object Classes,” Proc. IEEE Conf. Computer Vision and
(David McAllester), and IIS 0812428 (Deva Ramanan). Pattern Recognition, 2005.
[24] Y. Jin and S. Geman, “Context and Hierarchy in a Probabilistic
Image Model,” Proc. IEEE Conf. Computer Vision and Pattern
REFERENCES Recognition, 2006.
[1] Y. Amit and A. Kong, “Graphical Templates for Model Registra- [25] T. Joachims, “Making Large-Scale SVM Learning Practical,”
tion,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, Advances in Kernel Methods—Support Vector Learning, B. Schölkopf,
no. 3, pp. 225-236, Mar. 1996. C. Burges, and A. Smola, eds., MIT Press, 1999.
[2] Y. Amit and A. Trouve, “POP: Patchwork of Parts Models for [26] Y. Ke and R. Sukthankar, “PCA-SIFT: A More Distinctive
Object Recognition,” Int’l J. Computer Vision, vol. 75, no. 2, pp. 267- Representation for Local Image Descriptors,” Proc. IEEE Conf.
282, 2007. Computer Vision and Pattern Recognition, 2004.
[3] S. Andrews, I. Tsochantaridis, and T. Hofmann, “Support Vector [27] Y. LeCun, S. Chopra, R. Hadsell, R. Marc’Aurelio, and F. Huang,
Machines for Multiple-Instance Learning,” Proc. Advances in “A Tutorial on Energy-Based Learning,” Predicting Structured
Neural Information Processing Systems, 2003. Data, G. Bakir, T. Hofman, B. Schölkopf, A. Smola, and B. Taskar,
[4] A. Bar-Hillel and D. Weinshall, “Efficient Learning of Relational eds. MIT Press, 2006.
Object Class Models,” Int’l J. Computer Vision, vol. 77, no. 1, [28] B. Leibe, A. Leonardis, and B. Schiele, “Robust Object Detection
pp. 175-198, 2008. with Interleaved Categorization and Segmentation,” Int’l J.
[5] E. Bernstein and Y. Amit, “Part-Based Statistical Models for Object Computer Vision, vol. 77, no. 1, pp. 259-289, 2008.
Classification and Detection,” Proc. IEEE Conf. Computer Vision and [29] D. Lowe, “Distinctive Image Features from Scale-Invariant
Pattern Recognition, 2005. Keypoints,” Int’l J. Computer Vision, vol. 60, no. 2, pp. 91-110,
[6] M. Burl, M. Weber, and P. Perona, “A Probabilistic Approach to Nov. 2004.
Object Recognition Using Local Photometry and Global Geome- [30] C. Papageorgiou, M. Oren, and T. Poggio, “A General Framework
try,” Proc. European Conf. Computer Vision, 1998. for Object Detection,” Proc. IEEE Int’l Conf. Computer Vision, 1998.
[7] T. Cootes, G. Edwards, and C. Taylor, “Active Appearance [31] W. Plantinga and C. Dyer, “An Algorithm for Constructing the
Models,” IEEE Trans. Pattern Analysis and Machine Intelligence, Aspect Graph,” Proc. 27th Ann. Symp. Foundations of Computer
vol. 23, no. 6, pp. 681-685, June 2001. Science, 1985, pp. 123-131, 1986.
[8] J. Coughlan, A. Yuille, C. English, and D. Snow, “Efficient [32] A. Quattoni, S. Wang, L. Morency, M. Collins, and T. Darrell,
Deformable Template Detection and Localization without User “Hidden Conditional Random Fields,” IEEE Trans. Pattern
Initialization,” Computer Vision and Image Understanding, vol. 78, Analysis and Machine Intelligence, vol. 29, no. 10, pp. 1848-1852,
no. 3, pp. 303-319, June 2000. Oct. 2007.
[9] D. Crandall, P. Felzenszwalb, and D. Huttenlocher, “Spatial Priors [33] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S.
for Part-Based Recognition Using Statistical Models,” Proc. IEEE Belongie, “Objects in Context,” Proc. IEEE Int’l Conf. Computer
Conf. Computer Vision and Pattern Recognition, 2005. Vision, 2007.
[10] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for [34] D. Ramanan and C. Sminchisescu, “Training Deformable Models
Human Detection,” Proc. IEEE Conf. Computer Vision and Pattern for Localization,” Proc. IEEE Conf. Computer Vision and Pattern
Recognition, 2005. Recognition, 2006.
[11] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A. [35] H. Rowley, S. Baluja, and T. Kanade, “Human Face Detection in
Zisserman, “The PASCAL Visual Object Classes Challenge Visual Scenes,” Technical Report CMU-CS-95-158R, Carnegie
2007 (VOC 2007) Results,” http://www.pascal-network.org/ Mellon Univ., 1995.
challenges/VOC/voc2007/, 2007. [36] H. Schneiderman and T. Kanade, “A Statistical Method for 3d
[12] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A. Object Detection Applied to Faces and Cars,” Proc. IEEE Conf.
Zisserman, “The PASCAL Visual Object Classes Challenge Computer Vision and Pattern Recognition, 2000.
2008 (VOC 2008) Results,” http://www.pascal-network.org/ [37] S. Shalev-Shwartz, Y. Singer, and N. Srebro, “Pegasos: Primal
challenges/VOC/voc2008/, 2008. Estimated Sub-Gradient Solver for SVM,” Proc. Int’l Conf. Machine
[13] M. Everingham, A. Zisserman, C.K.I. Williams, and L. Van Gool, Learning, 2007.
“The PASCAL Visual Object Classes Challenge 2006 (VOC 2006) [38] K. Sung and T. Poggio, “Example-Based Learning for View-Based
Results,” http://www.pascal-network.org/challenges/VOC/ Human Face Detection,” Technical Report A.I. Memo No. 1521,
voc2006/, 2006. Massachusetts Inst. of Technology, 1994.
[14] P. Felzenszwalb and D. Huttenlocher, “Distance Transforms of [39] A. Torralba, “Contextual Priming for Object Detection,” Int’l J.
Sampled Functions,” Technical Report 2004-1963, Cornell Univ. Computer Vision, vol. 53, no. 2, pp. 169-191, July 2003.
CIS, 2004. [40] P. Viola, J. Platt, and C. Zhang, “Multiple Instance Boosting for
[15] P. Felzenszwalb and D. Huttenlocher, “Pictorial Structures for Object Detection,” Proc. Advances in Neural Information Processing
Object Recognition,” Int’l J. Computer Vision, vol. 61, no. 1, pp. 55- Systems, 2005.
79, 2005. [41] P. Viola and M. Jones, “Robust Real-Time Face Detection,” Int’l J.
[16] P. Felzenszwalb and D. McAllester, “The Generalized A Computer Vision, vol. 57, no. 2, pp. 137-154, May 2004.
Architecture,” J. Artificial Intelligence Research, vol. 29, pp. 153- [42] M. Weber, M. Welling, and P. Perona, “Towards Automatic
190, 2007. Discovery of Object Categories,” Proc. IEEE Conf. Computer Vision
[17] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A Discrimina- and Pattern Recognition, 2000.
tively Trained, Multiscale, Deformable Part Model,” Proc. IEEE [43] A. Yuille, P. Hallinan, and D. Cohen, “Feature Extraction from
Conf. Computer Vision and Pattern Recognition, 2008. Faces Using Deformable Templates,” Int’l J. Computer Vision,
[18] R. Fergus, P. Perona, and A. Zisserman, “Object Class Recognition vol. 8, no. 2, pp. 99-111, 1992.
by Unsupervised Scale-Invariant Learning,” Proc. IEEE Conf. [44] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, “Local
Computer Vision and Pattern Recognition, 2003. Features and Kernels for Classification of Texture and Object
[19] R. Fergus, P. Perona, and A. Zisserman, “A Sparse Object Categories: A Comprehensive Study,” Int’l J. Computer Vision,
Category Model for Efficient Learning and Exhaustive Recogni- vol. 73, no. 2, pp. 213-238, June 2007.
tion,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, [45] S. Zhu and D. Mumford, “A Stochastic Grammar of Images,”
2005. Foundations and Trends in Computer Graphics and Vision, vol. 2,
[20] M. Fischler and R. Elschlager, “The Representation and Matching no. 4, pp. 259-362, 2007.
of Pictorial Structures,” IEEE Trans. Computers, vol. 22, no. 1,
pp. 67-92, Jan. 1973.
[21] U. Grenander, Y. Chow, and D. Keenan, HANDS: A Pattern-
Theoretic Study of Biological Shapes. Springer-Verlag, 1991.
[22] D. Hoiem, A. Efros, and M. Hebert, “Putting Objects in
Perspective,” Int’l J. Computer Vision, vol. 80, no. 1, pp. 3-15,
Oct. 2008.
FELZENSZWALB ET AL.: OBJECT DETECTION WITH DISCRIMINATIVELY TRAINED PART-BASED MODELS 1645
Pedro F. Felzenszwalb received the BS degree David McAllester received the BS, MS, and
in computer science from Cornell University in PhD degrees from the Massachusetts Institute
1999. He received the PhD degree in computer of Technology (MIT) in 1978, 1979, and 1987,
science from the Massachusetts Institute of respectively. He served on the faculty of Cornell
Technology (MIT) in 2003. After leaving MIT, University for the academic year of 1987-1988
he spent one year as a postdoctoral researcher and served on the faculty of MIT from 1988 to
at Cornell University. He joined the Department 1995. He was a member of the technical staff at
of Computer Science at the University of AT&T Labs-Research from 1995 to 2002. Since
Chicago in 2004, where he is currently an 2002, he has been a professor and a chief
associate professor. His work has been sup- academic officer at the Toyota Technological
ported by the US National Science Foundation, including the CAREER Institute at Chicago. He has been a fellow of the American Association
award received in 2008. His main research interests are in computer of Artificial Intelligence (AAAI) since 1997. He has more than 80 refereed
vision, geometric algorithms, and artificial intelligence. He is currently an publications. His research is currently focused on applications of
associate editor of the IEEE Transactions on Pattern Analysis and machine learning to computer vision. His past research has included
Machine Intelligence. He is a member of the IEEE Computer Society. machine learning theory, the theory of programming languages,
automated reasoning, AI planning, computer game playing (computer
Ross B. Girshick received the BS degree in chess), and computational linguistics. A 1991 paper on AI planning
computer science from Brandeis University in proved to be one of the most influential papers of the decade in that
2004. He recently completed his second year area. A 1993 paper on computer game algorithms influenced the design
of PhD studies at the University of Chicago of the algorithms used in the Deep Blue system that defeated Gary
under the supervision of Professor Pedro F. Kasparov. A 1998 paper on machine learning theory introduced PAC-
Felzenszwalb. His current research focuses Bayesian theorems which combine Bayesian and non-Bayesian
primarily on the detection and localization of methods.
rich object categories in natural images. Before
starting his PhD studies, he worked as a Deva Ramanan received the PhD degree in
software engineer for three years. He is a electrical engineering and computer science
student member of the IEEE. from the University of California, Berkeley, in
2005. He is an assistant professor of computer
science at the University of California, Irvine.
From 2005 to 2007, he was a research assistant
professor at the Toyota Technological Institute
at Chicago. His work has been supported by the
US National Science Foundation (NSF), both
through large-scale grants and a graduate
fellowship, and the University of California Micro program. He regularly
reviews for major computer vision journals, has served on panels for
NSF, and has served on program committees for major conferences in
computer vision, machine learning, and artificial intelligence. His
interests include object recognition, tracking, and activity recognition.
He is a member of the IEEE.