Object Detection With Discriminatively Trained Part Based Models
Object Detection With Discriminatively Trained Part Based Models
(x) = max
zZ(x)
(x, z). (1)
Here is a vector of model parameters, z are latent
values, and (x, z) is a feature vector. In the case of one
of our star models is the concatenation of the root
lter, the part lters, and deformation cost weights, z is
a specication of the object conguration, and (x, z) is
a concatenation of subwindows from a feature pyramid
and part deformation features.
We note that (1) can handle very general forms of
latent information. For example, z could specify a deriva-
tion under a rich visual grammar.
Our second class of models represents an object cate-
gory by a mixture of star models. The score of a mixture
model at a particular position and scale is the maximum
over components, of the score of that component model
at the given location. In this case the latent information,
z, species a component label and a conguration for
that component. Figure 2 shows a mixture model for the
bicycle category.
To obtain high performance using discriminative train-
ing it is often important to use large training sets. In the
case of object detection the training problem is highly un-
balanced because there is vastly more background than
objects. This motivates a process of searching through
3
Fig. 2. Detections obtained with a 2 component bicycle model. These examples illustrate the importance of
deformations mixture models. In this model the rst component captures sideways views of bicycles while the second
component captures frontal and near frontal views. The sideways component can deform to match a wheelie.
the background data to nd a relatively small number
of potential false positives, or hard negative examples.
A methodology of data-mining for hard negative ex-
amples was adopted by Dalal and Triggs [10] but goes
back at least to the bootstrapping methods used by [38]
and [35]. Here we analyze data-mining algorithms for
SVM and LSVM training. We prove that data-mining
methods can be made to converge to the optimal model
dened in terms of the entire training set.
Our object models are dened by lters that score
subwindows of a feature pyramid. We have investigated
feature sets similar to the HOG features from [10] and
found lower dimensional features which perform as well
as the original ones. By doing principal component anal-
ysis on HOG features the dimensionality of the feature
vector can be signicantly reduced with no noticeable
loss of information. Moreover, by examining the prin-
cipal eigenvectors we discover structure that leads to
analytic versions of low-dimensional features which
are easily interpretable and can be computed efciently.
We have also considered some specic problems that
arise in the PASCAL object detection challenge and sim-
ilar datasets. We show how the locations of parts in an
object hypothesis can be used to predict a bounding box
for the object. This is done by training a model specic
predictor using least-squares regression. We also demon-
strate a simple method for aggregating the output of
several object detectors. The basic idea is that objects of
some categories provide evidence for, or against, objects
of other categories in the same image. We exploit this
idea by training a category specic classier that rescores
every detection of that category using its original score
and the highest scoring detection from each of the other
categories.
2 RELATED WORK
There is a signicant body of work on deformable mod-
els of various types for object detection, including several
kinds of deformable template models (e.g. [7], [8], [21],
[43]), and a variety of part-based models (e.g. [2], [6], [9],
[15], [18], [20], [28], [42]).
In the constellation models from [18], [42] parts are
constrained to be in a sparse set of locations determined
by an interest point operator, and their geometric ar-
rangement is captured by a Gaussian distribution. In
contrast, pictorial structure models [15], [20] dene a
matching problem where parts have an individual match
cost in a dense set of locations, and their geometric
arrangement is captured by a set of springs connecting
pairs of parts. The patchwork of parts model from [2] is
similar, but it explicitly considers how the appearance
model of overlapping parts interact.
Our models are largely based on the pictorial struc-
tures framework from [15], [20]. We use a dense set of
possible positions and scales in an image, and dene
a score for placing a lter at each of these locations.
4
The geometric conguration of the lters is captured by
a set of deformation costs (springs) connecting each
part lter to the root lter, leading to a star-structured
pictorial structure model. Note that we do not model
interactions between overlapping parts. While we might
benet from modeling such interactions, this does not
appear to be a problem when using models trained with
a discriminative procedure, and it signicantly simplies
the problem of matching a model to an image.
The introduction of new local and semi-local features
has played an important role in advancing the perfor-
mance of object recognition methods. These features are
typically invariant to illumination changes and small
deformations. Many recent approaches use wavelet-like
features [30], [41] or locally-normalized histograms of
gradients [10], [29]. Other methods, such as [5], learn
dictionaries of local structures from training images. In
our work, we use histogram of gradient (HOG) features
from [10] as a starting point, and introduce a variation
that reduces the feature size with no loss in performance.
As in [26] we used principal component analysis (PCA)
to discover low dimensional features, but we note that
the eigenvectors we obtain have a clear structure that
leads to a new set of analytic features. This removes
the need to perform a costly projection step when com-
puting dense feature maps.
Signicant variations in shape and appearance, such as
caused by extreme viewpoint changes, are not well cap-
tured by a 2D deformable model. Aspect graphs [31] are
a classical formalism for capturing signicant changes
that are due to viewpoint variation. Mixture models
provide a simpler alternative approach. For example, it
is common to use multiple templates to encode frontal
and side views of faces and cars [36]. Mixture models
have been used to capture other aspects of appearance
variation as well, such as when there are multiple natural
subclasses in an object category [5].
Matching a deformable model to an image is a dif-
cult optimization problem. Local search methods require
initialization near the correct solution [2], [7], [43]. To
guarantee a globally optimal match, more aggressive
search is needed. One popular approach for part-based
models is to restrict part locations to a small set of
possible locations returned by an interest point detector
[1], [18], [42]. Tree (and star) structured pictorial structure
models [9], [15], [19] allow for the use of dynamic
programming and generalized distance transforms to
efciently search over all possible object congurations
in an image, without restricting the possible locations
for each part. We use these techniques for matching our
models to images.
Part-based deformable models are parameterized by
the appearance of each part and a geometric model
capturing spatial relationships among parts. For gen-
erative models one can learn model parameters using
maximum likelihood estimation. In a fully-supervised
setting training images are labeled with part locations
and models can often be learned using simple methods
[9], [15]. In a weakly-supervised setting training images
may not specify locations of parts. In this case one can
simultaneously estimate part locations and learn model
parameters with EM [2], [18], [42].
Discriminative training methods select model param-
eters so as to minimize the mistakes of a detection algo-
rithm on a set of training images. Such approaches di-
rectly optimize the decision boundary between positive
and negative examples. We believe this is one reason for
the success of simple models trained with discriminative
methods, such as the Viola-Jones [41] and Dalal-Triggs
[10] detectors. It has been more difcult to train part-
based models discriminatively, though strategies exist
[4], [23], [32], [34].
Latent SVMs are related to hidden CRFs [32]. How-
ever, in a latent SVM we maximize over latent part loca-
tions as opposed to marginalizing over them, and we use
a hinge-loss rather than log-loss in training. This leads
to an an efcient coordinate-descent style algorithm for
training, as well as a data-mining algorithm that allows
for learning with very large datasets. A latent SVM can
be viewed as a type of energy-based model [27].
A latent SVM is equivalent to the MI-SVM formulation
of multiple instance learning (MIL) in [3], but we nd
the latent variable formulation more natural for the prob-
lems we are interested in.
1
A different MIL framework
was previously used for training object detectors with
weakly labeled data in [40].
Our method for data-mining hard examples during
training is related to working set methods for SVMs (e.g.
[25]). The approach described here requires relatively
few passes through the complete set of training examples
and is particularly well suited for training with very
large data sets, where only a fraction of the examples
can t in RAM.
The use of context for object detection and recognition
has received increasing attention in the recent years.
Some methods (e.g. [39]) use low-level holistic image fea-
tures for dening likely object hypothesis. The method
in [22] uses a coarse but semantically rich representation
of a scene, including its 3D geometry, estimated using a
variety of techniques. Here we dene the context of an
image using the results of running a variety of object
detectors in the image. The idea is related to [33] where
a CRF was used to capture co-occurrences of objects,
although we use a very different approach to capture
this information.
A preliminary version of our system was described in
[17]. The system described here differs from the one in
[17] in several ways, including: the introduction of mix-
ture models; here we optimize the true latent SVM ob-
jective function using stochastic gradient descent, while
in [17] we used an SVM package to optimize a heuristic
approximation of the objective; here we use new features
that are both lower-dimensional and more informative;
1. We dened a latent SVM in [17] before realizing the relationship
to MI-SVM.
5
Feature pyramid Image pyramid
Fig. 3. A feature pyramid and an instantiation of a person
model within that pyramid. The part lters are placed at
twice the spatial resolution of the placement of the root.
we now post-process detections via bounding box pre-
diction and context rescoring.
3 MODELS
All of our models involve linear lters that are applied
to dense feature maps. A feature map is an array whose
entries are d-dimensional feature vectors computed from
a dense grid of locations in an image. Intuitively each
feature vector describes a local image patch. In practice
we use a variation of the HOG features from [10], but the
framework described here is independent of the specic
choice of features.
A lter is a rectangular template dened by an array
of d-dimensional weight vectors. The response, or score,
of a lter F at a position (x, y) in a feature map G is
the dot product of the lter and a subwindow of the
feature map with top-left corner at (x, y),
,y
F[x
, y
] G[x +x
, y +y
].
We would like to dene a score at different positions
and scales in an image. This is done using a feature
pyramid, which species a feature map for a nite
number of scales in a xed range. In practice we com-
pute feature pyramids by computing a standard image
pyramid via repeated smoothing and subsampling, and
then computing a feature map from each level of the
image pyramid. Figure 3 illustrates the construction.
The scale sampling in a feature pyramid is determined
by a parameter dening the number of levels in an
octave. That is, is the number of levels we need to go
down in the pyramid to get to a feature map computed
at twice the resolution of another one. In practice we
have used = 5 in training and = 10 at test time. Fine
sampling of scale space is important for obtaining high
performance with our models.
The system in [10] uses a single lter to dene an
object model. That system detects objects from a par-
ticular category by computing the score of the lter at
each position and scale of a HOG feature pyramid and
thresholding the scores.
Let F be a w h lter. Let H be a feature pyramid
and p = (x, y, l) specify a position (x, y) in the l-th
level of the pyramid. Let (H, p, w, h) denote the vector
obtained by concatenating the feature vectors in the wh
subwindow of H with top-left corner at p in row-major
order. The score of F at p is F
is
the vector obtained by concatenating the weight vectors
in F in row-major order. Below we write F
(H, p) since
the subwindow dimensions are implicitly dened by the
dimensions of the lter F.
3.1 Deformable Part Models
Our star models are dened by a coarse root lter that
approximately covers an entire object and higher resolu-
tion part lters that cover smaller parts of the object.
Figure 3 illustrates an instantiation of such a model
in a feature pyramid. The root lter location denes a
detection window (the pixels contributing to the part of
the feature map covered by the lter). The part lters
are placed levels down in the pyramid, so the features
at that level are computed at twice the resolution of the
features in the root lter level.
We have found that using higher resolution features
for dening part lters is essential for obtaining high
recognition performance. With this approach the part
lters capture ner resolution features that are localized
to greater accuracy when compared to the features cap-
tured by the root lter. Consider building a model for a
face. The root lter could capture coarse resolution edges
such as the face boundary while the part lters could
capture details such as eyes, nose and mouth.
A model for an object with n parts is formally dened
by a (n + 2)-tuple (F
0
, P
1
, . . . , P
n
, b) where F
0
is a root
lter, P
i
is a model for the i-th part and b is a real-
valued bias term. Each part model is dened by a 3-tuple
(F
i
, v
i
, d
i
) where F
i
is a lter for the i-th part, v
i
is a
two-dimensional vector specifying an anchor position
for part i relative to the root position, and d
i
is a four-
dimensional vector specifying coefcients of a quadratic
function dening a deformation cost for each possible
placement of the part relative to the anchor position.
An object hypothesis species the location of each
lter in the model in a feature pyramid, z = (p
0
, . . . , p
n
),
where p
i
= (x
i
, y
i
, l
i
) species the level and position of
the i-th lter. We require that the level of each part is
such that the feature map at that level was computed at
twice the resolution of the root level, l
i
= l
0
for i > 0.
The score of a hypothesis is given by the scores of each
lter at their respective locations (the data term) minus
a deformation cost that depends on the relative position
of each part with respect to the root (the spatial prior),
6
plus the bias,
score(p
0
, . . . , p
n
) =
n
i=0
F
i
(H, p
i
)
n
i=1
d
i
d
(dx
i
, dy
i
) +b, (2)
where
(dx
i
, dy
i
) = (x
i
, y
i
) (2(x
0
, y
0
) +v
i
) (3)
gives the displacement of the i-th part relative to its
anchor position and
d
(dx, dy) = (dx, dy, dx
2
, dy
2
) (4)
are deformation features.
Note that if d
i
= (0, 0, 1, 1) the deformation cost for
the i-th part is the squared distance between its actual
position and its anchor position relative to the root. In
general the deformation cost is an arbitrary separable
quadratic function of the displacements.
The bias term is introduced in the score to make the
scores of multiple models comparable when we combine
them into a mixture model.
The score of a hypothesis z can be expressed in terms
of a dot product, (H, z), between a vector of model
parameters and a vector (H, z),
= (F
0
, . . . , F
n
, d
1
, . . . , d
n
, b). (5)
(H, z) = ((H, p
0
), . . . (H, p
n
),
d
(dx
1
, dy
1
), . . . ,
d
(dx
n
, dy
n
), 1).
(6)
This illustrates a connection between our models and
linear classiers. We use this relationship for learning
the model parameters with the latent SVM framework.
3.2 Matching
To detect objects in an image we compute an overall
score for each root location according to the best possible
placement of the parts,
score(p
0
) = max
p1,...,pn
score(p
0
, . . . , p
n
). (7)
High-scoring root locations dene detections while the
locations of the parts that yield a high-scoring root
location dene a full object hypothesis.
By dening an overall score for each root location we
can detect multiple instances of an object (we assume
there is at most one instance per root location). This
approach is related to sliding-window detectors because
we can think of score(p
0
) as a score for the detection
window specied by the root lter.
We use dynamic programming and generalized dis-
tance transforms (min-convolutions) [14], [15] to com-
pute the best locations for the parts as a function of
the root location. The resulting method is very efcient,
taking O(nk) time once lter responses are computed,
where n is the number of parts in the model and k is
the total number of locations in the feature pyramid. We
briey describe the method here and refer the reader to
[14], [15] for more details.
Let R
i,l
(x, y) = F
i
(H, (x, y, l)) be an array storing
the response of the i-th model lter in the l-th level
of the feature pyramid. The matching algorithm starts
by computing these responses. Note that R
i,l
is a cross-
correlation between F
i
and level l of the feature pyramid.
After computing lter responses we transform the re-
sponses of the part lters to allow for spatial uncertainty,
D
i,l
(x, y) = max
dx,dy
(R
i,l
(x +dx, y +dy) d
i
d
(dx, dy)) .
(8)
This transformation spreads high lter scores to nearby
locations, taking into account the deformation costs. The
value D
i,l
(x, y) is the maximum contribution of the i-th
part to the score of a root location that places the anchor
of this part at position (x, y) in level l.
The transformed array, D
i,l
, can be computed in linear
time from the response array, R
i,l
, using the generalized
distance transform algorithm from [14].
The overall root scores at each level can be expressed
by the sum of the root lter response at that level, plus
shifted versions of transformed and subsampled part
responses,
score(x
0
, y
0
, l
0
) =
R
0,l0
(x
0
, y
0
) +
n
i=1
D
i,l0
(2(x
0
, y
0
) +v
i
) +b. (9)
Recall that is the number of levels we need to go down
in the feature pyramid to get to a feature map that was
computed at exactly twice the resolution of another one.
Figure 4 illustrates the matching process.
To understand equation (9) note that for a xed root
location we can independently pick the best location for
each part because there are no interactions among parts
in the score of a hypothesis. The transformed arrays D
i,l
give the contribution of the i-th part to the overall root
score, as a function of the anchor position for the part. So
we obtain the total score of a root position at level l by
adding up the root lter response and the contributions
from each part, which are precomputed in D
i,l
.
In addition to computing D
i,l
the algorithm from [14]
can also compute optimal displacements for a part as a
function of its anchor position,
P
i,l
(x, y) = argmax
dx,dy
(R
i,l
(x +dx, y +dy) d
i
d
(dx, dy)) .
(10)
After nding a root location (x
0
, y
0
, l
0
) with high score
we can nd the corresponding part locations by looking
up the optimal displacements in P
i,l0
(2(x
0
, y
0
) +v
i
).
3.3 Mixture Models
A mixture model with m components is dened by a
m-tuple, M = (M
1
, . . . , M
m
), where M
c
is the model for
the c-th component.
An object hypothesis for a mixture model species a
mixture component, 1 c m, and a location for each
7
+
x
x x
...
...
...
model
response of root lter
transformed responses
response of part lters
feature map feature map at twice the resolution
combined score of
root locations low value high value
color encoding of lter
response values
Fig. 4. The matching process at one scale. Responses from the root and part lters are computed a different
resolutions in the feature pyramid. The transformed responses are combined to yield a nal score for each root
location. We show the responses and transformed responses for the head and right shoulder parts. Note how the
head lter is more discriminative. The combined scores clearly show two good hypothesis for the object at this scale.
8
lter of M
c
, z = (c, p
0
, . . . , p
nc
). Here n
c
is the number
of parts in M
c
. The score of this hypothesis is the score
of the hypothesis z
= (p
0
, . . . , p
nc
) for the c-th model
component.
As in the case of a single component model the score
of a hypothesis for a mixture model can be expressed
by a dot product between a vector of model parameters
and a vector (H, z). For a mixture model the vector
is the concatenation of the model parameter vectors
for each component. The vector (H, z) is sparse, with
non-zero entries dened by (H, z
) in a single interval
matching the interval of
c
in ,
= (
1
, . . . ,
m
). (11)
(H, z) = (0, . . . , 0, (H, z
), 0, . . . , 0). (12)
With this construction (H, z) =
c
(H, z
).
To detect objects using a mixture model we use the
matching algorithm described above to nd root loca-
tions that yield high scoring hypotheses independently
for each component.
4 LATENT SVM
Consider a classier that scores an example x with a
function of the form,
f
(x) = max
zZ(x)
(x, z). (13)
Here is a vector of model parameters and z are latent
values. The set Z(x) denes the possible latent values
for an example x. A binary label for x can be obtained
by thresholding its score.
In analogy to classical SVMs we train from labeled
examples D = (x
1
, y
1
), . . . , x
n
, y
n
)), where y
i
1, 1,
by minimizing the objective function,
L
D
() =
1
2
[[[[
2
+C
n
i=1
max(0, 1 y
i
f
(x
i
)), (14)
where max(0, 1 y
i
f
(x
i
)) is the standard hinge loss
and the constant C controls the relative weight of the
regularization term.
Note that if there is a single possible latent value for
each example ([Z(x
i
)[ = 1) then f
is linear in , and we
obtain linear SVMs as a special case of latent SVMs.
4.1 Semi-convexity
A latent SVM leads to a non-convex optimization prob-
lem. However, a latent SVM is semi-convex in the sense
described below, and the training problem becomes con-
vex once latent information is specied for the positive
training examples.
Recall that the maximum of a set of convex functions
is convex. In a linear SVM we have that f
(x) = (x)
is linear in . In this case the hinge loss is convex for
each example because it is always the maximum of two
convex functions.
Note that f
(x) is
convex in and thus the hinge loss, max(0, 1y
i
f
(x
i
)),
is convex in when y
i
= 1. That is, the loss function is
convex in for negative examples. We call this property
of the loss function semi-convexity.
In a general latent SVM the hinge loss is not convex for
a positive example because it is the maximum of a con-
vex function (zero) and a concave function (1y
i
f
(x
i
)).
Now consider a latent SVM where there is a single
possible latent value for each positive example. In this
case f
(x
i
) is linear for a positive example and the loss
due to each positive is convex. Combined with the semi-
convexity property, (14) becomes convex.
4.2 Optimization
Let Z
p
specify a latent value for each positive example
in a training set D. We can dene an auxiliary objective
function L
D
(, Z
p
) = L
D(Zp)
(), where D(Z
p
) is derived
from D by restricting the latent values for the positive
examples according to Z
p
. That is, for a positive example
we set Z(x
i
) = z
i
where z
i
is the latent value specied
for x
i
by Z
p
. Note that
L
D
() = min
Zp
L
D
(, Z
p
). (15)
In particular L
D
() L
D
(, Z
p
). The auxiliary objective
function bounds the LSVM objective. This justies train-
ing a latent SVM by minimizing L
D
(, Z
p
).
In practice we minimize L
D
(, Z
p
) using a coordinate
descent approach:
1) Relabel positive examples: Optimize L
D
(, Z
p
) over
Z
p
by selecting the highest scoring latent value for
each positive example,
z
i
= argmax
zZ(xi)
(x
i
, z).
2) Optimize beta: Optimize L
D
(, Z
p
) over by solv-
ing the convex optimization problem dened by
L
D(Zp)
().
Both steps always improve or maintain the value of
L
D
(, Z
p
). After convergence we have a relatively strong
local optimum in the sense that step 1 searches over
an exponentially-large space of latent values for positive
examples while step 2 searches over all possible models,
implicitly considering the exponentially-large space of
latent values for all negative examples.
We note, however, that careful initialization of may
be necessary because otherwise we may select unreason-
able latent values for the positive examples in step 1, and
this could lead to a bad model.
The semi-convexity property is important because it
leads to a convex optimization problem in step 2, even
though the latent values for the negative examples are
not xed. A similar procedure that xes latent values
for all examples in each round would likely fail to yield
good results. Suppose we let Z specify latent values for
all examples in D. Since L
D
() effectively maximizes
over negative latent values, L
D
() could be much larger
than L
D
(, Z), and we should not expect that minimiz-
ing L
D
(, Z) would lead to a good model.
9
4.3 Stochastic gradient descent
Step 2 (Optimize Beta) of the coordinate descent method
can be solved via quadratic programming [3]. It can
also be solved via stochastic gradient descent. Here we
describe a gradient descent approach for optimizing
over an arbitrary training set D. In practice we use a
modied version of this procedure that works with a
cache of feature vectors for D(Z
p
) (see Section 4.5).
Let z
i
() = argmax
zZ(xi)
(x
i
, z).
Then f
(x
i
) = (x
i
, z
i
()).
We can compute a sub-gradient of the LSVM objective
function as follows,
L
D
() = +C
n
i=1
h(, x
i
, y
i
) (16)
h(, x
i
, y
i
) =
0 if y
i
f
(x
i
) 1
y
i
(x
i
, z
i
()) otherwise
(17)
In stochastic gradient descent we approximate L
D
using a subset of the examples and take a step in
its negative direction. Using a single example, x
i
, y
i
),
we approximate
n
i=1
h(, x
i
, y
i
) with nh(, x
i
, y
i
). The
resulting algorithm repeatedly updates as follows:
1) Let
t
be the learning rate for iteration t.
2) Let i be a random example.
3) Let z
i
= argmax
zZ(xi)
(x
i
, z).
4) If y
i
f
(x
i
) = y
i
( (x
i
, z
i
)) 1 set :=
t
.
5) Else set :=
t
( Cny
i
(x
i
, z
i
)).
As in gradient descent methods for linear SVMs we
obtain a procedure that is quite similar to the perceptron
algorithm. If f
(D) = argmin
L
D
().
Since L
D
is strictly convex
(D) is unique.
Given a large training set D we would like to nd a
small set of examples C D such that
(C) =
(D).
Our method starts with an initial cache of examples
and alternates between training a model and updating
the cache. In each iteration we remove easy examples
from the cache and add new hard examples. A special
case involves keeping all positive examples in the cache
and data-mining over negatives.
Let C
1
D be an initial cache of examples. The
algorithm repeatedly trains a model and updates the
cache as follows:
1) Let
t
:=
(C
t
) (train a model using C
t
).
2) If H(
t
, D) C
t
stop and return
t
.
3) Let C
t
:= C
t
X for any X such that X E(
t
, C
t
)
(shrink the cache).
4) Let C
t+1
:= C
t
X for any X such that X D and
X H(
t
, D)C
t
,= (grow the cache).
In step 3 we shrink the cache by removing examples
from C
t
that are outside the margin dened by
t
. In
step 4 we grow the cache by adding examples from
D, including at least one new example that is inside
the margin dened by
t
. Such example must exist
otherwise we would have returned in step 2.
The following theorem shows that when we stop we
have found
(D).
Theorem 1: Let C D and =
(C). If H(, D) C
then =
(D).
Proof: C D implies L
D
(
(D)) L
C
(
(C)) =
L
C
(). Since H(, D) C all examples in DC have
zero loss on . This implies L
C
() = L
D
(). We conclude
L
D
(
(D)) L
D
(), and because L
D
has a unique
minimum =
(D).
The next result shows the algorithm will stop after a
nite number of iterations. Intuitively this follows from
the fact that L
Ct
(
(C
t
)) grows in each iteration, but it
is bounded by L
D
(
(D)).
Theorem 2: The data-mining algorithm terminates.
Proof: When we shrink the cache C
t
contains all
examples from C
t
with non-zero loss in a ball around
t
. This implies L
C
t
is identical to L
Ct
in a ball around
10
t
, and since
t
is a minimum of L
Ct
it also must be a
minimum of L
C
t
. Thus L
C
t
(
(C
t
)) = L
Ct
(
(C
t
)).
When we grow the cache C
t+1
C
t
contains at least one
example x, y) with non-zero loss at
t
. Since C
t
C
t+1
we have L
Ct+1
() L
C
t
() for all . If
(C
t+1
) ,=
(C
t
) then L
Ct+1
(
(C
t+1
)) > L
C
t
(
(C
t
)) because L
C
t
has a unique minimum. If
(C
t+1
) =
(C
t
) then
L
Ct+1
(
(C
t+1
)) > L
C
t
(
(C
t
)) due to x, y).
We conclude L
Ct+1
(
(C
t+1
)) > L
Ct
(
(C
t
)). Since
there are nitely many caches the loss in the cache can
only grow a nite number of times.
4.5 Data-mining hard examples, LSVM version
Now we describe a data-mining algorithm for training a
latent SVM when the latent values for the positive examples
are xed. That is, we are optimizing L
D(Zp)
(), and not
L
D
(). As discussed above this restriction ensures the
optimization problem is convex.
For a latent SVM instead of keeping a cache of exam-
ples x, we keep a cache of (x, z) pairs where z Z(x).
This makes it possible to avoid doing inference over all
of Z(x) in the inner loop of an optimization algorithm
such as gradient descent. Moreover, in practice we can
keep a cache of feature vectors, (x, z), instead of (x, z)
pairs. This representation is simpler (its application in-
dependent) and can be much more compact.
A feature vector cache F is a set of pairs (i, v) where
1 i n is the index of an example and v = (x
i
, z) for
some z Z(x
i
). Note that we may have several pairs
(i, v) F for each example x
i
. If the training set has
xed labels for positive examples this may still be true
for the negative examples.
Let I(F) be the examples indexed by F. The feature
vectors in F dene an objective function for , where we
only consider examples indexed by I(F), and for each
example we only consider feature vectors in the cache,
L
F
() =
1
2
[[[[
2
+C
iI(F)
max(0, 1y
i
( max
(i,v)F
v)). (20)
We can optimize L
F
via gradient descent by modi-
fying the method in Section 4.3. Let V (i) be the set of
feature vectors v such that (i, v) F. Then each gradient
descent iteration simplies to:
1) Let
t
be the learning rate for iteration t.
2) Let i I(F) be a random example indexed by F.
3) Let v
i
= argmax
vV (i)
v.
4) If y
i
( v
i
) 1 set =
t
.
5) Else set =
t
( Cny
i
v
i
).
Now the size of I(F) controls the number of iterations
necessary for convergence, while the size of V (i) controls
the time it takes to execute step 3. In step 5 n = [I(F)[.
Let
(F) = argmin
L
F
().
We would like to nd a small cache for D(Z
p
) with
(F) =
(D(Z
p
)).
We dene the hard feature vectors of a training set D
relative to as,
H(, D) = (i, (x
i
, z
i
)) [
z
i
= argmax
zZ(xi)
(x
i
, z) and y
i
( (x
i
, z
i
)) < 1. (21)
That is, H(, D) are pairs (i, v) where v is the highest
scoring feature vector from an example x
i
that is inside
the margin of the classier dened by .
We dene the easy feature vectors in a cache F as
E(, F) = (i, v) F [ y
i
( v) > 1 (22)
These are the feature vectors from F that are outside the
margin dened by .
Note that if y
i
( v) 1 then (i, v) is not considered
easy even if there is another feature vector for the i-th
example in the cache with higher score than v under .
Now we describe the data-mining algorithm for com-
puting
(D(Z
p
)).
The algorithm works with a cache of feature vectors
for D(Z
p
). It alternates between training a model and
updating the cache.
Let F
1
be an initial cache of feature vectors. Now
consider the following iterative algorithm:
1) Let
t
:=
(F
t
) (train a model).
2) If H(, D(Z
p
)) F
t
stop and return
t
.
3) Let F
t
:= F
t
X for any X such that X E(
t
, F
t
)
(shrink the cache).
4) Let F
t+1
:= F
t
X for any X such that
X H(
t
, D(Z
p
))F
t
,= (grow the cache).
Sstep 3 shrinks the cache by removing easy feature
vetors. Step 4 grows the cache by adding new feature
vectors, including at least one from H(
t
, D(Z
p
)). Note
that over time we will accumulate multiple feature vec-
tors from the same negative example in the cache.
We can show this algorithm will eventually stop and
return
(D(Z
p
)). This follows from arguments analo-
gous to the ones used in Section 4.4.
5 TRAINING MODELS
Now we consider the problem of training models from
images labeled with bounding boxes around objects of
interest. This is the type of data available in the PASCAL
datasets. Each dataset contains thousands of images and
each image has annotations specifying a bounding box
and a class label for each target object present in the
image. Note that this is a weakly labeled setting since
the bounding boxes do not specify component labels or
part locations.
We describe a procedure for initializing the structure
of a mixture model and learning all parameters. Pa-
rameter learning is done by constructing a latent SVM
training problem. We train the latent SVM using the
coordinate descent approach described in Section 4.2
together with the data-mining and gradient descent
algorithms that work with a cache of feature vectors
11
from Section 4.5. Since the coordinate descent method is
susceptible to local minima we must take care to ensure
a good initialization of the model.
5.1 Learning parameters
Let c be an object class. We assume the training examples
for c are given by positive bounding boxes P and a set
of background images N. P is a set of pairs (I, B) where
I is an image and B is a bounding box for an object of
class c in I.
Let M be a (mixture) model with xed structure. Recall
that the parameters for a model are dened by a vector
. To learn we dene a latent SVM training problem
with an implicitly dened training set D, with positive
examples from P, and negative examples from N.
Each example x, y) D has an associated image and
feature pyramid H(x). Latent values z Z(x) specify an
instantiation of M in the feature pyramid H(x).
Now dene (x, z) = (H(x), z). Then (x, z) is
exactly the score of the hypothesis z for M on H(x).
A positive bounding box (I, B) P species that the
object detector should re in a location dened by B.
This means the overall score (7) of a root location dened
by B should be high.
For each (I, B) P we dene a positive example x
for the LSVM training problem. We dene Z(x) so the
detection window of a root lter specied by a hypoth-
esis z Z(x) overlaps with B by at least 50%. There
are usually many root locations, including at different
scales, that dene detection windows with 50% overlap.
We have found that treating the root location as a latent
variable is helpful to compensate for noisy bounding box
labels in P. A similar idea was used in [40].
Now consider a background image I N. We do not
want the object detector to re in any location of the
feature pyramid for I. This means the overall score (7) of
every root location should be low. Let ( be a dense set of
locations in the feature pyramid. We dene a different
negative example x for each location (i, j, l) (. We
dene Z(x) so the level of the root lter specied by
z Z(x) is l, and the center of its detection window is
(i, j). Note that there is a very large number of negative
examples obtained from each image. This is consistent
with the requirement that a scanning window classier
should have low false positive rate.
The procedure Train is outlined below. The outer-
most loop implements a xed number of iterations of
coordinate descent on L
D
(, Z
p
). Lines 3-6 implement
the Relabel positives step. The resulting feature vectors,
one per positive example, are stored in F
p
. Lines 7-14
implement the Optimize beta step. Since the number of
negative examples implicitly dened by N is very large
we use the LSVM data-mining algorithm. We iterate
data-mining a xed number of times rather than until
convergence for practical reasons. At each iteration we
collect hard negative examples in F
n
, train a new model
using gradient descent, and then shrink F
n
by removing
easy feature vectors. During data-mining we grow the
cache by iterating over the images in N sequentially,
until we reach a memory limit.
Data:
Positive examples P = (I
1
, B
1
), . . . , (I
n
, B
n
)
Negative images N = J
1
, . . . , J
m
Initial model
Result: New model
F
n
:= 1
for relabel := 1 to num-relabel do 2
F
p
:= 3
for i := 1 to n do 4
Add detect-best(,I
i
,B
i
) to F
p
5
end 6
for datamine := 1 to num-datamine do 7
for j := 1 to m do 8
if [F
n
[ memory-limit then break 9
Add detect-all(,J
j
,(1 +)) to F
n
10
end 11
:=gradient-descent(F
p
F
n
) 12
Remove (i, v) with v < (1 +) from F
n
13
end 14
end 15
Procedure Train
The function detect-best(, I, B) nds the highest
scoring object hypothesis with a root lter that signi-
cantly overlaps B in I. The function detect-all(, I, t)
computes the best object hypothesis for each root lo-
cation and selects the ones that score above t. Both of
these functions can be implemented using the matching
procedure in Section 3.2.
The function gradient-descent(F) trains using
feature vectors in the cache as described in Section 4.5.
In practice we modied the algorithm to constrain the
coefcients of the quadratic terms in the deformation
models to be above 0.01. This ensures the deformation
costs are convex, and not too at. We also constrain
the model to be symmetric along the vertical axis. Filters
that are positioned along the center vertical axis of the
model are constrained to be self-symmetric. Part lters
that are off-center have a symmetric part on the other
side of the model. This effectively reduces the number
of parameters to be learned in half.
5.2 Initialization
The LSVM coordinate descent algorithm is susceptible to
local minima and thus sensitive to initialization. This is
a common limitation of other methods that use latent
information as well. We initialize and train mixture
models in three phases as follows.
Phase 1. Initializing Root Filters: For training a
mixture model with m components we sort the bounding
boxes in P by their aspect ratio and split them into m
groups of equal size P
1
, . . . , P
m
. Aspect ratio is used as a
simple indicator of extreme intraclass variation. We train
12
m different root lters F
1
, . . . , F
m
, one for each group of
positive bounding boxes.
To dene the dimensions of F
i
we select the mean
aspect ratio of the boxes in P
i
and the largest area not
larger than 80% of the boxes. This ensures that for most
pairs (I, B) P
i
we can place F
i
in the feature pyramid
of I so it signicantly overlaps with B.
We train F
i
using a standard SVM, with no latent
information, as in [10]. For (I, B) P
i
we warp the
image region under B so its feature map has the same
dimensions as F
i
. This leads to a positive example. We
select random subwindows of appropriate dimension
from images in N to dene negative examples. Fig-
ures 5(a) and 5(b) show the result of this phase when
training a two component car model.
Phase 2. Merging Components: We combine the
initial root lters into a mixture model with no parts
and retrain the parameters of the combined model us-
ing Train on the full (unsplit and without warping)
data sets P and N. In this case the component label
and root location are the only latent variables for each
example. The coordinate descent training algorithm can
be thought of as a discriminative clustering method that
alternates between assigning cluster (mixture) labels for
each positive example and estimating cluster means
(root lters).
Phase 3. Initializing Part Filters: We initialize the
parts of each component using a simple heuristic. We
x the number of parts at six per component, and using
a small pool of rectangular part shapes we greedily place
parts to cover high-energy regions of the root lter.
2
A
part is either anchored along the central vertical axis of
the root lter, or it is off-center and has a symmetric part
on the other side of the root lter. Once a part is placed,
the energy of the covered portion of the root lter is set
to zero, and we look for the next highest-energy region,
until six parts are chosen.
The part lters are initialized by interpolating the root
lter to twice the spatial resolution. The deformation pa-
rameters for each part are initialized to d
i
= (0, 0, .1, .1).
This pushes part locations to be fairly close to their
anchor position. Figure 5(c) shows the results of this
phase when training a two component car model. The
resulting model serves as the initial model for the last
round of parameter learning. The nal car model is
shown in Figure 9.
6 FEATURES
Here we describe the 36-dimensional histogram of ori-
ented gradients (HOG) features from [10] and introduce
an alternative 13-dimensional feature set that captures
essentially the same information.
3
We have found that
2. The energy of a region is dened by the norm of the positive
weights in a subwindow.
3. There are some small differences between the 36-dimensional
features dened here and the ones in [10], but we have found that
these differences did not have any signicant effect on the performance
of our system.
augmenting this low-dimensional feature set to include
both contrast sensitive and contrast insensitive features,
leading to a 31-dimensional feature vector, improves
performance for most classes of the PASCAL datasets.
6.1 HOG Features
6.1.1 Pixel-Level Feature Maps
Let (x, y) and r(x, y) be the orientation and magnitude
of the intensity gradient at a pixel (x, y) in an image.
As in [10], we compute gradients using nite difference
lters, [1, 0, +1] and its transpose. For color images we
use the color channel with the largest gradient magni-
tude to dene and r at each pixel.
The gradient orientation at each pixel is discretized
into one of p values using either a contrast sensitive (B
1
),
or insensitive (B
2
), denition,
B
1
(x, y) = round
p(x, y)
2
mod p (23)
B
2
(x, y) = round
p(x, y)
mod p (24)
Below we use B to denote either B
1
or B
2
.
We dene a pixel-level feature map that species a
sparse histogram of gradient magnitudes at each pixel.
Let b 0, . . . , p 1 range over orientation bins. The
feature vector at (x, y) is
F(x, y)
b
=
r(x, y) if b = B(x, y)
0 otherwise
(25)
We can think of F as an oriented edge map with p
orientation channels. For each pixel we select a channel
by discretizing the gradient orientation. The gradient
magnitude can be seen as a measure of edge strength.
6.1.2 Spatial Aggregation
Let F be a pixel-level feature map for a w h image.
Let k > 0 be a parameter specifying the side length
of a square image region. We dene a dense grid of
rectangular cells and aggregate pixel-level features to
obtain a cell-based feature map C, with feature vectors
C(i, j) for 0 i (w 1)/k| and 0 j (h 1)/k|.
This aggregation provides some invariance to small de-
formations and reduces the size of a feature map.
The simplest approach for aggregating features is to
map each pixel (x, y) into a cell (x/k|, y/k|) and dene
the feature vector at a cell to be the sum (or average) of
the pixel-level features in that cell.
Rather than mapping each pixel to a unique cell we
follow [10] and use a soft binning approach where
each pixel contributes to the feature vectors in the four
cells around it using bilinear interpolation.
6.1.3 Normalization and Truncation
Gradients are invariant to changes in bias. Invariance
to gain can be achieved via normalization. Dalal and
Triggs [10] used four different normalization factors for
13
(a)
(b)
(c)
Fig. 5. (a) and (b) are the initial root lters for a car model (the result of Phase 1 of the initialization process). (c) is the
initial part-based model for a car (the result of Phase 3 of the initialization process).
the feature vector C(i, j). We can write these factors as
N
,
(i, j) with , 1, 1,
N
,
(i, j) = ([[C(i, j)[[
2
+ [[C(i +, j)[[
2
+
[[C(i, j +)[[
2
+ [[C(i +, j +)[[
2
)
1
2
. (26)
Each factor measures the gradient energy in a square
block of four cells containing (i, j).
Let T
(C(i, j)/N
1,1
(i, j))
T
(C(i, j)/N
+1,1
(i, j))
T
(C(i, j)/N
+1,+1
(i, j))
T
(C(i, j)/N
1,+1
(i, j))
(27)
Commonly used HOG features are dened using p =
9 contrast insensitive gradient orientations (discretized
with B
2
), a cell size of k = 8 and truncation = 0.2.
This leads to a 36-dimensional feature vector. We used
these parameters in the analysis described below.
6.2 PCA and Analytic Dimensionality Reduction
We collected a large number of 36-dimensional HOG
features from different resolutions of a large number
of images and performed PCA on these vectors. The
principal components are shown in Figure 6. The results
lead to a number of interesting discoveries.
The eigenvalues indicate that the linear subspace
spanned by the top 11 eigenvectors captures essentially
all the information in a HOG feature. In fact we obtain
the same detection performance in all categories of the
PASCAL 2007 dataset using the original 36-dimensional
features or 11-dimensional features dened by projec-
tion to the top eigenvectors. Using lower dimensional
features leads to models with fewer parameters and
speeds up the detection and learning algorithms. We
note however that some of the gain is lost because we
need to perform a relatively costly projection step when
computing feature pyramids.
Recall that a 36-dimensional HOG feature is dened
using 4 different normalizations of a 9 dimensional his-
togram over orientations. Thus a 36-dimensional HOG
feature is naturally viewed as a 4 9 matrix. The top
eigenvectors in Figure 6 have a very special structure:
they are each (approximately) constant along each row
or column of their matrix representation. Thus the top
eigenvectors lie (approximately) in a linear subspace
dened by sparse vectors that have ones along a single
row or column of their matrix representation.
Let V = u
1
, . . . , u
9
v
1
, . . . , v
4
with
u
k
(i, j) =
1 if j = k
0 otherwise
(28)
v
k
(i, j) =
1 if i = k
0 otherwise
(29)
We can dene a 13-dimensional feature by taking the
dot product of a 36-dimensional HOG feature with each
u
k
and v
k
. Projection into each u
k
is computed by sum-
ming over the 4 normalizations for a xed orientation.
Projection into each v
k
is computed by summing over 9
orientations for a xed normalization.
4
As in the case of 11-dimensional PCA features we
obtain the same performance using the 36-dimensional
HOG features or the 13-dimensional features dened
by V . However, the computation of the 13-dimensional
features is much less costly than performing projections
to the top eigenvectors obtained via PCA since the u
k
and v
k
are sparse. Moreover, the 13-dimensional features
have a simple interpretation as 9 orientation features
and 4 features that reect the overall gradient energy
in different areas around a cell.
We can also dene low-dimensional features that are
contrast sensitive. We have found that performance on
some object categories improves using contrast sensitive
features, while some categories benet from contrast
insensitive features. Thus in practice we use feature vec-
tors that include both contrast sensitive and insensitive
information.
4. The 13-dimensional feature is not a linear projection of the 36-
dimensional feature into V because the u
k
and v
k
are not orthogonal.
In fact the linear subspace spanned by V has dimension 12.
14
0.45617 0.04390 0.02462 0.01339 0.00629 0.00556 0.00456 0.00391 0.00367
0.00353 0.00310 0.00063 0.00030 0.00020 0.00018 0.00018 0.00017 0.00014
0.00013 0.00011 0.00010 0.00010 0.00009 0.00009 0.00008 0.00008 0.00007
0.00006 0.00005 0.00004 0.00004 0.00003 0.00003 0.00003 0.00002 0.00002
Fig. 6. PCA of HOG features. Each eigenvector is displayed as a 4 by 9 matrix so that each row corresponds to one
normalization factor and each column to one orientation bin. The eigenvalues are displayed on top of the eigenvectors.
The linear subspace spanned by the top 11 eigenvectors captures essentially all of the information in a feature vector.
Note how all of the top eigenvectors are either constant along each column or row of the matrix representation.
Let C be a cell-based feature map computed by aggre-
gating a pixel-level feature map with 9 contrast insensi-
tive orientations. Let D be a similar cell-based feature
map computed using 18 contrast sensitive orientations.
We dene 4 normalization factors for the (i, j) cell of C
and D using C as in equation (26). We can normalize and
truncate C(i, j) and D(i, j) using these factors to obtain
4 (9 + 18) = 108 dimensional feature vectors, F(i, j).
In practice we use an analytic projection of these 108-
dimensional vectors, dened by 27 sums over different
normalizations, one for each orientation channel of F,
and 4 sums over the 9 contrast insensitive orientations,
one for each normalization factor. We use a cell size
of k = 8 and truncation value of = 0.2. The nal
feature map has 31-dimensional vectors G(i, j), with 27
dimensions corresponding to different orientation chan-
nels (9 contrast insensitive and 18 contrast sensitive), and
4 dimensions capturing the overall gradient energy in
square blocks of four cells around (i, j).
Finally, we note that the top eigenvectors in Figure 6
can be roughly interpreted as a two-dimensional sep-
arable Fourier basis. Each eigenvector can be roughly
seen as a sine or cosine function of one variable. This
observation could be used to dene features using a
nite number of Fourier basis functions instead of a nite
number of discrete orientations.
The appearance of Fourier basis in Figure 6 is an
interesting empirical result. The eigenvectors of a d d
covariance matrix form a Fourier basis when is
circulant, i.e.,
i,j
= k(i j mod d) for some function k.
Circulant covariance matrices arise from probability dis-
tributions on vectors that are invariant to rotation of the
vector coordinates. The appearance of a two-dimensional
Fourier basis in Figure 6 is evidence that the distribution
of HOG feature vectors on natural images have (approxi-
mately) a two-dimensional rotational invariance. We can
rotate the orientation bins and independently rotate the
four normalizations blocks.
7 POST PROCESSING
7.1 Bounding Box Prediction
The desired output of an object detection system is not
entirely clear. The goal in the PASCAL challenge is to
predict the bounding boxes of objects. In our previous
work [17] we reported bounding boxes derived from
root lter locations. Yet detection with one of our models
localizes each part lter in addition to the root lter. Fur-
thermore, part lters are localized with greater spatial
precision than root lters. It is clear that our original ap-
proach discards potentially valuable information gained
from using a multiscale deformable part model.
In the current system, we use the complete congu-
ration of an object hypothesis, z, to predict a bounding
box for the object. This is implemented using functions
that map a feature vector g(z), to the upper-left, (x
1
, y
1
),
and lower-right, (x
2
, y
2
), corners of the bounding box.
For a model with n parts, g(z) is a 2n + 3 dimensional
vector containing the width of the root lter in image
pixels (this provides scale information) and the location
of the upper-left corner of each lter in the image.
Each object in the PASCAL training data is labeled by a
bounding box. After training a model we use the output
of our detector on each instance to learn four linear
functions for predicting x
1
, y
1
, x
2
and y
2
from g(z). This
is done via linear least-squares regression, independently
for each component of a mixture model.
Figure 7 illustrates an example of bounding prediction
for a car detection. This simple method yields small
but noticible improvements in performance for some
categories in the PASCAL datasets (see Section 8).
7.2 Non-Maximum Suppression
Using the matching procedure from Section 3.2 we usu-
ally get multiple overlapping detections for each instance
of an object. We use a greedy procedure for eliminating
repeated detections via non-maximum suppression.
15
Fig. 7. A car detection and the bounding box predicted
from the object conguration.
After applying the bounding box prediction method
described above we have a set of detections D for a
particular object category in an image. Each detection
is dened by a bounding box and a score. We sort the
detections in D by score, and greedily select the highest
scoring ones while skipping detections with bounding
boxes that are at least 50% covered by a bounding box
of a previously selected detection.
7.3 Contextual Information
We have implemented a simple procedure to rescore
detections using contextual information.
Let (D
1
, . . . , D
k
) be a set of detections obtained using
k different models (for different object categories) in an
image I. Each detection (B, s) D
i
is dened by a
bounding box B = (x
1
, y
1
, x
2
, y
2
) and a score s. We
dene the context of I in terms of a k-dimensional vector
c(I) = ((s
1
), . . . , (s
k
)) where s
i
is the score of the high-
est scoring detection in D
i
, and (x) = 1/(1 +exp(2x))
is a logistic function for renormalizing the scores.
To rescore a detection (B, s) in an image I we build
a 25-dimensional feature vector with the original score
of the detection, the top-left and bottom-right bounding
box coordinates, and the image context,
g = ((s), x
1
, y
1
, x
2
, y
2
, c(I)). (30)
The coordinates x
1
, y
1
, x
2
, y
2
[0, 1] are normalized by
the width and height of the image. We use a category-
specic classier to score this vector to obtain a new
score for the detection. The classier is trained to distin-
guish correct detections from false positives by integrat-
ing contextual information dened by g.
To get training data for the rescoring classier we
run our object detectors on images that are annotated
with bounding boxes around the objects of interest (such
as provided in the PASCAL datasets). Each detection
returned by one of our models leads to an example g that
is labeled as a true positive or false positive detection,
depending on whether or not it signicantly overlaps an
object of the correct category.
This rescoring procedure leads to a noticible improve-
ment in the average precision on several categories in
the PASCAL datasets (see Section 8). In our experiments
we used the same dataset for training models and for
training the rescoring classiers. We used SVMs with
quadratic kernels for rescoring.
8 EMPIRICAL RESULTS
We evaluated our system using the PASCAL VOC 2006,
2007 and 2008 comp3 challenge datasets and protocol.
We refer to [11][13] for details, but emphasize that
these benchmarks are widely acknowledged as difcult
testbeds for object detection.
Each dataset contains thousands of images of real-
world scenes. The datasets specify ground-truth bound-
ing boxes for several object classes. At test time, the goal
is to predict the bounding boxes of all objects of a given
class in an image (if any). In practice a system will output
a set of bounding boxes with corresponding scores, and
we can threshold these scores at different points to obtain
a precision-recall curve across all images in the test set.
For a particular threshold the precision is the fraction of
the reported bounding boxes that are correct detections,
while recall is the fraction of the objects found.
A predicted bounding box is considered correct if it
overlaps more than 50% with a ground-truth bounding
box, otherwise the bounding box is considered a false
positive detection. Multiple detections are penalized. If
a system predicts several bounding boxes that overlap
with a single ground-truth bounding box, only one pre-
diction is considered correct, the others are considered
false positives. One scores a system by the average pre-
cision (AP) of its precision-recall curve across a testset.
We trained a two component model for each class in
each dataset. Figure 9 shows some of the models learned
on the 2007 dataset. Figure 10 shows some detections we
obtain using those models. We show both high-scoring
correct detections and high-scoring false positives.
In some categories our false detections are often due
to confusion among classes, such as between horse and
cow or between car and bus. In other categories false
detections are often due to the relatively strict bounding
box criteria. The two false positives shown for the person
category are due to insufcient overlap with the ground-
truth bounding box. In the cat category we often detect
the face of a cat and report a bounding box that is too
small because it does not include the rest of the cat. In
fact, the top 20 highest scoring false positive bounding
boxes for the cat category correspond to the face of a cat.
This is an extreme case but it gives an explanation for
our low AP score in this category. In many of the positive
16
training examples for cats only the face is visible, and we
learn a model where one of the components corresponds
to a cat face model, see Figure 9.
Tables 1 and 2 summarize the results of our system on
the 2006 and 2007 challenge datasets. Table 3 summarizes
the results on the more recent 2008 dataset, together
with the systems that entered the ofcial competition in
2008. Empty boxes indicate that a method was not tested
in the corresponding object class. The entry labeled
UofCTTIUCI is a preliminary version of the system
described here. Our system obtains the best AP score
in 9 out of the 20 categories and the second best in 8.
Moreover, in some categories such as person we obtain
a score signicantly above the second best score.
For all of the experiments shown here we used the
objects not marked as difcult from the trainval
datasets to train models (we include the objects marked
as truncated). Our system is fairly efcient. Using a
Desktop computer it takes about 4 hours to train a model
on the PASCAL 2007 trainval dataset and 3 hours to
evaluate it on the test dataset. There are 4952 images
in the test dataset, so the average running time per
image is around 2 seconds. All of the experiments were
done on a 2.8Ghz 8-core Intel Xeon Mac Pro computer
running Mac OS X 10.5. The system makes use of the
multiple-core architecture for computing lter responses
in parallel, although the rest of the computation runs in
a single thread.
We evaluated different aspects of our system on the
longer-established 2006 dataset. Figure 8 summarizes
results of different models on the person and car cate-
gories. We trained models with 1 and 2 components with
and without parts. We also show the result of a 2 compo-
nent model with parts and bounding box prediction. We
see that the use of parts (and bounding box prediction)
can signicantly improve the detection accuracy. Mixture
models are important in the car category but not in the
person category of the 2006 dataset.
We also trained and tested a 1 component model on
the INRIA Person dataset [10]. We scored the model
with the PASCAL evaluation methodology (using the
PASCAL development kit) over the complete test set,
including images without people. We obtained an AP
score of .869 in this dataset using the base system with
bounding box prediction.
9 DISCUSSION
We described an object detection system based on mix-
tures of multiscale deformable part models. Our system
relies heavily on new methods for discriminative train-
ing of classiers that make use of latent information.
It also relies heavily on efcient methods for matching
deformable models to images. The resulting system is
both efcient and accurate, leading to state-of-the-art
results on difcult datasets.
Our models are already capable of representing highly
variable object classes, but we would like to move
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
recall
p
r
e
c
i
s
i
o
n
class: person, year 2006
1 Root (0.24)
2 Root (0.24)
1 Root+Parts (0.38)
2 Root+Parts (0.37)
2 Root+Parts+BB (0.39)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
recall
p
r
e
c
i
s
i
o
n
class: car, year 2006
1 Root (0.48)
2 Root (0.58)
1 Root+Parts (0.55)
2 Root+Parts (0.62)
2 Root+Parts+BB (0.64)
Fig. 8. Precision/Recall curves for models trained on the
person and car categories of the PASCAL 2006 dataset.
We show results for 1 and 2 component models with
and without parts, and a 2 component model with parts
and bounding box prediction. In parenthesis we show the
average precision score for each model.
towards richer models. The framework described here
allows for exploration of additional latent structure. For
example, one can consider deeper part hierarchies (parts
with parts) or mixture models with many components.
In the future we would like to build grammar based
models that represent objects with variable hierarchical
structures. These models should allow for mixture mod-
els at the part level, and allow for reusability of parts,
both in different components of an object and among
different object models.
ACKNOWLEDGMENTS
This material is based upon work supported by the Na-
tional Science Foundation under Grant No. IIS 0746569
(Pedro Felzenszwalb and Ross Girshick), IIS 0811340
(David McAllester) and IIS 0812428 (Deva Ramanan).
17
person bottle
cat
car
Fig. 9. Some of the models learned on the PASCAL 2007 dataset.
18
person
car
horse
sofa
bottle
cat
Fig. 10. Examples of high-scoring detections on the PASCAL 2007 dataset, selected from the top 20 highest scoring
detections in each class. The framed images (last two in each row) illustrate false positives for each category. Many
false positives (such as for person and cat) are due to the bounding box scoring criteria.
19
bike bus car cat cow dog horse mbik pers sheep
a) base .619 .490 .615 .188 .407 .151 .392 .576 .363 .404
b) BB .620 .493 .635 .190 .417 .153 .386 .579 .380 .402
c) context .623 .502 .631 .236 .437 .185 .429 .625 .401 .431
TABLE 1
PASCAL VOC 2006 results. (a) average precision scores of the base system, (b) scores using bounding box
prediction, (c) scores using bounding box prediction and context rescoring.
aero bike bird boat bottle bus car cat chair cow table dog horse mbik pers plant sheep sofa train tv
a) base .290 .546 .006 .134 .262 .394 .464 .161 .163 .165 .245 .050 .436 .378 .350 .088 .173 .216 .340 .390
b) BB .287 .551 .006 .145 .265 .397 .502 .163 .165 .166 .245 .050 .452 .383 .362 .090 .174 .228 .341 .384
c) context .328 .568 .025 .168 .285 .397 .516 .213 .179 .185 .259 .088 .492 .412 .368 .146 .162 .244 .392 .391
TABLE 2
PASCAL VOC 2007 results. (a) average precision scores of the base system, (b) scores using bounding box
prediction, (c) scores using bounding box prediction and context rescoring.
aero bike bird boat bottle bus car cat chair cow table dog horse mbik pers plant sheep sofa train tv
a) base .336 .371 .066 .099 .267 .229 .319 .143 .149 .124 .119 .064 .321 .353 .407 .107 .157 .136 .228 .324
b) BB .339 .381 .067 .099 .278 .229 .331 .146 .153 .119 .124 .066 .322 .366 .423 .108 .157 .139 .234 .328
c) context .351 .402 .117 .114 .284 .251 .334 .188 .166 .114 .087 .078 .347 .395 .431 .117 .181 .166 .256 .347
d) rank 2 1 1 1 1 1 2 2 1 2 4 5 2 2 1 1 2 2 3 1
(UofCTTIUCI) .326 .420 .113 .110 .282 .232 .320 .179 .146 .111 .066 .102 .327 .386 .420 .126 .161 .136 .244 .371
CASIA Det .252 .146 .098 .105 .063 .232 .176 .090 .096 .100 .130 .055 .140 .241 .112 .030 .028 .030 .282 .146
Jena .048 .014 .003 .002 .001 .010 .013 .001 .047 .004 .019 .003 .031 .020 .003 .004 .022 .064 .137
LEAR PC .365 .343 .107 .114 .221 .238 .366 .166 .111 .177 .151 .090 .361 .403 .197 .115 .194 .173 .296 .340
MPI struct .259 .080 .101 .056 .001 .113 .106 .213 .003 .045 .101 .149 .166 .200 .025 .002 .093 .123 .236 .015
Oxford .333 .246 .291 .125 .325 .349
XRCE Det .264 .105 .014 .045 .000 .108 .040 .076 .020 .018 .045 .105 .118 .136 .090 .015 .061 .018 .073 .068
TABLE 3
PASCAL VOC 2008 results. Top: (a) average precision scores of the base system, (b) scores using bounding box
prediction, (c) scores using bounding box prediction and context rescoring, (d) ranking of nal scores relative to
systems in the 2008 competition. Bottom: the systems that participated in the competition (UofCTTIUCI is a
preliminary version of our system and we dont include it in the ranking).
REFERENCES
[1] Y. Amit and A. Kong, Graphical templates for model registra-
tion, IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 18, no. 3, pp. 225236, 1996.
[2] Y. Amit and A. Trouve, POP: Patchwork of parts models for
object recognition, International Journal of Computer Vision, vol. 75,
no. 2, pp. 267282, 2007.
[3] S. Andrews, I. Tsochantaridis, and T. Hofmann, Support vector
machines for multiple-instance learning, in Advances in Neural
Information Processing Systems, 2003.
[4] A. Bar-Hillel and D. Weinshall, Efcient learning of relational ob-
ject class models, International Journal of Computer Vision, vol. 77,
no. 1, pp. 175198, 2008.
[5] E. Bernstein and Y. Amit, Part-based statistical models for object
classication and detection, in IEEE Conference on Computer Vision
and Pattern Recognition, 2005.
[6] M. Burl, M. Weber, and P. Perona, A probabilistic approach to
object recognition using local photometry and global geometry,
in European Conference on Computer Vision, 1998.
[7] T. Cootes, G. Edwards, and C. Taylor, Active appearance mod-
els, IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 23, no. 6, pp. 681685, 2001.
[8] J. Coughlan, A. Yuille, C. English, and D. Snow, Efcient de-
formable template detection and localization without user initial-
ization, Computer Vision and Image Understanding, vol. 78, no. 3,
pp. 303319, June 2000.
[9] D. Crandall, P. Felzenszwalb, and D. Huttenlocher, Spatial pri-
ors for part-based recognition using statistical models, in IEEE
Conference on Computer Vision and Pattern Recognition, 2005.
[10] N. Dalal and B. Triggs, Histograms of oriented gradients for
human detection, in IEEE Conference on Computer Vision and
Pattern Recognition, 2005.
[11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
A. Zisserman, The PASCAL Visual Object Classes Challenge 2007
(VOC2007) Results. [Online]. Available: http://www.pascal-
network.org/challenges/VOC/voc2007/
[12] , The PASCAL Visual Object Classes Challenge 2008
(VOC2008) Results. [Online]. Available: http://www.pascal-
network.org/challenges/VOC/voc2008/
[13] M. Everingham, A. Zisserman, C. K. I. Williams, and
L. Van Gool, The PASCAL Visual Object Classes Challenge 2006
(VOC2006) Results. [Online]. Available: http://www.pascal-
network.org/challenges/VOC/voc2006/
[14] P. Felzenszwalb and D. Huttenlocher, Distance transforms of
sampled functions, Cornell University CIS, Tech. Rep. 2004-1963,
2004.
[15] , Pictorial structures for object recognition, International
Journal of Computer Vision, vol. 61, no. 1, 2005.
[16] P. Felzenszwalb and D. McAllester, The generalized A* architec-
ture, Journal of Articial Intelligence Research, vol. 29, pp. 153190,
2007.
[17] P. Felzenszwalb, D. McAllester, and D. Ramanan, A discrim-
inatively trained, multiscale, deformable part model, in IEEE
Conference on Computer Vision and Pattern Recognition, 2008.
[18] R. Fergus, P. Perona, and A. Zisserman, Object class recognition
by unsupervised scale-invariant learning, in IEEE Conference on
Computer Vision and Pattern Recognition, 2003.
[19] , A sparse object category model for efcient learning and
exhaustive recognition, in IEEE Conference on Computer Vision and
Pattern Recognition, 2005.
[20] M. Fischler and R. Elschlager, The representation and matching
of pictorial structures, IEEE Transactions on Computer, vol. 22,
no. 1, 1973.
[21] U. Grenander, Y. Chow, and D. Keenan, HANDS: A Pattern-
Theoretic Study of Biological Shapes. Springer-Verlag, 1991.
[22] D. Hoiem, A. Efros, and M. Hebert, Putting objects in per-
spective, International Journal of Computer Vision, vol. 80, no. 1,
October 2008.
[23] A. Holub and P. Perona, A discriminative framework for mod-
20
elling object classes, in IEEE Conference on Computer Vision and
Pattern Recognition, 2005.
[24] Y. Jin and S. Geman, Context and hierarchy in a probabilistic
image model, in IEEE Conference on Computer Vision and Pattern
Recognition, 2006.
[25] T. Joachims, Making large-scale svm learning practical, in Ad-
vances in Kernel Methods - Support Vector Learning, B. Sch olkopf,
C. Burges, and A. Smola, Eds. MIT Press, 1999.
[26] Y. Ke and R. Sukthankar, PCA-SIFT: A more distinctive represen-
tation for local image descriptors, in IEEE Conference on Computer
Vision and Pattern Recognition, 2004.
[27] Y. LeCun, S. Chopra, R. Hadsell, R. MarcAurelio, and F. Huang,
A tutorial on energy-based learning, in Predicting Structured
Data, G. Bakir, T. Hofman, B. Sch olkopf, A. Smola, and B. Taskar,
Eds. MIT Press, 2006.
[28] B. Leibe, A. Leonardis, and B. Schiele, Robust object detection
with interleaved categorization and segmentation, International
Journal of Computer Vision, vol. 77, no. 1, pp. 259289, 2008.
[29] D. Lowe, Distinctive image features from scale-invariant key-
points, International Journal of Computer Vision, vol. 60, no. 2, pp.
91110, November 2004.
[30] C. Papageorgiou, M. Oren, and T. Poggio, A general framework
for object detection, in IEEE International Conference on Computer
Vision, 1998.
[31] W. Plantinga and C. Dyer, An algorithm for constructing the
aspect graph, in Foundations of Computer Science, 1985., 27th
Annual Symposium on, 1986, pp. 123131.
[32] A. Quattoni, S. Wang, L. Morency, M. Collins, and T. Darrell,
Hidden conditional random elds, IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 29, no. 10, pp. 18481852,
October 2007.
[33] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and
S. Belongie, Objects in context, in IEEE International Conference
on Computer Vision, 2007.
[34] D. Ramanan and C. Sminchisescu, Training deformable models
for localization, in IEEE Conference on Computer Vision and Pattern
Recognition, 2006.
[35] H. Rowley, S. Baluja, and T. Kanade, Human face detection in
visual scenes, Carnegie Mellon University, Tech. Rep. CMU-CS-
95-158R, 1995.
[36] H. Schneiderman and T. Kanade, A statistical method for 3d
object detection applied to faces and cars, in IEEE Conference on
Computer Vision and Pattern Recognition, 2000.
[37] S. Shalev-Shwartz, Y. Singer, and N. Srebro, Pegasos: Primal
estimated sub-gradient solver for SVM, in International Conference
on Machine Learning, 2007.
[38] K. Sung and T. Poggio, Example-based learning for view-based
human face detection, Massachussets Institute of Technology,
Tech. Rep. A.I. Memo No. 1521, 1994.
[39] A. Torralba, Contextual priming for object detection, Interna-
tional Journal of Computer Vision, vol. 53, no. 2, pp. 169191, July
2003.
[40] P. Viola, J. Platt, and C. Zhang, Multiple instance boosting for
object detection, in Advances in Neural Information Processing
Systems, 2005.
[41] P. Viola and M. Jones, Robust real-time face detection, Interna-
tional Journal of Computer Vision, vol. 57, no. 2, pp. 137154, May
2004.
[42] M. Weber, M. Welling, and P. Perona, Towards automatic discov-
ery of object categories, in IEEE Conference on Computer Vision and
Pattern Recognition, 2000.
[43] A. Yuille, P. Hallinan, and D. Cohen, Feature extraction from
faces using deformable templates, International Journal of Com-
puter Vision, vol. 8, no. 2, pp. 99111, 1992.
[44] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, Local fea-
tures and kernels for classication of texture and object categories:
A comprehensive study, International Journal of Computer Vision,
vol. 73, no. 2, pp. 213238, June 2007.
[45] S. Zhu and D. Mumford, A stochastic grammar of images,
Foundations and Trends in Computer Graphics and Vision, vol. 2, no. 4,
pp. 259362, 2007.