0% found this document useful (0 votes)
51 views8 pages

The Fastest Deformable Part Model For Object Detection

This document summarizes a method for accelerating the deformable part model (DPM) object detection algorithm while maintaining accuracy. Three computationally prohibitive steps in the cascade version of DPM are accelerated: 1) 2D correlation between the root filter and feature map by constraining the root filter to be low rank; 2) cascade part pruning by proposing neighborhood-aware pruning; 3) HOG feature extraction by using look-up tables to replace expensive calculations. Experiments show the proposed method is 4 times faster than previous state-of-the-art methods with similar accuracy on Pascal VOC datasets.

Uploaded by

sadeqbillah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views8 pages

The Fastest Deformable Part Model For Object Detection

This document summarizes a method for accelerating the deformable part model (DPM) object detection algorithm while maintaining accuracy. Three computationally prohibitive steps in the cascade version of DPM are accelerated: 1) 2D correlation between the root filter and feature map by constraining the root filter to be low rank; 2) cascade part pruning by proposing neighborhood-aware pruning; 3) HOG feature extraction by using look-up tables to replace expensive calculations. Experiments show the proposed method is 4 times faster than previous state-of-the-art methods with similar accuracy on Pascal VOC datasets.

Uploaded by

sadeqbillah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

The Fastest Deformable Part Model for Object Detection

Junjie Yan Zhen Lei Longyin Wen Stan Z. Li ∗


Center for Biometrics and Security Research & National Laboratory of Pattern Recognition
Institute of Automation, Chinese Academy of Sciences, China
{jjyan,zlei,lywen,szli}@nlpr.ia.ac.cn

Abstract in Pascal VOC. The speed is a bottleneck of DPM in real


application, where speed is often as important as accuracy.
This paper solves the speed bottleneck of deformable Recent works have accelerated deformable part model
part model (DPM), while maintaining the accuracy in de- (DPM) by one order of magnitude, such as cascade [10],
tection on challenging datasets. Three prohibitive steps in coarse-to-fine [22], branch-and-bound [16] and FFT [8]. In
cascade version of DPM are accelerated, including 2D cor- DPM, the detection score of each hypothesis is determined
relation between root filter and feature map, cascade part by the score of appearance minus the deformation cost. The
pruning and HOG feature extraction. For 2D correlation, appearance score is calculated by the correlation between
the root filter is constrained to be low rank, so that 2D cor- HOG feature and a sequence of filters including root and
relation can be calculated by more efficient linear combi- parts, which takes most of the time due to the high dimen-
nation of 1D correlations. A proximal gradient algorithm is sion. [10] and [22] reduced the computation by pruning un-
adopted to progressively learn the low rank filter in a dis- promising hypothesis early. [8] used FFT to accelerate cor-
criminative manner. For cascade part pruning, neighbor- relation. These methods, however, still take about 1 second
hood aware cascade is proposed to capture the dependence per image for Pascal VOC detection. We take cascade DPM
in neighborhood regions for aggressive pruning. Instead [10] as the baseline, and find that each step in 2D correlation
of explicit computation of part scores, hypotheses can be by root filter, cascade part pruning and HOG feature extrac-
pruned by scores of neighborhoods under the first order ap- tion makes DPM prohibitive even if the other two steps are
proximation. For HOG feature extraction, look-up tables free. To finally remedy the speed bottleneck of DPM, we
are constructed to replace expensive calculations of orien- thus need accelerate all these three steps.
tation partition and magnitude with simpler matrix index Discriminative Low Rank Root Filter In DPM (and
operations. Extensive experiments show that (a) the pro- accelerated versions [10, 22, 16]), root scores are dense-
posed method is 4 times faster than the current fastest DPM ly computed by 2D correlation between the root filter and
method with similar accuracy on Pascal VOC, (b) the pro- HOG feature map. This paper reduces the cost by constrain-
posed method achieves state-of-the-art accuracy on pedes- ing the rank of root filter. As used in other computer vision
trian and face detection task with frame-rate speed. tasks [13, 21, 31, 25], the 2D correlation can be divided into
linear combination of more efficient 1D correlations, where
the combination number is the rank of the filter. To learn
1. Introduction the low rank filter while preserving the discriminative abil-
ity, an additional nuclear norm is added to traditional SVM
The deformable part model (DPM) [11] is one of the
objective function. This paper adopts a proximal gradient
most popular object detection methods. It is originally pro-
algorithm to progressively learn it by minimizing an upper
posed for Pascal VOC [9] challenge and is the foundation
bound function with closed form solution. The discrimina-
of champion systems in Pascal VOC 2007-2011. Recen-
tively learned low rank root filter can reduce the correlation
t works have extended DPM to related tasks and achieved
cost and help to prune a large number of negative hypothe-
leading performance, such as articulated human pose esti-
ses.
mation [35], face detection [36, 34] and pedestrian detec-
tion [33, 32]. DPM has advantage in handling large appear- Neighborhood Aware Cascade DPM can be more effi-
ance variations for challenging datasets, however, it takes cient through cascade based pruning of low score hypothe-
more than 10 seconds (without parallelization) per image ses after evaluating a subset of parts, as explored in [10].
However, there are still two kinds of redundancy in this
∗ corresponding author cascade. The first is that one object can activate multiple

1
overlapping hypotheses to pass through the whole cascade, mate the correlation in DPM with a decline of performance
while only one hypothesis with the highest score is useful to detect 100,000 categories on a single workstation.
for detection. The second is that if one hypothesis has very Acceleration of Pedestrian Detection Recently, large
low score, its neighborhoods tend to have low scores and improvements on efficiency were achieved in pedestrian de-
probably can avoid evaluation. Motivated by the crosstalk tection task [6, 5]. [6] proposed to approximate features at
[5] in boosting classifier, this paper proposes neighborhood nearby scales for fast computation of multiple channel fea-
aware cascade for DPM to reduce the two kinds of redun- tures. Based on the feature and boosting classifier in [6],
dancy. Many hypotheses in this cascade can be aggressively [5] further proposed crosstalk cascade by considering the
pruned according to the first order approximation of stage dependence in neighborhood. [5] is considered to be one of
scores by their neighborhoods, instead of explicit computa- the best detectors in Viola-Jones framework [30] in terms of
tion. speed and accuracy. We extend this idea to neighborhood
Look-up Table HOG HOG is used in DPM as a low- aware cascade in DPM.
level representation due to the advantage in tolerating local HOG computation The widely used HOG implemen-
transformation. However, the original HOG calculation has tation in [12] takes about 0.5 second per VGA image on a
high computational cost, mainly due to the operations in single thread, which itself slows down DPM. Unfortunate-
calculating the orientation partition and magnitude. This ly, this step is often directly ignored by recent works on
paper shows that look-up table (LUT) can be used to replace acceleration of deformable part model. Some recent works
them with much simpler matrix index operations, based on [27, 28, 24] accelerated HOG with the computation capacity
the fact that there are only finite possibilities of gradient and of GPU, however, the algorithm itself is not improved. With
orientation. the help of LUT, HOG implementation in this paper runs on
The rest of the paper is organized as follows. Section 2 a single CPU thread is as fast as the GPU implementation
reviews the related work. An overall introduction of DPM reported in [24]. LUT based method can be applied on GPU
is presented in section 3. The discriminative low rank root for more acceleration.
filter, neighborhood aware cascade and LUT HOG are de-
scribed in section 4, 5, 6 respectively. We show experiments 3. DPM and Cascade DPM
in section 7 and conclude the paper in section 8.
This part gives a brief review of DPM and cascade DPM,
and then analyzes the bottleneck in computation.
2. Related Work The DPM is composed of a root filter w0 and n part-
Acceleration of DPM This work is most related to ap- s, where the t-th part is parameterized by filter wt and de-
proaches that accelerate single category DPM in detection. formation term dt . An object hypothesis γ is specified by
[10] proposed to convert star-structure to cascade, which {p0 , p1 , · · · , pn }, where p0 is the location of root, and pt is
can efficiently prune unpromising hypotheses. [22] pro- the location of the t-th part. Root and parts are connected
posed a coarse-to-fine approach based on that model at low by a pictorial structure. The detection score s(γ) is defined
resolution can prune a lot of hypotheses with low compu- as:
tational cost. FFT was used to accelerate the correlation in Xn

[8]. Motivated by the branch-and-bound approach [20] for s(γ) = w0T φa (p0 , I) + wtT φa (pt , I) − dTt φd (pt , p0 ),
object detection, [16] introduced it to DPM with carefully t=1
(1)
designed bound. For a category with 6 components, these
where φa is the HOG feature for appearance, and φd is sep-
methods run at about 1 FPS per Pascal VOC image on a sin-
arable quadratic function for deformation. Mixture compo-
gle thread, which is faster than DPM by one order, but still
nents can be naturally added to represent objects in different
relatively slow for real application.
poses, but we leave them out to simplify the notation.
Acceleration of Multi-category DPM Quite a large
For a hypothesis γ in detection, only root location p0 is
number of recent works [23, 27, 15, 3, 17] were proposed to
known, while the part location pt is inferred by maximiz-
accelerate DPM for multi-category detection, e.g. simulta-
ing the part appearance score minus the deformation cost
neous detection of 20 categories on Pascal VOC. Steerable
associated with displacement:
part model [23] used a part bank with linear combination
to approximate correlation score of different categories. S- pt = arg max wtT φa (p, I) − dTt φd (p, p0 ), (2)
p
parselet [27, 15] used a large part bank with sparse linear
combination. [15] and [23] both achieve three times accel- where p traverses possible locations of the part. Since parts
eration over the original DPM for 20 category object detec- are directly attached to the root, their locations are inferred
tion, however, they are slower than the cascade DPM [10] independently for a fixed root. It has been found in previous
which detects each category independently. Very recently, works [10, 22, 8] that in DPM most of the time is spent on
[3] proposed to use locality-sensitive hashing to approxi- calculating the appearance term due to the high dimension.
Time Cost of Cascade DPM and Proposed Method proved that:
r
X r
X
K ◦F =K ◦ σi ui viT = σi ((F ◦ ui ) ◦ viT ), (4)
Cascade DPM 0.46 0.15 0.84 i=1 i=1

where ◦ denotes the correlation operator. In the last term,


0.08 the correlation is firstly conducted on each column by 1D
Proposed HOG Root Part
filter ui , and then on each row by 1D filter vi . This proce-
Method
0.07 0.14
dure requires r(m2 + n2 )m1 n1 multiplications1 , while the
original 2D correlation need m2 n2 m1 n1 multiplications. It
SECOND 0 0.5 1 1.5
is easy to see that the combination of two 1D correlations
Figure 1. Average time cost (second) of the cascade DPM [10] and in Eq. 4 needs fewer multiplications if the rank r is small
the proposed method on Pascal VOC with a single CPU thread. enough, so that we expect to have low rank root filter for
The time for HOG extraction, root and part are listed, while other computation efficiency.
steps are nearly free. For each category, the DPM has 6 mixture Besides the low rank property, the ability to distinguish
components and each component has 8 parts. objects and backgrounds is also preferred for the root filter,
in order to efficiently prune negative hypotheses after the
correlation. In the following part, we describe how to learn
In cascade DPM [10], acceleration is achieved by reduc-
this kind of discriminative low rank root filter.
ing the number of parts evaluated. The cascade DPM places
Matrix based representations are used to simplify the no-
the root filter in the first stage and part filters sequently in
tation. Let the dimension of HOG cell be l. The i-th 2D
following stages. In each stage, hypothesis can be pruned if
HOG feature plane of root specified by γ of image I is de-
its score is below a pre-learned threshold. The time cost of
noted as {Φi (γ, I)}1≤i≤l , and the i-th 2D plane of root filter
each step in cascade DPM is shown in Fig. 1. It has accel-
is denoted as {Wi }1≤i≤l . We denote diag{Wi }1≤i≤l as W
erated DPM by one order, however, all these three steps are
and diag{Φi (γ, I)}1≤i≤l as Φ(γ, I). The correlation score
still prohibitive.
can then be denoted as T r(W T Φ(γ, I)), where T r(·) is the
In the first stage of cascade, correlation is calculated
trace operator. One traditional way to get the discriminative
densely between root filter and feature map, while in fol-
root filter is SVM based learning, where the filter is expect-
lowing part stages, correlation is only calculated sparsely
ed to distinguish true object hypotheses from backgrounds.
for unpruned hypotheses. In this paper, different method-
Given M training samples, W can be learned by SVM:
s are used to accelerate the two kinds of computation. We
learn discriminative low rank root filter in first stage for both 1 X
min kW k2F +C max(0, 1−ym T r(W T Φ(γm , Im ))), (5)
efficient dense correlation and safe pruning of unpromis- W 2
M
ing hypotheses after the correlation. For parts, we design
more aggressive pruning by exploring the first order neigh- where k · kF is the Frobenius norm. ym ∈ {−1, 1} is the
borhood information. Besides, the HOG feature extraction label of hypothesis specified by γm and Im , where 1 indi-
is also accelerated by look-up tables. With these three ac- cates object and −1 indicates background. The first term is
celeration techniques, the proposed method is 4 times faster used for regularization, and the last term is used to measure
than the cascade DPM (shown in Fig. 1). We show details the loss in detection.
of these three techniques in following three sections. As aforementioned, the root filter is desired to be low
rank for efficient correlation computation. Motivated by re-
cent works on matrix completion [2], we use additional nu-
4. Discriminative Low Rank Root Filter clear norm to constrain the rank of W in learning:
In this part, we aim to reduce the cost on computation of 1
root score, which is the result of dense correlation between min µkW k∗ + kW k2F (6)
W 2
HOG feature map and root filter. The acceleration comes X
+C max(0, 1 − ym T r(W T Φ(γm , Im ))),
from the separability and linearity of correlation. Suppose M
we have a 2D feature map K ∈ Rm1 ×n1 and a 2D filter
F ∈ Rm2 ×n2 with rank r (r ≤ min(m2 , n2 )). With the where k · k∗ is the nuclear norm. µ controls the trade-off
SVD decomposition, F can be expressed as: between the efficiency and loss in detection. Despite Eq. 6
r
being convex, there are two difficulties in optimization. The
first is the large number of negative samples, for which we
X
F = σi ui viT , (3)
i=1 use the hard sample mining procedure similar to [11]. The
where σi ∈ R is the i-th singular value. Herein ui ∈ 1 In implementation, u is replaced with the “dot product” result σ · u
i i i
Rm2 ×1 , vi ∈ Rn2 ×1 . With this expression, it can be easily to avoid additional multiplications of σi at runtime.
second is the convergence when training set is fixed, for Algorithm 1 Proximal Gradient Algorithm for Discrimina-
which we adopt the proximal gradient method [29]. tive Low Rank Root Filter Learning
Denoting the sum of the last two terms in Eq. 6 as f (W ) Input: We set W0 and W1 to be the root filter after PCA in o-
(which is also convex), one sub-gradient of f (W ) can be riginal DPM. t0 = t1 = 1 and k = 1. Initial training set
formulated as: {ΓM , IM } is initialized by all positive samples and sampled
X negative samples.
∇f (W ) = W + C h(W, γm , ym , Im ), (7)
Output: Discriminative low rank root filter W
M
1: while Not Converged do
where h(W, γm , ym , Im ) is set to be: 2: Mine hard negative samples with W to update the training

0 if ym T r(W T Φ(γm , Im )) ≥ 1 set {ΓM , IM }.
3: while Not Converged do
−ym Φ(γm , Im ) otherwise. t −1
4: Yk ← Wk + k−1 (W − Wk−1 ).
tk P k
(8)
Defining Y as a local region of W , the sub-gradient 5: ∇f (Yk ) ← Yk + C m h(Wk , γm , ym , Im )
satisfies the Lipschitz condition k∇f (W ) − ∇f (Y )kF ≤ 6: Gk ← Yk − L1f ∇f (Yk ).
Lf kW − Y kF , where a conservative Lf is set to be 7: (Uk , Σk , Vk ) ← svd(Gk )
cM (which is a constant in this problem), since that 8: Wk+1 ← Uk D Lu (Σk )VkT
√ f
kΦ(γm , Im )kF can be naturally bounded by a constant c. 1+ 4t2 +1
A quadratic approximation of the objective function in Eq. 9: tk+1 ← 2
k
,k ←k+1
6 by Taylor expansion can be formulated as: 10: end while
11: W ← Wt .
Q(W, Y ) = µkW k∗ + f (Y ) (9) 12: end while
Lf
+T r(∇f (Y )T (W − Y )) + kW − Y k2F .
2
5. Neighborhood Aware Cascade
It can be easily proved that Q(W, Y ) is the tight upper
bound of the Eq.6 due to the Lipschitz condition of sub- In this part, we focus on the reduction of parts computa-
gradient. Defining a matrix G = Y − L1f ∇f (Y ), Eq. 9 can tion cost by neighborhood aware cascade.
Cascade DPM [10] improves the efficiency by pruning
be minimized by the following problem instead:
unpromising hypotheses early. Starting from the calculation
Lf of root score s0 (γ) in the first stage for each hypothesis γ,
arg min Q(W, G) = arg min µkW k∗ + kW − Gk2F .
W W 2 parts are evaluated sequently in following stages. The score
(10)
of the t-th (t ≥ 1) stage is defined as:
Suppose the SVD decomposition of G is U ΣV T , the
closed form solution to Eq. 10 can be obtained as (see [1]): st (γ) = st−1 (γ) + wtT φa (pt , I) − dTt φd (pt , p0 ), (12)
T
W = U Dτ (Σ)V , (11)
where each stage evaluates a part. There are two pruning
where Dτ (Σ) = diag({max(σi − τ, 0)}), and τ = µ/Lf . criteria in [10]. The hypothesis γ can be pruned directly if
The Eq. 9 can be iteratively optimized according to a se- the t-th stage satisfies that st (γ) < ρt , where ρt is a pre-
quences of {Yk } once the the training set is fixed. In defined threshold. By traversing optimal part location in a
√ we set Yk =
each iteration, following the advice in [29], local region, the deformation pruning is adopted if the score
−1 1+ 4t2 +1
Wk + tk−1 tk (Wk − Wk−1) , and tk = 2
k−1
. st (γ) minus deformation cost is below ζt . However, there
The details of the optimization procedure in root filter are still two kinds of redundancy can be reduced for further
learning are shown in Alg. 1. We use standard hard nega- acceleration, as discussed in the following part.
tive sample mining procedure according to the learned W The first redundancy, which is always ignored by pre-
in the outer loop, and then refine W by mined samples in vious works, exists in evaluating positive hypotheses. It is
the inner loop. The initialization of W is set to be root filter well known that an object instance always active multiple
used in the first stage of cascade DPM [10], which is the overlapping detections. A merge step such as non-maximal
original DPM root filter with PCA dimension reduction. In suppression (NMS) is usually adopted to eliminate these
experiments, the rank of filter in each plane is 2 or 3, which overlapping hypotheses and finally preserve the detection
is about 1 times faster than the original full rank filter for with the highest score. The redundancy is that we only need
correlation. Moreover, since that the root filter is discrimi- one per overlapping detection group, but all of them pass
natively learned, it is able to prune many negative hypothe- the whole cascade. In experiments we test the cascade DP-
ses. Note that the root filter at the first stage of cascade is M and find that one final detection corresponds to average
not necessarily to be optimal for the whole DPM, and we 21.85 detections before NMS step (with default threshold).
can re-compute scores for unpruned hypotheses with origi- We name these eliminated overlapping positive hypotheses
nal root filter at later stage. as semi-positive hypotheses. They can take about half of the
time (most hypotheses in later stages belong to this case), Algorithm 2 Neighborhood Aware Cascade in DPM
and we want to prune them early to save computation. Input: Pre-learned thresholds {µt , νt , ρt , ζt }, hypothesis set Γ of
The second redundancy exists in evaluating negative hy- an input image I, index set Z with all value initialized by 1.
potheses. In traditional cascade based pruning, each hy- Output: Detection set D
pothesis is evaluated independently. Nevertheless, the pro- 1: Calculate the root score of all hypotheses in first stage by
cedure ignores the fact that there has great dependency a- dense correlation between feature map and low rank root filter.
2: for t = 1 to n do
mong detection scores in neighborhood regions. For ex-
3: for γ ∈ Γ & Z(γ) = 1 do
ample, a hypothesis with very low score indicates that it- 4: if s(γ) ≤ νt then
s neighborhoods probably have very low score and do not 5: Z(N (γ)) ← 0
need to be evaluated any more. We name these negative 6: else if s(γ) ≤ ρt or s(N (γ)) − s(γ) > µt then
hypotheses with low score neighborhood as semi-negative 7: Z(γ) ← 0
hypotheses and want to prune them before explicitly evalu- 8: else
ating their scores at certain stages. 9: f ← −∞
Motivated by [5], we use the “first order” information 10: for p ∈ ∆(p0 , t) do
in DPM cascade pruning to avoid the two kinds of redun- 11: if s(γ) − dTt φd (p, p0 ) > ζt then
dancy. That is, besides explicitly calculating stage score, 12: f ← max(f, wtT φa (p, I) − dTt φd (p, p0 ))
13: end if
we can also estimate it according to their neighborhood-
14: end for
s by first order approximation. We name this cascade as 15: s(γ) ← s(γ) + f
neighborhood aware cascade. Let the neighborhood of γ be 16: end if
N (γ). We add the following two first order pruning criteria 17: end for
to decide whether γ is pruned or passed to next stage (the 18: end for
formal proofs can be found in supplementary material). 19: D ← NMS(Γ(Z = 1))
Semi-Positive Pruning: If ∃γ 0 ∈ N (γ) which satisfies
that st (γ 0 ) > st (γ) + µt , γ is pruned without evaluating left
stages. Herein µt is a pre-learned threshold. It is reasonable Let the object hypothesis training set be X, we set ρt =
since that if score of a hypothesis is much lower than its minγ∈X st (γ), and ζt = minγ∈X (st (γ) − dTt φd (pt , p0 )),
neighborhood, it will be pruned in NMS step even it passed where dTt φd (pt , p0 ) is the deformation cost. The µt and νt
all the cascade. are defined based on neighborhoods of labeled positive hy-
Semi-Negative Pruning: If score of a hypothesis at the t- potheses. We set µt = minγ∈X (st (γ) − max(st (N (γ))))
th stage is below a threshold st (γ) < νt , all the hypotheses and νt = minγ∈X st (N (γ)). Although it is better to learn
in its neighborhood region N (γ) are pruned without eval- thresholds from a new validation set, we find that learning
uating. This is because the score of a hypothesis can be thresholds from the training set is good enough in experi-
bounded by its neighborhoods under first order approxima- ments.
tion. We note that in experiments, the semi-negative pruning
The details of the neighborhood aware cascade for DP- mainly appears in early stages, the semi-positive pruning
M are listed in Alg. 2. Z(γ) in Alg. 2 indicates whether mainly appears in later stages, and the traditional “zero or-
γ is pruned or not. The neighborhood N (γ) is set to be a der” pruning appears in all stages. A comparison between
5×5 region centered at γ empirically after cross-validation. cascade [10] and proposed neighborhood aware cascade on
The algorithm is started from root score computation with the number of pruned parts in each stage is shown in Fig. 2.
the learned low rank root filter. The lines 9-15 in Alg. 2 are We also try to use neighborhood aware pruning in the first
used to find the best part location by searching a local region stage for root (instead of dense correlation in line 1 of Al-
∆(p0 , t) and add its score. In the final step, we also use NM- g. 2), but we find that it is not as efficient as dense low rank
S to merge overlapping hypotheses, but the number is much correlation.
fewer than the cascade DPM. In implementation, similar to
[10], PCA is used to simplify the appearance in early stages, 6. LUT HOG
and then original full filters are used at late stages. Anoth-
er useful detail is that part scores can be cached to avoid In this part we show how to dramatically reduce the com-
repeated calculation by its neighborhoods. putation cost while generating exactly the same HOG fea-
To learn the thresholds {µt , νt , ρt , ζt }, we run original ture.
DPM detector on labeled object hypotheses and their neigh- The HOG feature map is constructed on each scale in-
borhoods, and cache their scores of root and parts. The op- dependently by resizing input image. For each scale, the
timal threshold should be as large as possible for aggressive pixel-wise feature map, spatial aggregation and normaliza-
pruning, but must ensure not prune true object hypotheses. tion are operated in sequence. In pixel-wise feature map
Number of Pruned Part at Each Stage pedestrian and face in real applications, we also conduc-
70000 t experiments on challenging Caltech pedestrian detection
60000 task [7] and AFW face detection task [36].
Pruned Part Number

Cascade DPM
50000

40000 Our Neighborhood Aware 7.1. Pascal VOC 2007


Cascade
30000
On Pascal VOC 2007, the proposed method is imple-
20000
mented based on DPM release42 [12]. Besides the im-
10000
plementation of DPM release4, we compare accelerated
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 DPM versions, including cascade [10], branch-bound [16],
Stage coarse-to-fine [22] and FFT [8]. All these methods except
Figure 2. Average pruned part number at each stage on VOC 2007. coarse-to-fine [22] use the default setting and model in DP-
M release4, where the number of levels in an octave is 10,
HOG bin size is 8, part number for each component is 8
step, the gradient of each pixel is discretized into different and component number for each category is 6. For coarse-
partitions according to the orientation. In spatial aggrega- to-fine DPM, the setting advised by the paper [22] is used,
tion step, the gradient magnitude of each pixel is added to where component number is 4. The average feature extrac-
its corresponding bins in four cells around it with bilinear tion time, detection time and full time of the 20 categories
interpolation weight. Finally, a normalization step is ap- are reported in Tab. 1, where the detection time sums the
plied to gain the invariance by normalizing feature vector root and parts computation time. For fair comparison, all
with energy of four cells around. By analyzing a popular the codes run on the same PC with 2.66GHz Intel X5650
and well optimized implementation in [12], we find that the CPU, and only one thread is used in reporting Tab. 1. The
first two steps takes most of the time. The analysis here is accuracy on Pascal VOC 2007 testset (shown in Tab. 2) is
also valid for implementations in [8, 22]. measured by average-precision (AP) [9].
We use look-up table (LUT) to accelerate the first two
steps for HOG. With the LUT, the runtime computation is Table 1. Average time (measured by second) on Pascal VOC 2007.
replaced with simpler and more efficient array indexing op- Note that 6 components are used for each category and the times
eration. It is based on the fact that the pixels in image are are measured on a single thread implementation.
Feature Extraction Detection Full Time
represented by “uint8” integral numbers. They can only DPM [12] 0.46 11.77 12.23
generate limited cases of gradient orientation and magni- Branch-Bound (DPM) [16] 0.46 2.75 3.21
tude, so that can be computed in advance and stored as part Cascade (DPM) [10] 0.46 0.99 (0.15+0.84) 1.45
of model initialization. LUT is also valid for the compu- FFT (DPM) [8] 0.48 0.98 1.46
Coarse-to-fine (DPM) [22] 0.67 0.99 1.66
tation of the bilinear interpolation weight in spatial aggres- Proposed Method 0.07 0.22 (0.08+0.14) 0.29
sion step since that the possible bilinear weight number is
the HOG bin size.
Different DPM methods get similar accuracy on Pascal
Take the pixel-level feature map computation for exam-
VOC. Cascade [10], FFT [8] and coarse-to-fine [22] get
ple. Since pixels are in range of [0, 255], the gradients at x
similar 10 times acceleration over the DPM release4. With
and y directions are in range of 511 integers [−255, 255].
three acceleration techniques proposed in this paper, the
We pre-calculate three 511 × 511 look-up tables T1 , T2 and
proposed method runs 4 times faster than these accelerat-
T3 , where T1 , T2 and T3 store the index of contrast sensi-
ed DPM methods. Compared with the cascade DPM, pro-
tive and insensitive orientation partition, and the magnitude
posed method takes 1/2 time in calculation of root, 1/6 time
for possible gradient combinations in x and y directions, re-
in calculation parts and 1/7 time in calculation of HOG fea-
spectively. In runtime, these three values for each pixel can
ture. Proposed method runs at 3-4 FPS for a category with
be indexed in T1 , T2 and T3 instead of explicit computation.
6 components per image on Pascal VOC. When paralleliza-
The LUT based HOG computation is very simple and
tion is allowed, e.g. one thread for a component, the speed
easy for implementation. Our implementation based on
of proposed method is up to 15 FPS.
LUT is 6 times faster than the implementation in [11] on
One may also be interested in the comparison between
the same hardware, which clears up the time bottleneck in
Viola-Jones based detector and proposed method for ob-
computing HOG feature.
ject detection. Detectors are trained on Pascal VOC based
on the state-of-the-art Viola-Jones style detector ACF [4],
7. Experiments with DPM style mixture components. Although ACF is one
To evaluate the speed and accuracy of the proposed 2 We use release4 instead of release5, mainly due to that most algo-
method, experiments are conducted on Pascal VOC 2007 rithms compared are based on release4. Generally speaking, release5
object detection task [9]. Due to the special interests on would give a slight higher accuracy with exactly the same speed.
Table 2. Average-Precision (AP) of different methods on 20 categories of Pascal VOC 2007 testset.
plane bicycle bird boat bottle bus car cat chair cow table dog horse motor person plant sheep sofa train tv mean
DPM [12] 29.2 56.1 9.9 16.5 24.6 45.7 54.9 17.2 21.6 23.1 14.4 10.3 57.6 47.6 41.9 12.3 18.0 28.2 44.2 40.1 30.7
Branch-Bound (DPM) [16] 24.1 56.1 0.0 9.1 22.2 42.1 53.6 9.1 19.2 16.2 9.1 9.1 56.7 46.0 40.0 9.1 9.1 24.5 42.3 37.2 26.7
Cascade (DPM) [10] 27.6 56.2 9.9 16.6 24.7 45.5 55.0 17.3 21.6 22.8 14.4 10.4 57.7 48.0 41.8 12.3 18.1 28.6 44.3 40.1 30.6
FFT (DPM) [8] 30.1 56.2 9.8 15.0 23.7 48.3 54.8 16.4 22 22.4 18.1 10.5 56.3 46.4 40.9 12.4 17.7 29.7 42.6 37.2 30.5
Coarse-to-fine (DPM) [22] 27.9 54.8 10.2 16.1 16.2 49.7 48.3 17.5 17.2 26.4 21.4 11.4 55.7 42.2 30.7 11.4 20.9 29.1 41.5 30.0 28.9
Proposed Method 27.1 57.9 9.9 16.1 24.2 45.2 54.1 17.1 20.9 22.7 14.4 10.3 57.1 47.8 41.5 12.2 18.1 27.8 44.2 38.5 30.4

times faster (i.e., 0.12s per image) than proposed method, it 1


Pedestrian Detection ROC on Caltech Reasonable Set

can only get half the accuracy (i.e., 15.4 mean AP). .80
.64
.50
7.2. Caltech Pedestrians .40

.30 95% VJ

Caltech pedestrian benchmark [7] is one of the most 68% HOG

miss rate
61% MLS
.20 61% MultiFtr+CSS
challenging pedestrian detection task due to large appear- 60% FeatSynth
57% FPDW
ance variations in occlusion, pose, deformation and resolu- 56% ChnFtrs
54% CrossTalk
.10 51% ACF
tion. It is taken as a testbed to compare proposed method 48% MultiResC
48% Roerei
with other state-of-the-art methods for pedestrian detection. 48% DBN−Mut
44% ACF−Caltech
.05 42% ProposedMethod
Following the protocol in [7], set00-set05 are used to train 41% MT−DPM

model and set06-10 are used for test. The “reasonable” set- −3
10 10
−2
10
−1
10
false positives per image
0
10
1

ting in [7] is used to report the performance, where pedes-


Figure 3. ROC curve and mean miss rate of leading methods on
trians above 50 pixels in height of each 30 frames are taken
Caltech “reasonable” testset. We only report pure detection meth-
into consideration. ods (without context) for fair comparison. (Best viewed in color)
We report ROC and mean miss rate of the top method-
s3 plus Viola-Jones and HOG in Fig. 3. Since this paper
just considers the frame-wise detection, only methods with- cent work [26] are used for comparison. Note that the TSM
out usage of in-frame and between-frame context are com- (tree structure model [36]) and DPM reported in [36] are
pared. The proposed method is on par with the best perfor- trained on Multi-PIE, while the proposed method is trained
mance method MT-DPM [33] and outperforms the Viola- by more wild faces from AFLW. As shown in Fig. 4, the
Jones style detector ACF [4] by 2%. These three methods proposed method obtains a 93.7% AP on AFW, which is
largely outperform other methods. We compare the speed better than Face.com and very close to Google Picasa. The
of these top three methods. The number of scales evaluat- proposed method is about 100 times faster than TSM [36].
ed per octave is 5 and the mixture component number is 1, Although accuracy is not the main concern of the paper, the
which are good enough for pedestrian detection task. In this proposed method is better than TSM [36] by 5% AP. For full
setting, the proposed method runs at 10 FPS, while the MT- yaw pose face detection in VGA image, proposed method
DPM runs at 1.2 FPS with FFT based acceleration. The well runs at 5 FPS on a single thread and 25 FPS if 6 threads are
optimized ACF runs at 21 FPS with lower accuracy. When used. If only frontal faces are concerned, proposed method
6 cores are used for parallelization (mainly for HOG feature runs about 11 FPS (single thread) or 42 FPS (after paral-
in this experiment), speed of the proposed method is about lelization), which approximates the speed of Viola-Jones
40 FPS, which is fast enough for most applications. detector in OpenCV 4 . Considering the large performance
gain and similar speed, the proposed method has the poten-
7.3. AFW Faces
tial to replace Viola-Jones detector for face detection in the
The proposed method is also validated on AFW face de- wild.
tection task [36]. It contains 205 images with 468 faces in
the wild. Model in proposed method is trained on AFLW 8. Conclusion
dataset [18]. Training faces are split into 6 components
based on the pose annotations provided in [18] with yaw In this paper, three novel techniques are proposed to
angles in [0◦ ,30◦ ), [30◦ ,60◦ ), [60◦ ,90◦ ] and their mirrors. solve the speed bottleneck of deformable part model, while
Similar to the configuration for Pascal VOC, 8 parts are used maintaining its advantage in accuracy for various detection
for each component. tasks. The proposed method runs at 4 times faster than the
Recall-precision curve and average precision are used to previous fastest DPM method on Pascal VOC. For pedes-
report the performance. The results from [36] and a very re- trian and face detection, it runs at frame-rate with state-of-
3 Details can be found in Dollár’s website http://www.vision. 4 We note that Google Picasa has similar time cost when running the

caltech.edu/Image_Datasets/CaltechPedestrians/. software.
Face Detection Precision Recall Curve on AFW [10] P. Felzenszwalb, R. Girshick, and D. McAllester. Cascade object
1
detection with deformable part models. In CVPR. IEEE, 2010. 1, 2,
0.9 3, 4, 5, 6, 7
0.8 [11] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Ob-
0.7 ject detection with discriminatively trained part-based models. PAMI,
2010. 1, 3, 6
Precision Rate

0.6
85.5% DPM
Google Picasa
[12] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Dis-
0.5
Face.com criminatively trained deformable part models, release 4.
0.4 88.7% TSM−independent http://people.cs.uchicago.edu/ pff/latent-release4/. 2, 6, 7
87.2% TSM−shared
0.3 75.5% multiHOG [13] W. T. Freeman and E. H. Adelson. The design and use of steerable
0.2
69.8% Kalal filters. TPAMI, 1991. 1
2−view Viola−Jones
0.1 90.1% Shen−Retrieval [14] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hier-
93.7% Proposed method archies for accurate object detection and semantic segmentation. In
0
0 0.2 0.4 0.6 0.8 1 CVPR, 2014. 8
Recall Rate
[15] R. B. Girshick, H. O. Song, and T. Darrell. Discriminatively activated
Figure 4. Face detection Precision-recall and average precision on sparselets. In ICML, 2013. 2
AFW dataset. The proposed method dramatically outperforms the [16] I. Kokkinos. Rapid deformable object detection using dual-tree
methods reported in [36] and Face.com, and very close to Google branch-and-bound. In NIPS, 2011. 1, 2, 6, 7
Picasa. (Best viewed in color) [17] I. Kokkinos. Shufflets: Shared mid-level parts for fast object detec-
tion. In ICCV. IEEE, 2013. 2
[18] M. Kostinger, P. Wohlhart, P. Roth, and H. Bischof. Annotated facial
the-art accuracy, i.e. 10 FPS on a single CPU thread and 40 landmarks in the wild: A large-scale, real-world database for facial
landmark localization. In ICCV Workshops. IEEE, 2011. 7
FPS after parallelization. We expect this work can extend [19] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification
the DPM to real applications, such as video surveillance and with deep convolutional neural networks. In NIPS, 2012. 8
HCI. Techniques discussed in this paper can also be used to [20] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Efficient subwin-
accelerate related models, such as deep convolutional net- dow search: A branch and bound framework for object localization.
PAMI, 2009. 2
work [19, 14], which is taken as one of the future work.
[21] R. Manduchi, P. Perona, and D. Shy. Efficient deformable filter
banks. TSP, 1998. 1
Acknowledgement [22] M. Pedersoli, A. Vedaldi, and J. Gonzalez. A coarse-to-fine approach
for fast deformable object detection. In CVPR. IEEE, 2011. 1, 2, 6,
We thank the anonymous reviewers for their valu- 7
able feedbacks. This work was supported by the [23] H. Pirsiavash and D. Ramanan. Steerable part models. In CVPR.
Chinese National Natural Science Foundation Projects IEEE, 2012. 2
61105023, 61103156, 61105037, 61203267, 61375037, [24] V. Prisacariu and I. Reid. fasthog - a real-time gpu implementation
of hog. Technical report, Oxford University, 2009. 2
National Science and Technology Support Program Project
[25] R. Rigamonti, V. Lepetit, and P. Fua. Learning separable filters. In
2013BAK02B01, Chinese Academy of Sciences Project CVPR. IEEE, 2012. 1
KGZD-EW-102-2 and AuthenMetric Research and Devel- [26] X. Shen, Z. Lin, J. Brandt, and W. Ying. Detecting and aligning faces
opment Funds. by image retrieval. In CVPR. IEEE, 2013. 7
[27] H. O. Song, S. Zickler, T. Althoff, R. Girshick, M. Fritz, C. Gey-
er, P. Felzenszwalb, and T. Darrell. Sparselet models for efficient
References multiclass object detection. In ECCV. Springer, 2012. 2
[1] J.-F. Cai, E. J. Candès, and Z. Shen. A singular value thresholding [28] P. Sudowe and B. Leibe. Efficient Use of Geometric Constraints for
algorithm for matrix completion. SIAM Journal on Optimization, Sliding-Window Object Detection in Video. In ICVS’11, 2011. 2
2010. 4 [29] K.-C. Toh and S. Yun. An accelerated proximal gradient algorithm
[2] E. J. Candès, X. Li, Y. Ma, and J. Wright. Robust principal compo- for nuclear norm regularized linear least squares problems. Pacific
nent analysis? Journal of the ACM (JACM), 2011. 3 Journal of Optimization, 2010. 4
[3] T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, [30] P. Viola and M. Jones. Robust real-time face detection. IJCV, 2004.
and J. Yagnik. Fast, accurate detection of 100,000 object classes on 2
a single machine. In CVPR, 2013. 2 [31] L. Wolf, H. Jhuang, and T. Hazan. Modeling appearances with low-
[4] P. Dollár, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids rank svm. In CVPR. IEEE, 2007. 1
for object detection. PAMI, 2014. 6, 7 [32] J. Yan, Z. Lei, D. Yi, and S. Z. Li. Multi-pedestrian detection in
[5] P. Dollár, R. Appel, and W. Kienzle. Crosstalk cascades for frame- crowded scenes: A global view. In CVPR. IEEE, 2012. 1
rate pedestrian detection. In ECCV. Springer, 2012. 2, 5 [33] J. Yan, X. Zhang, Z. Lei, S. Liao, and S. Z. Li. Robust multi-
[6] P. Dollár, S. Belongie, and P. Perona. The fastest pedestrian detector resolution pedestrian detection in traffic scenes. In CVPR. IEEE,
in the west. BMVC 2010, 2010. 2 2013. 1, 7
[7] P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: [34] J. Yan, X. Zhang, Z. Lei, D. Yi, and S. Z. Li. Structural models for
An evaluation of the state of the art. PAMI, 34, 2012. 6, 7 face detection. In FG. IEEE, 2013. 1
[8] C. Dubout and F. Fleuret. Exact acceleration of linear object detec- [35] Y. Yang and D. Ramanan. Articulated pose estimation with flexible
tors. In ECCV. Springer, 2012. 1, 2, 6, 7 mixtures-of-parts. In CVPR. IEEE, 2011. 1
[9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zis- [36] X. Zhu and D. Ramanan. Face detection, pose estimation, and land-
serman. The pascal visual object classes (voc) challenge. IJCV, 2010. mark localization in the wild. In CVPR. IEEE, 2012. 1, 6, 7, 8
1, 6

You might also like