The Fastest Deformable Part Model For Object Detection
The Fastest Deformable Part Model For Object Detection
1
overlapping hypotheses to pass through the whole cascade, mate the correlation in DPM with a decline of performance
while only one hypothesis with the highest score is useful to detect 100,000 categories on a single workstation.
for detection. The second is that if one hypothesis has very Acceleration of Pedestrian Detection Recently, large
low score, its neighborhoods tend to have low scores and improvements on efficiency were achieved in pedestrian de-
probably can avoid evaluation. Motivated by the crosstalk tection task [6, 5]. [6] proposed to approximate features at
[5] in boosting classifier, this paper proposes neighborhood nearby scales for fast computation of multiple channel fea-
aware cascade for DPM to reduce the two kinds of redun- tures. Based on the feature and boosting classifier in [6],
dancy. Many hypotheses in this cascade can be aggressively [5] further proposed crosstalk cascade by considering the
pruned according to the first order approximation of stage dependence in neighborhood. [5] is considered to be one of
scores by their neighborhoods, instead of explicit computa- the best detectors in Viola-Jones framework [30] in terms of
tion. speed and accuracy. We extend this idea to neighborhood
Look-up Table HOG HOG is used in DPM as a low- aware cascade in DPM.
level representation due to the advantage in tolerating local HOG computation The widely used HOG implemen-
transformation. However, the original HOG calculation has tation in [12] takes about 0.5 second per VGA image on a
high computational cost, mainly due to the operations in single thread, which itself slows down DPM. Unfortunate-
calculating the orientation partition and magnitude. This ly, this step is often directly ignored by recent works on
paper shows that look-up table (LUT) can be used to replace acceleration of deformable part model. Some recent works
them with much simpler matrix index operations, based on [27, 28, 24] accelerated HOG with the computation capacity
the fact that there are only finite possibilities of gradient and of GPU, however, the algorithm itself is not improved. With
orientation. the help of LUT, HOG implementation in this paper runs on
The rest of the paper is organized as follows. Section 2 a single CPU thread is as fast as the GPU implementation
reviews the related work. An overall introduction of DPM reported in [24]. LUT based method can be applied on GPU
is presented in section 3. The discriminative low rank root for more acceleration.
filter, neighborhood aware cascade and LUT HOG are de-
scribed in section 4, 5, 6 respectively. We show experiments 3. DPM and Cascade DPM
in section 7 and conclude the paper in section 8.
This part gives a brief review of DPM and cascade DPM,
and then analyzes the bottleneck in computation.
2. Related Work The DPM is composed of a root filter w0 and n part-
Acceleration of DPM This work is most related to ap- s, where the t-th part is parameterized by filter wt and de-
proaches that accelerate single category DPM in detection. formation term dt . An object hypothesis γ is specified by
[10] proposed to convert star-structure to cascade, which {p0 , p1 , · · · , pn }, where p0 is the location of root, and pt is
can efficiently prune unpromising hypotheses. [22] pro- the location of the t-th part. Root and parts are connected
posed a coarse-to-fine approach based on that model at low by a pictorial structure. The detection score s(γ) is defined
resolution can prune a lot of hypotheses with low compu- as:
tational cost. FFT was used to accelerate the correlation in Xn
[8]. Motivated by the branch-and-bound approach [20] for s(γ) = w0T φa (p0 , I) + wtT φa (pt , I) − dTt φd (pt , p0 ),
object detection, [16] introduced it to DPM with carefully t=1
(1)
designed bound. For a category with 6 components, these
where φa is the HOG feature for appearance, and φd is sep-
methods run at about 1 FPS per Pascal VOC image on a sin-
arable quadratic function for deformation. Mixture compo-
gle thread, which is faster than DPM by one order, but still
nents can be naturally added to represent objects in different
relatively slow for real application.
poses, but we leave them out to simplify the notation.
Acceleration of Multi-category DPM Quite a large
For a hypothesis γ in detection, only root location p0 is
number of recent works [23, 27, 15, 3, 17] were proposed to
known, while the part location pt is inferred by maximiz-
accelerate DPM for multi-category detection, e.g. simulta-
ing the part appearance score minus the deformation cost
neous detection of 20 categories on Pascal VOC. Steerable
associated with displacement:
part model [23] used a part bank with linear combination
to approximate correlation score of different categories. S- pt = arg max wtT φa (p, I) − dTt φd (p, p0 ), (2)
p
parselet [27, 15] used a large part bank with sparse linear
combination. [15] and [23] both achieve three times accel- where p traverses possible locations of the part. Since parts
eration over the original DPM for 20 category object detec- are directly attached to the root, their locations are inferred
tion, however, they are slower than the cascade DPM [10] independently for a fixed root. It has been found in previous
which detects each category independently. Very recently, works [10, 22, 8] that in DPM most of the time is spent on
[3] proposed to use locality-sensitive hashing to approxi- calculating the appearance term due to the high dimension.
Time Cost of Cascade DPM and Proposed Method proved that:
r
X r
X
K ◦F =K ◦ σi ui viT = σi ((F ◦ ui ) ◦ viT ), (4)
Cascade DPM 0.46 0.15 0.84 i=1 i=1
Cascade DPM
50000
can only get half the accuracy (i.e., 15.4 mean AP). .80
.64
.50
7.2. Caltech Pedestrians .40
.30 95% VJ
miss rate
61% MLS
.20 61% MultiFtr+CSS
challenging pedestrian detection task due to large appear- 60% FeatSynth
57% FPDW
ance variations in occlusion, pose, deformation and resolu- 56% ChnFtrs
54% CrossTalk
.10 51% ACF
tion. It is taken as a testbed to compare proposed method 48% MultiResC
48% Roerei
with other state-of-the-art methods for pedestrian detection. 48% DBN−Mut
44% ACF−Caltech
.05 42% ProposedMethod
Following the protocol in [7], set00-set05 are used to train 41% MT−DPM
model and set06-10 are used for test. The “reasonable” set- −3
10 10
−2
10
−1
10
false positives per image
0
10
1
caltech.edu/Image_Datasets/CaltechPedestrians/. software.
Face Detection Precision Recall Curve on AFW [10] P. Felzenszwalb, R. Girshick, and D. McAllester. Cascade object
1
detection with deformable part models. In CVPR. IEEE, 2010. 1, 2,
0.9 3, 4, 5, 6, 7
0.8 [11] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Ob-
0.7 ject detection with discriminatively trained part-based models. PAMI,
2010. 1, 3, 6
Precision Rate
0.6
85.5% DPM
Google Picasa
[12] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Dis-
0.5
Face.com criminatively trained deformable part models, release 4.
0.4 88.7% TSM−independent http://people.cs.uchicago.edu/ pff/latent-release4/. 2, 6, 7
87.2% TSM−shared
0.3 75.5% multiHOG [13] W. T. Freeman and E. H. Adelson. The design and use of steerable
0.2
69.8% Kalal filters. TPAMI, 1991. 1
2−view Viola−Jones
0.1 90.1% Shen−Retrieval [14] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hier-
93.7% Proposed method archies for accurate object detection and semantic segmentation. In
0
0 0.2 0.4 0.6 0.8 1 CVPR, 2014. 8
Recall Rate
[15] R. B. Girshick, H. O. Song, and T. Darrell. Discriminatively activated
Figure 4. Face detection Precision-recall and average precision on sparselets. In ICML, 2013. 2
AFW dataset. The proposed method dramatically outperforms the [16] I. Kokkinos. Rapid deformable object detection using dual-tree
methods reported in [36] and Face.com, and very close to Google branch-and-bound. In NIPS, 2011. 1, 2, 6, 7
Picasa. (Best viewed in color) [17] I. Kokkinos. Shufflets: Shared mid-level parts for fast object detec-
tion. In ICCV. IEEE, 2013. 2
[18] M. Kostinger, P. Wohlhart, P. Roth, and H. Bischof. Annotated facial
the-art accuracy, i.e. 10 FPS on a single CPU thread and 40 landmarks in the wild: A large-scale, real-world database for facial
landmark localization. In ICCV Workshops. IEEE, 2011. 7
FPS after parallelization. We expect this work can extend [19] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification
the DPM to real applications, such as video surveillance and with deep convolutional neural networks. In NIPS, 2012. 8
HCI. Techniques discussed in this paper can also be used to [20] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Efficient subwin-
accelerate related models, such as deep convolutional net- dow search: A branch and bound framework for object localization.
PAMI, 2009. 2
work [19, 14], which is taken as one of the future work.
[21] R. Manduchi, P. Perona, and D. Shy. Efficient deformable filter
banks. TSP, 1998. 1
Acknowledgement [22] M. Pedersoli, A. Vedaldi, and J. Gonzalez. A coarse-to-fine approach
for fast deformable object detection. In CVPR. IEEE, 2011. 1, 2, 6,
We thank the anonymous reviewers for their valu- 7
able feedbacks. This work was supported by the [23] H. Pirsiavash and D. Ramanan. Steerable part models. In CVPR.
Chinese National Natural Science Foundation Projects IEEE, 2012. 2
61105023, 61103156, 61105037, 61203267, 61375037, [24] V. Prisacariu and I. Reid. fasthog - a real-time gpu implementation
of hog. Technical report, Oxford University, 2009. 2
National Science and Technology Support Program Project
[25] R. Rigamonti, V. Lepetit, and P. Fua. Learning separable filters. In
2013BAK02B01, Chinese Academy of Sciences Project CVPR. IEEE, 2012. 1
KGZD-EW-102-2 and AuthenMetric Research and Devel- [26] X. Shen, Z. Lin, J. Brandt, and W. Ying. Detecting and aligning faces
opment Funds. by image retrieval. In CVPR. IEEE, 2013. 7
[27] H. O. Song, S. Zickler, T. Althoff, R. Girshick, M. Fritz, C. Gey-
er, P. Felzenszwalb, and T. Darrell. Sparselet models for efficient
References multiclass object detection. In ECCV. Springer, 2012. 2
[1] J.-F. Cai, E. J. Candès, and Z. Shen. A singular value thresholding [28] P. Sudowe and B. Leibe. Efficient Use of Geometric Constraints for
algorithm for matrix completion. SIAM Journal on Optimization, Sliding-Window Object Detection in Video. In ICVS’11, 2011. 2
2010. 4 [29] K.-C. Toh and S. Yun. An accelerated proximal gradient algorithm
[2] E. J. Candès, X. Li, Y. Ma, and J. Wright. Robust principal compo- for nuclear norm regularized linear least squares problems. Pacific
nent analysis? Journal of the ACM (JACM), 2011. 3 Journal of Optimization, 2010. 4
[3] T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, [30] P. Viola and M. Jones. Robust real-time face detection. IJCV, 2004.
and J. Yagnik. Fast, accurate detection of 100,000 object classes on 2
a single machine. In CVPR, 2013. 2 [31] L. Wolf, H. Jhuang, and T. Hazan. Modeling appearances with low-
[4] P. Dollár, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids rank svm. In CVPR. IEEE, 2007. 1
for object detection. PAMI, 2014. 6, 7 [32] J. Yan, Z. Lei, D. Yi, and S. Z. Li. Multi-pedestrian detection in
[5] P. Dollár, R. Appel, and W. Kienzle. Crosstalk cascades for frame- crowded scenes: A global view. In CVPR. IEEE, 2012. 1
rate pedestrian detection. In ECCV. Springer, 2012. 2, 5 [33] J. Yan, X. Zhang, Z. Lei, S. Liao, and S. Z. Li. Robust multi-
[6] P. Dollár, S. Belongie, and P. Perona. The fastest pedestrian detector resolution pedestrian detection in traffic scenes. In CVPR. IEEE,
in the west. BMVC 2010, 2010. 2 2013. 1, 7
[7] P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: [34] J. Yan, X. Zhang, Z. Lei, D. Yi, and S. Z. Li. Structural models for
An evaluation of the state of the art. PAMI, 34, 2012. 6, 7 face detection. In FG. IEEE, 2013. 1
[8] C. Dubout and F. Fleuret. Exact acceleration of linear object detec- [35] Y. Yang and D. Ramanan. Articulated pose estimation with flexible
tors. In ECCV. Springer, 2012. 1, 2, 6, 7 mixtures-of-parts. In CVPR. IEEE, 2011. 1
[9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zis- [36] X. Zhu and D. Ramanan. Face detection, pose estimation, and land-
serman. The pascal visual object classes (voc) challenge. IJCV, 2010. mark localization in the wild. In CVPR. IEEE, 2012. 1, 6, 7, 8
1, 6