0% found this document useful (0 votes)
41 views8 pages

Pedestrian Detection at 100 Frames Per Second

This document presents a new method for pedestrian detection that achieves state-of-the-art detection quality at 100 frames per second. The method improves detection speed through two main contributions: 1) avoiding image resizing which provides a 3.5x speedup in feature computation, and 2) exploiting stereo depth information which reduces the detection search space by a factor of 44, enabling detections at 135 fps on a single CPU+GPU machine. The method outperforms previous real-time pedestrian detectors in both speed and detection quality.

Uploaded by

Nguyen Nam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views8 pages

Pedestrian Detection at 100 Frames Per Second

This document presents a new method for pedestrian detection that achieves state-of-the-art detection quality at 100 frames per second. The method improves detection speed through two main contributions: 1) avoiding image resizing which provides a 3.5x speedup in feature computation, and 2) exploiting stereo depth information which reduces the detection search space by a factor of 44, enabling detections at 135 fps on a single CPU+GPU machine. The method outperforms previous real-time pedestrian detectors in both speed and detection quality.

Uploaded by

Nguyen Nam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Pedestrian detection at 100 frames per second

Rodrigo Benenson, Markus Mathias, Radu Timofte and Luc Van Gool
ESAT-PSI-VISICS/IBBT, Katholieke Universiteit Leuven, Belgium
[email protected]

Abstract

We present a new pedestrian detector that improves both


in speed and quality over state-of-the-art. By efficiently
handling different scales and transferring computation from
test time to training time, detection speed is improved.
When processing monocular images, our system provides
high quality detections at 50 fps.
We also propose a new method for exploiting geomet-
ric context extracted from stereo images. On a single
CPU+GPU desktop machine, we reach 135 fps, when pro-
cessing street scenes, from rectified input to detections out- Figure 1: Example result on the Bahnhof sequence. Green
put. line indicates the stixels bottom, blue line the stixels top and
the red boxes are the obtained detections.

1. Introduction
capture best the input image information is crucial for
Visual object detection is under constant pressure to in- fast and good detections.
crease both its quality and speed. Such progress allows for Viola and Jones popularized the use of integral images
new applications. A higher speed enables its inclusion into to quickly compute rectangular averages [18]. Later,
larger systems with extensive subsequent processing (e.g. Dalal and Triggs popularized the idea that gradient ori-
as an initialization for segmentation or tracking), and its entation bins capture relevant information for detec-
deployment in computationally constrained scenarios (e.g. tions [4]. In the same vein, bag-of-words over dense
embedded system, large-scale data processing). SIFT features has been used [12].
In this paper we focus on improving the speed of pedes- It has also been shown multiple times that exploiting
trian (walking persons) detection, while providing state- depth and motion cues further improves the detection
of-the-art detection quality. We present two new algorith- quality [5, 2, 11], but so far usually at the cost and not
mic speed-ups, one based on better handling of scales (on the benefit of speed.
monocular images), and one based on better exploiting the
depth information (on stereo images). Altogether we ob- Better classifier For a given set of features, the choice
tain speed-ups by a factor ∼ 20, without suffering a loss in of classifier has a substantial impact on the resulting
detection quality. To the best of our knowledge, this is the speed and quality, often requiring a trade-off between
first time that pedestrian detections at 100 fps (frames per these two. Non-linear classifiers (e.g. RBF SVMs)
second) has been reached with such high detection quality. provide the best quality, but suffer from low speed.
As a result, linear classifiers such as Adaboost, linear
1.1. Related work
SVMs, or Random/Hough Forests are more commonly
Providing an exhaustive overview of previous, fast object used. Recent work on the linear approximation of non-
detection work is beyond the scope of this paper. Yet, most linear kernels seems a promising direction [16].
of the work on improving detection speed (without trading-
off quality) exploits one or more of the following ideas: Better prior knowledge In general, image processing
greatly benefits from prior knowledge . For pedestrian
Better features Having cheap to compute features that detection the presence of a single dominant ground

1
plane has often been used as prior knowledge to im- Using the stixel world (“better prior knowledge”), the de-
prove both speed and quality [9, 14, 17, 3]. tection search space is reduced by a factor 44 (factor 5 with
respect to using ground plane constraints only), enabling a
Cascades A frequently used method for speeding up clas- significant speed-up in practice. We obtain (stereo-based)
sifiers is to split them up into a sequence of simpler detections at 135 Hz without quality loss, while running on
classifiers. By having the first stages prune most of the a single CPU+GPU desktop machine. The same algorithm
false positives, the average computation time is signif- runs at 80 Hz on a high-end laptop (see section 6).
icantly reduced [18, 20, 10]. In the next section, we succinctly describe our baseline
Branch and bound Instead of accelerating the evaluation detector and proceed to describe our contributions in sec-
of individual hypotheses, branch-and-bound strategies tions 3 and 5. Section 6 discusses our experiments, and
attempt to reduce their number [12, 13]. By prioritiz- spells out the effects on the speed and quality of pedestrian
ing the most promising options, and discarding non- detection that the different novelties bring. Section 7 con-
promising ones, the speed is increased without a sig- cludes the paper with some general considerations.
nificant quality loss.
2. Baseline detector
Coarse-to-fine search is another popular method to arrive at
Our baseline detector is ChnFtrs of Dollar et al. [7].
fast object detection [15]. It can be considered a specific
Based on the exhaustive evaluations presented in [8], this
case of cascades, where the first stages take decisions on
detector yields state-of-the-art results, on par with the pop-
the basis of a coarser resolution. Coarse-to-fine approaches
ular part-based detectors [10]. Since then, we are aware
trade-off quality for speed. If the coarse resolution were
of only two quality improvements over this baseline. The
good enough, then all computations could be done at this
release of LatSvmV4 [10] seems to present significant im-
level, otherwise we are guaranteed to incur a quality loss.
provements over LatSvmV2. Also, Park et al. [14] pre-
1.2. Contributions sented a multi-scales detector that evaluates multiple detec-
tors (including part-based models) at multiple scales to im-
Our main result is a novel object detector, which we here prove results. We are comfortable claiming that ChnFtrs
demonstrate to provide high quality pedestrian detection at provides state-of-the-art results for single part templates
135 fps. This detector is 20 times faster than previous re- (significantly outperforming HOG+SVM [4]), and is compet-
sults of equal quality [6, 8], and has only half as many false itive with the initial versions of part-based detectors. Our
positives on the INRIA dataset compared to previous results own detector, will improve on top of ChnFtrs.
obtained at equal speed [17]. This result is based on two The ChnFtrs detector is based on the idea of “Inte-
core contributions. gral Channel Features”, which are simple rectangular fea-
tures that sum a filter response over a given image area.
Object detection without image resizing Common de- For pedestrian detection it was shown that using 6 quan-
tection methods require resizing the input image and com- tized orientations, 1 gradient magnitude and 3 LUV color
puting features multiple times. Viola and Jones [18] showed channels was enough to obtain state-of-the-art results (see
the benefits of “scaling the features not the images”, how- figure 5, upper row). On top of these rectangular features
ever such approach cannot be applied directly to HOG- a set of level-two decision trees (three stump classifiers per
like feature because of blurring effects. To the best of our tree) are constructed and then linearly weighted to obtain a
knowledge, we present the first detector based on orienta- strong classifier. The set of decision trees and their weights
tion gradients that requires no image resizing (see section is learned via discrete Adaboost.
3). This can be seen as a “better classifier”. Unless specified otherwise, we use the same setup as
This improvement provides a ∼ 3.5 times algorithmic used for the best results in the original paper [7]. The strong
speed-up on the features computation stage. We show state- classifier consists of 2000 weak classifiers, the features are
of-the-art (monocular) detections running at 50 Hz on GPU; selected from a random pool of 30000 rectangles. The train-
this is 10 times faster than previous HOG+SVM GPU results ing starts with a set of 5000 random negative samples and
and two times better quality-wise (see section 6). then bootstraps twice, each time adding 5000 additional
hard negative sample. The classifier is trained and evalu-
Object detection using stixels Depth information is ated using the INRIA pedestrians dataset. For faster train-
known to be a strong cue for detections. However depth ing and evaluation we also shrink the features by a factor 4
maps are too slow to compute. For the first time, we show (after computing the feature responses, before creating the
that a recently introduced fast depth information method integral image), as described in [7, addendum].
(“stixels world”, see figure 1) [3] can be used to acceler- The results presented here correspond to a complete re-
ate objects detection in practice (see section 5). implementation of the original method. Details of the fea-
ture computation and the bootstrapping procedure came out (used by [7, 10]), is to train a single model for one canonical
to have a significant impact on the final results. An all too scale, and then rescale the image N times, see figure 2b. A
naive implementation may lead up to about 10 % in perfor- detection with the canonical model scale on a resized image
mance loss when compared to the reported results. becomes equivalent to a detection on a different scale.
We produced two implementations of the ChnFtrs de- This traditional approach has been shown to be effec-
tector, a CPU version and a compatible GPU version. Eval- tive, but nonetheless it poses two problems. First, training a
uated over 640 × 480 pixels images (evaluating all shrunk canonical scale is delicate, as one needs to find the optimal
pixels, all 55 scales, and without using a cascade), the CPU size and learn a model that will trade-off between the rich
version runs at about 0.08 Hz when running on an 8 cores high resolution scales and the blurry low resolution scales.
machine Intel Core i7 870; the GPU version runs roughly Secondly, at run-time one needs to resize the input image
15 times faster, at 1.38 Hz on an Nvidia GeForce GTX 470. 50 times, and recompute the image features 50 times too.
At test time the GPU code spends roughly half of the time In the rest of the section we will explain how to sidestep
resizing the images and computing the integral channels, these issues, without having to naively train N models.
and half of the time computing the feature responses and
3.1. Approximating nearby scales
detection scores.
Another relevant feature of this detector is that the train- Recently, Dollar et al. [6] proposed a new approach for
ing is fairly fast. In our implementation the full training fast pedestrian detections, named FPDW. Instead of rescal-
from raw training dataset to final classifier (including the ing the input image N times, they propose to rescale it only
bootstrapping stages) takes about three hours, on a single N/K times, see figure 2c. Each rescaled image is used to
CPU + GPU machine. Importantly, the training time and compute the image features, and these image features are
memory consumption is stable even if the learned model then in turn used to approximate the feature response in the
has larger dimensions; this is a enabler for the approach de- remaining N − N/K scales. By reducing the number of im-
scribed in section 3. age resizing and feature computations by a factor K (∼ 10),
the total detection time is significantly reduced.
Comparison with HOG+SVM At a first glance it may be The core insight of the FPDW approach is that the feature
surprising to see that such a simple classifier may be able to responses of nearby scales can be approximated accurately
compete with sophisticated approaches such as HOG part- enough (up to half an octave). This empirical approxima-
based models [10]. A key difference is the use of learned tion can be described as follows (see [6] for details),
features versus hand designed features. Whereas Dalal and (
Triggs [4] chose to place the HOG cells uniformly, the au · sbu if s > 1
r(s) = (1)
ChnFtrs detector instead learns where to place the fea- ad · sbd otherwise
tures so as to maximize its discriminating power.
where s is the scaling factor (the new height of the detec-
tion window is equal to the old height times s), r(s) is the
3. Improving multi-scale handling
ratio between a feature response at scale 1 versus scale s,
Ideally, a class-specific object detector yields the cor- and au , bu , ad , bd are parameters empirically estimated for
rect number of object instances, as well as their positions the up-scaling and down-scaling case.
and scales. A naive approach would create a classifier for In our implementation we use au = 1, bu = 0, ad =
each position and scale, and make them compete against 0.89, bd = 1.586, for the orientation channels, and au =
each other. The strongest responses would then be selected. ad = 1, bu = bd = 2 for the LUV color channels; following
Since responses overlap, some non-maximum suppression the empirical evaluation from [6].
should determine the number of instances. This is an ab-
3.2. Object detection without image resizing
stract description of the most commonly used object detec-
tor architecture (sliding-windows type). The core idea of our paper it to move the resizing of the
Due to the pixel discretization, the object appearance at image from test time to training time. To do so we will
different scales changes by more than just scaling. In small use the insight of the FPDW detector and reverse it. Since
scales objects appear “blurry”, while bigger scales provide we can approximate the feature responses across scales, we
more detailed images. Assuming that object appearance can decide how to adjust a given stump classifier to classify
is invariant to translations in the image, to implement the correctly, as if the feature response had been computed at a
naive approach we should train as many models as there are different scale.
scales, see figure 2a. The number of scales N is usually in The strong classifier is built from a set of decision trees,
the order of ∼ 50 scales. with each decision tree containing three stump classifiers.
Training 50 models seems like a daunting task. The Each stump classifier is defined by a channel index, a rect-
traditional approach for object detection at multiple scales angle over such a channel, and a decision threshold τ . When
1 model, N image scales 1 model, N/K image scales N/K models, 1 image scale
N models, 1 image scale
(a) Naive approach (b) Traditional approach (c) FPDW approach (d) Our approach

Figure 2: Different approaches to detecting pedestrians at multiple scales.

rescaling a stump by a relative scale factor s, we keep the tween computing detection scores and computing the fea-
channel index constant, scale the rectangle by s and update tures; having a more streamlined execution path has signif-
the threshold as τ 0 = τ · r(s). icant impact in practice. This creates ideal conditions to
We can now take a canonical classifier, and convert it further speed-up our detector, as described in section 4.
into K classifiers for slightly different scales. Based on this,
we then proceed to train N/K (∼ 5) classifiers, one for each 3.3. Training the multi-scale classifier
octave (scale 0.5, 1, 2, etc.), see figures 2d and 5. Given that To train the N/K classifiers we rescale the positive train-
training our baseline detector takes three hours beginning to ing images to fit the desired object size. All large pedes-
end in a single desktop computer, we can easily train our trians can be converted into small training examples, how-
five classifiers in a few hours. ever when rescaling small examples into large sizes blurring
At test time we use the described approximation to trans- artefacts appear. The INRIA Persons dataset contains only
form our N/K classifiers into N classifiers (one per scale), few examples (< 600) of pedestrians taller than 256 pixels,
we compute the integral channel features on the original in- so training the larger scales using only appropriate example
put image, and then compute the response for each scale sizes risks leading to poor classifiers for these larger scales.
using the N classifiers. The proposed approach effectively In the experiments presented here we rescaled all exam-
enables to use the naive approach initially described (fig- ples to all scales, without taking any measure to lessen the
ure 2a). We name our “no image rescaling approach” the negative effect of blurred examples. We acknowledge that a
VeryFast detector. better handling of scales during training will certainly lead
to a further improved quality. However we focus on speed
Algorithmic speed-up Being able to skip the effort of more than quality.
computing multiple times the features, is clearly interesting Another issue to handle during training is calibrating the
speed-wise. Assuming half of the time is spent computing different detectors amongst themselves. Here, again, we
features for 50 scales and half of the time evaluating clas- take a simplistic approach that leaves room for improve-
sifier responses, computing features only once would pro- ment. We simply normalize the maximum possible score of
vide at best a speed-up of 1.9 times. Compared to FPDW, all detectors to 1.
assuming canonical scales 0.5, 1, 2 and 4; avoiding image
resizing reduces by a factor 3.75 the features computation 4. Soft cascade design
time. Then using VeryFast instead of FPDW, provides Up to now we have discussed speed(-up) results when
theoretical 1.57× speed-up. using all of the 2000 stages of our base Adaboost classifier.
Dollar et al. suggested to use a soft-cascade to accelerate
Measured speed-up In our GPU code the VeryFast the detections. The soft-cascade aborts the evaluation of
method is twice as fast as using the ChnFtrs detector non-promising detections if the score of a given stage drops
(2.68 Hz versus 1.38 Hz), while at the same time showing below a learned threshold. The suggested method [19] sets
a slight quality improvement (results presented in section such stage threshold at the minimal score of all accepted
6). As expected, our VeryFast detector is also faster than detections on a training or validation set.
FPDW (2.68 Hz versus 1.55 Hz). In our experiments building such cascade over the train-
After modifying the handling of scales, the GPU code ing set leads to over-fitting of the thresholds and poor detec-
now spends only 5% of the time computing features and tions at test time. Instead we adjusted quality results with-
the remaining 95% is solely dedicated to computing the out using soft-cascade, and then tuned the soft-cascade to
features responses at different scales and positions. Even keep the exact same quality results, but provide the desired
more, now the code does not need anymore to alternate be- speed-up. In practice, we use the INRIA test set as a val-
idation set to adjust the soft-cascade that will be used to In section 6.2 we compare the unconstrained detections
accelerate the results on the Bahnhof dataset (see section using our detector, those constrained by a ground plane esti-
6). mated for each stereo-frame, and those using stixels to con-
After using the INRIA test set, adding a small offset strain the detections.
(10% of the lowest threshold) allowed to make the cas- When using a ground plane estimate, the detections are
cade run as desired (higher speed, same quality). In the required to touch the ground plane with their bottom within
VeryFast method, each model has its own specific soft- a margin (e.g. ±30 pixels). When using stixels, they are
cascade thresholds. required to fit the bottom of the stixel crossing the detec-
tion centre (up to the same margin as in the ground plane
Algorithmic speed-up The speed-up of a soft-cascade is case). We also limit the scales of detection to a small range
content dependent and hard to predict, however a ten times around the scale indicated by the central stixel distance (e.g.
speed-up is expected. The soft-cascade should equally ben- ±5 scales, when scale step is 1.05).
efit the detection scores stage of ChnFtrs, FPDW, and
VeryFast. Since the latter spends a larger portion of Algorithmic speed-up When processing a
the time computing score (and a lower portion computing 640 × 480 pixels image over 55 scales, using stix-
features), we expect it to benefit the most from the soft- els with ±30 pixels and ±5 scales provides a 44×
cascade. reduction in search space (since we only consider
640 × 60 pixels · 10 scales). In comparison using ground
plane constraints only provides a 8× reduction in search
Measured speed-up When using the cascade our
space (since we only consider 640 × 60 pixels · 55 scales).
VeryFast implementation has a 20× speed gain, reach-
We show in section 6.2 that these parameters values provide
ing 50 Hz (see section 6 for speed evaluation details). In
no relevant degradation in the detection quality.
comparison ChnFtrs and FPDW barelly reach a 5× speed
gain (∼ 10 Hz). This is mainly due to the need to alternate
between features computation and detection scores which Measured speed-up Unconstrained detections run at
significantly hinders the GPU speed. Even if this was not 50 Hz (see section 4), however we may still want faster de-
a factor, VeryFast would still be faster, since it requires tections when additional computations are done in the real-
less computation by design (as seen in section 3). time system. When using the ground plane we reach 100 Hz
(ground plane computation itself runs at 300 Hz on CPU).
5. Exploiting geometry When using stixels, the detections run at 145 Hz on GPU,
but the stixel estimation itself runs at 135 Hz on CPU, mak-
Previous work has shown that using scene geometry as ing detections CPU bound at 135 fps.
prior for object detection can improve both the quality (by For both ground plane and stixel constraints the speed-
re-weighting the detection scores) [9, 14] and speed (by re- ups obtained (see table 1) are lower than the candidate win-
ducing the detections search space) [17, 3]. dow search space reduction because the discarded areas also
Common approaches based on processing dense stereo are those where soft-cascade (section 4) is most effective.
depth maps are a no-go, since producing depth maps at The details of the speed evaluation and the detection
100 Hz is a challenge in itself. Instead we follow the quality are presented in section 6.
approach of Benenson et al. [3], where objects above
the ground are modelled using the so-called “stixel world 6. Pedestrian detection at 100 fps
model” (stixel ≈ sticks above the ground in the image) [1].
For each column in the image, the bottom pixel, top pixel In the past sections we have presented the evolution from
and distance to the (unclassified) object are estimated (see a baseline CPU detector running at 0.08 Hz up to GPU de-
figure 1). The key feature of this approach is that the stixel tections at 135 Hz. We resume this evolution in the table
world model can be estimated directly from the stereo im- 1.
ages quickly, without having to compute the full depth map. Speed-wise our monocular results at 50 Hz are more than
In our implementation we are able to estimate the ground 7 times faster than the reported 6.5 Hz on CPU from Dol-
plane and the stixels at about 135 Hz, using only a CPU, 80 lar et al. [8]. Also our result is 10 times faster than the
disparities, and a fixed stixel height of 1.75 m. cudaHOG results reported from the GPU implementation
Although Benenson et al. [3] presented the idea of cou- in [17], and at the same time the quality is twice as good
pling a detector with stixels, they did not realize such cou- than cudaHOG (see section 6.1).
pling. Showing the actual speed impact of tightly coupling
stixels estimation and objects detection is a contribution of Speed measurement We measure the speed taken by the
this paper. CPU+GPU starting when the rectified stereo images are
Detector aspect Relative Absolute 1
0.9
0.8
speed speed 0.7
0.6
Baseline detector (§2) 1× 1.38 Hz 0.5
+Single scale detector (§3) 2× 2.68 Hz 0.4

+Soft-cascade (§4) 20× 50 Hz 0.3

+Estimated ground plane (§5) 2× 100 Hz

miss rate
0.2
+Estimated stixels (§5) 1.35× 135 Hz
Our monocular detector - 50 Hz
Our stereo (stixels) detector - 135 Hz 0.1

HOG (23.1%)
Table 1: Relative speed-up of each aspect of the proposed 0.05
Ours−FPDW (12.6%)
FPDW (9.3%)
detector, with respect to the baseline detector. ChnFtrs (8.7%)
Ours−ChnFtrs (8.7%)
Ours−MultipleScales (6.8%)
Ours−VeryFast (6.8%)
−2 −1 0 1
10 10 10 10
available both to the CPU and the GPU. The measured time, false positives per image

does include all CPU computations, GPU computations and (a) Quality of our detector variants (and reference detectors)
the time to download the GPU results and run the non-
1
maximum suppression on CPU. The ground plane and stix- 0.9
0.8
els are estimated at frame t−1 and fed to the GPU computa- 0.7
0.6
tions at frame t. All speed results are given when computing 0.5
over the Bahnhof images (640 × 480 pixels) over 55 scales 0.4

(unless otherwise specified), averaged over the 1000 frames 0.3

of the sequence.
As previously indicated our desktop computer is 0.2
miss rate

equipped with an Intel Core i7 870 and an Nvidia GeForce


GTX 470. Our fastest result VeryFast+stixels is 0.1
CPU bound (GPU runs at 145 Hz, CPU at 135 Hz), how-
ever the current CPU stixels code is sub-optimal and we be- VJ (47.5%)
lieve it should be amenable for further speed-up (to match 0.05
HOG (23.1%)
LatSvm−V2 (9.3%)
the GPU speed). FPDW (9.3%)
ChnFtrs (8.7%)
When running on a high end laptop (Intel Core Ours−VeryFast (6.8%)
i7-2630QM @ 2.00GHz, Nvidia GeForce GTX 10
−2 −1
10
0
10 10
1

false positives per image


560M), we reach 20 Hz for VeryFast, 38 Hz
(b) Comparison with other methods
for VeryFast+ground plane, and 80 Hz for
VeryFast+stixels. Figure 3: Results on the INRIA persons dataset.
6.1. INRIA Persons dataset results
We use the INRIA dataset to train our detector and to 6.2. Bahnhof sequence results
evaluate its quality. Although this dataset is rather small, The Bahnhof sequence presents a challenging stereo se-
the diversity of its content helps to highlight the differences quence, acquired from a stroller moving along a crowded
in performance of various methods. As a matter of fact, side-walk. This sequence allows us to evaluate the bene-
the relative ordering of methods seems roughly preserved fits of using stereo information and its impact on detection
across different pedestrian datasets [8]. quality. We use the PASCAL VOC evaluation criterion.
In figure 3a we present the results of the different detec- The evaluation from Dollar et al. [8], showed that on
tor variants discussed in section 3. We also evaluate using this sequence the results between different methods is sig-
the N/K detectors, while still rescaling the input image to nificantly reduced (due to the low intra-variance of the
compute the feature responses at different scales (i.e. we dataset). On this sequence we expect ChnFtrs to be only
do not use the FPDW approximation), this variant is named marginally better than HOG+SVM from Dalal and Triggs [4].
MultipleScales detector. In figure 4 we present the results obtained from the meth-
Figure 3b compares our detector with other state-of-the- ods described in section 5. We observe that the quality
art methods. Our detector is competitive in terms of the de- of our detector stays roughly constant when using ground
tection quality with respect to ChnFtrs and provides sig- plane and stixels, despite the 2.7× speed-up and reaching
nificant improvement over HOG+SVM. 135 fps. Equally important, we show that our VeryFast
Acknowledgement Work partly supported by the Toyota
Motor Corporation, the EU project EUROPA (FP7-231888)
and the ERC grant COGNIMUND.

References
[1] H. Badino, U. Franke, and D. Pfeiffer. The stixel world -
a compact medium level representation of the 3d-world. In
DAGM, 2009. 5
[2] M. Bajracharya, B. Moghaddam, A. Howard, S. Brennan,
and L. H. Matthies. A fast stereo-based system for detect-
ing and tracking pedestrians from a moving vehicle. IJRR,
28:1466–1485, 2009. 1
[3] R. Benenson, R. Timofte, and L. Van Gool. Stixels estima-
tion without depthmap computation. In ICCV, CVVT work-
shop, 2011. 2, 5, 7
[4] N. Dalal and B. Triggs. Histograms of oriented gradients for
Figure 4: Results on the Bahnhof stereo sequence. human detection. In CVPR, CA, USA, 2005. 1, 2, 3, 6
[5] N. Dalal, B. Triggs, and C. Schmid. Human detection using
oriented histograms of flow and appearance. In ECCV, 2006.
1
detector is above the HOG+SVM, confirming the expected
[6] P. Dollár, S. Belongie, and P. Perona. The fastest pedestrian
quality gain. detector in the west. In BMVC, 2010. 2, 3, 7
In [3] the authors presented higher quality detections [7] P. Dollár, Z. Tu, P. Perona, and S. Belongie. Integral channel
when using stixels than when using ground plane. We do features. In BMVC, 2009. 2, 3
not observe this in figure 4 (ground plane and stixels have [8] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian
overlapping quality). This can be explained by a few fac- detection: An evaluation of the state of the art. TPAMI, 2011.
tors. First, our detector makes different kind of errors than 2, 5, 6
HOG+SVM, which changes the stixels related gain. Second, [9] A. Ess. Visual Urban Scene Analysis by Moving Platforms.
PhD thesis, ETH Zurich, October 2009. 2, 5
we use the stixels to limit the scales search (to gain more
[10] P. Felzenszwalb, R. Girshick, and D. McAllester. Cascade
speed), while [3] did not consider this in their black box.
object detection with deformable part models. In CVPR,
Thirdly and more importantly, to reach our desired speed 2010. 2, 3
we use the stixels from t − 1 to guide the detections are [11] C. Keller, D. Fernandez, and D. Gavrila. Dense stereo-based
frame t, this was a slight negative impact on quality. All roi generation for pedestrian detection. In DAGM, 2009. 1
factors together, using stixels provides a pure speed gain, [12] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Beyond
with no noticeable quality loss. sliding windows: object localization by efficient subwindow
search. In CVPR, 2008. 1, 2
[13] A. Lehmann, P. Gehler, , and L. Van Gool. Branch&rank:
7. Conclusion Non-linear object detection. In BMVC, 2011. 2
[14] D. Park, D. Ramanan, and C. Fowlkes. Multiresolution mod-
We presented a novel pedestrian detector running at 135 els for object detection. In ECCV, 2010. 2, 5
fps in one CPU+GPU enabled desktop computer. The core [15] M. Pedersoli, A. Vedaldi, and J. Gonzalez. A coarse-to-fine
novelties of our approach are reverting the FPDW detector of approach for fast deformable object detection. In CVPR,
Dollar et al. [6] in order to avoid resizing the input image at 2011. 2
multiple scales and using a recent method to quickly access [16] V. Sreekanth, A. Vedaldi, C. V. Jawahar, and A. Zisserman.
to geometric information from stereo [3]. Generalized RBF feature maps for efficient detection. In
BMVC, 2010. 1
Our approach tallies with the Viola and Jones idea of
[17] P. Sudowe and B. Leibe. Efficient use of geometric con-
“scale the features not the images” [18], applied to HOG-
straints for sliding-window object detection in video. In
like features. ICVS, 2011. 2, 5
Given the high parallelism of our solution, it will directly [18] P. Viola and M. Jones. Robust real-time face detection. In
benefit from future hardware improvements. We wish to im- IJCV, 2004. 1, 2, 7
prove the quality of the classifier training (see section 3.3), [19] C. Zhang and P. Viola. Multiple-instance pruning for learn-
and extend the current system to the multi-class/multi-view ing efficient cascade detectors. In NIPS, 2007. 4
detection of cars, bikes and other mobile objects. We are [20] Q. Zhu, M.-C. Yeh, K.-T. Cheng, and S. Avidan. Fast human
also interested in exploring a monocular equivalent of the detection using a cascade of histograms of oriented gradi-
ents. In CVPR, 2006. 2
current stereo stixels estimation.
Figure 5: Visualization of the trained multi-scales model. First row shows an example of the different features used, one
per column. Each row below shows the trained model, one scale per row. Each model column is individually normalized to
maximize contrast (relative influence not visible). Red and blue indicate positive and negative contributions to the detection
score, respectively. Scale one has size 64 × 128 pixels.

You might also like