(Paper 5) Kira2012
(Paper 5) Kira2012
and containing a significant amount of people far away. A. Camera Calibration and Stereo Detection
II. RELATED WORK In order to compute accurate stereo disparity maps, the
Pedestrian detection is an active area of research, with cameras must first be calibrated to obtain intrinsic camera
dominant approaches utilizing monocular cameras [1][2] and parameters and extrinsic stereo parameters. For this paper,
a more limited amount of work on the use of stereo cameras two types of camera lenses were used. The first was a
[3][6]. The most common approach used as a baseline for 180-degree fisheye lens, where we used the Scaramuzza
comparison is one that uses Histogram of Oriented Gradients MATLAB toolbox to obtain intrinsic parameters [11]. Given
(HOG) features that are then fed to a linear SVM classifier. these parameters, we estimated extrinsic parameters using
This approach has been since augmented with other features images containing a checkerboard pattern in both views, and
such as parts-based features [2] or motion features [5]. finally mapped the fisheye image onto a cylindrical image.
Several works have explored combining multiple features to The second camera pair used an 80-degree lens which was
achieve performance that is better than any single feature [7]. calibrated using a traditional camera model and checkerboard
See [4] for a survey of methods that have been explored. pattern.
While several of these methods outperform a HOG-based Given a calibrated stereo pair, we compute stereo disparity
classifier in the tested datasets, they are often much slower maps using a fast CUDA implementation [14]. The resulting
than HOG features that are amenable to efficient implemen- disparity image is then used to find vertical structures in
tations, including through the use of CUDA. Similar to this the scene that could potentially correspond to pedestrians.
paper, [8] uses a cascade to speed up the computational Specifically, we discretize the image into a fixed number of
speed. Unlike that system, however, this work proposes patches and estimate the ground plane, converted into three-
automatically learned features and cascades classifiers based dimensional Cartesian space, using the disparity information.
on distance as opposed to traditional AdaBoost-based tech- This is done by building a histogram of disparity values
niques. [9] uses automatically learned features through a for each horizontal scanline in the patch and finding the
convolutional network classifier as well, but is not targeted best orientation estimate for the patch across all of these
towards longer-range detections. Several feature types, such histograms using a Hough transform. This orientation is used
as parts-based detections, are also difficult to obtain with to estimate the ground as a function of disparity for each
lower resolution image patches. Stereo-based detectors have patch. A mask of all disparity pixels that are above this
also been used in, e.g., [3] that use stereo to produce estimated ground is then created. Connected components is
detections by segmenting the scene based on range and using used to group these above-ground pixels together, separated
various shape and appearance features. Even when using into multiple disparity levels (distances) to provide a final
stereo, however, these works are extremely limited in their estimate of all objects that are vertical. An ROI (region of
ability to detect pedestrians when the number of pixels on interest) in the image space is created from the bounding box
target is low. of each component.
The technique described above finds all objects that are
III. SYSTEM DESIGN estimated to be above local ground patches, and hence can
The overall system design is depicted in Figure 2, consist- return a large number of ROIs. For example, large structures
ing of a stereo detector and a cascade of three convolutional such as buildings and trees can be picked up, although they
network classifiers. We will now describe each portion of the are contiguous structures that are unlikely to be pedestrians.
system in detail. In order to further prune these detections, it was observed
2397
learned) and pooling, resulting in the processing of data at
multiple levels. In addition to learning the underlying filters
and connection weights, labeled training data can be used to
learn a discrimination function that is able to classify data
into multiple classes.
There are several advantages to using this technique over
traditional approaches. First, it is a flexible framework which
supports the use of multiple modalities, in this case EO
images and stereo disparity images. We design a neural
network architecture that utilizes an asymmetrical connection
Fig. 3. Edge-reponse example with a negative instance mistakenly detected map with inputs to ensure that features are learned from each
due to the curb and stereo noise having very little vertical edge response modality independently as well as jointly over both modali-
(left) while a positive instance has a strong vertical edge response (right).
ties. In order to incorporate this additional information, there
is no need to hand-design additional features for the new
modality.
A second advantage is that this framework allows the
features to be learned for the particular task of pedestrian
detection, taking long-range instances into account when
learning them. As mentioned earlier, hand-designed features
have thus far not been as successful in these situations, and
we will show in Section IV that the convolutional network
classifier outperforms standard HOG features. Finally, the
processing time required during run-time (i.e. on test data
after training) lends itself to close to real-time application.
Fig. 4. Architecture for 2 input convolutional neural network. Figure 4 shows the architecture. The inputs to the classifier
consist of two channels: An 80x40 8-bit grayscale camera
image and a second 80x40 integer disparity map. Both inputs
that some of these false positives present no vertical edges in are normalized to zero mean and unit variance before being
the camera image, while pedestrian silhouettes contain strong applied to the subsequent layers. Beyond the input, a 6
edges. Hence, we also use a Sobel edge detection filter on the layer hierarchy was designed consisting of 3 convolutional
entire image, and prune ROIs that have little vertical edge layers, 2 pooling layers, and a fully connected layer that
response. Section IV will show that this can significantly produced two numerical outputs representing scores for
reduce the amount of detections with very little loss of recall. each class (pedestrian and non-pedestrian). In all, there are
Figure 3 shows an example of a false positive with little 8,000 trainable parameters describing the weights of the
edge response and a typical true positive that presents strong connections. These network parameters are optimized using
edges. stochastic gradient descent given a training set consisting of
labeled positive and negative instances.
B. Classification using Convolutional Networks
In addition, in order to further increase the classifier speed,
After the stereo-based detection, a classifier is used to we trained a second classifier that takes in 28x14 inputs
determine whether each ROI is a pedestrian or not. While instead of the original 80x40. The threshold for this classifier
previous appearance-based classifiers have achieved some is conservatively chosen to achieve a high recall. While the
success for cases where there are more than 50 pixels on a false alarm rate will not be as low as with the full-sized
pedestrian, they typically degrade heavily beyond this range classifier, it is able to filter out a significant portion of the
[4]. While we mitigate this effect somewhat by our use of easier negatives. Since the smaller classifier is faster, a net
higher-resolution cameras with an 80-degree lens, the fisheye time savings can be gained. Note that we also run the two
lens has significantly less pixels on target and even the classifiers across multiple threads since the classification of
narrower field of view results in pedestrians at our maximum ROIs can be independently processed, yielding further speed
target range (approximately 100 meters) occupying around savings on multi-core systems.
30 pixels or less. In fact, we will show in the experimental
section that approximately a third of our dataset consists of C. Long-Range Classifier
detections with less than 50 pixels in height. There are several challenging aspects to classifying de-
Instead of hand-crafted features, we use convolutional tections of pedestrians at long range. First, there are few
neural networks, a deep learning method that has achieved pixels on target (e.g. less than 50) and hence the resulting
success on a wide array of modalities for tasks such as stereo disparity map will be of much lower quality. This
object recognition [12] and speech [13]. It is considered can result in errors in the ground estimation as well as
a deep learning technique due to its use of a hierarchi- in the stereo calculations themselves, which require some
cal architecture that consists of alternating filters (that are texture that can be eroded by blur. Second, these problems
2398
Per−Window Results of Bimodal Classifier Per−Window Results of Bimodal Classifier
1 1
CID 36x18 LR − HOG
CID 36x18 LR − HOG (Best)
CID 36x18 LR − HOG (128x64)
CID 36x18 LR − ConvNet
Recall
Recall
0.5 0.5
0.2 0.2
INRIA − HOG
CID 80x40 − HOG
0.1 CID 80x40 − ConvNet 0.1
0.05 0.05
0.02
0.01 0.02
0.01
−6 −5 −4 −3 −2 −1 −6 −5 −4 −3 −2 −1
10 10 10 10 10 10 10 10 10 10 10 10
False Positives Per Window (FPPW) False Positives Per Window (FPPW)
Fig. 5. Left: Recall/FPPW curve demonstrating the performance of convolutional networks (”ConvNet”) over HOG features combined with linear SVMs
[1] (”HOG”) on our dataset (”CID - 80x40”). We also demonstrate the difficulty of the dataset by showing the results of the HOG algorithm on the INRIA
dataset; our dataset is significantly more challenging due to longer-range pedestrians. Right: Recall/FPPW curve for results obtained using the long-range
classifier trained on pedestrian samples that were 40 pixels in height or less. We compare our results to the original HOG-based classifier (”HOG”), after
optimizing the parameters of the HOG-based classifier (”HOG (Best)”), and after upsampling the 36x18 samples to 128x64 for which the HOG classifier
was originally designed (”HOG (128x64)”).
can cause detection windows that are not fully aligned with
the pedestrians, a situation that can reduce the classifier’s
success.
In order to mitigate these problems, we use a second
appearance-only classifier, trained exclusively on pedestrians
with 40 pixels or less on target. This classifier is similar
in architecture to the previously described classifier, except
that it is unimodal (stereo disparity maps are not used) and
the input image is a smaller 36x18 ROI. In addition, the
training data used to train the classifier differs. Unlike the
earlier classifier, the windows with which training occurs is Fig. 6. Sample mid-range inputs showing image and disparity maps.
not resized or resampled, and pedestrians that are smaller or
larger are allowed. In other words, the center of the detection
is taken and a window in the size of the classifier input (in
our case 36x18) is taken around the detection. This training
regime specifically builds in robustness to partial overlap
with the pedestrian and reduced image blur.
In order to combine the two classifiers, the detections are
split based on their height in pixels and fed to each respective
classifier. The results are then combined together to form the
final pedestrian detections. Note that since each classifier Fig. 7. Example filters that were learned and the features maps after
training the convolutional network.
has its own characteristics in terms of the operating curve,
different thresholds are used for each classifier.
Two sets of stereo cameras, each operating at either 1 Hz
IV. EXPERIMENTAL RESULTS
or 5 Hz, were used to capture the data, and classifier training
In order to test the system, over 45 video sequences were consisted of data from both cameras. Both rigs consisted
gathered in various outdoor environments over a period of of a stereo pair of Prosilica GB2450 cameras operating at
10 months and subsequently ground-truthed (we use the a resolution of 2448x2050; one setup used a 180 degree
name ”CID” to label this dataset in subsequent graphs). fisheye lens, resulting in an image of size 2881x657 after
Figure 6 shows examples of mid-range detections, and table I conversion to the cylindrical image, while another rig with
summarizes some statistics of the dataset. The environments 80-degree lens and more pixels on target was used, with
ranged from a helipad spanning approximately 60 meters, the image cropped to size 2448x625. The pedestrians in
a parking lot with long rows of cars, an open area with the dataset were captured at ranges of 10 to 140 meters.
buildings and vegetation, a forest environment, and an open In all, over 39,000 frames with over 160,000 annotated
parking lot spanning more than 100 meters. pedestrians resulted, with about a third being 50 pixels or
2399
Fig. 8. Results on one sequence, evaluated on a per-image basis, breaking down the detection and classification results for the fisheye camera (left) and
80 degree camera (right). Over half of the pedestrians are picked up at greater than 50 meters for the fisheye, representing approximately 30 pixels. The
80 degree camera performs even better, picking up approximately 60% of the pedestrians at over 80 meters.
TABLE I
were learned, and Figure 5 (left) shows the result on the
D ESCRIPTION OF DATA S ETS
testing set. The results are displayed as Recall/FPPW (false
positive per window) curves, where the class decision is
Attribute Fisheye 80 Degree
Frames 24,572 14,533 made by comparing the two numerical outputs of the net-
Detections 99,522 64,204 work classifier. In this case, a bias on the positive class is
Mean Height (Pixels) 87.78 88.37 applied and varied to produce the curve. For comparison,
% l.t. 50 pixels 32.1 35.6
% l.t. 40 pixels 14.4 11.6 we have also trained a HOG-based SVM classifier with the
same training set and tested on the same testing set. The
same scoring methodology was used as well. As can be
seen, the convolutional network achieves significantly better
less. The datasets also increase in complexity, with the last performance on the testing patches (note that the x-axis is
set consisting of people performing various maneuvers and log-scale). Note that some of this gain may be due to the
running, conditions that did not exist in the training set. We additional modality, but as we will see in the next subsection
will now describe our results on the training of the bimodal the convolutional network outperforms HOG even when both
and long-range classifiers on ground-truthed positives and use only appearance. We also display results of the HOG-
negatives sampled from images. The positive and negative based classifier on the original INRIA dataset to demonstrate
instances were extracted from the data sets and separated into that our dataset is significantly more challenging than INRIA.
training, testing, and validation sets. After showing results on This is due to significant amount of long-range pedestrians.
the testing set on a per-instance basis, we will show results
on entire sequences on a per-image basis. B. Long-Range Classifier
The long-range classifier used 29 sequences, with a similar
A. Bimodal Convolutional Network methodology for extracting positive samples. In this case,
To train the convolutional network classifier, 25 sequences however, the samples were not resampled to the classifier
were combined and all of the positive and negative image input size; instead, the center of the positive ground truth
patches were shuffled into training, testing, and validation windows that were 40 pixels in height or less was calculated
sets. To obtain positive samples, ground-truth ROIs were and a 36x18 patch around this center was extracted. Negative
extracted from the images. Note that a larger border was samples consisted of stereo detections that did not intersect
used during extraction so that the original positive set could with the ground-truth. In all, after expansion through jitter-
be expanded by applying translational and rotational jitter. ing, approximately 53k positive and negative samples each
This process serves to expand the positive set and make were used for training, and approximately 40k positive and
it robust to some transformation. Negative patches were negative samples each were used for testing. Training and
extracted by taking random patches that did not intersect testing sets were obtained by randomly shuffling the input
with the ground truth pedestrian windows but that had similar data.
scales and image vertical positions. In all, after jittering the Figure 5 (right) shows the per-window results of the long-
images 800,000 labeled positives (ROIs with pedestrians) and range classifier. Again, we compare these results to the orig-
negatives (ROIs with no pedestrians) were used. Training inal HOG features with an SVM classifier. Since the image
took about 2 days running on a single cpu. samples were much smaller than the HOG algorithm was
Figure 7 shows an example of the resulting filters that designed for, we also varied the number of cells, cell size,
2400
Fig. 9. Results of the classification system on a per-image basis (FPPI is false positives per image), on 4 subsets of the data. We show results for the
fisheye (left) and 80-degree camera (right), and demonstrate competitive results for long-range detections.
and distance metrics used in the HOG algorithm. The ”HOG using the 80 degree camera and in the earlier datasets that
(Best)” condition shows the results of the best parameter set, did not involve running and other motions.
consisting of 4 cells, each with a size of 4 pixels, and the In order to show that the long-range classifier can increase
L2Hys metric. We also performed a second condition where performance on a per-image basis, we used the entire pipeline
the smaller patches were upsampled to the original 128x64 on a challenging sequence from Set 4. Figure 10 shows
that the classifier was designed for. This latter condition results at different pixel height-based splits for one operating
performed the best for the HOG-based classifier, but was point (top) and the entire Recall/FPPI curve (bottom) for the
still outperformed by the convolutional neural network. sequence, demonstrating that higher operating points can be
achieved in the performance curve if the long-range classifier
C. Results on Data Sequences is used for detections with a height of 40 pixels or less.
There is a steeper dropoff at very low FPPI levels, however,
Although the results on a per-window basis is informative,
possibly resulting from the fact that the long-range classifier
the practical deployment of a detection system is determined
does not utilize stereo information.
by the per-image performance. As a result, we have taken
four subsets of the data based on the day they were captured, Figure 11 shows results on a subset of the sequences to
with increasing level of difficulty. As mentioned earlier, the show the effect of edge-based filtering of stereo detections,
last set consisted of people running and performing various for the three metrics. The left column shows results after
maneuvers. Figure 8 shows the results for one sequence the detection stage, and the right column shows results
only and at one particular operating point (resulting in for the classification stage. As can be seen, edge-based
approximately 1.6 false positives per frame), focusing on filtering significantly reduced the amount of detections while
the recall as the estimated distance, obtained from stereo, resulting in very slight recall decreases. This significantly
varied. Note that the results in this section did not utilize the decreases the number of detections that have to go through
long-range classifier; results for the long-range classifier are the classification stage. After the classifier, recall is very
shown in the next section. We show results for the fishey slightly decreased while some false positives that would have
camera (left) and the 80 degree camera (right), showing that passed through the classifier are removed.
the latter can detect pedestrians at further ranges. Even the Finally, we show timing results from the cascade of the
fisheye, however, can detect pedestrians at greater than 50 lower-resolution (28x14) classifier that is used to filter out
meters, corresponding to approximately 30 pixels in height easy examples early on to avoid having to run it through the
for this lens. The 80 degree camera can detect greater than more computationally expensive bimodal classifier. Note that
half of the pedestrians beyond 80 meters. we do not use the long-range classifier or edge-based filtering
Figure 9 shows our Recall/FPPI (false positives per image) for these timing results, which would only improve the
results for the four subsets using the fisheye camera (left) speed as the long-range classifier is faster. Figure 12 shows
and latter three subsets for the 80 degree camera (right). The timinig results on a quad-core i7 Dell Precision M6500
same scene for each data subset were recorded on each robot laptop. Although we do not show it here, there is little to
at the same time, although the robots were placed in different no effect on the accuracy metrics after the cascade, as the
positions so the images and pedestrian distances may vary. lower-resolution classifier is extremely high in recall. As can
Note that the first subset was only collected using the fisheye be seen from the timing results, the cascading and multi-
camera. Overall, competitive detection rates at fewer than 1 threading of the detections obtained from stereo significantly
false positive per frame can be obtained, especially when increases the running time, resulting in rates that are close
2401
Fig. 11. Results of edge-based filtering of detections in terms of precision
(top left), recall (top right), and false positives per frame (bottom) after
stereo-based detection and classification. As can be seen, a drastic reduction
in the number of detections can be achieved for these datasets.
Fig. 10. Results of incorporating the long-range classifier with varying pixel
sizes used to split which detections go to which classifier, at one operating
point (black dot, top) and the entire curve when using a pixel height split of
40 pixels (bottom). As can be seen, a boost in performance can be achieved
over the original system when the long-range classifier handles detections
at 40 pixels or less.
Fig. 12. Timing results on sample frames using the base system, two-
classifier cascade, and after threading.
to the capture rate of 5Hz.
V. CONCLUSIONS AND FUTURE WORK classifier that takes in much smaller inputs is also trained,
This paper demonstrated a system designed for the de- designed to quickly filter out easier false positives. Finally, a
tection of pedestrians at varying distances, including ranges classifier designed specifically for longer ranges, where there
that have typically not been explored in the literature. The are 40 pixels on target or less, in order to increase robustness
design of the system is targeted towards these longer-range to image blur and misalignments of the underlying detections
detections through several novel techniques. First, a stereo- that occur in such cases. We have shown that each of these
based detector is used to avoid having to classify the entire design decisions have led to a system that is either more
image at many scales, a method that has typically been accurate or computationally faster. We have also shown that
employed but is computationally expensive due to the higher the bimodal and long-range classifier perform significantly
resolution of the images and smaller scales of the pedestrians. better on a per-window basis than the HOG-based SVM
A second edge-based filtering step is then employed to classifier.
reduce the number of false positives significantly. There are several avenues of future work that remain.
A cascade of three convolutional network classifiers are First, there are more sophisticated methods to combine
then employed, each of which is designed for a particular the short-range and long-range classifiers that may yield
subproblem. A bimodal classifier is first trained to leverage additional improvements. Second, while this work has made
both appearance and stereo disparity information. We use a significant headway in the area of pedestrian detection, and
deep learning approach to obviate the need for having to as mentioned in recent surveys [4], it remains a significant
hand-design new stereo features that are instead automati- challenge to deploy these systems for applications such as
cally learned as part of the optimization process. A second safety where very few false positives are tolerable and very
2402
high recalls are desired. One topic that has not been explored
in this paper is tracking, where temporal information is used
to reduce false positives. Another significant challenge for
pedestrian detection is the presence of persistent false alarms,
for example on objects that resemble humans such as poles,
signs, etc. Here, additional modalities such as dense LIDAR
or classification using multiple views obtained by a mobile
robot may be helpful.
R EFERENCES
[1] N. Dalal and B. Triggs, ”Histograms of Oriented Gradients for Human
Detection”, CVPR, 2005.
[2] P. Felzenszwalb and R. Girshick, and D. McAllester, ”Cascade Object
Detection with Deformable Part Models”, CVPR, 2010
[3] M. Bajracharya and B. Moghaddam and A. Howard and S. Brennan
and L.H. Matthies, A fast stereo-based system for detecting and
tracking pedestrians from a moving vehicle, The International Journal
of Robotics Research, pp. 1466–1485, v. 28:11-12, 2009.
[4] P. Dollar, C. Wojek, B. Schiele and P. Perona , ”Pedestrian Detection:
An Evaluation of the State of the Art”, PAMI, 2011.
[5] S. Walk, N. Majer, K. Schindler, and B. Schiele, New features and
insights for pedestrian detection, in IEEE Conf. Computer Vision and
Pattern Recognition, 2010.
[6] S. Walk, K. Schindler, and B. Schiele, Disparity statistics for pedes-
trian detection: Combining appearance, motion and stereo, in European
Conf. Computer Vision, 2010.
[7] C. Wojek and B. Schiele, A performance evaluation of single and
multi-feature people detection, in DAGM Symposium Pattern Recog-
nition, 2008.
[8] S. Paisitkriangkrai, C. Shen, and J. Zhang, ”‘Fast Pedestrian Detection
Using a Cascade of Boosted Covariance Features”’, IEEE Transactions
on Circuits and Systems for Video Technology.
[9] A. Yoshizawa, M. Yamamoto, J. Ogata, ”‘Pedestrian detection with
convolutional neural networks”’, in Proceedings of the Intelligent
Vehicles Symposium, 2005.
[10] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, F. Huang, A tutorial on
energy-based learning, in Predicting Structured Data, ed by G. Bakir,
T. Hofman, B. Scholkopf, A. Smola, B. Taskar, MIT Press, 2006.
[11] D. Scaramuzza, A. Martinelli, and R. Siegwart, (2006). ”A Toolbox
for Easy Calibrating Omnidirectional Cameras”, Proceedings to IEEE
International Conference on Intelligent Robots and Systems (IROS
2006), Beijing China, October 7-15, 2006.
[12] M.A. Ranzato, F.J. Huang, Y.L. Boureau, Y. LeCun ”‘Unsupervised
learning of invariant feature hierarchies with applications to object
recognition”’, CVPR 2007.
[13] H. Lee, P.T. Pham, Y. Largman, A.Y. Ng, ”‘Unsupervised feature learn-
ing for audio classification using convolutional deep belief networks”’,
pp. 1096–1104, NIPS 2009.
[14] M. Sizintsev, and S. Kuthirummal, S. Samarasekera, R. Kumar, H.S.
Sawhney, and A. Chaudry, ”GPU accelerated realtime stereo for
augmented reality”, in Proceedings of the 5th International Symposium
on 3D Data Processing, Visualization, and Transmission (3DPVT),
2010.
2403