(Paper 5) Kira2012

This document summarizes a research paper that presents a system for detecting pedestrians at long ranges using stereo cameras and a cascade of convolutional neural network classifiers. The system first uses stereo disparity maps to detect vertical structures above estimated ground planes. A convolutional network then classifies detections using appearance and disparity features. A second network specifically classifies long-range detections using only appearance. The classifiers are run with a cascade approach and multi-threading to increase speed. The system was tested on a large dataset of pedestrian images captured in various outdoor environments.

Uploaded by

barnacleboy111111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views8 pages

(Paper 5) Kira2012

Uploaded by

barnacleboy111111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

2012 IEEE/RSJ International Conference on

Intelligent Robots and Systems

October 7-12, 2012. Vilamoura, Algarve, Portugal

Long-Range Pedestrian Detection using Stereo and a Cascade of

Convolutional Network Classifiers*
Zsolt Kira, Raia Hadsell, Garbis Salgian, and Supun Samarasekera1

Abstract— In this paper, we present a system for detecting

pedestrians at long ranges using a combination of stereo-based
detection, classification using deep learning, and a cascade
of specialized classifiers that can reduce false positives and
computational load. Specifically, we use stereo to perform
detection of vertical structures which are further filtered based
on edge responses. A convolutional neural network was then
designed to support the classification of pedestrians using
both appearance and stereo disparity-based features. A second
convolutional network classifier was trained specifically for the
case of long-range detections using appearance only. We further
speed up the classifier using a cascade approach and multi-
threading. The system was deployed on two robots, one using
a high resolution stereo pair with 180 degree fisheye lenses and
the other using 80 degree FOV lenses. Results are demonstrated
on a large dataset captured in a variety of environments.
I. INTRODUCTION
Pedestrian detection is an important topic with a variety of
applications including surveillance and collision avoidance Fig. 1. Qualitative comparison between images (all scaled to 50% original)
on vehicles. There are many sensors that have been used from the PASCAL VOC (top left), the INRIA pedestrian dataset (top right),
for this task depending on the application, including radar, and our long range detection data (bottom).
EO cameras (both monocular and stereo), infrared cameras,
and LIDAR. While additional sensors can enhance detection,
cameras still provide the widest applicability for robotics multiple scales across entire images. This is impractical for
applications due to low cost, low payload size and weight, longer-range detections for two reasons: 1) larger-resolution
and availability of data. images are necessary for such detections to be possible,
The current literature on pedestrian detection using making the processing time much larger, and 2) pedestrians
monocular or stereo cameras is typified by hand-designed that are further than 50 or more meters away will have less
features extracted from the images, the most successful being than 50 pixels on target, even when using higher resolution
the Histogram of Oriented Gradients (HOG) for appearance images than typical, and hence the scale-space search must
[1]. These features are used to train classifiers, for example be increased to include very small scales that dramatically
support vector machines. While these approaches can achieve increase the computation time.
some level of success, there are several challenging aspects In order to overcome these challenges, we first find can-
to the problem that have not been represented in the common didate pedestrians in stereo disparity maps and further filter
public datasets, including occlusion, pose variation, and them based on edge responses of the candidate detection
pedestrians at long ranges. As has been demonstrated in windows. An important contribution of this paper is the de-
several recent survey papers (e.g. [4]), an overall limitation sign of a convolutional neural network to accurately classify
of most approaches is that they are computationally inten- pedestrians using both appearance and disparity information,
sive and, more importantly, perform poorly on longer-range without having to hand-design additional features to leverage
detections with fewer than 50 pixels on target. stereo depth information. The second contribution is the
This paper focuses on detection pedestrians at long ranges. use of a second classifier specifically designed for long-
To motivate this, Figure 1 shows a qualitative comparison range detections in order to increase recall and decrease
between the typical datasets used (top) and our dataset false positives. We show that these two classifiers outperform
(bottom). We achieve this through several novel differences current HOG-based methods for both medium-range and
from related work. Specifically, many of the current ap- long-range detections. Finally, we cacade a third, lower-
proaches perform classification by scanning windows at resolution classifier that is faster with the higher-resolution
classifier, the computation of which is spread across multiple
This work was supported by TARDEC under the Robotic Technology threads, in order to speed up classification. We demonstrate
Consortium CombatID program (69-200907 T05).
1 Z. Kira, R. Hadsell, G. Salgian, and S. Samarasekera are at SRI the positive effects that each of these techniques afford on a
International Sarnoff, Princeton, NJ. [email protected] large set of data collected in varying outdoor environments
978-1-4673-1736-8/12/S31.00 ©2012 IEEE 2396
Fig. 2. Depiction of overall system, beginning with a stereo detector that finds vertical objects above locally estimated ground patches. A series of
convolutional network classifiers are then used, starting with a fast 28x14 classifier that removes easier instances, a bimodal appearance and stereo classifier
for shorter range detections, and a long-range appearance-only classifier for detections with less pixels. Detections that are positively classified by the
bimodal classifier and long-range classifier are then combined to form the final list of hypothesized pedestrians.

and containing a significant amount of people far away. A. Camera Calibration and Stereo Detection
II. RELATED WORK In order to compute accurate stereo disparity maps, the
Pedestrian detection is an active area of research, with cameras must first be calibrated to obtain intrinsic camera
dominant approaches utilizing monocular cameras [1][2] and parameters and extrinsic stereo parameters. For this paper,
a more limited amount of work on the use of stereo cameras two types of camera lenses were used. The first was a
[3][6]. The most common approach used as a baseline for 180-degree fisheye lens, where we used the Scaramuzza
comparison is one that uses Histogram of Oriented Gradients MATLAB toolbox to obtain intrinsic parameters [11]. Given
(HOG) features that are then fed to a linear SVM classifier. these parameters, we estimated extrinsic parameters using
This approach has been since augmented with other features images containing a checkerboard pattern in both views, and
such as parts-based features [2] or motion features [5]. finally mapped the fisheye image onto a cylindrical image.
Several works have explored combining multiple features to The second camera pair used an 80-degree lens which was
achieve performance that is better than any single feature [7]. calibrated using a traditional camera model and checkerboard
See [4] for a survey of methods that have been explored. pattern.
While several of these methods outperform a HOG-based Given a calibrated stereo pair, we compute stereo disparity
classifier in the tested datasets, they are often much slower maps using a fast CUDA implementation [14]. The resulting
than HOG features that are amenable to efficient implemen- disparity image is then used to find vertical structures in
tations, including through the use of CUDA. Similar to this the scene that could potentially correspond to pedestrians.
paper, [8] uses a cascade to speed up the computational Specifically, we discretize the image into a fixed number of
speed. Unlike that system, however, this work proposes patches and estimate the ground plane, converted into three-
automatically learned features and cascades classifiers based dimensional Cartesian space, using the disparity information.
on distance as opposed to traditional AdaBoost-based tech- This is done by building a histogram of disparity values
niques. [9] uses automatically learned features through a for each horizontal scanline in the patch and finding the
convolutional network classifier as well, but is not targeted best orientation estimate for the patch across all of these
towards longer-range detections. Several feature types, such histograms using a Hough transform. This orientation is used
as parts-based detections, are also difficult to obtain with to estimate the ground as a function of disparity for each
lower resolution image patches. Stereo-based detectors have patch. A mask of all disparity pixels that are above this
also been used in, e.g., [3] that use stereo to produce estimated ground is then created. Connected components is
detections by segmenting the scene based on range and using used to group these above-ground pixels together, separated
various shape and appearance features. Even when using into multiple disparity levels (distances) to provide a final
stereo, however, these works are extremely limited in their estimate of all objects that are vertical. An ROI (region of
ability to detect pedestrians when the number of pixels on interest) in the image space is created from the bounding box
target is low. of each component.
The technique described above finds all objects that are
III. SYSTEM DESIGN estimated to be above local ground patches, and hence can
The overall system design is depicted in Figure 2, consist- return a large number of ROIs. For example, large structures
ing of a stereo detector and a cascade of three convolutional such as buildings and trees can be picked up, although they
network classifiers. We will now describe each portion of the are contiguous structures that are unlikely to be pedestrians.
system in detail. In order to further prune these detections, it was observed
2397
learned) and pooling, resulting in the processing of data at
multiple levels. In addition to learning the underlying filters
and connection weights, labeled training data can be used to
learn a discrimination function that is able to classify data
into multiple classes.
There are several advantages to using this technique over
traditional approaches. First, it is a flexible framework which
supports the use of multiple modalities, in this case EO
images and stereo disparity images. We design a neural
network architecture that utilizes an asymmetrical connection
Fig. 3. Edge-reponse example with a negative instance mistakenly detected map with inputs to ensure that features are learned from each
due to the curb and stereo noise having very little vertical edge response modality independently as well as jointly over both modali-
(left) while a positive instance has a strong vertical edge response (right).
ties. In order to incorporate this additional information, there
is no need to hand-design additional features for the new
modality.
A second advantage is that this framework allows the
features to be learned for the particular task of pedestrian
detection, taking long-range instances into account when
learning them. As mentioned earlier, hand-designed features
have thus far not been as successful in these situations, and
we will show in Section IV that the convolutional network
classifier outperforms standard HOG features. Finally, the
processing time required during run-time (i.e. on test data
after training) lends itself to close to real-time application.
Fig. 4. Architecture for 2 input convolutional neural network. Figure 4 shows the architecture. The inputs to the classifier
consist of two channels: An 80x40 8-bit grayscale camera
image and a second 80x40 integer disparity map. Both inputs
that some of these false positives present no vertical edges in are normalized to zero mean and unit variance before being
the camera image, while pedestrian silhouettes contain strong applied to the subsequent layers. Beyond the input, a 6
edges. Hence, we also use a Sobel edge detection filter on the layer hierarchy was designed consisting of 3 convolutional
entire image, and prune ROIs that have little vertical edge layers, 2 pooling layers, and a fully connected layer that
response. Section IV will show that this can significantly produced two numerical outputs representing scores for
reduce the amount of detections with very little loss of recall. each class (pedestrian and non-pedestrian). In all, there are
Figure 3 shows an example of a false positive with little 8,000 trainable parameters describing the weights of the
edge response and a typical true positive that presents strong connections. These network parameters are optimized using
edges. stochastic gradient descent given a training set consisting of
labeled positive and negative instances.
B. Classification using Convolutional Networks
In addition, in order to further increase the classifier speed,
After the stereo-based detection, a classifier is used to we trained a second classifier that takes in 28x14 inputs
determine whether each ROI is a pedestrian or not. While instead of the original 80x40. The threshold for this classifier
previous appearance-based classifiers have achieved some is conservatively chosen to achieve a high recall. While the
success for cases where there are more than 50 pixels on a false alarm rate will not be as low as with the full-sized
pedestrian, they typically degrade heavily beyond this range classifier, it is able to filter out a significant portion of the
[4]. While we mitigate this effect somewhat by our use of easier negatives. Since the smaller classifier is faster, a net
higher-resolution cameras with an 80-degree lens, the fisheye time savings can be gained. Note that we also run the two
lens has significantly less pixels on target and even the classifiers across multiple threads since the classification of
narrower field of view results in pedestrians at our maximum ROIs can be independently processed, yielding further speed
target range (approximately 100 meters) occupying around savings on multi-core systems.
30 pixels or less. In fact, we will show in the experimental
section that approximately a third of our dataset consists of C. Long-Range Classifier
detections with less than 50 pixels in height. There are several challenging aspects to classifying de-
Instead of hand-crafted features, we use convolutional tections of pedestrians at long range. First, there are few
neural networks, a deep learning method that has achieved pixels on target (e.g. less than 50) and hence the resulting
success on a wide array of modalities for tasks such as stereo disparity map will be of much lower quality. This
object recognition [12] and speech [13]. It is considered can result in errors in the ground estimation as well as
a deep learning technique due to its use of a hierarchi- in the stereo calculations themselves, which require some
cal architecture that consists of alternating filters (that are texture that can be eroded by blur. Second, these problems
2398
Per−Window Results of Bimodal Classifier Per−Window Results of Bimodal Classifier
1 1
CID 36x18 LR − HOG
CID 36x18 LR − HOG (Best)
CID 36x18 LR − HOG (128x64)
CID 36x18 LR − ConvNet

Recall

Recall
0.5 0.5

0.2 0.2
INRIA − HOG
CID 80x40 − HOG
0.1 CID 80x40 − ConvNet 0.1
0.05 0.05
0.02
0.01 0.02
0.01
−6 −5 −4 −3 −2 −1 −6 −5 −4 −3 −2 −1
10 10 10 10 10 10 10 10 10 10 10 10
False Positives Per Window (FPPW) False Positives Per Window (FPPW)

Fig. 5. Left: Recall/FPPW curve demonstrating the performance of convolutional networks (”ConvNet”) over HOG features combined with linear SVMs
[1] (”HOG”) on our dataset (”CID - 80x40”). We also demonstrate the difficulty of the dataset by showing the results of the HOG algorithm on the INRIA
dataset; our dataset is significantly more challenging due to longer-range pedestrians. Right: Recall/FPPW curve for results obtained using the long-range
classifier trained on pedestrian samples that were 40 pixels in height or less. We compare our results to the original HOG-based classifier (”HOG”), after
optimizing the parameters of the HOG-based classifier (”HOG (Best)”), and after upsampling the 36x18 samples to 128x64 for which the HOG classifier
was originally designed (”HOG (128x64)”).

can cause detection windows that are not fully aligned with
the pedestrians, a situation that can reduce the classifier’s
success.
In order to mitigate these problems, we use a second
appearance-only classifier, trained exclusively on pedestrians
with 40 pixels or less on target. This classifier is similar
in architecture to the previously described classifier, except
that it is unimodal (stereo disparity maps are not used) and
the input image is a smaller 36x18 ROI. In addition, the
training data used to train the classifier differs. Unlike the
earlier classifier, the windows with which training occurs is Fig. 6. Sample mid-range inputs showing image and disparity maps.
not resized or resampled, and pedestrians that are smaller or
larger are allowed. In other words, the center of the detection
is taken and a window in the size of the classifier input (in
our case 36x18) is taken around the detection. This training
regime specifically builds in robustness to partial overlap
with the pedestrian and reduced image blur.
In order to combine the two classifiers, the detections are
split based on their height in pixels and fed to each respective
classifier. The results are then combined together to form the
final pedestrian detections. Note that since each classifier Fig. 7. Example filters that were learned and the features maps after
training the convolutional network.
has its own characteristics in terms of the operating curve,
different thresholds are used for each classifier.
Two sets of stereo cameras, each operating at either 1 Hz
IV. EXPERIMENTAL RESULTS
or 5 Hz, were used to capture the data, and classifier training
In order to test the system, over 45 video sequences were consisted of data from both cameras. Both rigs consisted
gathered in various outdoor environments over a period of of a stereo pair of Prosilica GB2450 cameras operating at
10 months and subsequently ground-truthed (we use the a resolution of 2448x2050; one setup used a 180 degree
name ”CID” to label this dataset in subsequent graphs). fisheye lens, resulting in an image of size 2881x657 after
Figure 6 shows examples of mid-range detections, and table I conversion to the cylindrical image, while another rig with
summarizes some statistics of the dataset. The environments 80-degree lens and more pixels on target was used, with
ranged from a helipad spanning approximately 60 meters, the image cropped to size 2448x625. The pedestrians in
a parking lot with long rows of cars, an open area with the dataset were captured at ranges of 10 to 140 meters.
buildings and vegetation, a forest environment, and an open In all, over 39,000 frames with over 160,000 annotated
parking lot spanning more than 100 meters. pedestrians resulted, with about a third being 50 pixels or
2399
Fig. 8. Results on one sequence, evaluated on a per-image basis, breaking down the detection and classification results for the fisheye camera (left) and
80 degree camera (right). Over half of the pedestrians are picked up at greater than 50 meters for the fisheye, representing approximately 30 pixels. The
80 degree camera performs even better, picking up approximately 60% of the pedestrians at over 80 meters.

TABLE I
were learned, and Figure 5 (left) shows the result on the
D ESCRIPTION OF DATA S ETS
testing set. The results are displayed as Recall/FPPW (false
positive per window) curves, where the class decision is
Attribute Fisheye 80 Degree
Frames 24,572 14,533 made by comparing the two numerical outputs of the net-
Detections 99,522 64,204 work classifier. In this case, a bias on the positive class is
Mean Height (Pixels) 87.78 88.37 applied and varied to produce the curve. For comparison,
% l.t. 50 pixels 32.1 35.6
% l.t. 40 pixels 14.4 11.6 we have also trained a HOG-based SVM classifier with the
same training set and tested on the same testing set. The
same scoring methodology was used as well. As can be
seen, the convolutional network achieves significantly better
less. The datasets also increase in complexity, with the last performance on the testing patches (note that the x-axis is
set consisting of people performing various maneuvers and log-scale). Note that some of this gain may be due to the
running, conditions that did not exist in the training set. We additional modality, but as we will see in the next subsection
will now describe our results on the training of the bimodal the convolutional network outperforms HOG even when both
and long-range classifiers on ground-truthed positives and use only appearance. We also display results of the HOG-
negatives sampled from images. The positive and negative based classifier on the original INRIA dataset to demonstrate
instances were extracted from the data sets and separated into that our dataset is significantly more challenging than INRIA.
training, testing, and validation sets. After showing results on This is due to significant amount of long-range pedestrians.
the testing set on a per-instance basis, we will show results
on entire sequences on a per-image basis. B. Long-Range Classifier
The long-range classifier used 29 sequences, with a similar
A. Bimodal Convolutional Network methodology for extracting positive samples. In this case,
To train the convolutional network classifier, 25 sequences however, the samples were not resampled to the classifier
were combined and all of the positive and negative image input size; instead, the center of the positive ground truth
patches were shuffled into training, testing, and validation windows that were 40 pixels in height or less was calculated
sets. To obtain positive samples, ground-truth ROIs were and a 36x18 patch around this center was extracted. Negative
extracted from the images. Note that a larger border was samples consisted of stereo detections that did not intersect
used during extraction so that the original positive set could with the ground-truth. In all, after expansion through jitter-
be expanded by applying translational and rotational jitter. ing, approximately 53k positive and negative samples each
This process serves to expand the positive set and make were used for training, and approximately 40k positive and
it robust to some transformation. Negative patches were negative samples each were used for testing. Training and
extracted by taking random patches that did not intersect testing sets were obtained by randomly shuffling the input
with the ground truth pedestrian windows but that had similar data.
scales and image vertical positions. In all, after jittering the Figure 5 (right) shows the per-window results of the long-
images 800,000 labeled positives (ROIs with pedestrians) and range classifier. Again, we compare these results to the orig-
negatives (ROIs with no pedestrians) were used. Training inal HOG features with an SVM classifier. Since the image
took about 2 days running on a single cpu. samples were much smaller than the HOG algorithm was
Figure 7 shows an example of the resulting filters that designed for, we also varied the number of cells, cell size,
2400
Fig. 9. Results of the classification system on a per-image basis (FPPI is false positives per image), on 4 subsets of the data. We show results for the
fisheye (left) and 80-degree camera (right), and demonstrate competitive results for long-range detections.

and distance metrics used in the HOG algorithm. The ”HOG using the 80 degree camera and in the earlier datasets that
(Best)” condition shows the results of the best parameter set, did not involve running and other motions.
consisting of 4 cells, each with a size of 4 pixels, and the In order to show that the long-range classifier can increase
L2Hys metric. We also performed a second condition where performance on a per-image basis, we used the entire pipeline
the smaller patches were upsampled to the original 128x64 on a challenging sequence from Set 4. Figure 10 shows
that the classifier was designed for. This latter condition results at different pixel height-based splits for one operating
performed the best for the HOG-based classifier, but was point (top) and the entire Recall/FPPI curve (bottom) for the
still outperformed by the convolutional neural network. sequence, demonstrating that higher operating points can be
achieved in the performance curve if the long-range classifier
C. Results on Data Sequences is used for detections with a height of 40 pixels or less.
There is a steeper dropoff at very low FPPI levels, however,
Although the results on a per-window basis is informative,
possibly resulting from the fact that the long-range classifier
the practical deployment of a detection system is determined
does not utilize stereo information.
by the per-image performance. As a result, we have taken
four subsets of the data based on the day they were captured, Figure 11 shows results on a subset of the sequences to
with increasing level of difficulty. As mentioned earlier, the show the effect of edge-based filtering of stereo detections,
last set consisted of people running and performing various for the three metrics. The left column shows results after
maneuvers. Figure 8 shows the results for one sequence the detection stage, and the right column shows results
only and at one particular operating point (resulting in for the classification stage. As can be seen, edge-based
approximately 1.6 false positives per frame), focusing on filtering significantly reduced the amount of detections while
the recall as the estimated distance, obtained from stereo, resulting in very slight recall decreases. This significantly
varied. Note that the results in this section did not utilize the decreases the number of detections that have to go through
long-range classifier; results for the long-range classifier are the classification stage. After the classifier, recall is very
shown in the next section. We show results for the fishey slightly decreased while some false positives that would have
camera (left) and the 80 degree camera (right), showing that passed through the classifier are removed.
the latter can detect pedestrians at further ranges. Even the Finally, we show timing results from the cascade of the
fisheye, however, can detect pedestrians at greater than 50 lower-resolution (28x14) classifier that is used to filter out
meters, corresponding to approximately 30 pixels in height easy examples early on to avoid having to run it through the
for this lens. The 80 degree camera can detect greater than more computationally expensive bimodal classifier. Note that
half of the pedestrians beyond 80 meters. we do not use the long-range classifier or edge-based filtering
Figure 9 shows our Recall/FPPI (false positives per image) for these timing results, which would only improve the
results for the four subsets using the fisheye camera (left) speed as the long-range classifier is faster. Figure 12 shows
and latter three subsets for the 80 degree camera (right). The timinig results on a quad-core i7 Dell Precision M6500
same scene for each data subset were recorded on each robot laptop. Although we do not show it here, there is little to
at the same time, although the robots were placed in different no effect on the accuracy metrics after the cascade, as the
positions so the images and pedestrian distances may vary. lower-resolution classifier is extremely high in recall. As can
Note that the first subset was only collected using the fisheye be seen from the timing results, the cascading and multi-
camera. Overall, competitive detection rates at fewer than 1 threading of the detections obtained from stereo significantly
false positive per frame can be obtained, especially when increases the running time, resulting in rates that are close
2401
Fig. 11. Results of edge-based filtering of detections in terms of precision
(top left), recall (top right), and false positives per frame (bottom) after
stereo-based detection and classification. As can be seen, a drastic reduction
in the number of detections can be achieved for these datasets.

Fig. 10. Results of incorporating the long-range classifier with varying pixel
sizes used to split which detections go to which classifier, at one operating
point (black dot, top) and the entire curve when using a pixel height split of
40 pixels (bottom). As can be seen, a boost in performance can be achieved
over the original system when the long-range classifier handles detections
at 40 pixels or less.
Fig. 12. Timing results on sample frames using the base system, two-
classifier cascade, and after threading.
to the capture rate of 5Hz.
V. CONCLUSIONS AND FUTURE WORK classifier that takes in much smaller inputs is also trained,
This paper demonstrated a system designed for the de- designed to quickly filter out easier false positives. Finally, a
tection of pedestrians at varying distances, including ranges classifier designed specifically for longer ranges, where there
that have typically not been explored in the literature. The are 40 pixels on target or less, in order to increase robustness
design of the system is targeted towards these longer-range to image blur and misalignments of the underlying detections
detections through several novel techniques. First, a stereo- that occur in such cases. We have shown that each of these
based detector is used to avoid having to classify the entire design decisions have led to a system that is either more
image at many scales, a method that has typically been accurate or computationally faster. We have also shown that
employed but is computationally expensive due to the higher the bimodal and long-range classifier perform significantly
resolution of the images and smaller scales of the pedestrians. better on a per-window basis than the HOG-based SVM
A second edge-based filtering step is then employed to classifier.
reduce the number of false positives significantly. There are several avenues of future work that remain.
A cascade of three convolutional network classifiers are First, there are more sophisticated methods to combine
then employed, each of which is designed for a particular the short-range and long-range classifiers that may yield
subproblem. A bimodal classifier is first trained to leverage additional improvements. Second, while this work has made
both appearance and stereo disparity information. We use a significant headway in the area of pedestrian detection, and
deep learning approach to obviate the need for having to as mentioned in recent surveys [4], it remains a significant
hand-design new stereo features that are instead automati- challenge to deploy these systems for applications such as
cally learned as part of the optimization process. A second safety where very few false positives are tolerable and very
2402
high recalls are desired. One topic that has not been explored
in this paper is tracking, where temporal information is used
to reduce false positives. Another significant challenge for
pedestrian detection is the presence of persistent false alarms,
for example on objects that resemble humans such as poles,
signs, etc. Here, additional modalities such as dense LIDAR
or classification using multiple views obtained by a mobile
robot may be helpful.
R EFERENCES
[1] N. Dalal and B. Triggs, ”Histograms of Oriented Gradients for Human
Detection”, CVPR, 2005.
[2] P. Felzenszwalb and R. Girshick, and D. McAllester, ”Cascade Object
Detection with Deformable Part Models”, CVPR, 2010
[3] M. Bajracharya and B. Moghaddam and A. Howard and S. Brennan
and L.H. Matthies, A fast stereo-based system for detecting and
tracking pedestrians from a moving vehicle, The International Journal
of Robotics Research, pp. 1466–1485, v. 28:11-12, 2009.
[4] P. Dollar, C. Wojek, B. Schiele and P. Perona , ”Pedestrian Detection:
An Evaluation of the State of the Art”, PAMI, 2011.
[5] S. Walk, N. Majer, K. Schindler, and B. Schiele, New features and
insights for pedestrian detection, in IEEE Conf. Computer Vision and
Pattern Recognition, 2010.
[6] S. Walk, K. Schindler, and B. Schiele, Disparity statistics for pedes-
trian detection: Combining appearance, motion and stereo, in European
Conf. Computer Vision, 2010.
[7] C. Wojek and B. Schiele, A performance evaluation of single and
multi-feature people detection, in DAGM Symposium Pattern Recog-
nition, 2008.
[8] S. Paisitkriangkrai, C. Shen, and J. Zhang, ”‘Fast Pedestrian Detection
Using a Cascade of Boosted Covariance Features”’, IEEE Transactions
on Circuits and Systems for Video Technology.
[9] A. Yoshizawa, M. Yamamoto, J. Ogata, ”‘Pedestrian detection with
convolutional neural networks”’, in Proceedings of the Intelligent
Vehicles Symposium, 2005.
[10] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, F. Huang, A tutorial on
energy-based learning, in Predicting Structured Data, ed by G. Bakir,
T. Hofman, B. Scholkopf, A. Smola, B. Taskar, MIT Press, 2006.
[11] D. Scaramuzza, A. Martinelli, and R. Siegwart, (2006). ”A Toolbox
for Easy Calibrating Omnidirectional Cameras”, Proceedings to IEEE
International Conference on Intelligent Robots and Systems (IROS
2006), Beijing China, October 7-15, 2006.
[12] M.A. Ranzato, F.J. Huang, Y.L. Boureau, Y. LeCun ”‘Unsupervised
learning of invariant feature hierarchies with applications to object
recognition”’, CVPR 2007.
[13] H. Lee, P.T. Pham, Y. Largman, A.Y. Ng, ”‘Unsupervised feature learn-
ing for audio classification using convolutional deep belief networks”’,
pp. 1096–1104, NIPS 2009.
[14] M. Sizintsev, and S. Kuthirummal, S. Samarasekera, R. Kumar, H.S.
Sawhney, and A. Chaudry, ”GPU accelerated realtime stereo for
augmented reality”, in Proceedings of the 5th International Symposium
on 3D Data Processing, Visualization, and Transmission (3DPVT),
2010.

2403

Computer Vision Lecture Notes All
0% (1)
Computer Vision Lecture Notes All
18 pages
Pami09 Compressed
No ratings yet
Pami09 Compressed
18 pages
Pedestrian Detection at 100 Frames Per Second
No ratings yet
Pedestrian Detection at 100 Frames Per Second
8 pages
Stereo-Based Pedestrian Detection For Collision-Avoidance Applications
No ratings yet
Stereo-Based Pedestrian Detection For Collision-Avoidance Applications
12 pages
Embedded Night-Vision System For Pedestrian Detection - Doc
No ratings yet
Embedded Night-Vision System For Pedestrian Detection - Doc
61 pages
Object Detection and Localization Using Stereo Cameras
No ratings yet
Object Detection and Localization Using Stereo Cameras
6 pages
Project Detecto!: A Real-Time Object Detection Model
No ratings yet
Project Detecto!: A Real-Time Object Detection Model
3 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
1 s2.0 S1877050915000915 Main - 3
No ratings yet
1 s2.0 S1877050915000915 Main - 3
8 pages
Real Time Object Detection in Surveillance Cameras With 2xjeq74wam
No ratings yet
Real Time Object Detection in Surveillance Cameras With 2xjeq74wam
8 pages
SSRN 4286087
No ratings yet
SSRN 4286087
7 pages
Object Recognition
No ratings yet
Object Recognition
49 pages
Real Time Object Recognition and Classification
No ratings yet
Real Time Object Recognition and Classification
6 pages
Integrating Visual and Range Data For Robotic Object Detection
No ratings yet
Integrating Visual and Range Data For Robotic Object Detection
12 pages
A Low-Cost Stereo System For 3D Object Recognition
No ratings yet
A Low-Cost Stereo System For 3D Object Recognition
7 pages
Document
No ratings yet
Document
51 pages
Preliminary Papers
No ratings yet
Preliminary Papers
10 pages
Part B Eti-1
No ratings yet
Part B Eti-1
7 pages
Electronics 09 00589
No ratings yet
Electronics 09 00589
17 pages
Thermal
No ratings yet
Thermal
13 pages
Realtime Visual Recognition in Deep Convolutional Neural Networks
No ratings yet
Realtime Visual Recognition in Deep Convolutional Neural Networks
13 pages
A Multi-Resolution Approach For Infrared Vision-Based Pedestrian Detection
No ratings yet
A Multi-Resolution Approach For Infrared Vision-Based Pedestrian Detection
6 pages
06 Long Range Obstacle Detection From A Monocular Camera
No ratings yet
06 Long Range Obstacle Detection From A Monocular Camera
3 pages
2802 8020 1 PB
No ratings yet
2802 8020 1 PB
3 pages
Object Detection Using ELAN
No ratings yet
Object Detection Using ELAN
6 pages
Comparative Analysis of Feature Descriptors and Classifiers For Real-Time Object Detection
No ratings yet
Comparative Analysis of Feature Descriptors and Classifiers For Real-Time Object Detection
11 pages
Real-Time Object Detection Using Deep Learning and Open CV
No ratings yet
Real-Time Object Detection Using Deep Learning and Open CV
4 pages
A Novel Model To Detect and Categorize Objects From Images by Using A Hybrid Machine Learning Model
No ratings yet
A Novel Model To Detect and Categorize Objects From Images by Using A Hybrid Machine Learning Model
13 pages
Applsci 12 01354 v2
No ratings yet
Applsci 12 01354 v2
14 pages
b83 Finalversion Esigelec WSCG 30-04-2019
No ratings yet
b83 Finalversion Esigelec WSCG 30-04-2019
10 pages
Project Detecto!: A Real-Time Object Detection Model
No ratings yet
Project Detecto!: A Real-Time Object Detection Model
3 pages
Computer Vision Application
No ratings yet
Computer Vision Application
2 pages
Real-Time Pedestrian Detection and Tracking at Nighttime For Driver-Assistance Systems
No ratings yet
Real-Time Pedestrian Detection and Tracking at Nighttime For Driver-Assistance Systems
16 pages
18 TallapallyHarini 162-170
No ratings yet
18 TallapallyHarini 162-170
9 pages
Object Tracking For Autonomous Vehicles: Project Report
No ratings yet
Object Tracking For Autonomous Vehicles: Project Report
10 pages
Image Sorting Using Object Detection and Face Recognition
No ratings yet
Image Sorting Using Object Detection and Face Recognition
6 pages
Real-Time Object Detection Using Deep Learning: Journal of Advances in Mathematics and Computer Science June 2023
No ratings yet
Real-Time Object Detection Using Deep Learning: Journal of Advances in Mathematics and Computer Science June 2023
10 pages
Keypoint Recognition Using Randomized Trees
No ratings yet
Keypoint Recognition Using Randomized Trees
29 pages
Computer Vision
No ratings yet
Computer Vision
14 pages
Improving Distant 3D Object Detection Using 2D Box Supervision
No ratings yet
Improving Distant 3D Object Detection Using 2D Box Supervision
11 pages
YOLO Multi-Camera Object Detection and Distance Estimation
No ratings yet
YOLO Multi-Camera Object Detection and Distance Estimation
6 pages
5 Ijlemr 77839
No ratings yet
5 Ijlemr 77839
5 pages
An Investigation of Deep Neural Network Based Techniques For Object Detection An
No ratings yet
An Investigation of Deep Neural Network Based Techniques For Object Detection An
6 pages
Second Progress Report UID - 17BCS2127
No ratings yet
Second Progress Report UID - 17BCS2127
13 pages
A Viewpoint Invariant Approach For Crowd Counting: Dan Kong, Doug Gray and Hai Tao
No ratings yet
A Viewpoint Invariant Approach For Crowd Counting: Dan Kong, Doug Gray and Hai Tao
4 pages
Overview of Object Detection Based On Deep Learnin
No ratings yet
Overview of Object Detection Based On Deep Learnin
7 pages
Object Detection With Deep Learning - A Review Summary
No ratings yet
Object Detection With Deep Learning - A Review Summary
11 pages
A Comprehensive Study of Camouflaged Object Detection Using Deep Learning
No ratings yet
A Comprehensive Study of Camouflaged Object Detection Using Deep Learning
8 pages
Theoretical method to increase the speed of continuous mapping in a three-dimensional laser scanning system using servomotors control
From Everand
Theoretical method to increase the speed of continuous mapping in a three-dimensional laser scanning system using servomotors control
Lars Lindner
No ratings yet
Sorting of Objects Using Image Processing
No ratings yet
Sorting of Objects Using Image Processing
6 pages
Younis 2020
No ratings yet
Younis 2020
5 pages
Object Detection and Identification A Project Report: November 2019
No ratings yet
Object Detection and Identification A Project Report: November 2019
45 pages
Object Detector For Blind Person
No ratings yet
Object Detector For Blind Person
20 pages
SR22804211151
No ratings yet
SR22804211151
8 pages
6973-Article Text-12451-1-10-20210605
No ratings yet
6973-Article Text-12451-1-10-20210605
6 pages
Wepik Advancing Object Detection Unveiling The Potential For Precision and Efficiency 202401081226449LyU
No ratings yet
Wepik Advancing Object Detection Unveiling The Potential For Precision and Efficiency 202401081226449LyU
22 pages
Object Detection and Identification A Project Report: November 2019
No ratings yet
Object Detection and Identification A Project Report: November 2019
45 pages
Zusc S 24 00845
No ratings yet
Zusc S 24 00845
15 pages
Image Segmentation: Unlocking Insights through Pixel Precision
From Everand
Image Segmentation: Unlocking Insights through Pixel Precision
Fouad Sabry
No ratings yet
Pyramid Image Processing: Exploring the Depths of Visual Analysis
From Everand
Pyramid Image Processing: Exploring the Depths of Visual Analysis
Fouad Sabry
No ratings yet
View Synthesis: Exploring Perspectives in Computer Vision
From Everand
View Synthesis: Exploring Perspectives in Computer Vision
Fouad Sabry
No ratings yet
Boundary Extraction
No ratings yet
Boundary Extraction
25 pages
Surface ReConstruction of Dicom Images Final
No ratings yet
Surface ReConstruction of Dicom Images Final
8 pages
DIP Chapter 1
No ratings yet
DIP Chapter 1
38 pages
AIS412 - Lecture 2
No ratings yet
AIS412 - Lecture 2
81 pages
Adobe Photoshop CC For Photographers 2018 1st Edition - PDF Version Instant Download
No ratings yet
Adobe Photoshop CC For Photographers 2018 1st Edition - PDF Version Instant Download
53 pages
Proceedingbook-Anas Mustafa
No ratings yet
Proceedingbook-Anas Mustafa
10 pages
Practice Final 09
No ratings yet
Practice Final 09
8 pages
Visual Recognition For Images, Video, and 3D: ICCV 2019 Tutorial Overview
No ratings yet
Visual Recognition For Images, Video, and 3D: ICCV 2019 Tutorial Overview
11 pages
From Classical Techniques To Convolution-Based Models: A Review of Object Detection Algorithms
No ratings yet
From Classical Techniques To Convolution-Based Models: A Review of Object Detection Algorithms
6 pages
HW 4
No ratings yet
HW 4
2 pages
Computer Vision
No ratings yet
Computer Vision
8 pages
Multi-View Stereo A Tutorial.
No ratings yet
Multi-View Stereo A Tutorial.
164 pages
Chapter 9: Morphological Image Processing Digital Image Processing
No ratings yet
Chapter 9: Morphological Image Processing Digital Image Processing
58 pages
Unit IV
No ratings yet
Unit IV
3 pages
Crack
No ratings yet
Crack
6 pages
Wa0002.
No ratings yet
Wa0002.
28 pages
Computer Vision: Chapter 5. Segmentation
100% (1)
Computer Vision: Chapter 5. Segmentation
16 pages
Blob Detection PDF
No ratings yet
Blob Detection PDF
6 pages
Histogram Equalization
No ratings yet
Histogram Equalization
15 pages
1260021862PL PhilipsHealthcare
No ratings yet
1260021862PL PhilipsHealthcare
43 pages
10 Stereo
No ratings yet
10 Stereo
131 pages
Corner and Interest Point Detection
No ratings yet
Corner and Interest Point Detection
37 pages
Computer Vision: Chapter 1: Introduction
No ratings yet
Computer Vision: Chapter 1: Introduction
7 pages
Enhancing Images With Filtering
No ratings yet
Enhancing Images With Filtering
77 pages
Engineering Drawing: Muhammad Azfar Yaqub Jawad Mirza
No ratings yet
Engineering Drawing: Muhammad Azfar Yaqub Jawad Mirza
35 pages
End-to-End Boundary Aware Networks For Medical Image Segmentation
No ratings yet
End-to-End Boundary Aware Networks For Medical Image Segmentation
8 pages
Computer Vision & Image Processing Assignment
100% (1)
Computer Vision & Image Processing Assignment
13 pages
Walt: Wilf:: Use Isometric Paper To Draw Plans and Elevations of 3D Shapes
No ratings yet
Walt: Wilf:: Use Isometric Paper To Draw Plans and Elevations of 3D Shapes
24 pages
Color Image Processing: Gonzalez & Woods
No ratings yet
Color Image Processing: Gonzalez & Woods
51 pages