Pedestrian Detection For Autonomous Vehicle Using Multi-Spectral Cameras
Pedestrian Detection For Autonomous Vehicle Using Multi-Spectral Cameras
Pedestrian Detection For Autonomous Vehicle Using Multi-Spectral Cameras
Abstract—Pedestrian detection is a critical feature of au- regular color image. Some methods such as gradient and shape
tonomous vehicle or advanced driver assistance system. This paper based feature extraction may still be applicable since an ob-
presents a novel instrument for pedestrian detection by combining ject has similar silhouettes in both color and thermal images. In
stereo vision cameras with a thermal camera. A new dataset for ve-
hicle applications is built from the test vehicle recorded data when addition, data from different sensors may contain complemen-
driving on city roads. Data received from multiple cameras are tary information and combining them may result better perfor-
aligned using trifocal tensor with pre-calibrated parameters. Can- mance. Multiple cameras can form stereo vision, which provides
didates are generated from each image frame using sliding win- additional disparity and depth information. An example of com-
dows across multiple scales. A reconfigurable detector framework bining stereo vision color cameras and a thermal camera for
is proposed, in which feature extraction and classification are two
separate stages. The input to the detector can be the color image, pedestrian detection can be found in [4].
disparity map, thermal data, or any of their combinations. When The data collection environment is also very important. Unlike
applying to convolutional channel features, feature extraction uti- static cameras for surveillance applications, cameras mounted
lizes the first three convolutional layers of a pre-trained convolu- on a moving vehicle may observe much more complex back-
tional neural network cascaded with an AdaBoost classifier. The ground and distance-varied pedestrians. Therefore, it calls for
evaluation results show that it significantly outperforms the tra-
ditional histogram of oriented gradients features. The proposed different pedestrian detection algorithms from the surveillance
pedestrian detector with multi-spectral cameras can achieve 9% camera applications. To use multiple sensors on a vehicle, a co-
log-average miss rate. The experimental dataset is made available operative multi-sensor system need to be designed and new algo-
at http://computing.wpi.edu/dataset.html. rithms that can coherently process multi-sensor data need to be
Index Terms—Multi-spectral camera, autonomous vehicle, investigated. The contributions of this paper are listed as follows:
pedestrian detection, machine learning. 1) A multi-spectral camera instrument is designed and
assembled on a moving vehicle to collect data for
pedestrian detection. These data contain many complex
I. INTRODUCTION scenarios that are challenging for detection and classi-
UTOMATIC and reliable detection of pedestrians is an fication. The experimental dataset is made available at
A important function of an autonomous vehicle or advanced
driver assistance system (ADAS). Research works on pedes-
http://computing.wpi.edu/dataset.html.
2) The multi-spectral data are aligned using trifocal tensor. It
trian detection are heavily depended on data, as different data is then possible to combine features from different sources
and methods may yield different evaluation results. The most and compare their performance.
commonly used sensor in data collection is a regular color cam- 3) A machine learning based algorithm is employed for
era, and many datasets have been built such as the INRIA person pedestrian detection by combining stereo vision and ther-
dataset [1] and the Caltech Pedestrian Detection Benchmark [2]. mal images. Evaluation results show satisfactory perfor-
Thermal cameras have also been considered lately, and different mance.
methods of pedestrian detection were developed based on the The rest of the paper is organized as follows. Section II pro-
thermal data [3]. It is worth investigating whether the methods vides a summary of related work. Section III describes our in-
developed from one type of sensor data are applicable to other strumental setup for data collection. In Section IV, we propose
types of sensors. A method may not work anymore since the a framework that combines stereo vision color cameras and a
nature of data has changes, e.g., finding certain hot objects by thermal camera for pedestrian detection using different feature
intensity value threshold on thermal image is not applicable to a extraction methods and classifiers. Performance evaluations are
presented in Section V, followed by further discussion in Sec-
tion VI and conclusions in Section VII.
Manuscript received February 15, 2018; revised June 9, 2018 and September
4, 2018; accepted November 27, 2018. Date of publication March 20, 2019;
date of current version May 22, 2019. This work was supported by the U.S. NSF
under Grant CNS-1626236. (Corresponding author: Xinming Huang.) II. RELATED WORK
The authors are with the Department of Electrical and Computer Engi-
neering, Worcester Polytechnic Institute, Worcester, MA 01609 USA (e-mail:, There are many existing works on pedestrian detection.
[email protected]). The Caltech Pedestrian Detection Benchmark [2] has been
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. widely used by the researchers. It contains frames from a
Digital Object Identifier 10.1109/TIV.2019.2904389 single vision camera with pedestrians annotated. Based on
2379-8858 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: MKSSS CUMMINS COLLEGE OF ENGINEERING FOR WOMEN. Downloaded on January 08,2022 at 13:41:07 UTC from IEEE Xplore. Restrictions apply.
212 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 4, NO. 2, JUNE 2019
the CVPR2015 snapshot of the results on the Caltech-USA were selected based on disparity, and HOG features were ex-
pedestrian benchmark, it was stated in [5] that at ∼ 95% recall, tracted from color, thermal and disparity images. Concatenated
the state-of-the-art detectors made ten times more errors than HOG features were then fed to radio basis function (RBF) SVM
the human-eye baseline, which is still a huge gap that calls for classifier to obtain the final decision. An indoor people detection
research attentions. Overall, the detector performance has been system using stereo vision cameras and a thermal camera was
improved as new methods were introduced in recent years. Tra- presented in [25]. Instead of trifocal tensor, 3D point cloud pro-
ditional methods such as Viola–Jones (VJ) [6] and Histogram of jection was used for image point registration between thermal
Oriented Gradients (HOG) [1] were often included as the base- and color images.
line. A total of 44 methods were listed in [7] for Caltech-USA For ADAS applications, pedestrian detection is often chal-
dataset, and 30 of them made the use of HOG or HOG-like fea- lenging because the camera is moving with the vehicle, and
tures. Channel features [8] and Convolutional Neural Networks the pedestrians are often very small on images due to the dis-
[9]–[11] also achieved impressive performance on pedestrian tance and image resolution. Several pedestrian detection re-
detection. The Convolutional Channel Features (CCF) [12], search works were summarized in [26], including the use of color
which combines a boosting forest model and low level features cameras and thermal cameras, as well as sensor fusion such as
from CNN, achieved as low as 19% log-average miss rate radar and stereo vision cameras. A benchmark for multi-spectral
(MR) on Caltech Pedestrian Detection Benchmark. Despite the pedestrian detection was presented in [27] and several methods
progressive improvement of detection results on the datasets, were analyzed. However, the color-thermal pairs were manually
color cameras still have many limitations. For instance, color annotated and it is unclear if any automatic point registration
cameras are sensitive to the lighting condition. Most of these algorithms were used. Furthermore, more sophisticated appli-
detection methods may fail if the image quality is impaired cations or systems can be built upon pedestrian detection, such
under poor lighting condition. as pedestrian tracking across multiple driving recorders [28] and
Thermal cameras can be employed to overcome some limita- crowd movement analysis [29].
tions of color cameras, because they are not affected by light-
ing condition. Several research works using thermal data for
pedestrian detection and tracking were summarized in [3]. Back- III. DATA COLLECTION AND EXPERIMENTAL SETUP
ground subtraction was applied in [13] for people detection,
since the camera was static. HOG features and Support Vec- A. Data Collection Equipment
tor Machine (SVM) were employed for classification [14]. A To collect on-road data for pedestrian detection, we design
two-layered representation was described in [15], where the still and assemble a custom test equipment rig. This design enables
background layer and the moving foreground layer were sepa- the data collection system to be mobile on the test vehicle as
rated. The shape cue and appearance cue were used to detect and well as maintaining calibration between data collection runs.
locate pedestrians. In [16], a window based screening procedure The completed system can be seen in Figure 1.
was proposed for potential candidate selections. The Contour The stereo vision cameras called ZED StereoLabs are cho-
Saliency Map (CSM) was used to represent the edges of a pedes- sen for providing color images as well as disparity informa-
trian, followed by AdaBoost classification with adaptive filters. tion. The ZED cameras can capture high resolution side-by-side
Assuming the region occupied by a pedestrian has a hot spot, video that contains synchronized left and right video streams,
candidates were selected based on thermal intensity value [17] and can create a disparity map of the environment in real-time
and then classified by a SVM. In addition, both Kalman filter using the graphics processing unit (GPU) in the host computer.
prediction and mean shift tracking were incorporated for fur- Furthermore, an easy to use SDK is provided, which allows
ther improvement. A new contrast invariant descriptor [18] was for camera controls and output configuration. In addition, the
introduced for far infrared images, which outperformed HOG on-board cameras are pre-calibrated and come with known in-
features by 7% at 10−4 FPPW for people detection. The Shape trinsic parameters. This makes image rectification and dispar-
Context Descriptor (SCD) was also used for pedestrian detection ity map generation easier. The rectified images and the dispar-
in [19], followed by AdaBoost classifier. The HOG features were ity map can be obtained by using the SDK, and point corre-
considered not suitable for this task because of the small size of spondence between the 2 stereo images can be calculated as
the target, variations of pixel intensities and lack of texture in- xlef t = xright − disparity(xright , y), where (xlef t , y) is the
formation. Probabilistic models for pedestrian detection in far point location in the left image, (xright, y) is the point location
infrared images was presented in [20]. The method in [21] found in the right image, and disparity() is the disparity value at the
the head regions at the initial stage, then confirmed the detection given location.
of a pedestrian by the histograms of Sobel edges in the region. The thermal camera is called FLIR Vue Pro, which is a long
Stereo vision can provide additional information such as dis- wavelengths infrared (LWIR) camera. The IR camera is an un-
parity map to better detect people in the frame. RGB-D cam- cooled vanadium-oxide microbolometer touting a 640 × 512
eras were used for indoor people detection or tracking in [22], resolution at a full 30 Hz and paired with a 13 mm germanium
[23], and stereo thermal cameras were used in [24] for pedes- lens providing a 45◦ × 35◦ field of view (FOV). This IR cam-
trian detection. and the image pixel registration was done using era has a wide −20◦ to 50◦ operation range which allows for
3D point cloud. The combination of stereo vision cameras and a rugged outdoor use. The thermal camera also provides Bluetooth
thermal camera was used in [4]. Trifocal tensor was used to align wireless control and video data recording via its on-board mi-
the thermal image with color and disparity images. Candidates croSD card as well as an analog video output.
Authorized licensed use limited to: MKSSS CUMMINS COLLEGE OF ENGINEERING FOR WOMEN. Downloaded on January 08,2022 at 13:41:07 UTC from IEEE Xplore. Restrictions apply.
CHEN AND HUANG: PEDESTRIAN DETECTION FOR AUTONOMOUS VEHICLE USING MULTI-SPECTRAL CAMERAS 213
Authorized licensed use limited to: MKSSS CUMMINS COLLEGE OF ENGINEERING FOR WOMEN. Downloaded on January 08,2022 at 13:41:07 UTC from IEEE Xplore. Restrictions apply.
214 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 4, NO. 2, JUNE 2019
Authorized licensed use limited to: MKSSS CUMMINS COLLEGE OF ENGINEERING FOR WOMEN. Downloaded on January 08,2022 at 13:41:07 UTC from IEEE Xplore. Restrictions apply.
CHEN AND HUANG: PEDESTRIAN DETECTION FOR AUTONOMOUS VEHICLE USING MULTI-SPECTRAL CAMERAS 215
D. Detection
In this paper, we only compare the HOG and CCF methods
Fig. 4. Proper alignment of color and thermal images using trifocal tensor. for the task of pedestrian detection. The reason is explained in
Section VI-A. The HOG method used in this paper is based
on [1].
The HOG features have been widely used in object detec-
tion. It defines overlapped blocks in a windowed sample, and
cells within blocks. The histogram of the unsigned gradients of
several different directions are computed in all blocks, and are
concatenated as features. The HOG features are often combined
with SVM and sliding window method for detection on different
scaling levels.
At the training stage, the positive samples are manually la-
beled. The initial negative samples are randomly selected on
training images as long as they do not overlap with the positive
samples. All samples are scaled to a standard window size of
20 × 40 for training. The size of the minimum sample in our
data is 11 × 22. After the initial training, the detector is tested
on the training set and more false positives are added back to
the negative samples set. These false positives are often called
hard negatives and this procedure is often called hard negatives
Fig. 5. Examples of pedestrians in color and thermal images. mining. This procedure can be repeated for a few times until the
performance improvement becomes marginal.
Once the detector is trained, it is ready to perform detection on
often used for thermal images, because pedestrians are often hot- the test dataset and give a decision score for each window. Each
ter than the surrounding environment. The ROIs are segmented frame with original size of 640 × 480 is scaled into different
based on the pixel intensity values. However, we find that the ROI sizes. The detector with a fixed size of 20 × 40 is then applied to
extraction on thermal images does not always work well. The the scaled images to find pedestrians of various sizes at different
assumption that the pedestrians are hotter is not always true for locations in a frame.
various reasons. For instance, a pedestrian wearing heavy layers CCF uses low level features from a pre-trained CNN model,
of clothing does not appear with distinctively high pixel intensity cascaded with a boosting forest model such as Real AdaBoost
values in a thermal image, and thus a pedestrian can not be lo- [32] as a classifier. The lower level features from the first few
cated by simple morphological operations. As another example, CNN layers are considered generic descriptors for objects, which
the temperature of the road surface exposed to intense sunlight contain richer information than channel features. Meanwhile,
has higher temperature than the human bodies. Although false the boosting forest model replaces the remaining parts of CNN.
positives introduced by hot objects such as vehicle engines can Thus we avoid training a complete end-to-end CNN model for a
be filtered in later steps, the losses of true positives become a specific object detection application which would require large
serious problem. As a result, we feel the sliding window detec- resources of computation, storage and time. In our experiment,
tion method is more reliable in case of these complex scenarios. we apply similar settings as described in [12], except for the
The classifier can analyze the windowed samples thoroughly and parameters of the scales and number of octaves, in order to de-
make an accurate decision. Figure 5 shows some examples of our tect pedestrians far away that are as small as 20 × 40 pixels. The
Authorized licensed use limited to: MKSSS CUMMINS COLLEGE OF ENGINEERING FOR WOMEN. Downloaded on January 08,2022 at 13:41:07 UTC from IEEE Xplore. Restrictions apply.
216 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 4, NO. 2, JUNE 2019
E. Information Fusion
The idea of combing the information from color image, dis-
parity map and thermal data for decision making is referred as in-
formation fusion. One approach is to concatenate these features
together [4]. A single classifier can be trained on the concate-
nated features and the final decisions of the test instances can be Fig. 6. The relationship between the mean disparity and the height of an object.
obtained from the classifier. This approach has an disadvantage
that the classifier training becomes a challenge as the dimen-
sion of features increases. Furthermore, if a new type of feature
needs to be added or an existing feature needs to be removed,
the classifier need to be re-trained, which is time consuming.
An alternative approach of information fusion is to employ
multiple classifiers and an example can be found in [35]. Each
classifier makes decision on a certain type or subset of features
and the final result is obtained by using a decision fusion tech-
nique such as majority voting or sum rule [36]. This approach
has an advantage that the structure of a system is reconfigurable.
Without re-training the classifiers, adding or removing differ-
ent types of features becomes very convenient. Therefore, we
choose the later approach to make our system reconfigurable
so that it evaluates various settings and methods. Specifically,
an SVM is used at the decision fusion stage and its inputs are
confidence scores from classifiers in the previous stage, which
is more appropriate than commonly used statistical decision fu-
sion method in the case of multi-source data [37], [38]. The data
Fig. 7. Performance of different input data combinations, all using HOG
from different sources are often not equally reliable, and so are features.
the classifiers. The confidence scores must be weighted when
obtaining the final decision from information fusion.
Object beyond that distance results zero disparity, which makes
F. Additional Constraints the estimation for small size samples inaccurate.
2) Road Horizon: During detection, a few reasonable as-
1) Disparity-size: Besides the extracted features from an im-
sumptions can be made to filter out more false positives while
age frame, additional constraints can be incorporated into the de-
retaining the true positives. The assumptions vary depending
cision fusion stage to further improve the detector performance.
on the application, including color, shape, position, etc. One
An example is the disparity-size relationship. Figure 6 shows
assumption here is that pedestrians stand on the road, i.e., the
the disparity and height relationship of the positive
samples in
lower bound of a pedestrian must below the road horizon. The
the form of a linear regression line d = h 1 × B , where d is
road horizon can be automatically detected in an image. This
mean disparity, h is the height of the sample, and B is a 2 × 1
kind of simple constraint may or may not improve the detector
coefficient matrix. Given a pair of mean
disparity
dˆ and height
performance, and experiments should be carried out to deter-
ĥ of a sample, the residual r = |dˆ − ĥ 1 × B| can be used to
mine its effectiveness.
estimate whether this sample is possibly a pedestrian or not.
From Figure 6 we can see a number of samples have very
V. PERFORMANCE EVALUATION
small mean disparity and are far below the regression line. This is
because the disparity information is not accurate when an object There are a total of 58 labeled video sequences in our dataset.
is far away from camera. In fact, the stereo vision camera we We use 39 of them for training and the remaining 19 for test.
use automatically clamps the disparity value at certain distance. Figure 7 shows the performance of different settings, including
Authorized licensed use limited to: MKSSS CUMMINS COLLEGE OF ENGINEERING FOR WOMEN. Downloaded on January 08,2022 at 13:41:07 UTC from IEEE Xplore. Restrictions apply.
CHEN AND HUANG: PEDESTRIAN DETECTION FOR AUTONOMOUS VEHICLE USING MULTI-SPECTRAL CAMERAS 217
VI. DISCUSSION
A. Why HOG and CCF?
Fig. 8. Performance improvement by adding disparity-size and road horizon
constraints. While there are more advanced deep learning networks that
have better performance on Caltech-USA dataset, we only com-
pare the HOG and CCF methods for the task of pedestrian de-
tection for the follow reasons:
1) The HOG method was always included as a baseline in
Caltech-USA dataset. Among 44 methods reported on the
Caltech-USA dataset [7], 30 of them employed HOG or
HOG-like features.
2) The CCF achieved good performance on Caltech-USA
dataset. The idea of combining low level CNN feature
and a boosting forest model avoids training a CNN from
end to end, which requires huge amount of data and is
time consuming. The advantage of CCF is especially ob-
vious in this paper, when our dataset is relatively small,
and different combinations of features are used as input
data. Training different versions of CNN to find the best
combination can be done when more data become avail-
able in the future.
3) The goal of this paper is to investigate the combination of
Fig. 9. Performance of different input data combinations, all using CCF. multi-spectral cameras and its improvement on pedestrian
detection. We publicize our dataset, so other researchers
can continue this study to discover many better solutions
disparity map, color image, thermal data, and their combina- in the future.
tions, all based on HOG features. Generally, the more types of
information are used, the better performance is achieved. The
B. How to Interpret the Results?
disparity-only setup performs the worst. The color image only is
better, followed by the combination of color and disparity. Note As shown in Figure 9 and explained in Section V, the best
that the thermal-only setup outperforms the combination of color performance is achieved when combining color and thermal
and disparity. The heat signature of pedestrians seems more rec- data, and introducing disparity as additional feature does not
ognizable in thermal images. The combination of color, thermal improve the performance. However, this does not mean the dis-
and disparity information achieves the best performance, with parity information is useless, nor stereo vision is unnecessary.
about 36% log-average miss rate. As described in Section IV-B, trifocal tensor must be employed
Figure 8 shows the performance of the HOG features, added to align the thermal and color data, which requires disparity
with disparity-size information and road horizon constraint. The information. It is impossible to align the color data with the
road horizon improves the log-average MR by about 5%. De- thermal data using a single color camera and a single thermal
spite little improvement provided by adding the disparity-size camera, because the entire image cannot be transformed using
information alone, the combination both provides nearly 7% point matching techniques due to the difference of color and
improvement in log-average MR. thermal data in nature.
Figure 9 shows the performance of different settings us- On the other hand, the performance is still highly depen-
ing CCF. Performance of disparity only is the worst. Thermal dent on the instrument. Our thermal camera has a resolution
Authorized licensed use limited to: MKSSS CUMMINS COLLEGE OF ENGINEERING FOR WOMEN. Downloaded on January 08,2022 at 13:41:07 UTC from IEEE Xplore. Restrictions apply.
218 IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, VOL. 4, NO. 2, JUNE 2019
VII. CONCLUSIONS
In this paper, a novel pedestrian detection instrumentation
is designed using both thermal and RGB-D stereo cameras.
Data are collected from on-road driving and an experimental
dataset is built with pedestrians labeled as ground truth. A re-
configurable multi-stage detection framework is proposed. Tri-
focal tensor is used to align data from multiple cameras. It is
then possible to combine features from different sources and
compare their performance. Both HOG and CCF based de-
Fig. 10. A pedestrian is embedded in the shadow of a color image.
tection methods are evaluated using the multi-spectral dataset
with various combinations of thermal, color, and disparity in-
formation. The experimental results show that CCF signifi-
cantly outperforms the HOG features. The combination of color
and thermal images using CCF method results the best per-
formance of 9% log-average miss rate. For the future work,
other advanced feature extraction and classification methods
will be considered to further improve the pedestrian detector
performance.
REFERENCES
[1] N. Dalal and B. Triggs, “Histograms of oriented gradients for human de-
tection,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.,
Jun. 2005, vol. 1, pp. 886–893.
[2] P. Dollár, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: A
benchmark,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009,
pp. 304–311.
[3] R. Gade and T. B. Moeslund, “Thermal cameras and applications: A sur-
Fig. 11. An example thermal image with two pedestrians. vey,” Mach. Vis. Appl., vol. 25, no. 1, pp. 245–262, 2014. [Online]. Avail-
able: http://dx.doi.org/10.1007/s00138-013-0570-5
[4] S. J. Krotosky and M. M. Trivedi, “On color-, infrared-, and multimodal-
stereo approaches to pedestrian detection,” IEEE Trans. Intell. Transp.
Syst., vol. 8, no. 4, pp. 619–629, Dec. 2007.
of 640 × 480, which is relatively low. To accommodate the res- [5] S. Zhang, R. Benenson, M. Omran, J. H. Hosang, and B. Schiele, “How far
olution and FOV of the thermal camera, the color cameras have are we from solving pedestrian detection?” CoRR, vol. abs/1602.01237,
to be set to the same resolution. In addition, color cameras are 2016. [Online]. Available: http://arxiv.org/abs/1602.01237
[6] P. Viola and M. J. Jones, “Robust real-time face detection,” Int. J. Com-
sensitive to the lighting condition, therefore the quality of the put. Vis., vol. 57, no. 2, pp. 137–154, 2004. [Online]. Available: http://
image sometimes cannot be guaranteed. Figure 10 shows an ex- dx.doi.org/10.1023/B:VISI.0000013087.49260.fb
ample, with bounding box drawn on the detected pedestrian in [7] R. Benenson, M. Omran, J. H. Hosang, and B. Schiele, “Ten years of
pedestrian detection, what have we learned?” CoRR, vol. abs/1411.4304,
both color and thermal images. It is obvious that the thermal 2014. [Online]. Available: http://arxiv.org/abs/1411.4304
image provide much better information about the presence of [8] P. Dollar, Z. Tu, P. Perona, and S. Belongie, “Integral channel features,”
the pedestrian, while it is hardly identifiable in the color image Proc. British Mach. Vis. Conf., 2009, pp.–91-1–91-11.
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
due to the shadow. with deep convolutional neural networks,” in Advances in Neural Informa-
Although thermal images seem to be dominant in our experi- tion Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K.
ment, its reliability still needs improvement. Figure 11 shows a Q. Weinberger, Eds. Red Hook, NY, USA: Curran Associates, Inc., 2012,
pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/4824-
thermal image taken on a hot sunny day. Two pedestrians circled imagenet-classification-with-deep-convolutional-neural-networks.pdf
are not bright enough compared to the surroundings, which is [10] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
contradictory to the assumption of distinct thermal intensity in large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014. [Online].
Available: http://arxiv.org/abs/1409.1556
many existing research works. In this case, the methods or oper- [11] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf.
ations based on pixel intensity values become unreliable, such Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1–9.
as intensity thresholding, head recognition using hot spot, etc. [12] B. Yang, J. Yan, Z. Lei, and S. Z. Li, “Convolutional channel features for
pedestrian, face and edge detection,” CoRR, vol. abs/1504.07339, 2015.
On the contrary, some shape or gradient based methods may still [Online]. Available: http://arxiv.org/abs/1504.07339
perform well, such as HOG and CCF described in this paper. [13] W. Li, D. Zheng, T. Zhao, and M. Yang, “An effective approach to pedes-
Finally, it is worth noting that possible camera parameters trian detection in thermal imagery,” in Proc. 8th Int. Conf. Natural Com-
put., May 2012, pp. 325–329.
estimation errors may have an impact on the performance. The [14] F. Suard, A. Rakotomamonjy, A. Bensrhair, and A. Broggi, “Pedestrian
feature extraction across images requires accurate image point detection using infrared images and histograms of oriented gradients,” in
registration, which can’t be done without accurate camera pa- Proc. IEEE Intell. Veh. Symp., 2006, pp. 206–212.
[15] C. Dai, Y. Zheng, and X. Li, “Pedestrian detection and tracking in in-
rameters. Therefore, the possible camera parameters estimation frared imagery using shape and appearance,” Comput. Vis. Image Un-
errors should be minimized during the calibration stage, possibly derstanding, vol. 106, no. 2/3, pp. 288–299, 2007. [Online]. Available:
by using more point pairs over a set of images. http://www.sciencedirect.com/science/article/pii/S1077314206001925
Authorized licensed use limited to: MKSSS CUMMINS COLLEGE OF ENGINEERING FOR WOMEN. Downloaded on January 08,2022 at 13:41:07 UTC from IEEE Xplore. Restrictions apply.
CHEN AND HUANG: PEDESTRIAN DETECTION FOR AUTONOMOUS VEHICLE USING MULTI-SPECTRAL CAMERAS 219
[16] J. W. Davis and M. A. Keck, “A two-stage template approach to person [32] R. E. Schapire and Y. Singer, “Improved boosting algorithms using
detection in thermal imagery,” in Proc. 7th IEEE Workshop Appl. Comput. confidence-rated predictions,” Mach. Learn., vol. 37, no. 3, pp. 297–336,
Vis., IEEE Workshop Motion Video Comput., 2005, vol. 1, pp. 364–369. 1999. [Online]. Available: http://dx.doi.org/10.1023/A:1007614523901
[17] F. Xu, X. Liu, and K. Fujimura, “Pedestrian detection and tracking with [33] P. Dollár, R. Appel, S. Belongie, and P. Perona, “Fast feature pyramids for
night vision,” IEEE Trans. Intell. Transp. Syst., vol. 6, no. 1, pp. 63–71, object detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 8,
Mar. 2005. pp. 1532–1545, Aug. 2014.
[18] D. Olmeda, A. de la Escalera, and J. M. Armingol, “Contrast invariant [34] Y. Jia et al., “CAFFE: Convolutional architecture for fast feature embed-
features for human detection in far infrared images,” in Proc. IEEE Intell. ding,” in Proc. ACM Intl. Conf. Multimedia, 2014, pp. 675–678.
Veh. Symp., Jun. 2012, pp. 117–122. [35] M. Rohrbach, M. Enzweiler, and D. M. Gavrila, “High-level fusion of
[19] W. Wang, J. Zhang, and C. Shen, “Improved human detection and clas- depth and intensity for pedestrian classification,” in Proc. Joint Pattern
sification in thermal images,” in Proc. IEEE Int. Conf. Image Process., Recognit. Symp., 2009, pp. 101–110.
Sep. 2010, pp. 2313–2316. [36] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On combining classi-
[20] M. Bertozzi, A. Broggi, C. H. Gomez, R. I. Fedriga, G. Vezzoni, and M. fiers,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp. 226–239,
DelRose, “Pedestrian detection in far infrared images based on the use Mar. 1998.
of probabilistic templates,” in Proc. IEEE Intell. Veh. Symp., Jun. 2007, [37] B. Waske and J. A. Benediktsson, “Fusion of support vector machines
pp. 327–332. for classification of multisensor data,” IEEE Trans. Geosci. Remote Sens.,
[21] T. T. Zin, H. Takahashi, and H. Hama, “Robust person detection using vol. 45, no. 12, pp. 3858–3866, Dec. 2007.
far infrared camera for image fusion,” in Proc. 2nd Int. Conf. Innovative [38] R. Pouteau, B. Stoll, and S. Chabrier, “Support vector machine fusion of
Comput., Inf. Control, Sep. 2007, pp. 310–310. multisensor imagery in tropical ecosystems,” in Proc. 2nd Int. Conf. Image
[22] L. Spinello and K. O. Arras, “People detection in RGB-D data,” in Proc. Process. Theory Tools Appl., Jul. 2010, pp. 325–329.
IEEE/RSJ Int. Conf. Intell. Robots Syst., Sep. 2011, pp. 3838–3843.
[23] M. Munaro, F. Basso, and E. Menegatti, “Tracking people within groups
with RGB-D data,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst.,
Oct. 2012, pp. 2101–2107. Zhilu Chen received the B.E. degree in microelec-
[24] M. Bertozzi, A. Broggi, A. Lasagni, and M. D. Rose, “Infrared stereo tronics from Xi’an Jiaotong University, Xi’an, China,
vision-based pedestrian detection,” in Proc. IEEE Proc. Intell. Veh. Symp., in 2011, and the M.S. degree in electrical and com-
Jun. 2005, pp. 24–29. puter engineering in 2013 from Worcester Polytech-
[25] I. R. Spremolla, M. Antunes, D. Aouada, and B. E. Ottersten, “RGB-D and nic Institute, Worcester, MA, USA, where he is
thermal sensor fusion-application in person tracking,” in Proc. Int. Joint currently working toward the Ph.D. degree at the De-
Conf. Comput. Vis., Imag. Comput. Graphics Theory Appl., 2016, vol. 3, partment of Electrical and Computer Engineering.
pp. 612–619. His research interests include computer vision, ma-
[26] D. Geronimo, A. M. Lopez, A. D. Sappa, and T. Graf, “Survey of pedestrian chine learning, and GPU acceleration for advanced
detection for advanced driver assistance systems,” IEEE Trans. Pattern driver assistance systems.
Anal. Mach. Intell., vol. 32, no. 7, pp. 1239–1258, Jul. 2010.
[27] S. Hwang, J. Park, N. Kim, Y. Choi, and I. S. Kweon, “Multispectral pedes-
trian detection: Benchmark dataset and baseline,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1037–1045.
[28] K. H. Lee and J. N. Hwang, “On-road pedestrian tracking across multiple Xinming Huang (M’01–SM’09) received the Ph.D.
driving recorders,” IEEE Trans. Multimedia, vol. 17, no. 9, pp. 1429–1438, degree in electrical engineering from Virginia Tech,
Sep. 2015. Blacksburg, VA, USA, in 2001. From 2001 to 2003,
[29] W. Liu, R. W. H. Lau, X. Wang, and D. Manocha, “Exemplar-AMMS: he was a member of Technical Staff with Bell Labs
Recognizing crowd movements from pedestrian trajectories,” IEEE Trans. of Lucent Technologies. He is currently a Professor
Multimedia, vol. 18, no. 12, pp. 2398–2406, Dec. 2016. with the Department of Electrical and Computer En-
[30] R. Brehar, C. Vancea, T. Mariţa, I. Giosan, and S. Nedevschi, “Pedestrian gineering, Worcester Polytechnic Institute, Worces-
detection in the context of multiple-sensor data alignment for far-infrared ter, MA, USA. His research interests include inte-
and stereo vision sensors,” in Proc. IEEE Int. Conf. Intell. Comput. Com- grated circuits and embedded systems, with empha-
mun. Process., Sep. 2015, pp. 385–392. sis on reconfigurable computing, wireless communi-
[31] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision. cations, information security, computer vision, and
Cambridge, U.K.: Cambridge Univ. Press, 2003. machine learning.
Authorized licensed use limited to: MKSSS CUMMINS COLLEGE OF ENGINEERING FOR WOMEN. Downloaded on January 08,2022 at 13:41:07 UTC from IEEE Xplore. Restrictions apply.