JOURNAL - 2020 - A Multispectral Feature Fusion Network For Robust Pedestrian Detection

Alexandria Engineering Journal (2020) xxx, xxx–xxx
H O S T E D BY
Alexandria University
Alexandria Engineering Journal

www.elsevier.com/locate/aej
www.sciencedirect.com
A multispectral feature fusion network for robust

pedestrian detection
Xiaoru Song, Song Gao *, Chaobo Chen
School of Electronic and Information Engineering, Xi’an Technological University, Xi’an 710021, China
Received 27 April 2020; revised 20 May 2020; accepted 24 May 2020
KEYWORDS Abstract The multispectral information, including both visible information and infrared informa-
Multispectral feature fusion; tion, can describe the detection target in a comprehensive manner. The deep learning (DL)-based
Pedestrian detection; detectors that fuse the multispectral features can detect pedestrians robustly in various environ-
Visible light images; ments. Therefore, this paper puts forward a robust multispectral feature fusion network (MSFFN)
Infrared images for pedestrian detection, which fully integrates the features extracted from visible light and infrared
channels. Specifically, multiscale semantic features were extracted by two core modules, namely,
multiscale feature extraction of visible images (MFEV) and multiscale feature extraction of infrared
images (MFEI), and fused by the improved YOLOv3 network for pedestrian recognition. Through
experiments on the KAIST dataset, it is proved that the MSFFN model can detect pedestrians more
accurately than both MFEV and MFEI over daytime and nighttime images on multiple scales. The
experimental results on the KAIST multispectral data set in the last section showed that our pro-
posed MFMFN model was superior to a number of state-of-the-art multispectral pedestrian detec-
tors methods in accuracy and speed. The model was also found to strike a good balance between
accuracy and speed, and perform excellently on small input images. The research results shed
important new light on the design of self-driving vehicles.
Ó 2020 Production and hosting by Elsevier B.V. on behalf of Faculty of Engineering, Alexandria
University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/
licenses/by-nc-nd/4.0/).
1. Introduction self-driving vehicles to surveillance cameras [2], lays the basis

for behavior analysis and target identification.
For many automated systems, it is a crucial yet challenging In urban areas, self-driving vehicles must support the func-
problem to detect pedestrians based on computer vision [1]. tion of pedestrian detection: If the vehicle is about to collide
Pedestrian detection, as the key to various fields, ranging from into a pedestrian, the relevant systems should be activated to
protect the pedestrian; if the vehicle is about to hit into another
vehicle or a wall, the relevant systems should be activated to
protect the passengers. However, the current sensors (e.g.
* Corresponding author.
radar) of collision avoidance systems cannot differentiate
E-mail addresses: [email protected] (X. Song), gaosong@xatu. between the detected obstacles [3].
edu.cn (S. Gao).
Despite the recent progress, there is still no robust pedes-
Peer review under responsibility of Faculty of Engineering, Alexandria
trian detector ready for practical applications. Most existing
University.
https://doi.org/10.1016/j.aej.2020.05.035
1110-0168 Ó 2020 Production and hosting by Elsevier B.V. on behalf of Faculty of Engineering, Alexandria University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Please cite this article in press as: X. Song et al., A multispectral feature fusion network for robust pedestrian detection, Alexandria Eng. J. (2020), https://doi.org/
10.1016/j.aej.2020.05.035
2 X. Song et al.
pedestrian detectors are solely based on visible information, The YOLO network adopts a single stage testing frame-
and thus sensitive to ambient brightness [4]. Against this back- work: the region proposal network (RPN) is coupled with
drop, infrared imagers have aroused much interest, because the Faster R-CNN to generate the candidate regions; then,
infrared images are not susceptible to light. But infrared pedes- the target boundary is identified through classification of the
trian detectors face a severe defect: the infrared images have a candidate regions; the entire image is shot by modifying the
low resolution and few color features [5]. network inputs; finally, the location and category of the
To overcome the above defect, multispectral information bounding box are regressed in the output layer. Xu et al. [17]
(e.g. visible information and infrared information) should be designed the YOLOv2, a cascaded network for unified and
considered in pedestrian detectors, such that the detectors real-time target detection, which achieves both high recall
could adapt to various brightness levels. In the past few years, and high efficiency in pedestrian detection in complex scenes.
many multispectral pedestrian detectors have been developed, The YOLOv2 relies on the RPN to compute the candidate
making pedestrian detection accurate and stable around the regions of pedestrians, and employs the boosted cascaded for-
clock [6]. est [11] to classify the candidate regions by reweighting the
The remainder of this paper is organized as follows: Sec- samples. Through end-to-end training and real-time detection
tion 2 reviews the literature on pedestrian detection; Section 3 [18], the YOLO network is much faster than the Faster R-
improves the YOLOv3 network; Section 4 sets up the pro- CNN [19], providing a truly real-time target detector based
posed model; Section 5 validates our model through experi- on the DL.
ments; Section 6 puts forward the conclusions. Most studies on pedestrian detection of self-driving vehicles
focus on the visible light spectrum of images [20]. However, the
2. Literature review visible light sensing technique may cease to be effective at night
or under dim light. Hence, it is imperative to install thermal
Currently, pedestrians are detected either by the traditional infrared sensors on the vehicle, which differentiate target(s)
approaches or based on deep learning (DL). from the background based on their difference of thermal radi-
The traditional approaches mainly express features ation. Thermal infrared images have distinctive advantages for
according to the relationship between adjacent pixels of pedestrian detection at night, for the wavelength of human
the image. Dalal and Triggs [7] proposed the histogram of body radiation falls near 9.3 lm. Albeit lacking the texture
oriented gradients (HOG), which counts occurrences of gra- and color of the background [21], infrared images can be pro-
dient orientation in localized portions of an image. How- duced throughout the day, and help identify vehicles at night,
ever, the real-time performance of the HOG is poor, for facilitating the real-time decisions on collision avoidance. The
the high-dimensional HOG features are high-dimensional, thermal infrared and visible light reflections from the back-
pushing up the computing load. Kim and Cho [8] developed ground should be combined to improve the performance of
the integral histogram to speed up the computation of HOG target detection.
features, but failed to reduce the dimensionality of such fea- To sum up, the multispectral information, including both
tures. Ojala et al. [9] put forward the local binary pattern visible information and infrared information, can describe
(LBP) operator, which constructs a rotation- and the target in a comprehensive manner. The DL-based detectors
grayscale-invariant feature map through pairwise compar- that fuse the multispectral features can detect pedestrians
ison between neighboring pixels. Mu et al. [10] detected robustly in various environments. In the light of the above, this
pedestrians accurately with the LBP operator as the feature paper firstly improves the YOLOv3, a robust DL framework
descriptor. Gavrila [11] developed a rapid and accurate for target detection. On this basis, multiscale feature extraction
method for pedestrian detection called template matching, of visible images (MFEV) and multiscale feature extraction of
but this method requires manual marking of templates and infrared images (MFEI) were combined into a multispectral
works poorly in generalization. Overall, the traditional feature fusion network (MSFFN). Based on multispectral
approaches generally rely on manual design, failing to cap- information, the MSFFN can detect pedestrians accurately
ture the changing features of pedestrians in complex scenes. in real time, especially if the pedestrian images are small and
In recent years, the DL has been widely applied in pedes- occluded. The research results enjoy a great application poten-
trian detection, thanks to its application potential in self- tial in self-driving vehicles.
driving vehicles and target tracking. Unlike the traditional
approaches, the DL can extract high-level semantics of images 3. Improvement of YOLOv3 network
through deep convolutional neural network (CNN), and thus
make a better description of pedestrians. The two most The YOLOv3 is a robust network to detect two or several tar-
advanced DL-based pedestrian detection tools are faster gets that are close to each other [22]. This section improves the
region-based CNN (Faster R-CNN) [12,13] and the You Only structure of the traditional YOLOv3 to realize pedestrian
Look Once (YOLO) network [14]. The Faster R-CNN, detection in various environments.
extended from a CNN with multiple feature layers, provides
a universal framework for pedestrian detection under the lack 3.1. The structure of YOLOv3 network
of small-scale features [15,16]. Under the two-level framework,
each candidate region was inputted to the CNN for convolu- The YOLOv3 network adopts the one-stage detection method
tion, which slows down the detection and pushes up the com- [12]. The overall framework of the network is illustrated in
puting load [12]. These disadvantages are fatal to self-driving Fig. 1 below.
vehicles, whose operation requires high real-timeliness and The YOLOv3 network down-samples the feature map with
precision. a fully convoluted layer, rather than the pooling layer in
10.1016/j.aej.2020.05.035
A multispectral feature fusion network 3
Fig. 1 The structure of YOLOv3 network.
YOLOv2. Besides, the residual structure is introduced to pre- from the top of each layer, rather than randomly select seman-
vent vanishing and exploding gradients, which are unfavorable tic information. Several convolutional layers are added t
to network training. In this way, it is possible to build up and extract deep features. Thus, the FPN utilizes the semantic
train a 53-layer deep network with a high accuracy. The information of low-level features and high-level features at
YOLOv3 network can be trained with samples on multiple the same time. The features from different layers are fused to
scales. The detection accuracy of the network is positively cor- achieve good predictive effect. Hence, the network can learn
related with the fineness of the grids. However, the network more meaningful semantic information. The scale is selected
cannot work rapidly and accurately at the same time. based on the size of the target to be detected. In general, small
targets are detected on large scale, and large targets are
3.2. The structure of improved YOLOv3 network detected on small scale. The multiscale feature map is obtained
through up-sampling. In this way, the improved network can
By the end-to-end detection method, this paper adds multiscale detect small targets in an accurate and fast manner.
prediction to improve the YOLOv3 network. Three scales were
selected to the targets of different sizes, and pedestrian infor- 4. Design of the MSFFN
mation was added to the location information of the bounding
box. The overall structure of the improved network is pre- Currently, pedestrian detection is mainly realized based on vis-
sented in Fig. 2. ible information. But the detection effect is poor at night or
The improve network combines the merits of the YOLOv2 under dim light. To solve the problem, the visible light and
network and the feature pyramid network (FPN), aiming to infrared images were fused for pedestrian detection under com-
enhance the recognition ability of multiscale network on small plex environments.
targets. To identify targets, the FPN extracts pixels accurately
Fig. 2 The structure of the improved YOLOv3 network.
10.1016/j.aej.2020.05.035
4 X. Song et al.
4.1. MFEV and MFEI modules The features on each scale are accumulated into a feature
map, and then convoluted by 3*3 and 1*1 kernels. Hence,
YOLO is a simple network with fast speed. The YOLOv3 net- the features on different scales are mapped to channels Yv1
work can detect two or more targets that are close to each (13*13) and Yv2 (26*26) and Yv3 (52*52), respectively. In this
other in a robust manner. In the above section, the YOLOv3 way, the MFEV can extract the feature maps of multiscale
network was improved with the FPN to better recognize tar- information, laying a solid basis for accurate detection. The
gets of different sizes. MFEI works in a similar principle as the MEFV.
As mentioned before, the multispectral information, includ-
ing visible light information and infrared information, can 4.2. Msffn
greatly promote target recognition. To extract multiscale fea-
tures, visible light and infrared images are imported to the The MSFFN is a new hybrid pedestrian detection that extracts
MFEV and MFEI modules. The two modules are the core multiscale features under various light scenes. Fig. 4 introduces
of the MSFFN design. Fig. 3 explains the structures of the the MSFFN structure with a pair of visible light and infrared
two modules. images. Firstly, the semantic features of a single channel (Yv1,
As shown in Fig. 3(a), the MFEV module receives a 416 * Yv2, Yv3, YI1, YI2 and YI3) are extracted through MFEV and
416 input image with 3 channels, and implements the multi- MFEI modules, respectively. Next, the visible light and infra-
scale training of the YOLOv3 network. A series 3*3 and 1*1 red images are fused through the extraction of multiscale fea-
convolutional kernels are deployed in the module. The 3*3 ker- tures, and the fusion of multispectral feature maps, etc. On this
nels increase the number of channels of the feature map, while basis, the MSFFN judges if pedestrians exist in the scenes. The
the 1*1 kernels compress the feature map processed by the cor- detection results are given in the green box at the top of Fig. 4.
responding 3*3 kernels. Each convolutional layer performs the In the visible channel, the pedestrian at the bottom is likely to
Bayesian network (BN) operation on the input image. Each be ignored, due to the shadowing effect of the tree. On the con-
input image goes through 53 convolutional layers and 5 resid- trary, the image in the infrared channel is crystal clear, enhanc-
ual layers with different scales and depths, and then through ing the reliability of pedestrian detection. Therefore, the
75–105 layers of the YOLOv3 network. The feature fusion MSFFN can extract meaningful multiscale feature maps to
layer is divided into three scales (13*13, 26*26 and 52*52). achieve accurate detection of pedestrians.
Yv1 Yv2 Yv3

YI1 YI2 YI3
YOLO Region YOLO Region YOLO Region

YOLO Region YOLO Region YOLO Region
Conv Conv Conv

Conv Conv Conv
Upsampling Upsampling Upsampling

Upsampling Upsampling Upsampling
Conv 1*1*256
Conv 1*1*256
4× Conv 3*3*1024
4× Conv 3*3*1024
Residual
Residual
Conv 1*1*128
Conv 1*1*128
8× Conv 3*3*512
8× Conv 3*3*512
Residual
Residual
Conv 1*1*128
Conv 1*1*128
8× Conv 3*3*256
8× Conv 3*3*256
Residual
Residual
Conv 1*1*64
Conv 1*1*64
2× Conv 3*3*128
2× Conv 3*3*128
Residual
Residual
Conv 1*1*32
Conv 1*1*32
1× Conv 3*3*512
1× Conv 3*3*512
Residual
Residual
Visible image
416*416*3
Infrared image
416*416*3
(a) MFEV (b) MFEI

Fig. 3 The structure of MFEV and MFEI.
10.1016/j.aej.2020.05.035
nighttime. The original daytime images, 512 pixels in height

and 640 pixels in width, were resized into 416 416
(height width), without changing the aspect ratio.
However, the nighttime images are insufficient for training
our model. Thus, the selected nighttime images were aug-
mented to expand the size of the nighttime image set. During
the augmentation, daytime images were subjected to brightness
reduction and noise addition, creating nighttime images
(Fig. 6). In addition, horizontal flipping was applied to both
Multispectral Pedestrian Dectection daytime and nighttime images to prevent overfitting, a com-
mon problem among detection systems based on convolutional
Multispectral Feature Maps Fusion neural networks (CNNs). Besides, several nighttime images
were selected as a validation set. Table 1 details the images
YI1 YI2 YI3
for training, validation and testing.
Yv1 Yv2 Yv3
5.2. Evaluation metrics

MFEV MFEI
Multi-scale Multi-scale
Feature Extraction of Feature Extraction of The MSFFN outputs a visible light image and a full-scale pre-
Visible image Infrared image dicted heat map, in which the pedestrian regions have a high
confidence while the background has a low confidence. For
comparison, the bounding box detection results with different
prediction scores were transformed to the heat map represen-
tation and the pixel-level average precision (AP).
To explain the evaluation metrics, the pixels located in the
ground-truth bounding boxes were defined as foreground
ones, while the other pixels were defined as background ones.
Visible Channel Infrared Channel
Based on the last fusion results, the true positive (TP) refers to
the number of correctly predicted foreground pixels, i.e. the
pedestrian region; the false positive (FP) refers to the number
Fig. 4 The structure of the MSFFN.
of incorrectly predicted background pixels, i.e. the background
region was mistaken for pedestrian region. For each image, the
5. Experiments and results analysis FP falls in the range of [10-2, 100]. The false negative (FN)
refers to the number of foreground pixels incorrectly identified
as background pixels. Precision and Recall are respectively cal-
This section aims to verify the effectiveness of the proposed
culated by TP/(TP + FP) and TP/(TP + FN). The shape of
MSFFN through experiments on a desktop computer (Intel
the Precision and Recall curves reflects the AP. Precision refers
Core i7-3770 K Quad-Core Processor 3.5 GHz 8 MB Cache;
to the mean precision at a number of 100 equally spaced recall
NVIDIA GeForce GTX 1070 graphics processing unit
levels, which are achieved by changing the threshold on detec-
(GPU) 8 GB).
tion scores. The recall level falls between 0 and 1.
5.1. KAIST dataset
5.3. Model training
The visible light and infrared images from the KAIST dataset
Our model adopts two fusion modules for multispectral fea-
were adopted to test the effectiveness of the MSFFN (Fig. 5).
tures (MFEV and MFEI). Based on the inputted visible
The KAIST dataset is a public database of 2252 pairs of
images, the MFEV can extract visible light features on multiple
images with,356 pedestrian annotations. These images are cap-
scales (Yv1, Yv2 and Yv3) through the improved YOLOv3
tured by normal cameras or infrared sensors at daytime or
structure. Similarly, the MFEI automatically extract the infra-
(a) Nighttime images (b) Daytime images

Fig. 5 Sample visible light images from the KAIST dataset.
10.1016/j.aej.2020.05.035
6 X. Song et al.
(a) Darkness level=1; (b) Darkness level=3; (c) Darkness level=5
Fig. 6 An example of enhanced nighttime image.
Table 1 The images for training, validation and testing.

Dataset Time when image is captured Number of images Number of pedestrians
Original training data KAIST Daytime 818 1328
Nighttime 488 1083
Augmented training data KAIST Daytime 8180(818 2*5**) 13,280
Nighttime 976(488 2*) 2166
Validation data KAIST Daytime 300 62
KAIST-daytime Daytime 1455 1009
KAIST-nighttime Nighttime 797 399
Testing data KAIST-all time Daytime + Nighttime 2252 1408
Note: * means the data are augmented by horizontal flipping; ** means the data are augmented by brightness reduction and noise addition.
red image features on multiple scales. The features of visible ground-truth boxes, because the number of real frames and
and infrared images were then fused into the multiscale fea- box location varies from image to image.
tures of the MSFFN, paving the way for multispectral pedes- During the matching, each detector was trained to predict
trian detection. various targets: some are large, some are small, some are in
The MEFV and MFEI modules, the core parts of our the corners, and some are in the middle. Hence, the detection
model, rely on the improved YOLOv3 network for dataset models with fixed size network were adopted, where each
training. Here, the proposed model is trained by the YOLOv3 detector is responsible for detecting targets of a specific size
network, which adopts the structure of DarkNet-53. The at a specific location. In the YOLOv3, each target in the image
trained model can automatically collect and fuse the features was predicted by only one detector. The grid at the center of
from visible light and infrared images on three different scales, the bounding box was found, and the other grids were penal-
laying the basis for pedestrian detection under various ized by the loss function.
environments. The output of the improved YOLOV3 network is the tensor
Before the training, the xmin and xmax of the bounding of 13*13*125. Therefore, the target tensor of the loss function
box were divided by the width of the image, and the ymin is of the size 13*13*125. The number 125 comes from 5 detec-
and ymax were divided by the height of the image. The division tors, each of which predicts 20 probabilities of the class, the
operations aim to normalize the coordinates, making the train- coordinates of four bounding boxes, and one confidence. For
ing independent of the pixel size of each image. Since the input a positive sample, the target sensor contains the coordinates
image is not square, the x coordinate is divided by a number of the bounding box of the target, the class vector encoded
different from the y coordinate. The divisor of each image var- by one-shot, and a confidence of 1.0, because it is 100% certain
ies with the sizes and aspect ratios, which affects the treatment that the sample is a real target. For a negative sample, all val-
of the coordinates of the bounding box and the prior box. ues of the target tensor are 0; the coordinates of the bounding
Here, the pre-training model of improved YOLOv3 net- box and the class vector are unimportant because they are
work is adopted. Take the visible light channel for example. ignored by the loss function; the confidence is zero, because
The series of image frames received from the camera were pre- it is 100% certain that the sample is not the target.
processed to adjust the image size to 416*416. Despite squeez- The iteration of the training requires an image tensor of the
ing the image, this adjustment approach is simple to size 416*416*3, and a target tensor of the size 13*13. Most ele-
implement, especially if most images have similar, non- ments of the tensors are zeros, for most detectors are not
extreme aspect ratios. responsible for predicting a particular target. The prediction
In our design, the pedestrian region is directly predicted by is considered true positive only if the coordinates, confidence
the CNN. Then, the predicted values are converted into and class are all correct.
bounding boxes. Since the ground-truth box is available in Before training, the input images were normalized to
the dataset, the training should design a loss function that 416*416, and the sigmoid function was used as the loss func-
compares the prediction box with the ground-truth box. Dur- tion of YOLOv3 to limit the confidence in the interval [0, 1].
ing the training, each detector was matched with one of the Table 2 compares the detection effects of MFEV, MFEI,
10.1016/j.aej.2020.05.035
Table 2 Comparison of detection effects of multiple models.

Model Reasonable Reasonable Reasonable Near Medium Far No Part Heavy
all day night scale scale scale occlusion occlusion occlusion
MFEV- 0.810 0.865 0.651 0.798 0.716 0.139 0.815 0.316 0.150
640
MFEI-640 0.825 0.837 0.812 0.799 0.705 0.100 0.790 0.328 0.151
MSFFN- 0.843 0.866 0.805 0.796 0.764 0.148 0.818 0.373 0. 152
640
MFEV- 0.759 0.858 0.635 0.754 0.738 0.149 0.823 0.381 0.166
480
MFEI-480 0.844 0.849 0.834 0.812 0.736 0.163 0.816 0.373 0.169
MSFFN- 0.854 0.865 0.836 0.797 0.785 0.166 0.832 0.391 0.171
480
MFEV- 0.801 0.832 0.692 0.778 0.562 0.516 0.712 0.256 0.109
320
MFEI-320 0.748 0.757 0.740 0.756 0.546 0.693 0.697 0.243 0.110
MSFFN- 0.817 0.825 0.808 0.779 0.696 0.779 0.779 0.345 0.140
320
Note: ‘‘Reasonable” indicates a criterion for designating ground truth of pedestrian region.
and MSFFN for input images of different sizes (640 512, but accurately identified by the MFEI. Similarly, the three
480 384 and 320 256), i.e. different APs. heavily occluded pedestrians in the lower left corner were not
The results in Table 2 generally fulfill the detection require- identified accurately by the MFEV, yet pinpointed by the
ments. The MEFV, as a visible light channel, achieved satisfac- MSFFN, thanks to the fusion of multispectral features. In
tory effect in the daytime or under bright light; the MEFI, as Fig. 9(a), the MFEV failed to identify the pedestrian covered
an infrared channel, achieved much better effect than the by the shade across the road, while the MFEI and MSFFN
MEFV, especially in nighttime or under dim light. The could recognize the pedestrian in an accurate manner.
MSFFN boasted the best detection effect, because the visible
light and infrared features are fused into multispectral features 5.4.2. Nighttime experiments
for prediction. The excellence of the MSFFN is particularly Figs. 10–12 show the detection results on near, medium and far
obvious when the input images are of small size. scales, respectively. The first column is the detection results of
The good performance of the MSFFN is attributable to the the MFEV on visible light images; the second column is the
FPN in the improved YOLOv3. The FPN makes simple detection results of the MFEI on infrared images; the third col-
changes to network connections, which greatly improves the umn is the detection results of the MSFFN that fuses the vis-
small target detection performance, without increasing the ible light and infrared features.
computing load. Overall, the three models achieved poorer effects on the vis-
ible light channel in the nighttime experiments than the day-
5.4. Model testing time experiments, especially on the medium scale. Under the
dim light at night, the images lack light exposure, and the vis-
To test the effectiveness of the proposed model, 20% of the ible light cannot provide a good reference. In the case of an
pedestrian images from the KAIST dataset were extracted as occlusion, especially heavy occlusion (Fig. 12(c)), the MFEI
the test set. The MFEV, MFEI and MSFFN were separately could not identify the specific pedestrians well.
applied to detect pedestrians in images with no occlusion, part To sum up, the MEFV greatly outperformed the MFEI in
occlusion and heavy occlusion on near, medium and far scales, the daytime or under bright light, and in the nighttime or
respectively. The results on daytime or well-lighted images are under the dim light. The MSFFN achieved even better effects
displayed in Fig. 7, and those on nighttime or poorly-lighted than the MEFV, especially for occluded targets. The experi-
images are displayed in Fig. 8. mental results fully verify the effectiveness of our MSFFN
model in target detection under changing scenes.
5.4.1. Daytime experiments
Figs. 7–9 show the detection results on near, medium and far 5.4.3. Evaluation of multispectral feature fusion schemes
scales, respectively. The first column is the detection results In this paper, we design the multispectral feature fusion
of the MFEV on visible light images; the second column is schemes in the section in Figs. 2 and 3. MFEV can extract
the detection results of the MFEI on infrared images; the third image features of different multi-scale visible light through
column is the detection results of the MSFFN that fuses the the improved YOLOV3 structure on the basis of visible light
visible light and infrared features. image input, and the corresponding MFEI automatically
As shown in Fig. 7, the MFEI mistook the head of a car as extracts multi-scale infrared image features. Visible image fea-
a pedestrian; the mistake was not made by the MSFFN. The tures and infrared image features are multi-scale feature
gray pedestrians in the lower left corner which are heavily fusion, and then multi-spectral pedestrian detection results
occluded (Figs. 7(c) and 8 (c)) were ignored by the MFEV, are obtained.
10.1016/j.aej.2020.05.035
8 X. Song et al.
Fig. 7 Near Scale (a) No occlusion (b) Part occlusion (c) Heavy occlusion.
Fig. 8 Medium Scale (a) No occlusion (b) Part occlusion (c) Heavy occlusion.
10.1016/j.aej.2020.05.035
Fig. 9 Far scale (a) No occlusion (b) Part occlusion (c) Heavy occlusion.
Fig. 10 Near Scale (a) No occlusion (b) Part occlusion (c) Heavy occlusion.
10.1016/j.aej.2020.05.035
10 X. Song et al.
Fig. 11 Medium Scale (a) No occlusion (b) Part occlusion (c) Heavy occlusion.
Fig. 12 Far scale (a) No occlusion (b) Part occlusion (c) Heavy occlusion.
10.1016/j.aej.2020.05.035
(a) (b) (c) (d) (e)
(1) Daytime (a) Visible Object (b) R-CNN (c) Faster R-CNN (d)SSD (e) MFMFN (ours)
(a) (b) (c) (d) (e)
(2) Nighttime (a) Visible Object (b) R-CNN (c) Faster R-CNN (d) SSD (e) MFMFN (ours)
Fig. 13 Comparison of pedestrian detection results by different state-of-the-art methods. Note: Figure (a) and (b) show the comparison
of pedestrian detection results by different methods in different scenarios during the day and night. In the figure, the first line is Near, the
next line is Medium and the last one is Far. The first column of the figure is the original image of visible light, and the second to fifth
columns are the recognition results of different distances using R-CNN, Faster R-CNN, SSD and MFMFN (the model used in this
subject). The first line is Near, the second line Medium and the last line is Far. As can be seen from the results in Figure and Table 3 above,
there are different advantages for different methods.
(1) Analysis and comparison of recognition accuracy per- erature [12]. The SSD parameters were in [5]. This article tests
formance in this paper. the comparison results of pedestrian detection in different sce-
narios, as shown in Fig. 13 and Table 3. These comparison
We compare the proposed MFMFN model with a number results show that the multispectral pedestrian detector pro-
of state-of-the-art multispectral pedestrian detectors including posed in this paper is more robust under different monitoring
R-CNN, Faster R-CNN, and SSD. The simulation conditions conditions.
are the same as the previous simulation conditions. This test As shown in Fig. 13, the first column uses a simple method
selects several commonly used target detection methods for R-CNN for pedestrian object detection, R-CNN uses CNN for
comparison, namely R-CNN, Faster R-CNN, SSD and feature extraction, combined with SVM for classification, and
MFMFN. The recognition parameters of R-CNN adopt the the amount of calculation is very large, resulting in a slow
parameters in [2]. And the ones of Faster R-CNN were in Lit- detection speed of R-CNN. The main contribution of the sec-
10.1016/j.aej.2020.05.035
12 X. Song et al.
Table 3 the comparison performance of MFMFN with the current state-of-the-art methods.
Model Reasonable Reasonable Reasonable Near Medium Far No Part Heavy
all day night scale scale scale occlusion occlusion occlusion
R-CNN 0.832 0.835 0.701 0.732 0.693 0.145 0.680 0.0252 0.101
Faster R- 0.849 0.858 0.725 0.782 0.756 0.159 0.790 0.328 0.131
CNN
SSD 0.844 0.849 0.798 0.792 0.765 0.133 0.815 0.316 0.132
MFMFN 0.854 0.865 0.836 0.797 0.785 0.166 0.818 0.373 0. 152
current state-of-the-art multispectral pedestrian detectors typ-

Table 4 The Comparison of differ-
ically perform image up-scaling to achieve their optimal detec-
ent methods in real time.
tion performances. However, in comparison, MFMFN takes
Model Inference speed (fps) uniform picture size 416 416 multispectral data as input thus
R-CNN 5.6 run much faster. Moreover, our MFMFN model takes
Faster R-CNN 28 uniform-size 416 416 images as input and achieves 56 fps,
SSD 32 which is sufficient for real-time autonomous driving applica-
MFMFN 56 tions. It can perform real-time video processing in a few mil-
liseconds. Please note MFMFN achieves more accurate
detection results than the current state-of-the-art multispectral
pedestrian detection methods. Although the algorithm pro-
ond Faster R-CNN is to design a network RPN to extract can-
posed in this paper loses part of the detection accuracy, it
didate regions, instead of the time-consuming selective search.
can meet the requirements of unmanned driving, and meet
Then it was compared with the third popular SSD detection
the requirements of real-time. The algorithm in this paper
method. It draws on the anchor mechanism of faster RCNN
achieves a perfect balance between detection accuracy and
and combines YOLO’s regression idea, which not only main-
detection speed, and is more suitable for pedestrian detection
tains the characteristics of YOLO’s fast speed, but also ensures
in real-time road environment.
that the window prediction is similar to the accuracy of Faster
R-CNN. SSD only uses the upper layer to perform target
6. Conclusions
detection, so the detection performance for small objects is
poor. The multi-spectral MFMFN fusion model algorithm
designed in this subject is a combination of the optical channel Pedestrian detection is a key difficulty in the design of self-
MFEV and infrared channel MFEV models. Based on the driving vehicles. To overcome the difficulty, this paper pro-
improvement of YOLOv3, the two characteristics are fused, poses a multispectral pedestrian detection model called the
and the recognition results are obviously superior to other MSFFN. The model combines the visible light and infrared
methods, especially in night or in poor light conditions, the features. The former features help to identify pedestrians under
advantages of small size target recognition are more obvious. bright light, and the latter help to recognize pedestrians under
Visible light and infrared multi-source input have advantages dim light. Based on visible light and infrared images, the pro-
in far-scale and heavy-occlusion subsets. Therefore, the posed model was developed based on the improved YOLOv3
MFMFN multispectral fusion model proposed in this topic network. The effectiveness of our model was verified through
achieves higher AP values in all reasonable, scale, and occlu- experiments on the KAIST dataset. The results show that
sion subsets of the KAIST test data set. our model can detect pedestrians accurately in real time, espe-
cially on the far scale. The good performance is the result of
(2) The computational efficiency identified in the work the fusion of feature maps of different sizes. Our model has
a great potential in target detection tasks with multispectral
The real-time requirement for pedestrian detection of inputs, and facilitate the potential applications (e.g., path plan-
unmanned driving is relatively high. In order to evaluate the ning, collision avoidance, and target tracking) of self-driving
timeliness of the algorithm, this article combines the content vehicles.
of the above chapters to extract the pedestrian images in the
KAIST data set as pedestrian data sets, and extract 20% of Declaration of Competing Interest
the images as test sets.
Several target detection methods commonly used in the pre- The authors declare that they have no known competing
vious section were compared, namely R-CNN, Faster R-CNN, financial interests or personal relationships that could have
SSD and MFMFN. This article tests the comparison results of appeared to influence the work reported in this paper.
pedestrian detection in different scenarios as shown in Table 4.
We also compare the computational efficiency of MFMFN Acknowledgement
with state-of-the-art methods. A single NVIDIA GeForce
GTX 1070 graphics processing unit (GPU) 8 GB is utilized This project is grateful for the following projects: National
to evaluate the computation efficiency. Please note that the Key R&D Program (2016YFE0111900), Shaanxi International
10.1016/j.aej.2020.05.035
Science and Technology Cooperation Program (2018KW-022), [10] Y. Mu, S. Yan, Y. Liu, T. Huang, B. Zhou, Discriminative local
Shaanxi Natural Science Foundation (2018JQ5009 and binary patterns for human detection in personal album, in: 2008
18JK0388), and the independent intelligent control research IEEE Conference on Computer Vision and Pattern Recognition,
and innovation team support this topic. 2008, pp. 1–8, https://doi.org/10.1109/CVPR.2008.4587800.
[11] D.M. Gavrila, A bayesian, exemplar-based approach to
hierarchical shape matching, IEEE Trans. Pattern Anal.
References Mach. Intell. 29 (8) (2007) 1408–1421, https://doi.org/10.1109/
TPAMI.2007.1062.
[1] W.C. Ma, Y.W. Wu, F. Cen, G.H. Wang, MDFN: Multi-scale [12] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards
deep feature learning network for object detection, Pattern real-time object detection with region proposal networks, Adv.
Recogn. 100 (2019) 1–13, https://doi.org/10.1016/ Neural Inf. Process. Syst. (2015) 1–9.
j.patcog.2019.107149. [13] M.A. Wajeed, V. Sreenivasulu, Image based tumor cells
[2] R. Girshick, J. Donahue, T. Darrell, J. Malik, Region-based identification using convolutional neural network and auto
convolutional networks for accurate object detection and encoders, Traitement du Signal 36 (5) (2019) 445–453.
segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 38 (1) https://doi.org/10.18280/ts.360510.
(2016) 142–158, https://doi.org/10.1109/TPAMI.2015.2437384. [14] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look
[3] J. Hosang, R. Benenson, P. Dollár, B. Schiele, What makes for once: Unified, real-time object detection, in: Proceedings of the
effective detection proposals, IEEE Trans. Pattern Anal. Mach. IEEE conference on computer vision and pattern recognition,
Intell. 38 (4) (2015) 814–830, https://doi.org/10.1109/ 2016, pp. 779–788. https://doi.org/10.1109/CVPR.2016.91.
TPAMI.2015.2465908. [15] A.X. Guo, B.Q. Yin, Y. Li, Small-size pedestrian detection via
[4] Y.L. Hou, Y. Song, X. Hao, Y. Shen, M. Qian, H. Chen, deep convolutional neural network, Inf. Technol. Netw. Security
Multispectral pedestrian detection based on deep convolutional 37 (7) (2018) 50–53.
neural networks, Infrared Phys. Technol. 94 (2018) 69–77, [16] K. Gorur, M.R. Bozkurt, M.S. Bascil, F. Temurtas, GKP signal
https://doi.org/10.1016/j.infrared.2018.08.029. processing using deep CNN and SVM for tongue-machine
[5] X. Jin, Q. Jiang, S. Yao, D. Zhou, R. Nie, J. Hai, K. He, A interface, Traitement du Signal 36 (4) (2019) 319–329.
survey of infrared and visual image fusion methods, Infrared https://doi.org/10.18280/ts.360404.
Phys. Technol. 85 (2017) 478–501, https://doi.org/10.1016/j. [17] W. Xu, K. Shawn, G. Wang, Toward learning a unified many-
infrared.2017.07.010. to-many mapping for diverse image translation, Pattern Recogn.
[6] Z. Zhao, Y. Zhang, L. Bai, Y. Zhang, J. Han, Multispectral 93 (2019) 570–580, https://doi.org/10.1016/j.patcog.2019.05.017.
target detection based on the space–spectrum structure [18] Y. Wu, Z. Zhang, G. Wang, Unsupervised deep feature transfer
constraint with the multi-scale hierarchical model, Signal for low resolution image classification, in: Proceedings of the
Process. Image Commun. 68 (2018) 58–67, https://doi.org/ IEEE International Conference on Computer Vision (ICCV),
10.1016/j.image.2018.06.014. 2019, pp. 1–5.
[7] N. Dalal, B. Triggs, Histograms of oriented gradients for human [19] X. Song, S. Gao, K. Cao, J. Huang, A new hybrid method in
detection, in: 2005 IEEE computer society conference on global dynamic path planning of mobile robot, Int. J. Comput.
computer vision and pattern recognition (CVPR’05), vol. 1, Commun. Control 13 (6) (2018) 1032–1046.
2005, pp. 886–893. https://doi.org/10.1109/CVPR.2005.177 [20] X. Song, H. Chen, Y. Xue, Stabilization precision control
[8] S. Kim, K. Cho, Trade-off between accuracy and speed for methods of photoelectric aim-stabilized system, Opt. Commun.
pedestrian detection using HOG feature, in: 2013 IEEE Third 351 (2015) 115–120, https://doi.org/10.1016/j.
International Conference on Consumer Electronics ¿ Berlin optcom.2015.04.056.
(ICCE-Berlin), 2013, pp. 207–209. https://doi.org/10.1109/ [21] Q. Xiao, S. Liu, Motion retrieval based on dynamic Bayesian
ICCE-Berlin.2013.6698033 network and canonical time warping, Soft. Comput. 21 (1)
[9] T. Ojala, M. Pietikäinen, D. Harwood, A comparative study of (2017) 267–280, https://doi.org/10.1007/s00500-015-1889-9.
texture measures with classification based on featured [22] J. Redmon, A. Farhadi, YOLOv3: An incremental
distributions, Pattern Recogn. 29 (1) (1996) 51–59, https://doi. improvement, Comput. Sci.-Comput. Vision Pattern Recogn.
org/10.1016/0031-3203(95)00067-4. (2018), arXiv:1804.02767.
10.1016/j.aej.2020.05.035

JOURNAL - 2020 - A Multispectral Feature Fusion Network For Robust Pedestrian Detection

Uploaded by

Copyright:

Available Formats

JOURNAL - 2020 - A Multispectral Feature Fusion Network For Robust Pedestrian Detection

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

JOURNAL - 2020 - A Multispectral Feature Fusion Network For Robust Pedestrian Detection

Uploaded by

Copyright:

Available Formats

Alexandria Engineering Journal (2020) xxx, xxx–xxx

Alexandria Engineering Journal

A multispectral feature fusion network for robust

Received 27 April 2020; revised 20 May 2020; accepted 24 May 2020

1. Introduction self-driving vehicles to surveillance cameras [2], lays the basis

Fig. 1 The structure of YOLOv3 network.

Fig. 2 The structure of the improved YOLOv3 network.

Yv1 Yv2 Yv3

YOLO Region YOLO Region YOLO Region

Conv Conv Conv

Upsampling Upsampling Upsampling

(a) MFEV (b) MFEI

nighttime. The original daytime images, 512 pixels in height

5.2. Evaluation metrics

(a) Nighttime images (b) Daytime images

(a) Darkness level=1; (b) Darkness level=3; (c) Darkness level=5

Fig. 6 An example of enhanced nighttime image.

Table 1 The images for training, validation and testing.

Table 2 Comparison of detection effects of multiple models.

(a) (b) (c) (d) (e)

(a) (b) (c) (d) (e)

current state-of-the-art multispectral pedestrian detectors typ-

You might also like