JOURNAL - 2020 - A Multispectral Feature Fusion Network For Robust Pedestrian Detection
JOURNAL - 2020 - A Multispectral Feature Fusion Network For Robust Pedestrian Detection
JOURNAL - 2020 - A Multispectral Feature Fusion Network For Robust Pedestrian Detection
H O S T E D BY
Alexandria University
School of Electronic and Information Engineering, Xi’an Technological University, Xi’an 710021, China
KEYWORDS Abstract The multispectral information, including both visible information and infrared informa-
Multispectral feature fusion; tion, can describe the detection target in a comprehensive manner. The deep learning (DL)-based
Pedestrian detection; detectors that fuse the multispectral features can detect pedestrians robustly in various environ-
Visible light images; ments. Therefore, this paper puts forward a robust multispectral feature fusion network (MSFFN)
Infrared images for pedestrian detection, which fully integrates the features extracted from visible light and infrared
channels. Specifically, multiscale semantic features were extracted by two core modules, namely,
multiscale feature extraction of visible images (MFEV) and multiscale feature extraction of infrared
images (MFEI), and fused by the improved YOLOv3 network for pedestrian recognition. Through
experiments on the KAIST dataset, it is proved that the MSFFN model can detect pedestrians more
accurately than both MFEV and MFEI over daytime and nighttime images on multiple scales. The
experimental results on the KAIST multispectral data set in the last section showed that our pro-
posed MFMFN model was superior to a number of state-of-the-art multispectral pedestrian detec-
tors methods in accuracy and speed. The model was also found to strike a good balance between
accuracy and speed, and perform excellently on small input images. The research results shed
important new light on the design of self-driving vehicles.
Ó 2020 Production and hosting by Elsevier B.V. on behalf of Faculty of Engineering, Alexandria
University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/
licenses/by-nc-nd/4.0/).
pedestrian detectors are solely based on visible information, The YOLO network adopts a single stage testing frame-
and thus sensitive to ambient brightness [4]. Against this back- work: the region proposal network (RPN) is coupled with
drop, infrared imagers have aroused much interest, because the Faster R-CNN to generate the candidate regions; then,
infrared images are not susceptible to light. But infrared pedes- the target boundary is identified through classification of the
trian detectors face a severe defect: the infrared images have a candidate regions; the entire image is shot by modifying the
low resolution and few color features [5]. network inputs; finally, the location and category of the
To overcome the above defect, multispectral information bounding box are regressed in the output layer. Xu et al. [17]
(e.g. visible information and infrared information) should be designed the YOLOv2, a cascaded network for unified and
considered in pedestrian detectors, such that the detectors real-time target detection, which achieves both high recall
could adapt to various brightness levels. In the past few years, and high efficiency in pedestrian detection in complex scenes.
many multispectral pedestrian detectors have been developed, The YOLOv2 relies on the RPN to compute the candidate
making pedestrian detection accurate and stable around the regions of pedestrians, and employs the boosted cascaded for-
clock [6]. est [11] to classify the candidate regions by reweighting the
The remainder of this paper is organized as follows: Sec- samples. Through end-to-end training and real-time detection
tion 2 reviews the literature on pedestrian detection; Section 3 [18], the YOLO network is much faster than the Faster R-
improves the YOLOv3 network; Section 4 sets up the pro- CNN [19], providing a truly real-time target detector based
posed model; Section 5 validates our model through experi- on the DL.
ments; Section 6 puts forward the conclusions. Most studies on pedestrian detection of self-driving vehicles
focus on the visible light spectrum of images [20]. However, the
2. Literature review visible light sensing technique may cease to be effective at night
or under dim light. Hence, it is imperative to install thermal
Currently, pedestrians are detected either by the traditional infrared sensors on the vehicle, which differentiate target(s)
approaches or based on deep learning (DL). from the background based on their difference of thermal radi-
The traditional approaches mainly express features ation. Thermal infrared images have distinctive advantages for
according to the relationship between adjacent pixels of pedestrian detection at night, for the wavelength of human
the image. Dalal and Triggs [7] proposed the histogram of body radiation falls near 9.3 lm. Albeit lacking the texture
oriented gradients (HOG), which counts occurrences of gra- and color of the background [21], infrared images can be pro-
dient orientation in localized portions of an image. How- duced throughout the day, and help identify vehicles at night,
ever, the real-time performance of the HOG is poor, for facilitating the real-time decisions on collision avoidance. The
the high-dimensional HOG features are high-dimensional, thermal infrared and visible light reflections from the back-
pushing up the computing load. Kim and Cho [8] developed ground should be combined to improve the performance of
the integral histogram to speed up the computation of HOG target detection.
features, but failed to reduce the dimensionality of such fea- To sum up, the multispectral information, including both
tures. Ojala et al. [9] put forward the local binary pattern visible information and infrared information, can describe
(LBP) operator, which constructs a rotation- and the target in a comprehensive manner. The DL-based detectors
grayscale-invariant feature map through pairwise compar- that fuse the multispectral features can detect pedestrians
ison between neighboring pixels. Mu et al. [10] detected robustly in various environments. In the light of the above, this
pedestrians accurately with the LBP operator as the feature paper firstly improves the YOLOv3, a robust DL framework
descriptor. Gavrila [11] developed a rapid and accurate for target detection. On this basis, multiscale feature extraction
method for pedestrian detection called template matching, of visible images (MFEV) and multiscale feature extraction of
but this method requires manual marking of templates and infrared images (MFEI) were combined into a multispectral
works poorly in generalization. Overall, the traditional feature fusion network (MSFFN). Based on multispectral
approaches generally rely on manual design, failing to cap- information, the MSFFN can detect pedestrians accurately
ture the changing features of pedestrians in complex scenes. in real time, especially if the pedestrian images are small and
In recent years, the DL has been widely applied in pedes- occluded. The research results enjoy a great application poten-
trian detection, thanks to its application potential in self- tial in self-driving vehicles.
driving vehicles and target tracking. Unlike the traditional
approaches, the DL can extract high-level semantics of images 3. Improvement of YOLOv3 network
through deep convolutional neural network (CNN), and thus
make a better description of pedestrians. The two most The YOLOv3 is a robust network to detect two or several tar-
advanced DL-based pedestrian detection tools are faster gets that are close to each other [22]. This section improves the
region-based CNN (Faster R-CNN) [12,13] and the You Only structure of the traditional YOLOv3 to realize pedestrian
Look Once (YOLO) network [14]. The Faster R-CNN, detection in various environments.
extended from a CNN with multiple feature layers, provides
a universal framework for pedestrian detection under the lack 3.1. The structure of YOLOv3 network
of small-scale features [15,16]. Under the two-level framework,
each candidate region was inputted to the CNN for convolu- The YOLOv3 network adopts the one-stage detection method
tion, which slows down the detection and pushes up the com- [12]. The overall framework of the network is illustrated in
puting load [12]. These disadvantages are fatal to self-driving Fig. 1 below.
vehicles, whose operation requires high real-timeliness and The YOLOv3 network down-samples the feature map with
precision. a fully convoluted layer, rather than the pooling layer in
Please cite this article in press as: X. Song et al., A multispectral feature fusion network for robust pedestrian detection, Alexandria Eng. J. (2020), https://doi.org/
10.1016/j.aej.2020.05.035
A multispectral feature fusion network 3
YOLOv2. Besides, the residual structure is introduced to pre- from the top of each layer, rather than randomly select seman-
vent vanishing and exploding gradients, which are unfavorable tic information. Several convolutional layers are added t
to network training. In this way, it is possible to build up and extract deep features. Thus, the FPN utilizes the semantic
train a 53-layer deep network with a high accuracy. The information of low-level features and high-level features at
YOLOv3 network can be trained with samples on multiple the same time. The features from different layers are fused to
scales. The detection accuracy of the network is positively cor- achieve good predictive effect. Hence, the network can learn
related with the fineness of the grids. However, the network more meaningful semantic information. The scale is selected
cannot work rapidly and accurately at the same time. based on the size of the target to be detected. In general, small
targets are detected on large scale, and large targets are
3.2. The structure of improved YOLOv3 network detected on small scale. The multiscale feature map is obtained
through up-sampling. In this way, the improved network can
By the end-to-end detection method, this paper adds multiscale detect small targets in an accurate and fast manner.
prediction to improve the YOLOv3 network. Three scales were
selected to the targets of different sizes, and pedestrian infor- 4. Design of the MSFFN
mation was added to the location information of the bounding
box. The overall structure of the improved network is pre- Currently, pedestrian detection is mainly realized based on vis-
sented in Fig. 2. ible information. But the detection effect is poor at night or
The improve network combines the merits of the YOLOv2 under dim light. To solve the problem, the visible light and
network and the feature pyramid network (FPN), aiming to infrared images were fused for pedestrian detection under com-
enhance the recognition ability of multiscale network on small plex environments.
targets. To identify targets, the FPN extracts pixels accurately
Please cite this article in press as: X. Song et al., A multispectral feature fusion network for robust pedestrian detection, Alexandria Eng. J. (2020), https://doi.org/
10.1016/j.aej.2020.05.035
4 X. Song et al.
4.1. MFEV and MFEI modules The features on each scale are accumulated into a feature
map, and then convoluted by 3*3 and 1*1 kernels. Hence,
YOLO is a simple network with fast speed. The YOLOv3 net- the features on different scales are mapped to channels Yv1
work can detect two or more targets that are close to each (13*13) and Yv2 (26*26) and Yv3 (52*52), respectively. In this
other in a robust manner. In the above section, the YOLOv3 way, the MFEV can extract the feature maps of multiscale
network was improved with the FPN to better recognize tar- information, laying a solid basis for accurate detection. The
gets of different sizes. MFEI works in a similar principle as the MEFV.
As mentioned before, the multispectral information, includ-
ing visible light information and infrared information, can 4.2. Msffn
greatly promote target recognition. To extract multiscale fea-
tures, visible light and infrared images are imported to the The MSFFN is a new hybrid pedestrian detection that extracts
MFEV and MFEI modules. The two modules are the core multiscale features under various light scenes. Fig. 4 introduces
of the MSFFN design. Fig. 3 explains the structures of the the MSFFN structure with a pair of visible light and infrared
two modules. images. Firstly, the semantic features of a single channel (Yv1,
As shown in Fig. 3(a), the MFEV module receives a 416 * Yv2, Yv3, YI1, YI2 and YI3) are extracted through MFEV and
416 input image with 3 channels, and implements the multi- MFEI modules, respectively. Next, the visible light and infra-
scale training of the YOLOv3 network. A series 3*3 and 1*1 red images are fused through the extraction of multiscale fea-
convolutional kernels are deployed in the module. The 3*3 ker- tures, and the fusion of multispectral feature maps, etc. On this
nels increase the number of channels of the feature map, while basis, the MSFFN judges if pedestrians exist in the scenes. The
the 1*1 kernels compress the feature map processed by the cor- detection results are given in the green box at the top of Fig. 4.
responding 3*3 kernels. Each convolutional layer performs the In the visible channel, the pedestrian at the bottom is likely to
Bayesian network (BN) operation on the input image. Each be ignored, due to the shadowing effect of the tree. On the con-
input image goes through 53 convolutional layers and 5 resid- trary, the image in the infrared channel is crystal clear, enhanc-
ual layers with different scales and depths, and then through ing the reliability of pedestrian detection. Therefore, the
75–105 layers of the YOLOv3 network. The feature fusion MSFFN can extract meaningful multiscale feature maps to
layer is divided into three scales (13*13, 26*26 and 52*52). achieve accurate detection of pedestrians.
Conv 1*1*256
Conv 1*1*256
4× Conv 3*3*1024
4× Conv 3*3*1024
Residual
Residual
Conv 1*1*128
Conv 1*1*128
8× Conv 3*3*512
8× Conv 3*3*512
Residual
Residual
Conv 1*1*128
Conv 1*1*128
8× Conv 3*3*256
8× Conv 3*3*256
Residual
Residual
Conv 1*1*64
Conv 1*1*64
2× Conv 3*3*128
2× Conv 3*3*128
Residual
Residual
Conv 1*1*32
Conv 1*1*32
1× Conv 3*3*512
1× Conv 3*3*512
Residual
Residual
Visible image
416*416*3
Infrared image
416*416*3
Please cite this article in press as: X. Song et al., A multispectral feature fusion network for robust pedestrian detection, Alexandria Eng. J. (2020), https://doi.org/
10.1016/j.aej.2020.05.035
A multispectral feature fusion network 5
The visible light and infrared images from the KAIST dataset
Our model adopts two fusion modules for multispectral fea-
were adopted to test the effectiveness of the MSFFN (Fig. 5).
tures (MFEV and MFEI). Based on the inputted visible
The KAIST dataset is a public database of 2252 pairs of
images, the MFEV can extract visible light features on multiple
images with,356 pedestrian annotations. These images are cap-
scales (Yv1, Yv2 and Yv3) through the improved YOLOv3
tured by normal cameras or infrared sensors at daytime or
structure. Similarly, the MFEI automatically extract the infra-
Please cite this article in press as: X. Song et al., A multispectral feature fusion network for robust pedestrian detection, Alexandria Eng. J. (2020), https://doi.org/
10.1016/j.aej.2020.05.035
6 X. Song et al.
red image features on multiple scales. The features of visible ground-truth boxes, because the number of real frames and
and infrared images were then fused into the multiscale fea- box location varies from image to image.
tures of the MSFFN, paving the way for multispectral pedes- During the matching, each detector was trained to predict
trian detection. various targets: some are large, some are small, some are in
The MEFV and MFEI modules, the core parts of our the corners, and some are in the middle. Hence, the detection
model, rely on the improved YOLOv3 network for dataset models with fixed size network were adopted, where each
training. Here, the proposed model is trained by the YOLOv3 detector is responsible for detecting targets of a specific size
network, which adopts the structure of DarkNet-53. The at a specific location. In the YOLOv3, each target in the image
trained model can automatically collect and fuse the features was predicted by only one detector. The grid at the center of
from visible light and infrared images on three different scales, the bounding box was found, and the other grids were penal-
laying the basis for pedestrian detection under various ized by the loss function.
environments. The output of the improved YOLOV3 network is the tensor
Before the training, the xmin and xmax of the bounding of 13*13*125. Therefore, the target tensor of the loss function
box were divided by the width of the image, and the ymin is of the size 13*13*125. The number 125 comes from 5 detec-
and ymax were divided by the height of the image. The division tors, each of which predicts 20 probabilities of the class, the
operations aim to normalize the coordinates, making the train- coordinates of four bounding boxes, and one confidence. For
ing independent of the pixel size of each image. Since the input a positive sample, the target sensor contains the coordinates
image is not square, the x coordinate is divided by a number of the bounding box of the target, the class vector encoded
different from the y coordinate. The divisor of each image var- by one-shot, and a confidence of 1.0, because it is 100% certain
ies with the sizes and aspect ratios, which affects the treatment that the sample is a real target. For a negative sample, all val-
of the coordinates of the bounding box and the prior box. ues of the target tensor are 0; the coordinates of the bounding
Here, the pre-training model of improved YOLOv3 net- box and the class vector are unimportant because they are
work is adopted. Take the visible light channel for example. ignored by the loss function; the confidence is zero, because
The series of image frames received from the camera were pre- it is 100% certain that the sample is not the target.
processed to adjust the image size to 416*416. Despite squeez- The iteration of the training requires an image tensor of the
ing the image, this adjustment approach is simple to size 416*416*3, and a target tensor of the size 13*13. Most ele-
implement, especially if most images have similar, non- ments of the tensors are zeros, for most detectors are not
extreme aspect ratios. responsible for predicting a particular target. The prediction
In our design, the pedestrian region is directly predicted by is considered true positive only if the coordinates, confidence
the CNN. Then, the predicted values are converted into and class are all correct.
bounding boxes. Since the ground-truth box is available in Before training, the input images were normalized to
the dataset, the training should design a loss function that 416*416, and the sigmoid function was used as the loss func-
compares the prediction box with the ground-truth box. Dur- tion of YOLOv3 to limit the confidence in the interval [0, 1].
ing the training, each detector was matched with one of the Table 2 compares the detection effects of MFEV, MFEI,
Please cite this article in press as: X. Song et al., A multispectral feature fusion network for robust pedestrian detection, Alexandria Eng. J. (2020), https://doi.org/
10.1016/j.aej.2020.05.035
A multispectral feature fusion network 7
and MSFFN for input images of different sizes (640 512, but accurately identified by the MFEI. Similarly, the three
480 384 and 320 256), i.e. different APs. heavily occluded pedestrians in the lower left corner were not
The results in Table 2 generally fulfill the detection require- identified accurately by the MFEV, yet pinpointed by the
ments. The MEFV, as a visible light channel, achieved satisfac- MSFFN, thanks to the fusion of multispectral features. In
tory effect in the daytime or under bright light; the MEFI, as Fig. 9(a), the MFEV failed to identify the pedestrian covered
an infrared channel, achieved much better effect than the by the shade across the road, while the MFEI and MSFFN
MEFV, especially in nighttime or under dim light. The could recognize the pedestrian in an accurate manner.
MSFFN boasted the best detection effect, because the visible
light and infrared features are fused into multispectral features 5.4.2. Nighttime experiments
for prediction. The excellence of the MSFFN is particularly Figs. 10–12 show the detection results on near, medium and far
obvious when the input images are of small size. scales, respectively. The first column is the detection results of
The good performance of the MSFFN is attributable to the the MFEV on visible light images; the second column is the
FPN in the improved YOLOv3. The FPN makes simple detection results of the MFEI on infrared images; the third col-
changes to network connections, which greatly improves the umn is the detection results of the MSFFN that fuses the vis-
small target detection performance, without increasing the ible light and infrared features.
computing load. Overall, the three models achieved poorer effects on the vis-
ible light channel in the nighttime experiments than the day-
5.4. Model testing time experiments, especially on the medium scale. Under the
dim light at night, the images lack light exposure, and the vis-
To test the effectiveness of the proposed model, 20% of the ible light cannot provide a good reference. In the case of an
pedestrian images from the KAIST dataset were extracted as occlusion, especially heavy occlusion (Fig. 12(c)), the MFEI
the test set. The MFEV, MFEI and MSFFN were separately could not identify the specific pedestrians well.
applied to detect pedestrians in images with no occlusion, part To sum up, the MEFV greatly outperformed the MFEI in
occlusion and heavy occlusion on near, medium and far scales, the daytime or under bright light, and in the nighttime or
respectively. The results on daytime or well-lighted images are under the dim light. The MSFFN achieved even better effects
displayed in Fig. 7, and those on nighttime or poorly-lighted than the MEFV, especially for occluded targets. The experi-
images are displayed in Fig. 8. mental results fully verify the effectiveness of our MSFFN
model in target detection under changing scenes.
5.4.1. Daytime experiments
Figs. 7–9 show the detection results on near, medium and far 5.4.3. Evaluation of multispectral feature fusion schemes
scales, respectively. The first column is the detection results In this paper, we design the multispectral feature fusion
of the MFEV on visible light images; the second column is schemes in the section in Figs. 2 and 3. MFEV can extract
the detection results of the MFEI on infrared images; the third image features of different multi-scale visible light through
column is the detection results of the MSFFN that fuses the the improved YOLOV3 structure on the basis of visible light
visible light and infrared features. image input, and the corresponding MFEI automatically
As shown in Fig. 7, the MFEI mistook the head of a car as extracts multi-scale infrared image features. Visible image fea-
a pedestrian; the mistake was not made by the MSFFN. The tures and infrared image features are multi-scale feature
gray pedestrians in the lower left corner which are heavily fusion, and then multi-spectral pedestrian detection results
occluded (Figs. 7(c) and 8 (c)) were ignored by the MFEV, are obtained.
Please cite this article in press as: X. Song et al., A multispectral feature fusion network for robust pedestrian detection, Alexandria Eng. J. (2020), https://doi.org/
10.1016/j.aej.2020.05.035
8 X. Song et al.
Fig. 7 Near Scale (a) No occlusion (b) Part occlusion (c) Heavy occlusion.
Fig. 8 Medium Scale (a) No occlusion (b) Part occlusion (c) Heavy occlusion.
Please cite this article in press as: X. Song et al., A multispectral feature fusion network for robust pedestrian detection, Alexandria Eng. J. (2020), https://doi.org/
10.1016/j.aej.2020.05.035
A multispectral feature fusion network 9
Fig. 9 Far scale (a) No occlusion (b) Part occlusion (c) Heavy occlusion.
Fig. 10 Near Scale (a) No occlusion (b) Part occlusion (c) Heavy occlusion.
Please cite this article in press as: X. Song et al., A multispectral feature fusion network for robust pedestrian detection, Alexandria Eng. J. (2020), https://doi.org/
10.1016/j.aej.2020.05.035
10 X. Song et al.
Fig. 11 Medium Scale (a) No occlusion (b) Part occlusion (c) Heavy occlusion.
Fig. 12 Far scale (a) No occlusion (b) Part occlusion (c) Heavy occlusion.
Please cite this article in press as: X. Song et al., A multispectral feature fusion network for robust pedestrian detection, Alexandria Eng. J. (2020), https://doi.org/
10.1016/j.aej.2020.05.035
A multispectral feature fusion network 11
(1) Daytime (a) Visible Object (b) R-CNN (c) Faster R-CNN (d)SSD (e) MFMFN (ours)
(2) Nighttime (a) Visible Object (b) R-CNN (c) Faster R-CNN (d) SSD (e) MFMFN (ours)
Fig. 13 Comparison of pedestrian detection results by different state-of-the-art methods. Note: Figure (a) and (b) show the comparison
of pedestrian detection results by different methods in different scenarios during the day and night. In the figure, the first line is Near, the
next line is Medium and the last one is Far. The first column of the figure is the original image of visible light, and the second to fifth
columns are the recognition results of different distances using R-CNN, Faster R-CNN, SSD and MFMFN (the model used in this
subject). The first line is Near, the second line Medium and the last line is Far. As can be seen from the results in Figure and Table 3 above,
there are different advantages for different methods.
(1) Analysis and comparison of recognition accuracy per- erature [12]. The SSD parameters were in [5]. This article tests
formance in this paper. the comparison results of pedestrian detection in different sce-
narios, as shown in Fig. 13 and Table 3. These comparison
We compare the proposed MFMFN model with a number results show that the multispectral pedestrian detector pro-
of state-of-the-art multispectral pedestrian detectors including posed in this paper is more robust under different monitoring
R-CNN, Faster R-CNN, and SSD. The simulation conditions conditions.
are the same as the previous simulation conditions. This test As shown in Fig. 13, the first column uses a simple method
selects several commonly used target detection methods for R-CNN for pedestrian object detection, R-CNN uses CNN for
comparison, namely R-CNN, Faster R-CNN, SSD and feature extraction, combined with SVM for classification, and
MFMFN. The recognition parameters of R-CNN adopt the the amount of calculation is very large, resulting in a slow
parameters in [2]. And the ones of Faster R-CNN were in Lit- detection speed of R-CNN. The main contribution of the sec-
Please cite this article in press as: X. Song et al., A multispectral feature fusion network for robust pedestrian detection, Alexandria Eng. J. (2020), https://doi.org/
10.1016/j.aej.2020.05.035
12 X. Song et al.
Table 3 the comparison performance of MFMFN with the current state-of-the-art methods.
Model Reasonable Reasonable Reasonable Near Medium Far No Part Heavy
all day night scale scale scale occlusion occlusion occlusion
R-CNN 0.832 0.835 0.701 0.732 0.693 0.145 0.680 0.0252 0.101
Faster R- 0.849 0.858 0.725 0.782 0.756 0.159 0.790 0.328 0.131
CNN
SSD 0.844 0.849 0.798 0.792 0.765 0.133 0.815 0.316 0.132
MFMFN 0.854 0.865 0.836 0.797 0.785 0.166 0.818 0.373 0. 152
Please cite this article in press as: X. Song et al., A multispectral feature fusion network for robust pedestrian detection, Alexandria Eng. J. (2020), https://doi.org/
10.1016/j.aej.2020.05.035
A multispectral feature fusion network 13
Science and Technology Cooperation Program (2018KW-022), [10] Y. Mu, S. Yan, Y. Liu, T. Huang, B. Zhou, Discriminative local
Shaanxi Natural Science Foundation (2018JQ5009 and binary patterns for human detection in personal album, in: 2008
18JK0388), and the independent intelligent control research IEEE Conference on Computer Vision and Pattern Recognition,
and innovation team support this topic. 2008, pp. 1–8, https://doi.org/10.1109/CVPR.2008.4587800.
[11] D.M. Gavrila, A bayesian, exemplar-based approach to
hierarchical shape matching, IEEE Trans. Pattern Anal.
References Mach. Intell. 29 (8) (2007) 1408–1421, https://doi.org/10.1109/
TPAMI.2007.1062.
[1] W.C. Ma, Y.W. Wu, F. Cen, G.H. Wang, MDFN: Multi-scale [12] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards
deep feature learning network for object detection, Pattern real-time object detection with region proposal networks, Adv.
Recogn. 100 (2019) 1–13, https://doi.org/10.1016/ Neural Inf. Process. Syst. (2015) 1–9.
j.patcog.2019.107149. [13] M.A. Wajeed, V. Sreenivasulu, Image based tumor cells
[2] R. Girshick, J. Donahue, T. Darrell, J. Malik, Region-based identification using convolutional neural network and auto
convolutional networks for accurate object detection and encoders, Traitement du Signal 36 (5) (2019) 445–453.
segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 38 (1) https://doi.org/10.18280/ts.360510.
(2016) 142–158, https://doi.org/10.1109/TPAMI.2015.2437384. [14] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look
[3] J. Hosang, R. Benenson, P. Dollár, B. Schiele, What makes for once: Unified, real-time object detection, in: Proceedings of the
effective detection proposals, IEEE Trans. Pattern Anal. Mach. IEEE conference on computer vision and pattern recognition,
Intell. 38 (4) (2015) 814–830, https://doi.org/10.1109/ 2016, pp. 779–788. https://doi.org/10.1109/CVPR.2016.91.
TPAMI.2015.2465908. [15] A.X. Guo, B.Q. Yin, Y. Li, Small-size pedestrian detection via
[4] Y.L. Hou, Y. Song, X. Hao, Y. Shen, M. Qian, H. Chen, deep convolutional neural network, Inf. Technol. Netw. Security
Multispectral pedestrian detection based on deep convolutional 37 (7) (2018) 50–53.
neural networks, Infrared Phys. Technol. 94 (2018) 69–77, [16] K. Gorur, M.R. Bozkurt, M.S. Bascil, F. Temurtas, GKP signal
https://doi.org/10.1016/j.infrared.2018.08.029. processing using deep CNN and SVM for tongue-machine
[5] X. Jin, Q. Jiang, S. Yao, D. Zhou, R. Nie, J. Hai, K. He, A interface, Traitement du Signal 36 (4) (2019) 319–329.
survey of infrared and visual image fusion methods, Infrared https://doi.org/10.18280/ts.360404.
Phys. Technol. 85 (2017) 478–501, https://doi.org/10.1016/j. [17] W. Xu, K. Shawn, G. Wang, Toward learning a unified many-
infrared.2017.07.010. to-many mapping for diverse image translation, Pattern Recogn.
[6] Z. Zhao, Y. Zhang, L. Bai, Y. Zhang, J. Han, Multispectral 93 (2019) 570–580, https://doi.org/10.1016/j.patcog.2019.05.017.
target detection based on the space–spectrum structure [18] Y. Wu, Z. Zhang, G. Wang, Unsupervised deep feature transfer
constraint with the multi-scale hierarchical model, Signal for low resolution image classification, in: Proceedings of the
Process. Image Commun. 68 (2018) 58–67, https://doi.org/ IEEE International Conference on Computer Vision (ICCV),
10.1016/j.image.2018.06.014. 2019, pp. 1–5.
[7] N. Dalal, B. Triggs, Histograms of oriented gradients for human [19] X. Song, S. Gao, K. Cao, J. Huang, A new hybrid method in
detection, in: 2005 IEEE computer society conference on global dynamic path planning of mobile robot, Int. J. Comput.
computer vision and pattern recognition (CVPR’05), vol. 1, Commun. Control 13 (6) (2018) 1032–1046.
2005, pp. 886–893. https://doi.org/10.1109/CVPR.2005.177 [20] X. Song, H. Chen, Y. Xue, Stabilization precision control
[8] S. Kim, K. Cho, Trade-off between accuracy and speed for methods of photoelectric aim-stabilized system, Opt. Commun.
pedestrian detection using HOG feature, in: 2013 IEEE Third 351 (2015) 115–120, https://doi.org/10.1016/j.
International Conference on Consumer Electronics ¿ Berlin optcom.2015.04.056.
(ICCE-Berlin), 2013, pp. 207–209. https://doi.org/10.1109/ [21] Q. Xiao, S. Liu, Motion retrieval based on dynamic Bayesian
ICCE-Berlin.2013.6698033 network and canonical time warping, Soft. Comput. 21 (1)
[9] T. Ojala, M. Pietikäinen, D. Harwood, A comparative study of (2017) 267–280, https://doi.org/10.1007/s00500-015-1889-9.
texture measures with classification based on featured [22] J. Redmon, A. Farhadi, YOLOv3: An incremental
distributions, Pattern Recogn. 29 (1) (1996) 51–59, https://doi. improvement, Comput. Sci.-Comput. Vision Pattern Recogn.
org/10.1016/0031-3203(95)00067-4. (2018), arXiv:1804.02767.
Please cite this article in press as: X. Song et al., A multispectral feature fusion network for robust pedestrian detection, Alexandria Eng. J. (2020), https://doi.org/
10.1016/j.aej.2020.05.035