Deep Learning-Based Pedestrian Detection Using RGB Images and Sparse LiDAR Point Clouds
Deep Learning-Based Pedestrian Detection Using RGB Images and Sparse LiDAR Point Clouds
1551-3203 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
7150 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 20, NO. 5, MAY 2024
works have improved the original target detection models for large amount of computational overhead. Once the illumination
specific pedestrian detection challenges. conditions change drastically, their proposed method to generate
On the other hand, data sources play an essential role in 2-D regions based on RGB images can fail. Chen et al. [19]
pedestrian detection. There are detection methods based on a extracted features from RGB images, top view, and front view
single data source [4], [5], [6], [7], [15], [16] and based on of point cloud data simultaneously, and then fused them for
multiple data sources [8], [17], [18], [19], [20], [21] as well. object detection. Ku et al. [20] designed a two-branch target
Common data sources include RGB image sensors [4], [5], [6], detection network for point cloud data and RGB images. In
[7], RGB-depth (RGB-D) cameras [22], [23], light detection this method, the RGB branch is used to extract useful features
and ranging (LiDAR) sensors [15], [16], and RGB-thermal from ordinary RGB images, and the point cloud branch is used
infrared (RGB-T) sensors [24], [25]. It is important to note to extract useful features from bird’s-eye view (BEV) maps of
that RGB imaging is a passive imaging technique. An RGB point cloud data. Based on these, the fusion process is performed
camera captures the light reflected from the surface of an object to obtain the target detection frame. Gao et al. [21] designed
and converts a large number of light signals into digital signals a multimodal dual-branch target detection network based on a
that are then stored. The image quality is heavily dependent on faster RCNN target detection network. This article introduced a
external factors such as weather and lighting. The performance channel attention mechanism to fully extract feature information
of RGB image-based detection methods can be significantly from RGB images and point cloud projection maps. The above
degraded in poor lighting conditions at night or in foggy en- three works described transform the LiDAR data, including
vironments. LiDAR is an active detection technology. It can converting the point cloud data into a front view, top view,
provide information on the distance and reflected intensity by and depth information mapping, to improve the speed of the
calibrating from an object to itself through multiple embedded target detection. However, they are all based on high line-beam
transmitter–receivers. LiDARs receive less affection induced LiDAR point cloud datasets and do not specifically consider
by environmental factors, such as light and weather. However, pedestrian detection work. Recently, Song et al. [8] designed and
LiDAR data are sparse compared to the RGB camera data. It proposed an improved two-branch detection network, named
lacks semantic information about the surrounding environment MS-YOLO, based on the YOLO V5 model. This model also uses
and does not well model the detailed information of the object. two separate feature extraction branches to extract features from
Target detection methods based on LiDAR point cloud data both data modes simultaneously and then perform feature-level
can be classified into direct point-based methods, voxel-based data fusion. The special feature is that millimeter-wave LiDAR
methods, and projection-based methods. In 2017, Qi et al. [15], is used and data from millimeter-wave LiDAR is converted into
[16] successively proposed PointNet and PointNet++, which pseudochannel images. However, the feature maps of both data
both used deep neural networks (DNN) directly on point cloud modes are directly concatenated without carefully considering
data for the first time and achieved good results on classification the association and the difference between the feature maps
and segmentation tasks. Subsequently, Shi et al. [26] proposed a of the two data sources. In addition, several of these studies
two-stage directly applied detection model for raw point cloud have been conducted at the level of specific detection methods,
data, called PointRCNN. The voxel-based approach first divides with little work done on aspects such as simultaneous data
the disordered point cloud data into each voxel by spatial voxel acquisition.
partitioning. The voxels are then encoded, and CNNs are used To address these issues, this article focuses on both designing a
to extract features and predict target frames. VoxelNet is a data acquisition platform and developing detection models. The
pioneering work in this area [27]. It uses feature learning network details are described as follows. A multimodal data acquisition
to extract the features of each point within a voxel, obtains the platform is designed and built. The platform is configured with
features of the voxel by pooling operations, and then uses CNN an RGB camera and a sparse LiDAR to be able to complete the
for feature extraction and target prediction. The sparsely embed- acquisition of the road surroundings. A data soft-synchronous
ded convolutional detection (SECOND) improves on VoxelNet acquisition scheme is also designed and implemented. This
by taking into account the sparsity of the point cloud distribution platform achieves synchronous acquisition between the two
and improves the model convergence speed [28]. Similar work sensors using signal triggering. This data acquisition platform is
has been conducted with PointPillars [29], etc. The projection- used to collect pedestrian data for several scenes under multiple
based methods are summarized below. Although good target periods. Based on the collected data, a multimodal dataset for
localization accuracy can be obtained, the dense high line-beam pedestrian detection is produced. The dataset contains RGB
LiDAR point cloud data lead to a huge computational effort and images and LiDAR data and is matched to each other. Taking into
make deployment difficult. In addition, LiDAR point data lack account the data characteristics and performance requirements,
semantic information and do not reflect dense information about a two-branch multimodal multilayer fusion pedestrian detection
the object. network (MM-Net) based on an advanced single-stage detector
Both sensors have their advantages and disadvantages. Re- and attention mechanism is proposed in this article. There are
searchers have used them in combination and proposed ad- two innovative parts in the proposed model. One is the design
ditional multimodal detection methods with good results. of a two-branch network backbone that satisfies multimodal
Qi et al. [18] proposed a method for joint target detection using data feature extraction. The other is to use the channel attention
RGB images and LiDAR point clouds. This approach directly mechanism in the synchronized fusion of feature maps from
used neural networks to process point clouds, which caused a different modal data. Extensive comparison experiments on two
Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
XU et al.: DEEP LEARNING-BASED PEDESTRIAN DETECTION USING RGB IMAGES AND SPARSE LIDAR POINT CLOUDS 7151
datasets are performed, our dataset and the KITTI dataset [30],
respectively. The experimental results show that the MM-Net
outperforms the advanced comparison models in terms of de-
tection accuracy and speed. In addition, we have conducted an
exploratory study on the fusion strategy to further validate the
effectiveness of the designed model structure and fusion module.
The main contributions of this article are as follows:
1) A pedestrian data acquisition platform equipped with a
sparse LiDAR (only 16 line beams) and an RGB camera
is designed and built. A soft-synchronous data acquisi-
tion scheme is designed and deployed in the platform to
Fig. 1. Platform architecture.
achieve the acquisition of road environment data.
2) A multimodal pedestrian detection dataset is captured and
produced. To the best of our knowledge, this is the first
dataset that uses sparse LiDAR (only 16 line beams) and Dense line LiDAR systems are associated with high costs,
focuses on pedestrian detection. This dataset will be made which has become one of the bottlenecks for large-scale com-
publicly available to researchers. mercialization of autonomous driving. Therefore, many compa-
3) Focusing on the data characteristics and performance re- nies, such as Audi [31] and Xiaopeng [32], have used low-cost
quirements, a one-stage deep learning-based multimodal 16-beam LiDAR for autonomous driving research and develop-
pedestrian detection model is proposed. The detection ment. Similarly, due to cost considerations, the other major sen-
model can fully extract effective features from different sor deployed in this platform is the 16-beam LiDAR produced
modal data and can fuse features at different levels, such by Velodyne [33], which has been widely used in autonomous
as high, medium, and low, to utilize the favorable infor- driving fleet operated by high-tech companies such as Baidu.
mation of different modal data for pedestrian detection. This particular model transmits sparse LiDAR data through
4) Extensive simulations are conducted on two datasets. The automotive Ethernet, also called 100-based T1, over shielded
simulation results demonstrate the challenging nature of twisted pairs. An interface box sits in between the sensor and
the homemade dataset, as well as the superior perfor- the data processing module for Ethernet protocol conversion so
mance of MM-Net. that a local area network (LAN) link can be established and
The rest of this article is organized as follows. Section II de- LiDAR data can be streamed.
scribes the overall architecture of the data acquisition platform. Xavier kit [34] performs as the data processing module due
Section III introduces our proposed MM-Net and feature-level to adequate computing power and adaptive tool chains, whereas
fusion modules. Section IV describes in detail the construction of this particular chip has been already installed onto production
the data acquisition platform and the details of the experiments. passenger cars by the EV OEM, NIO [35], in exactly the field
Finally, Section V concludes this article. of perception for autonomous driving. The Xavier kit plays as
the role of preprocessing unit, which connects all sensors, syn-
II. DATA ACQUISITION PLATFORM chronizes and collects all the sensor data, as well as aberration
This section introduces the data acquisition platform, consist- correction and joint calibration. In addition, The actual model
ing of a computing unit and multisource sensors, which enables inference process will be performed on a high-performance
data acquisition, transmission, and processing. computing platform (NVIDIA RTX 3090 GPU).
Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
7152 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 20, NO. 5, MAY 2024
Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
XU et al.: DEEP LEARNING-BASED PEDESTRIAN DETECTION USING RGB IMAGES AND SPARSE LIDAR POINT CLOUDS 7153
into account the characteristics of both modal data. It enables based on the idea of hardware-aware neural network design.
the model to autonomously select valid information from the They introduced the RepVGG structure into the model to further
features extracted from both branches. The effective integration improve accuracy, while speeding up the inference of the model.
of multiscale features have been proven to be beneficial for target Among them, the most basic model module is RepVGGBlock,
detection [38], [39]. MM-Net accomplishes the full integration whose structure of the training phase can be expressed as
of different layers of features in the Neck module. The Neck follows:
module is also stacked by convolution and sampling operations.
It includes the FPN structure on the left and the PAN structure O = Add (Bn (Conv1(I)) , Bn (Conv3(I)) , Bn(I)) (6)
on the right. The FPN transfers semantic information from the where I and O denote the input feature map and output feature
high-level feature map to the low-level feature map from the top map during model training, respectively; Conv1 and Conv3
down, and the PAN transfers information, such as texture from denote the 1×1 convolution operation and 3×3 convolution
the low-level to the high-level feature. The Neck module re- operation, respectively; Bn denotes batch normalization; and
ceives features from the Backbone module at different levels and Add denotes the summation operation. This network structure
fully integrates them. MM-Net uses an advanced Rep-PAN [40] decouples the training and inference phases through a structural
structure in the Neck module to accomplish effective feature parameterization technique, which allows more efficient use of
fusion. The detect module consists of convolution operations for the power of computationally intensive hardware and achieves
obtaining the final detection results. It receives features from the a better balance between accuracy and speed. In the model
neck module and generates three sets of predictions at different inference phase, all network layers are converted into a Conv3
scales, 20 × 20, 40 × 40, and 80 × 80, respectively. The efficient module by a fusion strategy, whose structure can be expressed
decoupled head used by MM-Net is an improved decoupled as follows:
detection head structure that separates the classification and
localization of targets into separate operations. Finally, some O = Bn (Conv3 (I )) (7)
postprocessing operations (nonmaximum suppression, etc.) are where I and O denote the input feature map and output feature
performed to obtain the final detection results. The final detec- map during model inference, respectively; Conv3 denotes 3×3
tion results can be used to assist the vehicle control system in convolution operation; and Bn denotes batch normalization.
making decisions. The RepBlock module is formed by several RepVGGBlocks in
series. The structure of a RepBlock consisting of three RepVG-
B. Two-Branch Feature Extraction Module GBlock modules can be represented as follows:
The recently proposed YOLO V6 integrates a large number of
O = Rvboc oc oc Rvboc (I)
Rvboc ic
(8)
excellent CNN design ideas and performs well on the Common
Objects in Context (COCO) public dataset [40]. Due to the where I and O denote the input feature map and the output
real-time requirement of pedestrian target detection work, we feature map, respectively; Rvboc
oc indicates that the number of
improved the YOLO V6 target detection algorithm and designed channels of the input feature map is equal to the number of the
a two-branch network structure for extracting features in both output feature map of RepVGGBlock, and Rvbic oc indicates that
modalities, as shown in Fig. 3. One branch is used to extract the number of channels of the input feature map is not equal to
features from RGB images, and the LiDAR branch is used to the number of the output feature map of RepVGGBlock.
extract features from the point cloud data. Both branches have
the same network structure, and both use a backbone network C. Multilevel Feature-Level Data Fusion Module
called EfficientRep to extract the effective features of their
The two-branch feature extraction network can sufficiently
respective modal data. The module can be expressed as follows:
and efficiently extract useful feature information from the data of
OutputRGB = EffRepRGB (InputRGB ) (4) two different modalities, thus providing a large amount of useful
information for pedestrian detection tasks. However, fusing the
OutputLiDAR = EffRepLiDAR (InputLiDAR ) (5)
features of the two modalities using operations, such as simple
where EffRepRGB denotes the EfficientRep module for RGB summation or dimensional splicing may cause redundancy in
images; EffRepLiDAR denotes the EfficientRep module for Li- the information. The attention mechanism is widely used and
DAR point clouds; InputRGB and InputLiDAR denote the RGB provides an effective tool for better fusion of data features
images and LiDAR point clouds input to MM-Net, respectively; from different modalities [41], [42], [43]. Based on the channel
and OutputRGB and OutputLiDAR denote the outputs of the two attention mechanism, we have designed a multilevel feature-
branches, respectively. Both OutputRGB and OutputLiDAR con- level data fusion module that can be easily deployed in three
tain three feature maps at different scales: low-level features, different layers of the MM-Net network backbone. We try to
mid-level features, and high-level features. The lower-level insert this fusion module in the high, medium, and low levels of
feature maps contain more information about image texture, the two-branch feature extraction module. We want the model
color, edges, etc. The higher-level feature map contains more to be able to adaptively select features of interest for fusion
information about semantics. The two branches are independent after extracting multiple levels of feature maps for two different
of each other in the feature extraction process and no information modal data, rather than simply adding the features together. The
interference occurs. EfficientRep was proposed by Li et al. [40] structure of the single-level fusion module is shown in Fig. 4.
Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
7154 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 20, NO. 5, MAY 2024
TABLE I
RGB CAMERA DISTORTION PARAMETERS
Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
7156 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 20, NO. 5, MAY 2024
TABLE II
COMPARISON OF DETECTION ACCURACY OF DIFFERENT MODELS
Fig. 6. Loss curve during model training. (a) Class_loss. (b) Iou_loss.
C. Experimental Settings
The dataset is divided into different acquisition scenarios to
conduct experiments. The training set includes data from five
daytime scenes and six nighttime scenes, with a total of 1977 im-
age pairs. The test set includes data from two daytime scenes and
four nighttime scenes, with a total of 1054 image pairs. In this Fig. 7. Comparison of detection accuracy of different models.
article, experiments are conducted on a workstation equipped (a) AP50. (b) AP75. (c) AP. (d) P–R curves.
with an Nvidia RTX 3090 GPU, implemented in the framework
of PyTorch (version = 1.10.1). While training our own model,
values of MM-Net drop faster than those of the baseline model
stochastic gradient descent is chosen as the optimizer, and the
after 15 rounds of training. This indicates that the convergence
batch size is set to eight. All experiments are repeated three
speed of the model training is also accelerated after adding the
times, and the results are then averaged. The model is evaluated
multilevel feature-level data fusion module.
according to three different COCO evaluation metrics, including
To test the detection performance of MM-Net and the effec-
AP, AP (IoU = 0.5), and AP (IoU = 0.75), referred to as AP,
tiveness of adding LiDAR information, a variety of detection
AP50, and AP75, respectively.
models are selected for comparison. The models utilized for
comparison include six single-mode models and four multi-
D. Analysis of Experimental Results mode models [8], [21], [45], [46]. Table II shows the detection
Fig. 6 shows the decline of loss during model training. Two accuracy of different models. The AP50 metric is more com-
models are specifically plotted in the figure. The first one is the monly used in the field of target detection [47], [48], [49], [50].
baseline model. The original two-branch network is chosen as a Moreover, more stringent AP75 and AP metrics are also utilized
baseline (i.e., the features extracted from the two branches are in validation. Fig. 7 visualizes the detection accuracy of the
directly summed element by element). The second one is the different models. Two models, TPH-YOLOv5 [45] and Cascade
MM-Net (i.e., the multilevel feature-level data fusion module RCNN [46], utilize only single-mode RGB images as input.
designed in this research is used). The loss values of the two Other models are available in two versions. The two versions
models steadily decrease with the number of iterations, and no include single-mode (using only RGB images) and multimodal
dramatic fluctuations, etc., occur. It is noteworthy that the loss (using both RGB images and LiDAR information).
Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
XU et al.: DEEP LEARNING-BASED PEDESTRIAN DETECTION USING RGB IMAGES AND SPARSE LIDAR POINT CLOUDS 7157
is 17.4%, 3.2%, 4.2%, and 7.6% higher than No.4, No.6, No.8,
and No.10, respectively. The AP metric integrates the detection
performance of the model under different IoU thresholds and
is the most rigorous accuracy evaluation metric. MM-Net also
performs well in this metric, with 9.3%, 5.1%, 3.7%, and 4.6%
higher than the other four models, respectively. In the daytime
scenario, MM-Net using multimodal data outperforms the other
four models in all three different accuracy evaluation metrics.
It is lower than MM-Net, which uses only single-mode data, in
both AP75 and AP accuracy metrics. However, MM-Net using
only single-mode data could not achieve satisfactory detection
results in the nighttime scenes. In the night scene, MM-Net with
multimodal data also achieves top-ranked detection performance
in the three evaluation metrics. Combined with the above analy-
sis, MM-Net using multimodal data performs the best. This can
mainly attribute to the reasonable model design of MM-Net.
First, the designed two-branch feature extraction module fully
extracts the features of both modal data, and the two branches
do not cause information interference between each other. It has
been shown that the fusion of data from different modalities at an
early stage may have a bad effect on the model performance [51].
Second, MM-Net adaptively fuses the features from the two
modal data at three different levels: high, medium, and low,
taking into account both the data modality and the fusion level.
Fig. 8. Detection results for several models using multimodal data.
The first row shows the detection results of MM-Net, and the second In addition, Table II also lists the running speed and number of
row shows the detection results of the improved model based on Faster parameters of each detection model. It is clear from the table that
RCNN. The third, fourth, and fifth rows show the detection results of the MM-Net achieves high detection accuracy while maintaining
three versions of MS-YOLO (S, M, and L), respectively. The faces have
been blurred. (a) Day. (b) Night. a relatively fast detection speed. MM-Net is the fastest of all
models that utilize multimodal data. This can mainly attribute to
the advanced feature extraction network. MM-Net’s two-branch
feature extraction module uses advanced structural parameteri-
Our dataset is divided into two scenes, daytime and night- zation techniques to decouple the training and inference phases,
time. The detection accuracy of each model is higher in the achieving a better balance between accuracy and speed. We also
daytime scenes than in the nighttime scenes. The nighttime tested the inference speed of MM-Net on the Nvidia Jetson AGX
scenes are severely disturbed by the light factor, which poses Xavier and Orin platforms. Thanks to the advanced architecture
a great challenge to the detection models. As can be seen and performance of the Nvidia Jetson series products, the infer-
from the table, the TPH-YOLOv5 and Cascade RCNN models ence speed of the detection model on Xavier is approximately
perform poorly on the overall dataset, especially on the nighttime 0.0750 s per frame, equivalent to 13.3 frames per second (fps),
dataset. This is mainly because these two models only use while on Orin, it is approximately 0.0374 s per frame, equivalent
RGB images as the input to the model. As shown in Fig. 8, to 26.7 fps. The test results indicate that the model’s inference
RGB images are usually very blurred in night scenes, which speed on Xavier and Orin exceeds the sampling rate of 10
led to the failure of the detection model. In addition, several frames per second of the VLP-16 LiDAR, thus meeting the
other models have different degrees of improvement in detection real-time requirements. The precision-recall (P-R) curves of
accuracy after adding LiDAR information, especially in the different detection models can be shown in Fig. 7(d). In the
nighttime scenes. This indicates that the LiDAR information P–R curves, the more convex the curve is toward the upper right
can supplement the RGB images with effective features. The corner, the better the detection effect of its corresponding model.
RGB images are severely affected by the illumination factor in The P–R curve of MM-Net tends to be more toward the upper
the night scenes, while LiDAR, which is an active detection right corner, and the area formed by its encirclement with the
method, is not greatly affected. The MM-Net network using coordinate axes is larger than that of several other detection
multimodal data achieves the highest AP50, AP75, and AP on models.
the overall dataset (including both day and night scenes) and Fig. 8 shows the detection results of some models by utilizing
satisfactory detection results in both day and night subscenes. multimodal data. Since the dataset is collected on the campus,
MM-Net achieves an AP50 metric of 72.9% on the full dataset, the pedestrian distribution of it is not very dense. The dataset
which is 6.5%, 9.9%, 5.2%, and 3.1% higher than No.4, No.6, is richer in scenes and contains data collected under different
No.8, and No.10, respectively. AP75 is more stringent than periods. This dataset poses a great challenge for detection mod-
AP50 and better reflects the target localization ability of the els. In the daytime scenes, the RGB images provide richer color
model. MM-Net also performs well on the AP75 metric, which texture information of the pedestrians, and the LiDAR point
Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
7158 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 20, NO. 5, MAY 2024
Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
XU et al.: DEEP LEARNING-BASED PEDESTRIAN DETECTION USING RGB IMAGES AND SPARSE LIDAR POINT CLOUDS 7159
Fig. 10. Detection results for several models using multimodal data. The first row shows the detection results of MM-Net, and the second row
shows the detection results of the improved model based on Faster RCNN. The third, fourth, and fifth rows show the detection results of the three
versions of MS-YOLO (S, M, and L), respectively.
TABLE V seven models with the fusion module is higher than that of the
ABLATION EXPERIMENTS ON THE KITTI DATASET
baseline model.
V. CONCLUSION
In this article, we describe our research work on the core prob-
lem of fusing RGB cameras and sparse LiDAR for pedestrian
detection. Our research work is carried out in two aspects: hard-
ware platform construction and software algorithm research.
Specifically, for the hardware platform, we design and build
shows the performance of different detection models on the
a data acquisition platform using hardware devices, including
three accuracy metrics more visually in the form of a bar chart.
an IMX-307 RGB camera, VLP-16 sparse LiDAR, and Xavier
Both the improved Faster RCNN-based model and MS-YOLO
development board. We deploy the hardware with reasonable
(including S, M, and L versions) show different degrees of
connections, correct and calibrate the two different sensors, and
improvement in detection accuracy after using multimodal data.
design a soft-synchronous data acquisition scheme. In terms of
This further demonstrates the effectiveness of adding LiDAR
software algorithms, first, we use the data acquisition platform
information. LiDAR detects pedestrian targets by transmitting
to collect pedestrian datasets from multiple scenes and form
and receiving laser signals that are virtually unaffected by light-
a multimodal pedestrian detection dataset after preprocessing
ing factors. LiDAR point cloud data provide valuable auxiliary
and manual labeling. Second, we design a two-branch MM-Net.
information. This further improves the detection performance
We build a dual-branch feature extraction module for efficiently
of those single-mode models. In addition, compared to other
extracting features of both data modes simultaneously. To fully
detection models, MM-Net using multimodal data achieves
consider the differences between the features of different modal
the best accuracy in AP50, AP75, and AP metrics. MM-Net
data, we construct a multilevel feature-level data fusion module
achieves 91.6% in AP50 metric on the dataset, which is 10.7%,
and insert it into three different positions of the model: high, mid-
8.3%, 8.5%, and 10.7% higher than No.4, No.6, No.8, and
dle, and low. We carry out comparison experiments. MM-Net
No.10, respectively. In addition, MM-Net achieves 52.7% and
shows the best detection results compared to some comparison
51.0% on AP75 and AP, which are 1.3% and 2.9% higher
models. We also validate the effectiveness of the designed mul-
than the second place, respectively. This further demonstrates
tilevel feature-level data fusion module and further explore the
the applicability of MM-Net. The P–R curves for different
reasonable fusion positions. Finally, to further test the perfor-
models are shown in Fig. 9(b). The P–R curve of MM-Net
mance of MM-Net, we conduct comparison experiments using
is the best. Fig. 10 shows the detection results of several
the well-known KITTI dataset. The experimental results illus-
detection models utilizing multimodal data. Some pedestrian
trate the good performance of MM-Net. Further improving the
targets affected by lighting factors are better detected by the
detection accuracy of the model and exploring more reasonable
models using multimodal data. Above all, by taking advantage
fusion strategies are the focus of future research work.
of multiscale feature fusion, MM-Net is able to achieve ac-
curate extraction of pedestrian targets in different scenes and
is more sensitive to some smaller-size pedestrian targets. In REFERENCES
addition, ablation experiments were conducted to demonstrate [1] D. Parekh et al., “A review on autonomous vehicles: Progress, methods
the effectiveness of the designed fusion module. The results and challenges,” Electronics, vol. 11, no. 14, 2022, Art. no. 2162.
[2] X. Li et al., “A unified framework for concurrent pedestrian and cyclist
of these experiments are presented in Table V. Based on the detection,” IEEE Trans. Intell. Transp. Syst., vol. 18, no. 2, pp. 269–281,
table, it can be observed that, overall, the overall accuracy of the Feb. 2016.
Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
7160 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 20, NO. 5, MAY 2024
[3] D. Ridel, E. Rehder, M. Lauer, C. Stiller, and D. Wolf, “A literature review [28] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional
on the prediction of pedestrian behavior in urban scenarios,” in Proc. 21st detection,” Sensors, vol. 18, no. 10, 2018, Art. no. 3337.
Int. Conf. Intell. Transp. Syst., 2018, pp. 3105–3112. [29] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Point-
[4] J. Li, X. Liang, S. Shen, T. Xu, J. Feng, and S. Yan, “Scale-aware fast Pillars: Fast encoders for object detection from point clouds,” in Proc.
R-CNN for pedestrian detection,” IEEE Trans. Multimedia, vol. 20, no. 4, IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 12697–12705.
pp. 985–996, Apr. 2018. [30] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving?
[5] W.-Y. Hsu and W.-Y. Lin, “Adaptive fusion of multi-scale YOLO for The KITTI vision benchmark suite,” in Proc. IEEE Conf. Comput. Vis.
pedestrian detection,” IEEE Access, vol. 9, pp. 110063–110073, 2021. Pattern Recognit., 2012, pp. 3354–3361.
[6] S. Zhang, J. Yang, and B. Schiele, “Occluded pedestrian detection through [31] J. Geyer et al., “A2D2: Audi autonomous driving dataset,” 2020,
guided attention in CNNs,” in Proc. IEEE Conf. Comput. Vis. Pattern arXiv:2004.06320.
Recognit., 2018, pp. 6995–7003. [32] “xpilot.” Accessed: Jan. 5, 2024. [Online]. Available: https://www.
[7] G. Brazil, X. Yin, and X. Liu, “Illuminating pedestrians via simultaneous heyxpeng.com/intelligent/xpilot
detection & segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, [33] Velodyne, “VLP-16.” Accessed: Jan. 5, 2024. [Online]. Available: https:
pp. 4960–4969. //velodynelidar.com/products/puck/
[8] Y. Song, Z. Xie, X. Wang, and Y. Zou, “MS-YOLO: Object detection based [34] “NVIDIA Jetson AGX Xavier.” Accessed: Jan. 5, 2024. [Online]. Avail-
on YOLOv5 optimized fusion millimeter-wave radar and machine vision,” able: https://www.nvidia.cn/autonomous-machines/embedded-systems/
IEEE Sensors J., vol. 22, no. 15, pp. 15435–15447, Aug. 2022. jetson-agx-xavier/
[9] S. Iftikhar, Z. Zhang, M. Asim, A. Muthanna, A. Koucheryavy, and A. A. [35] “NIO.” Accessed: Jan. 5, 2024. [Online]. Available: https://www.nio.com/
Abd El-Latif, “Deep learning-based pedestrian detection in autonomous [36] C. Ricolfe-Viala and A.-J. Sanchez-Salmeron, “Lens distortion models
vehicles: Substantial issues and challenges,” Electronics, vol. 11, no. 21, evaluation,” Appl. Opt., vol. 49, no. 30, pp. 5914–5928, 2010.
2022, Art. no. 3551. [37] S. Xie, D. Yang, K. Jiang, and Y. Zhong, “Pixels and 3-D points alignment
[10] L. Wang, J. Yan, L. Mu, and L. Huang, “Knowledge discovery from remote method for the fusion of camera and LiDAR data,” IEEE Trans. Instrum.
sensing images: A review,” Wiley Interdiscipl. Rev.: Data Mining Knowl. Meas., vol. 68, no. 10, pp. 3661–3676, Oct. 2019.
Discov., vol. 10, no. 5, 2020, Art. no. e1371. [38] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
[11] M. M. Islam, v. A. R. Newaz, and A. Karimoddini, “Pedestrian detection “Feature pyramid networks for object detection,” in Proc. IEEE Conf.
for autonomous cars: Inference fusion of deep neural networks,” IEEE Comput. Vis. Pattern Recognit., 2017, pp. 2117–2125.
Trans. Intell. Transp. Syst., vol. 23, no. 12, pp. 23358–23368, Dec. 2022. [39] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network
[12] W. Han et al., “Methods for small, weak object detection in optical high- for instance segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern
resolution remote sensing images: A survey of advances and challenges,” Recognit., 2018, pp. 8759–8768.
IEEE Geosci. Remote Sens. Mag., vol. 9, no. 4, pp. 8–34, Dec. 2021. [40] C. Li et al., “YOLOv6: A single-stage object detection framework for
[13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: industrial applications,” 2022, arXiv:2209.02976.
Unified, real-time object detection,” in Proc. IEEE Conf. Comput. Vis. [41] Z. Li, Y. Sun, L. Zhang, and J. Tang, “CTNet: Context-based tandem
Pattern Recognit., 2016, pp. 779–788. network for semantic segmentation,” IEEE Trans. Pattern Anal. Mach.
[14] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time Intell., vol. 44, no. 12, pp. 9904–9917, Dec. 2022.
object detection with region proposal networks,” in Proc. Annu. Conf. [42] X. Li, W. Wang, X. Hu, and J. Yang, “Selective kernel networks,” in Proc.
Neural Inf. Process. Syst., 2015, pp. 1–9. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 510–519.
[15] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on [43] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc.
point sets for 3D classification and segmentation,” in Proc. IEEE Conf. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141.
Comput. Vis. Pattern Recognit., 2017, pp. 652–660. [44] A. Pfeuffer and K. Dietmayer, “Optimal sensor data fusion architecture
[16] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierarchical for object detection in adverse weather conditions,” in Proc. 21st Int. Conf.
feature learning on point sets in a metric space,” in Proc. Annu. Conf. Inf. Fusion, 2018, pp. 1–8.
Neural Inf. Process. Syst., 2017, pp. 5105–5114. [45] X. Zhu, S. Lyu, X. Wang, and Q. Zhao, “TPH-YOLOv5: Improved
[17] D. J. Yeong, G. Velasco-Hernandez, J. Barry, and J. Walsh, “Sensor and YOLOv5 based on transformer prediction head for object detection on
sensor fusion technology in autonomous vehicles: A review,” Sensors, drone-captured scenarios,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.,
vol. 21, no. 6, 2021, Art. no. 2140. 2021, pp. 2778–2788.
[18] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets for [46] Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high quality
3D object detection from RGB-D data,” in Proc. IEEE Conf. Comput. Vis. object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Pattern Recognit., 2018, pp. 918–927. 2018, pp. 6154–6162.
[19] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3D object [47] T. Ye, W. Qin, Z. Zhao, X. Gao, X. Deng, and Y. Ouyang, “Real-time
detection network for autonomous driving,” in Proc. IEEE Conf. Comput. object detection network in UAV-vision based on CNN and transformer,”
Vis. Pattern Recognit., 2017, pp. 1907–1915. IEEE Trans. Instrum. Meas., vol. 72, pp. 1–13, 2023.
[20] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander, “Joint 3D [48] W. Han et al., “A context-scale-aware detector and a new benchmark for
proposal generation and object detection from view aggregation,” in Proc. remote sensing small weak object detection in unmanned aerial vehicle
IEEE/RSJ Int. Conf. Intell. Robots Syst., 2018, pp. 1–8. images,” Int. J. Appl. Earth Observation Geoinformation, vol. 112, 2022,
[21] X. Gao, G. Zhang, and Y. Xiong, “Multi-scale multi-modal fusion for Art. no. 102966.
object detection in autonomous driving based on selective kernel,” Mea- [49] C. Zheng et al., “Multiscale fusion network for rural newly constructed
surement, vol. 194, 2022, Art. no. 111001. building detection in unmanned aerial vehicle imagery,” IEEE J. Sel. Topics
[22] J. Tang, L. Jin, Z. Li, and S. Gao, “RGB-D object recognition via incorpo- Appl. Earth Observ. Remote Sens., vol. 15, pp. 9160–9173, Sep. 27, 2022.
rating latent data structure and prior knowledge,” IEEE Trans. Multimedia, [50] W. Han et al., “A survey on methods of small weak object detection in
vol. 17, no. 11, pp. 1899–1908, Nov. 2015. optical high-resolution remote sensing images,” IEEE Geosci. Remote.
[23] F. Pala, R. Satta, G. Fumera, and F. Roli, “Multimodal person reidentifi- Sens. Mag., vol. 9, no. 4, pp. 8–34, Dec. 2021.
cation using RGB-D cameras,” IEEE Trans. Circuits Syst. Video Technol., [51] Z. Xinyu, Z. Zhen-Hong, L. Zhi-Wei, L. Hua-Ping, and L. Jun, “Deep
vol. 26, no. 4, pp. 788–799, Apr. 2016. multi-modal fusion in object detection for autonomous driving,” CAAI
[24] W. Zhou, Y. Pan, J. Lei, L. Ye, and L. Yu, “DEFNet: Dual-branch enhanced Trans. Intell. Syst., vol. 15, no. 4, pp. 758-771, 2020.
feature fusion network for RGB-T crowd counting,” IEEE Trans. Intell.
Transp. Syst., vol. 23, no. 12, pp. 24540–24549, Dec. 2022. Haoran Xu currently working toward the Ph.D.
[25] D. Guan, Y. Cao, J. Yang, Y. Cao, and M. Y. Yang, “Fusion of multispec- degree in geoscience information engineering
tral data through illumination-aware deep neural networks for pedestrian with China University of Geosciences, Wuhan,
detection,” Inf. Fusion, vol. 50, pp. 148–157, 2019. China.
[26] S. Shi, X. Wang, and H. Li, “PointRCNN: 3D object proposal generation His research interests include Internet of
and detection from point cloud,” in Proc. IEEE/CVF Conf. Comput. Vis. Things, unmanned systems, and data analysis.
Pattern Recognit., 2019, pp. 770–779.
[27] Y. Zhou and O. Tuzel, “VoxelNet: End-to-end learning for point cloud
based 3D object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., 2018, pp. 4490–4499.
Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
XU et al.: DEEP LEARNING-BASED PEDESTRIAN DETECTION USING RGB IMAGES AND SPARSE LIDAR POINT CLOUDS 7161
Shuo Huang received the B.E. degree in elec- Xiaodao Chen received the B.E. degree in
tronic engineering from the Fudan University, telecommunication from the Wuhan Univer-
Shanghai, China, in 2007. sity of Technology, Wuhan, China, in 2006,
He is currently an Engineering Director with the M.S. degree in electrical engineering
the China FAW(Nanjing) Technology Develop- and Ph.D. degree in computer engineer-
ment Company Ltd, Nanjing, China. His re- ing from Michigan Technological University,
search interests include autonomous driving Houghton, MI, USA, in 2008 and 2012, respec-
and edge computing. tively.
He is an Associate Professor with the School
of Computer Science, China University of Geo-
sciences, Wuhan, China. His research interests
include high performance computing and CAD design for cyber-physical
systems.
Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.