0% found this document useful (0 votes)
36 views13 pages

Deep Learning-Based Pedestrian Detection Using RGB Images and Sparse LiDAR Point Clouds

Uploaded by

2823520190
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views13 pages

Deep Learning-Based Pedestrian Detection Using RGB Images and Sparse LiDAR Point Clouds

Uploaded by

2823520190
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 20, NO.

5, MAY 2024 7149

Deep Learning-Based Pedestrian Detection


Using RGB Images and Sparse
LiDAR Point Clouds
Haoran Xu , Shuo Huang , Yixin Yang , Xiaodao Chen , and Shiyan Hu , Senior Member, IEEE

Abstract—One of the fundamental tasks in autonomous I. INTRODUCTION


driving is environment perception for pedestrian detection,
ERCEPTION is one of the most fundamental tasks in
where the fused pedestrian detection using camera and
light detection and ranging (LiDAR) information imposes
challenges since the data alignment, compensation, and fu-
P autonomous driving [1]. The purpose of perception is to
obtain information from the surroundings of the vehicle using
sion between different data modes are challenging and the a wide variety of onboard sensors. Vulnerable road users, such
simultaneous acquisition of data from two different modal-
ities also increases the difficulty. This work addresses the as pedestrians, are highly susceptible to injury because of their
above challenges from both of the hardware and software high proportion and the absence of special protective devices [2].
dimensions. First, a multimodal pedestrian data acquisi- According to official statistics from the World Health Organiza-
tion platform is designed and constructed using an RGB tion, over 270 000 pedestrians die in road traffic accidents every
camera, sparse LiDAR, and data processing module includ- year [3]. Therefore, pedestrian detection is an important part of
ing hardware connection and deployment, sensor distor-
tion correction and joint calibration, and data acquisition perception [4], [5], [6], [7]. It can assist vehicles in reducing the
synchronization. Pedestrian data from multiple scenes are traffic accidents.
then collected using this platform to produce and form From a methodological point of view, pedestrian detection
a dedicated multimodal pedestrian detection dataset. Fur- methods can be classified into two categories which are tra-
ther, a two-branch multimodal multilevel fusion pedestrian ditional approaches and deep learning-based approaches [8].
detection network (MM-Net) is proposed, which includes
a two-branch feature extraction module and a feature- Traditional detection approaches use features which can be
level data fusion module. Extensive experiments are per- histogram of oriented gradients features, local binary pattern
formed on the multimodal pedestrian detection dataset and features, and scale-invariant feature transform features, to ex-
KITTI dataset for the comparison with the existing models. tract pedestrians in images [9]. Deep learning-based detection
The experimental results demonstrate the superior perfor- methods are becoming increasingly popular [10], [11]. They
mance of MM-Net.
can efficiently extract pedestrian features through stacking mul-
Index Terms—Data fusion, pedestrian detection, RGB tiple feature extraction layers. Deep learning methods have
camera, sparse light detection and ranging (LiDAR). advantages in utilizing powerful computing resources to achieve
better quality detection solutions. Recently, a large number of
research works with quality solutions are proposed based on
deep learning techniques. These works can be divided into two
categories: one-stage and two-stage methods [12]. One-stage
Manuscript received 29 July 2023; revised 9 November 2023; ac- methods incorporate the entire process of detection, and the “you
cepted 30 December 2023. Date of publication 31 January 2024; date
of current version 6 May 2024. This work was supported by the National only look once” (YOLO) family [13] is one of the represented
Natural Science Foundation of China under Grant 42242105 and Grant works in this category. Two-stage methods first generate regions
42311530065. Paper no. TII-23-2864. (Corresponding author: Xiaodao of interest and then classify and regress these regions for targets,
Chen.)
This work involved human subjects or animals in its research. Ap- and the region-based convolutional neural network (RCNN)
proval of all ethical and experimental procedures and protocols was family [14] is one of the represented works in this category.
granted by the School of Computer Science, China University of Geo- Li et al. [4] introduced the philosophy of divide and conquer into
sciences (Wuhan).
Haoran Xu, Yixin Yang, and Xiaodao Chen are with the School of fast RCNN to develop a scale-aware fast R-CNN (SAF R-CNN).
Computer Science, China University of Geosciences, Wuhan 430078, Hsu et al. [5] utilized the divide and conquer idea in YOLO to
China (e-mail: [email protected]; [email protected]; xiao fuse image ratio information to effectively integrate pedestrian
[email protected]).
Shuo Huang is with China FAW (Nanjing) Technology Development features of various resolutions. Zhang et al. [6] proposed the at-
Company Ltd., Nanjing 211106, China (e-mail: [email protected]). tention mechanism and explored the effect of different attention
Shiyan Hu is with the School of Electronics and Computer Sci- mechanisms. Brazil et al. [7] added the semantic segmentation
ence, University of Southampton, SO17 1BJ Southampton, U.K. (e-mail:
[email protected]). task to the fast RCNN to form a framework for synchronizing
Color versions of one or more figures in this article are available at detection and segmentation, which improves the accuracy of
https://doi.org/10.1109/TII.2024.3353845. pedestrian detection. In general, most of the recent research
Digital Object Identifier 10.1109/TII.2024.3353845

1551-3203 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
7150 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 20, NO. 5, MAY 2024

works have improved the original target detection models for large amount of computational overhead. Once the illumination
specific pedestrian detection challenges. conditions change drastically, their proposed method to generate
On the other hand, data sources play an essential role in 2-D regions based on RGB images can fail. Chen et al. [19]
pedestrian detection. There are detection methods based on a extracted features from RGB images, top view, and front view
single data source [4], [5], [6], [7], [15], [16] and based on of point cloud data simultaneously, and then fused them for
multiple data sources [8], [17], [18], [19], [20], [21] as well. object detection. Ku et al. [20] designed a two-branch target
Common data sources include RGB image sensors [4], [5], [6], detection network for point cloud data and RGB images. In
[7], RGB-depth (RGB-D) cameras [22], [23], light detection this method, the RGB branch is used to extract useful features
and ranging (LiDAR) sensors [15], [16], and RGB-thermal from ordinary RGB images, and the point cloud branch is used
infrared (RGB-T) sensors [24], [25]. It is important to note to extract useful features from bird’s-eye view (BEV) maps of
that RGB imaging is a passive imaging technique. An RGB point cloud data. Based on these, the fusion process is performed
camera captures the light reflected from the surface of an object to obtain the target detection frame. Gao et al. [21] designed
and converts a large number of light signals into digital signals a multimodal dual-branch target detection network based on a
that are then stored. The image quality is heavily dependent on faster RCNN target detection network. This article introduced a
external factors such as weather and lighting. The performance channel attention mechanism to fully extract feature information
of RGB image-based detection methods can be significantly from RGB images and point cloud projection maps. The above
degraded in poor lighting conditions at night or in foggy en- three works described transform the LiDAR data, including
vironments. LiDAR is an active detection technology. It can converting the point cloud data into a front view, top view,
provide information on the distance and reflected intensity by and depth information mapping, to improve the speed of the
calibrating from an object to itself through multiple embedded target detection. However, they are all based on high line-beam
transmitter–receivers. LiDARs receive less affection induced LiDAR point cloud datasets and do not specifically consider
by environmental factors, such as light and weather. However, pedestrian detection work. Recently, Song et al. [8] designed and
LiDAR data are sparse compared to the RGB camera data. It proposed an improved two-branch detection network, named
lacks semantic information about the surrounding environment MS-YOLO, based on the YOLO V5 model. This model also uses
and does not well model the detailed information of the object. two separate feature extraction branches to extract features from
Target detection methods based on LiDAR point cloud data both data modes simultaneously and then perform feature-level
can be classified into direct point-based methods, voxel-based data fusion. The special feature is that millimeter-wave LiDAR
methods, and projection-based methods. In 2017, Qi et al. [15], is used and data from millimeter-wave LiDAR is converted into
[16] successively proposed PointNet and PointNet++, which pseudochannel images. However, the feature maps of both data
both used deep neural networks (DNN) directly on point cloud modes are directly concatenated without carefully considering
data for the first time and achieved good results on classification the association and the difference between the feature maps
and segmentation tasks. Subsequently, Shi et al. [26] proposed a of the two data sources. In addition, several of these studies
two-stage directly applied detection model for raw point cloud have been conducted at the level of specific detection methods,
data, called PointRCNN. The voxel-based approach first divides with little work done on aspects such as simultaneous data
the disordered point cloud data into each voxel by spatial voxel acquisition.
partitioning. The voxels are then encoded, and CNNs are used To address these issues, this article focuses on both designing a
to extract features and predict target frames. VoxelNet is a data acquisition platform and developing detection models. The
pioneering work in this area [27]. It uses feature learning network details are described as follows. A multimodal data acquisition
to extract the features of each point within a voxel, obtains the platform is designed and built. The platform is configured with
features of the voxel by pooling operations, and then uses CNN an RGB camera and a sparse LiDAR to be able to complete the
for feature extraction and target prediction. The sparsely embed- acquisition of the road surroundings. A data soft-synchronous
ded convolutional detection (SECOND) improves on VoxelNet acquisition scheme is also designed and implemented. This
by taking into account the sparsity of the point cloud distribution platform achieves synchronous acquisition between the two
and improves the model convergence speed [28]. Similar work sensors using signal triggering. This data acquisition platform is
has been conducted with PointPillars [29], etc. The projection- used to collect pedestrian data for several scenes under multiple
based methods are summarized below. Although good target periods. Based on the collected data, a multimodal dataset for
localization accuracy can be obtained, the dense high line-beam pedestrian detection is produced. The dataset contains RGB
LiDAR point cloud data lead to a huge computational effort and images and LiDAR data and is matched to each other. Taking into
make deployment difficult. In addition, LiDAR point data lack account the data characteristics and performance requirements,
semantic information and do not reflect dense information about a two-branch multimodal multilayer fusion pedestrian detection
the object. network (MM-Net) based on an advanced single-stage detector
Both sensors have their advantages and disadvantages. Re- and attention mechanism is proposed in this article. There are
searchers have used them in combination and proposed ad- two innovative parts in the proposed model. One is the design
ditional multimodal detection methods with good results. of a two-branch network backbone that satisfies multimodal
Qi et al. [18] proposed a method for joint target detection using data feature extraction. The other is to use the channel attention
RGB images and LiDAR point clouds. This approach directly mechanism in the synchronized fusion of feature maps from
used neural networks to process point clouds, which caused a different modal data. Extensive comparison experiments on two

Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
XU et al.: DEEP LEARNING-BASED PEDESTRIAN DETECTION USING RGB IMAGES AND SPARSE LIDAR POINT CLOUDS 7151

datasets are performed, our dataset and the KITTI dataset [30],
respectively. The experimental results show that the MM-Net
outperforms the advanced comparison models in terms of de-
tection accuracy and speed. In addition, we have conducted an
exploratory study on the fusion strategy to further validate the
effectiveness of the designed model structure and fusion module.
The main contributions of this article are as follows:
1) A pedestrian data acquisition platform equipped with a
sparse LiDAR (only 16 line beams) and an RGB camera
is designed and built. A soft-synchronous data acquisi-
tion scheme is designed and deployed in the platform to
Fig. 1. Platform architecture.
achieve the acquisition of road environment data.
2) A multimodal pedestrian detection dataset is captured and
produced. To the best of our knowledge, this is the first
dataset that uses sparse LiDAR (only 16 line beams) and Dense line LiDAR systems are associated with high costs,
focuses on pedestrian detection. This dataset will be made which has become one of the bottlenecks for large-scale com-
publicly available to researchers. mercialization of autonomous driving. Therefore, many compa-
3) Focusing on the data characteristics and performance re- nies, such as Audi [31] and Xiaopeng [32], have used low-cost
quirements, a one-stage deep learning-based multimodal 16-beam LiDAR for autonomous driving research and develop-
pedestrian detection model is proposed. The detection ment. Similarly, due to cost considerations, the other major sen-
model can fully extract effective features from different sor deployed in this platform is the 16-beam LiDAR produced
modal data and can fuse features at different levels, such by Velodyne [33], which has been widely used in autonomous
as high, medium, and low, to utilize the favorable infor- driving fleet operated by high-tech companies such as Baidu.
mation of different modal data for pedestrian detection. This particular model transmits sparse LiDAR data through
4) Extensive simulations are conducted on two datasets. The automotive Ethernet, also called 100-based T1, over shielded
simulation results demonstrate the challenging nature of twisted pairs. An interface box sits in between the sensor and
the homemade dataset, as well as the superior perfor- the data processing module for Ethernet protocol conversion so
mance of MM-Net. that a local area network (LAN) link can be established and
The rest of this article is organized as follows. Section II de- LiDAR data can be streamed.
scribes the overall architecture of the data acquisition platform. Xavier kit [34] performs as the data processing module due
Section III introduces our proposed MM-Net and feature-level to adequate computing power and adaptive tool chains, whereas
fusion modules. Section IV describes in detail the construction of this particular chip has been already installed onto production
the data acquisition platform and the details of the experiments. passenger cars by the EV OEM, NIO [35], in exactly the field
Finally, Section V concludes this article. of perception for autonomous driving. The Xavier kit plays as
the role of preprocessing unit, which connects all sensors, syn-
II. DATA ACQUISITION PLATFORM chronizes and collects all the sensor data, as well as aberration
This section introduces the data acquisition platform, consist- correction and joint calibration. In addition, The actual model
ing of a computing unit and multisource sensors, which enables inference process will be performed on a high-performance
data acquisition, transmission, and processing. computing platform (NVIDIA RTX 3090 GPU).

A. Platform Architecture B. Aberration Correction and Joint Calibration


Automotive grade components are chosen in this platform in 1) Aberration Correction: The deviation from perfection in
order to mimic an in-vehicle environment, in terms of sensors, optics is called aberration. More specifically, manufacturing
and the way that sensor data is transmitted and processed. inaccuracy and assembly deviations cause aberration phenom-
The data acquisition platform includes three major components ena, which is a deviation of a ray from the predicted behavior
as shown in Fig. 1, the data acquisition module (camera and by simplified geometric rules. Radial aberration and tangential
LiDAR), the data transmission module, and the data processing aberration [36] are the two that need to be considered and for-
module. mulated, and eventually corrected, so that object stretching and
An RGB camera is used as one of the two major data ac- distortion around the image disappears. The correction equations
quisition components in this platform, and a data transmission are as follows [36]:
standard (FPDlink III) in automotive industry is adopted. Cam-    
era raw data is pipelined onto CSI II interface, then serialized x0 = x 1 + k1 r2 + k2 r4 + k3 r6 + 2p1 xy + p2 r2 + 2x2
and transmitted through coaxial cable, in order to effectively (1)
maintain signal integrity and reliability, therefore, to extend    
signal transmission distance. The data processing module will y0 = y 1 + k1 r2 + k2 r4 + k3 r6 + p1 r2 + 2y 2 + 2p2 xy
then deserialize the camera data and consume it. (2)

Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
7152 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 20, NO. 5, MAY 2024

Fig. 2. Schematic diagram of the joint calibration.

where p1 and p2 are the tangential distortion correction parame-


ters to be solved; k1 , k2 , and k3 are the radial distortion correction
ones; (x, y) is the normal coordinate of the image coordinate
system without distortion; and (x0 , y0 ) is the coordinate value
of the image coordinate system after radial distortion, r2 =
x2 + y 2 . The aberration correction is to solve for the normal
coordinate of the image by fitting the five aberration parameters
(k1 , k2 , k3 , p1 , and p2 ) in the equation above, and taking into
account of the impact of those parameters.
2) Joint Calibration: The transformation relationship needs Fig. 3. Network structure of the MM-Net.
to be solved since sensors of different types and mountings are
deployed in the platform as shown in Fig. 1. With the help
of the calibration plate shown in Fig. 2, rotation and transla- III. MM-NET FOR PEDESTRIAN DETECTION
tion between camera and LiDAR space coordinate system can
be correlated and calculated, thus the joint calibration can be This section details the proposed two-branch MM-Net. Sec-
determined [17], with conversion equation [37]: tion III-A introduces the overall structure of MM-Net and the
     design concept. The innovative design of MM-Net consists of
PRGB R T PLiDAR two main parts, which are described in detail in Sections III-B
= (3)
1 0 1 1 and III-C, respectively. Fig. 3 shows the entire framework of the
MM-Net.
where PRGB denotes the coordinates of point P in the camera
coordinate system, while PLiDAR denotes the coordinates of A. Model Overview
point P in the LiDAR coordinate system; R and T represent
the rotation matrix and translation vector between the two coor- The proposed MM-Net model consists of three main parts,
dinate systems, respectively. Joint calibration is accomplished including a two-branch feature extraction module (Backbone),
by solving the parameters. an intermediate layer (Neck), and a detection layer (Detect). The
Backbone module consists of two branches for processing dif-
C. Data Acquisition Synchronization ferent modal data, and each of them consists of several advanced
operations. Each of the two branches is relatively independent.
Sensor synchronization is another critical issue that needs One is to extract the features of RGB images and the other is
to be addressed between different sensors. A synchronization to extract the features of LiDAR point cloud data. As shown
scheme has been designed for data acquisition due to different in Fig. 3, the extracted features include three levels which are
sampling frequencies between camera and LiDAR, who have the high-level, the medium-level, and the low-level. Features at
30 and 10 Hz sampling frequencies, respectively. A pulse per different levels contain different types of information. The lower
second (PPS) analog signal is used as the trigger signal for simul- level features contain more information related to edges and
taneous data acquisition. Synchronization workflow is shown in textures, and the higher level features contain more information
Fig. 1, where a two-step mechanism is established. A PPS signal related to semantics. Currently, data fusion using deep learning
is generated by the camera while capturing the first frame, and techniques is still a difficult task. Researchers have come up with
transmitted into the data processing module, where the PPS is many different ideas for fusion strategies. MM-Net operates at
passed through and eventually arrives at LiDAR to trigger the three different feature-levels to fuse features from two modal
start of laser scanning. This mechanism is repeated every second, data. To fully fuse the feature information from different modal
so that the camera and the LiDAR are synchronized accordingly. data at different levels, an attention mechanism based multilevel
Defective point cloud frames and poor-quality image data are feature level fusion module is designed and added to the three
manually filtered and eliminated to ensure temporal matching fusion parts of Backbone, which are Fusion Block_1, Fusion
between camera and LiDAR data. Block_2, and Fusion Block_3 in Fig. 3. This fusion module takes

Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
XU et al.: DEEP LEARNING-BASED PEDESTRIAN DETECTION USING RGB IMAGES AND SPARSE LIDAR POINT CLOUDS 7153

into account the characteristics of both modal data. It enables based on the idea of hardware-aware neural network design.
the model to autonomously select valid information from the They introduced the RepVGG structure into the model to further
features extracted from both branches. The effective integration improve accuracy, while speeding up the inference of the model.
of multiscale features have been proven to be beneficial for target Among them, the most basic model module is RepVGGBlock,
detection [38], [39]. MM-Net accomplishes the full integration whose structure of the training phase can be expressed as
of different layers of features in the Neck module. The Neck follows:
module is also stacked by convolution and sampling operations.
It includes the FPN structure on the left and the PAN structure O = Add (Bn (Conv1(I)) , Bn (Conv3(I)) , Bn(I)) (6)
on the right. The FPN transfers semantic information from the where I and O denote the input feature map and output feature
high-level feature map to the low-level feature map from the top map during model training, respectively; Conv1 and Conv3
down, and the PAN transfers information, such as texture from denote the 1×1 convolution operation and 3×3 convolution
the low-level to the high-level feature. The Neck module re- operation, respectively; Bn denotes batch normalization; and
ceives features from the Backbone module at different levels and Add denotes the summation operation. This network structure
fully integrates them. MM-Net uses an advanced Rep-PAN [40] decouples the training and inference phases through a structural
structure in the Neck module to accomplish effective feature parameterization technique, which allows more efficient use of
fusion. The detect module consists of convolution operations for the power of computationally intensive hardware and achieves
obtaining the final detection results. It receives features from the a better balance between accuracy and speed. In the model
neck module and generates three sets of predictions at different inference phase, all network layers are converted into a Conv3
scales, 20 × 20, 40 × 40, and 80 × 80, respectively. The efficient module by a fusion strategy, whose structure can be expressed
decoupled head used by MM-Net is an improved decoupled as follows:
detection head structure that separates the classification and
localization of targets into separate operations. Finally, some O = Bn (Conv3 (I  )) (7)
postprocessing operations (nonmaximum suppression, etc.) are where I  and O denote the input feature map and output feature
performed to obtain the final detection results. The final detec- map during model inference, respectively; Conv3 denotes 3×3
tion results can be used to assist the vehicle control system in convolution operation; and Bn denotes batch normalization.
making decisions. The RepBlock module is formed by several RepVGGBlocks in
series. The structure of a RepBlock consisting of three RepVG-
B. Two-Branch Feature Extraction Module GBlock modules can be represented as follows:
The recently proposed YOLO V6 integrates a large number of   
O = Rvboc oc oc Rvboc (I)
Rvboc ic
(8)
excellent CNN design ideas and performs well on the Common
Objects in Context (COCO) public dataset [40]. Due to the where I and O denote the input feature map and the output
real-time requirement of pedestrian target detection work, we feature map, respectively; Rvboc
oc indicates that the number of
improved the YOLO V6 target detection algorithm and designed channels of the input feature map is equal to the number of the
a two-branch network structure for extracting features in both output feature map of RepVGGBlock, and Rvbic oc indicates that
modalities, as shown in Fig. 3. One branch is used to extract the number of channels of the input feature map is not equal to
features from RGB images, and the LiDAR branch is used to the number of the output feature map of RepVGGBlock.
extract features from the point cloud data. Both branches have
the same network structure, and both use a backbone network C. Multilevel Feature-Level Data Fusion Module
called EfficientRep to extract the effective features of their
The two-branch feature extraction network can sufficiently
respective modal data. The module can be expressed as follows:
and efficiently extract useful feature information from the data of
OutputRGB = EffRepRGB (InputRGB ) (4) two different modalities, thus providing a large amount of useful
information for pedestrian detection tasks. However, fusing the
OutputLiDAR = EffRepLiDAR (InputLiDAR ) (5)
features of the two modalities using operations, such as simple
where EffRepRGB denotes the EfficientRep module for RGB summation or dimensional splicing may cause redundancy in
images; EffRepLiDAR denotes the EfficientRep module for Li- the information. The attention mechanism is widely used and
DAR point clouds; InputRGB and InputLiDAR denote the RGB provides an effective tool for better fusion of data features
images and LiDAR point clouds input to MM-Net, respectively; from different modalities [41], [42], [43]. Based on the channel
and OutputRGB and OutputLiDAR denote the outputs of the two attention mechanism, we have designed a multilevel feature-
branches, respectively. Both OutputRGB and OutputLiDAR con- level data fusion module that can be easily deployed in three
tain three feature maps at different scales: low-level features, different layers of the MM-Net network backbone. We try to
mid-level features, and high-level features. The lower-level insert this fusion module in the high, medium, and low levels of
feature maps contain more information about image texture, the two-branch feature extraction module. We want the model
color, edges, etc. The higher-level feature map contains more to be able to adaptively select features of interest for fusion
information about semantics. The two branches are independent after extracting multiple levels of feature maps for two different
of each other in the feature extraction process and no information modal data, rather than simply adding the features together. The
interference occurs. EfficientRep was proposed by Li et al. [40] structure of the single-level fusion module is shown in Fig. 4.

Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
7154 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 20, NO. 5, MAY 2024

Based on z, the weights of the two modal fusions are


calculated using the following equation:
eActh z
acth = (13)
eActh z + eBcth z
eBcth z
bcth = (14)
Fig. 4. Single-level feature-level data fusion module.
+ eBcth z
eActh z
where acth and bcth denote the respective weight parameters of
Existing research work [44] shows that it is feasible to project the two branches, and cth denotes the c th channel. In addition,
distance information from LiDAR point cloud data onto 2-D A, B ∈ RC×d . A and B are weights, which are obtained from
front-view (FV) images and extract useful features from ex- model training.
ternal scenes. The difference in the data structures between Finally, based on the calculated fusion weights, the feature
the RGB image and the point cloud depth projection image maps of the bimodal data are weighted to obtain the final fusion
would normally be too large. Compared with the RGB image, features
the point cloud image contains very sparse information. Thus,
Outputcth = acth URGB,cth + bcth ULiDAR,cth (15)
according to the characteristics of the multimodal of the RGB
images and the 2-D FV images of LiDAR point cloud, mul- acth + bcth = 1 (16)
tiscale convolution is used to process them data of different
modalities. Effectively diminishing the adverse effects arising where URGB,cth and ULiDAR,cth denote the feature maps of the cth
from the integration of multimodal data into a shared feature channel of the two branches, and Outputcth denotes the feature
space. Inspired by this, we try to introduce a multiscale convo- map of the cth channel after fusion.
lution structure [21], [42] for each fusion level of the MM-Net. The fusion module fully takes into account the characteristics
A 3 × 3 convolution is used for RGB images, and a larger of the two data modes and makes good use of the structural
5 × 5 convolution is used for point cloud images, and the weights characteristics of the SKNet [42] network with two channels,
associated with the bimodal data are then calculated. Thus, the enabling the network to adaptively weight the fusion of the data
features of the two modalities can be adaptively fused. features from different modalities.
First, the feature maps of the two input data modes are con-
volved separately, which is equivalent to the “Split” operation IV. EXPERIMENTS
in the original SKNet [42], and the resulting feature maps are This section describes the construction of the data acquisition
recorded as URGB and ULiDAR . The two are then added element platform and the experiments in detail.
by element as follows:
U = URGB + ULiDAR (9) A. Platform Construction
1) Hardware Selection: Hardware devices utilized in this ar-
where U denotes the feature map obtained by element-by-
ticle are shown in Fig. 1. The RGB camera is equipped with a 2.0
element summation.
megapixel Sony star-grade CMOS sensor IMX307LQD-C. The
Next, the 1-D vector S is obtained by averaging U element by
default resolution is 1920 × 1080 and the frame rate is 30 fps.
element by channel
The output raw image data are in an uncompressed UYVY
1 H  W
format. A small 3-D LiDAR device, Velodyne VLP-16 [33],
Sc = Uc (i, j) (10) is utilized to capture the distance information and reflected
H × W i=1 j=1
intensity. Multiple laser transmitter–receivers of LiDAR have
where H, W , and c represent the height, width, and number of functioned to measure the distance and reflected intensity from
channels of U, respectively. the sensor to the surrounding objects. Though its measurement
For a better selection of useful features by the network, s is range is up to one hundred meters, LiDAR, with fewer line
passed through a simple fully connected layer to obtain a more beams, results in an inability to better reflect detailed informa-
compact feature vector z: tion, such as object contours. The collected data are sent to the
Nvidia Jetson AGX Xavier [34], which is the data carrier, in
z = δ (Bn (WS)) (11)
the form of User Datagram Protocol (UDP) packets through the
where δ denotes the rectified linear unit (ReLU) activation network port.
function, and Bn denotes batch normalization. In addition, W ∈ 2) Hardware Connection and Synchronization: Hardware
Rd×C , where C is the number of channels of the input feature devices, mentioned above, are applied in the platform con-
map. The value of d is controlled according to the following struction. Connections between the devices are shown in
equation: Fig. 1. Specifically, the IMX-307 camera is connected to the
DS90UB953 serializer via a parallel data line and to the
d = max(C/r, L), L = 32 (12)
DS90UB954 deserializer via a serial line. It is then connected
where r is the scaling factor, L denotes the minimum value of to the ADP-N3 adapter module using a parallel data cable and
d, and C is defined in the same way as the previous equation. finally to Xavier’s CSI data interface. The official interface box
The value of L is taken with reference to [42]. provides the data interface for acquiring the VLP-16 LiDAR
Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
XU et al.: DEEP LEARNING-BASED PEDESTRIAN DETECTION USING RGB IMAGES AND SPARSE LIDAR POINT CLOUDS 7155

TABLE I
RGB CAMERA DISTORTION PARAMETERS

data. A LAN is utilized to transfer the LiDAR point cloud data to


Fig. 5. Some samples in the dataset. (The faces have been blurred.)
Xavier. Fig. 1 presents the workflow of the soft synchronization
between the two sensors. The transmission of the PPS signal
is triggered when Xavier first receives the data from IMX-307. JPG files with a downsampling ratio of 40 parameter settings is
The PPS signal is transmitted to the VLP-16 LiDAR. When the 1 − 2 ms. Since this research studies the fusion of RGB images
VLP-16 LiDAR receives a valid PPS signal, data sampling is and LiDAR point cloud for pedestrian target detection methods
triggered on its rising edge. The PPS signal is analog output in an offline state, it involves the storage of compressed JPG
from Xavier’s general-purpose input/output (GPIO) interface, images. This research also saves the compressed JPG images
and the analog PPS signal is output from the GPIO 12 pin. The and names them with the timestamp of their acquisition moment.
high level in the Xavier GPIO output signal is 3.3 V, and the low Accurate to the microsecond level, Unix timestamps are utilized
level is 0 V. To make the PPS signal stable and valid, a booster to name the JPG images. Upon careful analysis, calculation, and
is added to raise the high level to 5 V. measurement, the average delay is estimated to be around 8 ms,
3) Aberration Correction and Joint Calibration: A black and and the sampling delay does not exceed 8% of the sampling
white checkerboard calibration board is utilized to correct the period. These findings confirm that the synchronized acquisition
aberration of the camera with the help of the camera_calibration of camera and LiDAR data falls within an acceptable range of
tool in the robot operating system toolset [17]. The calibration time delay.
process requires moving and rotating the board up, down, left, The LiDAR point cloud data are saved in PCAP format in the
and right. The calibration tool detects the corner points in the data carrier Xavier. In addition, the LiDAR point cloud data are
checkerboard grid of the calibration board during each operation. mapped to a single channel image with 1920 × 1080 resolution.
Aberration parameters can be calculated by corner points. The The specific steps are as follows:
calibration results are shown in Table I. The joint calibration of 1) Initialize an unsigned 16-bit single-channel picture of
the two sensors is completed by using the calibration tool pro- 1920 × 1080 with pixel values set to zero.
vided by Autoware, a well-known open-source software project 2) Using the results of the joint calibration, calculate the
in autonomous driving, to solve for the calibration parameters pixel coordinates corresponding to the point cloud data
between the RGB camera and the LiDAR device. The results of in a single frame.
the joint calibration are as follows: 3) Remove the part of the point cloud that is not within the
⎡ ⎤ image range or with a distance greater than 20 m.
0.0816131 −0.996592 0.0119078
R = ⎣−0.0758579 −0.0181243 −0.996953 ⎦ Some samples in the dataset are shown in Fig. 5. In partic-
0.993773 0.0804612 −0.0770786 ular, we have obtained the official permission for collecting
the data while blurring faces in images to protect personal
T = 0.0634140 0.109179 −0.434289 privacy.
2) Dataset: To better reflect the actual working scenario of
where R and T represent the rotation matrix and translation
pedestrian detection, pedestrian dataset is collected in the field
vector between the camera coordinate system and the LiDAR
using the above acquisition platform for 17 scenes at different
coordinate system, respectively.
times of the day and night, including 7 scenes in the daytime
and 10 scenes at night, namely multi-modal pedestrian detection
B. Data Collection and Dataset dataset MM-PedDat. The data collection site is situated in a
1) Data Collection: In order to meet the hard requirement major city in China, specifically at a latitude of 30.52° N. It
of rapid acquisition of environmental information in the field was collected during two seasons, namely summer and fall,
of autonomous driving, we implement the accelerated encoding specifically in July and November, respectively. The data col-
and saving of images based on the Xavier platform. Specifically, lection encompassed a wide range of specific times, including
the data collection procedure can be divided into three steps: midday (approximately 12:00–13:00), evening (approximately
1) reading raw data based on Linux V4L2; 2) hardware accel- 19:00), and night (approximately 20:00 to 21:00). The size of
erated compression encoding of JPG based on the GPU; and the data collected varied slightly across different scenarios, with
3) saving the JPG data. The image compression algorithm first the number of instances ranging from approximately 100–300.
uses the discrete cosine transform algorithm to compress the The diverse nature of the dataset stems from the inclusion of
original data and then uses 8-bit Huffman coding to convert the different scenarios, collection times, and weather conditions.
format. According to experiments on the Xavier platform, the The LableImg tool is utilized to annotate the dataset with only
average encoding time for converting 1080p UYVY images to one category of person (pedestrian).

Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
7156 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 20, NO. 5, MAY 2024

TABLE II
COMPARISON OF DETECTION ACCURACY OF DIFFERENT MODELS

Fig. 6. Loss curve during model training. (a) Class_loss. (b) Iou_loss.

C. Experimental Settings
The dataset is divided into different acquisition scenarios to
conduct experiments. The training set includes data from five
daytime scenes and six nighttime scenes, with a total of 1977 im-
age pairs. The test set includes data from two daytime scenes and
four nighttime scenes, with a total of 1054 image pairs. In this Fig. 7. Comparison of detection accuracy of different models.
article, experiments are conducted on a workstation equipped (a) AP50. (b) AP75. (c) AP. (d) P–R curves.
with an Nvidia RTX 3090 GPU, implemented in the framework
of PyTorch (version = 1.10.1). While training our own model,
values of MM-Net drop faster than those of the baseline model
stochastic gradient descent is chosen as the optimizer, and the
after 15 rounds of training. This indicates that the convergence
batch size is set to eight. All experiments are repeated three
speed of the model training is also accelerated after adding the
times, and the results are then averaged. The model is evaluated
multilevel feature-level data fusion module.
according to three different COCO evaluation metrics, including
To test the detection performance of MM-Net and the effec-
AP, AP (IoU = 0.5), and AP (IoU = 0.75), referred to as AP,
tiveness of adding LiDAR information, a variety of detection
AP50, and AP75, respectively.
models are selected for comparison. The models utilized for
comparison include six single-mode models and four multi-
D. Analysis of Experimental Results mode models [8], [21], [45], [46]. Table II shows the detection
Fig. 6 shows the decline of loss during model training. Two accuracy of different models. The AP50 metric is more com-
models are specifically plotted in the figure. The first one is the monly used in the field of target detection [47], [48], [49], [50].
baseline model. The original two-branch network is chosen as a Moreover, more stringent AP75 and AP metrics are also utilized
baseline (i.e., the features extracted from the two branches are in validation. Fig. 7 visualizes the detection accuracy of the
directly summed element by element). The second one is the different models. Two models, TPH-YOLOv5 [45] and Cascade
MM-Net (i.e., the multilevel feature-level data fusion module RCNN [46], utilize only single-mode RGB images as input.
designed in this research is used). The loss values of the two Other models are available in two versions. The two versions
models steadily decrease with the number of iterations, and no include single-mode (using only RGB images) and multimodal
dramatic fluctuations, etc., occur. It is noteworthy that the loss (using both RGB images and LiDAR information).

Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
XU et al.: DEEP LEARNING-BASED PEDESTRIAN DETECTION USING RGB IMAGES AND SPARSE LIDAR POINT CLOUDS 7157

is 17.4%, 3.2%, 4.2%, and 7.6% higher than No.4, No.6, No.8,
and No.10, respectively. The AP metric integrates the detection
performance of the model under different IoU thresholds and
is the most rigorous accuracy evaluation metric. MM-Net also
performs well in this metric, with 9.3%, 5.1%, 3.7%, and 4.6%
higher than the other four models, respectively. In the daytime
scenario, MM-Net using multimodal data outperforms the other
four models in all three different accuracy evaluation metrics.
It is lower than MM-Net, which uses only single-mode data, in
both AP75 and AP accuracy metrics. However, MM-Net using
only single-mode data could not achieve satisfactory detection
results in the nighttime scenes. In the night scene, MM-Net with
multimodal data also achieves top-ranked detection performance
in the three evaluation metrics. Combined with the above analy-
sis, MM-Net using multimodal data performs the best. This can
mainly attribute to the reasonable model design of MM-Net.
First, the designed two-branch feature extraction module fully
extracts the features of both modal data, and the two branches
do not cause information interference between each other. It has
been shown that the fusion of data from different modalities at an
early stage may have a bad effect on the model performance [51].
Second, MM-Net adaptively fuses the features from the two
modal data at three different levels: high, medium, and low,
taking into account both the data modality and the fusion level.
Fig. 8. Detection results for several models using multimodal data.
The first row shows the detection results of MM-Net, and the second In addition, Table II also lists the running speed and number of
row shows the detection results of the improved model based on Faster parameters of each detection model. It is clear from the table that
RCNN. The third, fourth, and fifth rows show the detection results of the MM-Net achieves high detection accuracy while maintaining
three versions of MS-YOLO (S, M, and L), respectively. The faces have
been blurred. (a) Day. (b) Night. a relatively fast detection speed. MM-Net is the fastest of all
models that utilize multimodal data. This can mainly attribute to
the advanced feature extraction network. MM-Net’s two-branch
feature extraction module uses advanced structural parameteri-
Our dataset is divided into two scenes, daytime and night- zation techniques to decouple the training and inference phases,
time. The detection accuracy of each model is higher in the achieving a better balance between accuracy and speed. We also
daytime scenes than in the nighttime scenes. The nighttime tested the inference speed of MM-Net on the Nvidia Jetson AGX
scenes are severely disturbed by the light factor, which poses Xavier and Orin platforms. Thanks to the advanced architecture
a great challenge to the detection models. As can be seen and performance of the Nvidia Jetson series products, the infer-
from the table, the TPH-YOLOv5 and Cascade RCNN models ence speed of the detection model on Xavier is approximately
perform poorly on the overall dataset, especially on the nighttime 0.0750 s per frame, equivalent to 13.3 frames per second (fps),
dataset. This is mainly because these two models only use while on Orin, it is approximately 0.0374 s per frame, equivalent
RGB images as the input to the model. As shown in Fig. 8, to 26.7 fps. The test results indicate that the model’s inference
RGB images are usually very blurred in night scenes, which speed on Xavier and Orin exceeds the sampling rate of 10
led to the failure of the detection model. In addition, several frames per second of the VLP-16 LiDAR, thus meeting the
other models have different degrees of improvement in detection real-time requirements. The precision-recall (P-R) curves of
accuracy after adding LiDAR information, especially in the different detection models can be shown in Fig. 7(d). In the
nighttime scenes. This indicates that the LiDAR information P–R curves, the more convex the curve is toward the upper right
can supplement the RGB images with effective features. The corner, the better the detection effect of its corresponding model.
RGB images are severely affected by the illumination factor in The P–R curve of MM-Net tends to be more toward the upper
the night scenes, while LiDAR, which is an active detection right corner, and the area formed by its encirclement with the
method, is not greatly affected. The MM-Net network using coordinate axes is larger than that of several other detection
multimodal data achieves the highest AP50, AP75, and AP on models.
the overall dataset (including both day and night scenes) and Fig. 8 shows the detection results of some models by utilizing
satisfactory detection results in both day and night subscenes. multimodal data. Since the dataset is collected on the campus,
MM-Net achieves an AP50 metric of 72.9% on the full dataset, the pedestrian distribution of it is not very dense. The dataset
which is 6.5%, 9.9%, 5.2%, and 3.1% higher than No.4, No.6, is richer in scenes and contains data collected under different
No.8, and No.10, respectively. AP75 is more stringent than periods. This dataset poses a great challenge for detection mod-
AP50 and better reflects the target localization ability of the els. In the daytime scenes, the RGB images provide richer color
model. MM-Net also performs well on the AP75 metric, which texture information of the pedestrians, and the LiDAR point

Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
7158 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 20, NO. 5, MAY 2024

TABLE III TABLE IV


RESULTS OF THE FUSION STRATEGY STUDY RESULTS ON THE KITTI DATASET

cloud converted images play a secondary role to provide useful


distance information for the model. In the night scene, RGB
images are heavily influenced by the lighting conditions. The
LiDAR point cloud converted image, on the other hand, can
provide the approximate contour information of the pedestrian,
thus aiding the detection of pedestrian targets. As can be seen
in Table II, the four models using multimodal data show more
significant accuracy gains in the nighttime scenes, especially for
MM-Net. This further illustrates the requirement to add LiDAR
information and the great potential of MM-Net when fusing
data from both modalities. In Fig. 8, MM-Net performs the best
with the least number of false detections and missed detections.
Since the MM-Net network fuses the feature information of RGB
images and LiDAR data at three different levels, high, medium,
and low, it makes it better able to accomplish the detection of
pedestrian targets at different scales.
Fig. 9. Comparison of detection accuracy of different models.
(a) AP50, AP75, and AP. (b) P–R curves.
E. Study of Fusion Strategy
To verify the effectiveness of the designed fusion module and
to explore more deeply the effect of the fusion phase on the
F. Results on the KITTI Dataset
detection results, a large number of comparative experiments
are conducted. The original two-branch network is chosen as a KITTI is utilized to verify the performance of MM-Net [30],
baseline (i.e., the features extracted from the two branches are a well-known open-source dataset in the field of autonomous
subjected directly to an element-by-element summation opera- driving. It was jointly collected and produced by the Karlsruhe
tion). The single-level data fusion module is embedded in each Institute of Technology and Toyota Technological Institute in
of the three stages (High, medium, and low) for the experiments, Chicago. The collection scenarios of it are very rich, including
and the experimental results are shown in Table III. urban, rural, and highway. A large number of pedestrian targets
Data fusion strategy has been one of the research topics. How are also included.
to fuse and where to fuse has been widely studied and discussed For the pedestrian detection task in this paper, the original
by researchers [25]. First, Tables II and III are observed together. KITTI dataset is utilized to conduct experiments. The processing
It can be seen that the different variants of the MM-Net model process mainly includes:
all perform relatively well. This illustrates the superiority of the 1) Filter a large number of RGB images containing pedes-
overall structure of our proposed MM-Net. Second, Table III is trian targets from the training and test sets and label the
observed separately. When adding the feature-level fusion mod- pedestrian targets.
ule to different locations, it produces differences in the detection 2) Similar to the processing of our dataset, the point cloud
effect. It is obvious from the table that the model outperforms the data are converted into single-channel images using cali-
baseline model in all AP50 metrics after inserting the feature- bration parameters between the RGB camera and LiDAR.
level data fusion module into different positions, but fluctuates After the above processing, 1046 pairs of training samples
slightly in AP75 and AP metrics. The model outperforms the and 670 pairs of test samples are obtained. Each pair of samples
baseline model in all three metrics only when the fusion module consists of an RGB image and a matched point cloud single-
is inserted into the C4 position. It has been shown that fusion channel image. Note that, the data that we used in the KITTI
at intermediate levels tends to yield the best results, and that dataset are all collected during the daytime. This is quite different
fusing too early or too late may have a bad impact on model from our dataset. However, this dataset contains many pedestrian
performance [51]. This may be because the intermediate-level targets that are affected by factors, such as tree shading, which
features are intermediate between the bottom-level features and poses a challenge to the detection model.
the top-level features, taking into account the different levels of Extra comparison experiments are conducted based on this
information contained in both. dataset. The experimental results are shown in Table IV. Fig. 9(a)

Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
XU et al.: DEEP LEARNING-BASED PEDESTRIAN DETECTION USING RGB IMAGES AND SPARSE LIDAR POINT CLOUDS 7159

Fig. 10. Detection results for several models using multimodal data. The first row shows the detection results of MM-Net, and the second row
shows the detection results of the improved model based on Faster RCNN. The third, fourth, and fifth rows show the detection results of the three
versions of MS-YOLO (S, M, and L), respectively.

TABLE V seven models with the fusion module is higher than that of the
ABLATION EXPERIMENTS ON THE KITTI DATASET
baseline model.

V. CONCLUSION
In this article, we describe our research work on the core prob-
lem of fusing RGB cameras and sparse LiDAR for pedestrian
detection. Our research work is carried out in two aspects: hard-
ware platform construction and software algorithm research.
Specifically, for the hardware platform, we design and build
shows the performance of different detection models on the
a data acquisition platform using hardware devices, including
three accuracy metrics more visually in the form of a bar chart.
an IMX-307 RGB camera, VLP-16 sparse LiDAR, and Xavier
Both the improved Faster RCNN-based model and MS-YOLO
development board. We deploy the hardware with reasonable
(including S, M, and L versions) show different degrees of
connections, correct and calibrate the two different sensors, and
improvement in detection accuracy after using multimodal data.
design a soft-synchronous data acquisition scheme. In terms of
This further demonstrates the effectiveness of adding LiDAR
software algorithms, first, we use the data acquisition platform
information. LiDAR detects pedestrian targets by transmitting
to collect pedestrian datasets from multiple scenes and form
and receiving laser signals that are virtually unaffected by light-
a multimodal pedestrian detection dataset after preprocessing
ing factors. LiDAR point cloud data provide valuable auxiliary
and manual labeling. Second, we design a two-branch MM-Net.
information. This further improves the detection performance
We build a dual-branch feature extraction module for efficiently
of those single-mode models. In addition, compared to other
extracting features of both data modes simultaneously. To fully
detection models, MM-Net using multimodal data achieves
consider the differences between the features of different modal
the best accuracy in AP50, AP75, and AP metrics. MM-Net
data, we construct a multilevel feature-level data fusion module
achieves 91.6% in AP50 metric on the dataset, which is 10.7%,
and insert it into three different positions of the model: high, mid-
8.3%, 8.5%, and 10.7% higher than No.4, No.6, No.8, and
dle, and low. We carry out comparison experiments. MM-Net
No.10, respectively. In addition, MM-Net achieves 52.7% and
shows the best detection results compared to some comparison
51.0% on AP75 and AP, which are 1.3% and 2.9% higher
models. We also validate the effectiveness of the designed mul-
than the second place, respectively. This further demonstrates
tilevel feature-level data fusion module and further explore the
the applicability of MM-Net. The P–R curves for different
reasonable fusion positions. Finally, to further test the perfor-
models are shown in Fig. 9(b). The P–R curve of MM-Net
mance of MM-Net, we conduct comparison experiments using
is the best. Fig. 10 shows the detection results of several
the well-known KITTI dataset. The experimental results illus-
detection models utilizing multimodal data. Some pedestrian
trate the good performance of MM-Net. Further improving the
targets affected by lighting factors are better detected by the
detection accuracy of the model and exploring more reasonable
models using multimodal data. Above all, by taking advantage
fusion strategies are the focus of future research work.
of multiscale feature fusion, MM-Net is able to achieve ac-
curate extraction of pedestrian targets in different scenes and
is more sensitive to some smaller-size pedestrian targets. In REFERENCES
addition, ablation experiments were conducted to demonstrate [1] D. Parekh et al., “A review on autonomous vehicles: Progress, methods
the effectiveness of the designed fusion module. The results and challenges,” Electronics, vol. 11, no. 14, 2022, Art. no. 2162.
[2] X. Li et al., “A unified framework for concurrent pedestrian and cyclist
of these experiments are presented in Table V. Based on the detection,” IEEE Trans. Intell. Transp. Syst., vol. 18, no. 2, pp. 269–281,
table, it can be observed that, overall, the overall accuracy of the Feb. 2016.

Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
7160 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 20, NO. 5, MAY 2024

[3] D. Ridel, E. Rehder, M. Lauer, C. Stiller, and D. Wolf, “A literature review [28] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional
on the prediction of pedestrian behavior in urban scenarios,” in Proc. 21st detection,” Sensors, vol. 18, no. 10, 2018, Art. no. 3337.
Int. Conf. Intell. Transp. Syst., 2018, pp. 3105–3112. [29] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Point-
[4] J. Li, X. Liang, S. Shen, T. Xu, J. Feng, and S. Yan, “Scale-aware fast Pillars: Fast encoders for object detection from point clouds,” in Proc.
R-CNN for pedestrian detection,” IEEE Trans. Multimedia, vol. 20, no. 4, IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 12697–12705.
pp. 985–996, Apr. 2018. [30] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving?
[5] W.-Y. Hsu and W.-Y. Lin, “Adaptive fusion of multi-scale YOLO for The KITTI vision benchmark suite,” in Proc. IEEE Conf. Comput. Vis.
pedestrian detection,” IEEE Access, vol. 9, pp. 110063–110073, 2021. Pattern Recognit., 2012, pp. 3354–3361.
[6] S. Zhang, J. Yang, and B. Schiele, “Occluded pedestrian detection through [31] J. Geyer et al., “A2D2: Audi autonomous driving dataset,” 2020,
guided attention in CNNs,” in Proc. IEEE Conf. Comput. Vis. Pattern arXiv:2004.06320.
Recognit., 2018, pp. 6995–7003. [32] “xpilot.” Accessed: Jan. 5, 2024. [Online]. Available: https://www.
[7] G. Brazil, X. Yin, and X. Liu, “Illuminating pedestrians via simultaneous heyxpeng.com/intelligent/xpilot
detection & segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, [33] Velodyne, “VLP-16.” Accessed: Jan. 5, 2024. [Online]. Available: https:
pp. 4960–4969. //velodynelidar.com/products/puck/
[8] Y. Song, Z. Xie, X. Wang, and Y. Zou, “MS-YOLO: Object detection based [34] “NVIDIA Jetson AGX Xavier.” Accessed: Jan. 5, 2024. [Online]. Avail-
on YOLOv5 optimized fusion millimeter-wave radar and machine vision,” able: https://www.nvidia.cn/autonomous-machines/embedded-systems/
IEEE Sensors J., vol. 22, no. 15, pp. 15435–15447, Aug. 2022. jetson-agx-xavier/
[9] S. Iftikhar, Z. Zhang, M. Asim, A. Muthanna, A. Koucheryavy, and A. A. [35] “NIO.” Accessed: Jan. 5, 2024. [Online]. Available: https://www.nio.com/
Abd El-Latif, “Deep learning-based pedestrian detection in autonomous [36] C. Ricolfe-Viala and A.-J. Sanchez-Salmeron, “Lens distortion models
vehicles: Substantial issues and challenges,” Electronics, vol. 11, no. 21, evaluation,” Appl. Opt., vol. 49, no. 30, pp. 5914–5928, 2010.
2022, Art. no. 3551. [37] S. Xie, D. Yang, K. Jiang, and Y. Zhong, “Pixels and 3-D points alignment
[10] L. Wang, J. Yan, L. Mu, and L. Huang, “Knowledge discovery from remote method for the fusion of camera and LiDAR data,” IEEE Trans. Instrum.
sensing images: A review,” Wiley Interdiscipl. Rev.: Data Mining Knowl. Meas., vol. 68, no. 10, pp. 3661–3676, Oct. 2019.
Discov., vol. 10, no. 5, 2020, Art. no. e1371. [38] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
[11] M. M. Islam, v. A. R. Newaz, and A. Karimoddini, “Pedestrian detection “Feature pyramid networks for object detection,” in Proc. IEEE Conf.
for autonomous cars: Inference fusion of deep neural networks,” IEEE Comput. Vis. Pattern Recognit., 2017, pp. 2117–2125.
Trans. Intell. Transp. Syst., vol. 23, no. 12, pp. 23358–23368, Dec. 2022. [39] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network
[12] W. Han et al., “Methods for small, weak object detection in optical high- for instance segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern
resolution remote sensing images: A survey of advances and challenges,” Recognit., 2018, pp. 8759–8768.
IEEE Geosci. Remote Sens. Mag., vol. 9, no. 4, pp. 8–34, Dec. 2021. [40] C. Li et al., “YOLOv6: A single-stage object detection framework for
[13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: industrial applications,” 2022, arXiv:2209.02976.
Unified, real-time object detection,” in Proc. IEEE Conf. Comput. Vis. [41] Z. Li, Y. Sun, L. Zhang, and J. Tang, “CTNet: Context-based tandem
Pattern Recognit., 2016, pp. 779–788. network for semantic segmentation,” IEEE Trans. Pattern Anal. Mach.
[14] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time Intell., vol. 44, no. 12, pp. 9904–9917, Dec. 2022.
object detection with region proposal networks,” in Proc. Annu. Conf. [42] X. Li, W. Wang, X. Hu, and J. Yang, “Selective kernel networks,” in Proc.
Neural Inf. Process. Syst., 2015, pp. 1–9. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 510–519.
[15] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on [43] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc.
point sets for 3D classification and segmentation,” in Proc. IEEE Conf. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141.
Comput. Vis. Pattern Recognit., 2017, pp. 652–660. [44] A. Pfeuffer and K. Dietmayer, “Optimal sensor data fusion architecture
[16] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierarchical for object detection in adverse weather conditions,” in Proc. 21st Int. Conf.
feature learning on point sets in a metric space,” in Proc. Annu. Conf. Inf. Fusion, 2018, pp. 1–8.
Neural Inf. Process. Syst., 2017, pp. 5105–5114. [45] X. Zhu, S. Lyu, X. Wang, and Q. Zhao, “TPH-YOLOv5: Improved
[17] D. J. Yeong, G. Velasco-Hernandez, J. Barry, and J. Walsh, “Sensor and YOLOv5 based on transformer prediction head for object detection on
sensor fusion technology in autonomous vehicles: A review,” Sensors, drone-captured scenarios,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.,
vol. 21, no. 6, 2021, Art. no. 2140. 2021, pp. 2778–2788.
[18] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets for [46] Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high quality
3D object detection from RGB-D data,” in Proc. IEEE Conf. Comput. Vis. object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Pattern Recognit., 2018, pp. 918–927. 2018, pp. 6154–6162.
[19] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3D object [47] T. Ye, W. Qin, Z. Zhao, X. Gao, X. Deng, and Y. Ouyang, “Real-time
detection network for autonomous driving,” in Proc. IEEE Conf. Comput. object detection network in UAV-vision based on CNN and transformer,”
Vis. Pattern Recognit., 2017, pp. 1907–1915. IEEE Trans. Instrum. Meas., vol. 72, pp. 1–13, 2023.
[20] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander, “Joint 3D [48] W. Han et al., “A context-scale-aware detector and a new benchmark for
proposal generation and object detection from view aggregation,” in Proc. remote sensing small weak object detection in unmanned aerial vehicle
IEEE/RSJ Int. Conf. Intell. Robots Syst., 2018, pp. 1–8. images,” Int. J. Appl. Earth Observation Geoinformation, vol. 112, 2022,
[21] X. Gao, G. Zhang, and Y. Xiong, “Multi-scale multi-modal fusion for Art. no. 102966.
object detection in autonomous driving based on selective kernel,” Mea- [49] C. Zheng et al., “Multiscale fusion network for rural newly constructed
surement, vol. 194, 2022, Art. no. 111001. building detection in unmanned aerial vehicle imagery,” IEEE J. Sel. Topics
[22] J. Tang, L. Jin, Z. Li, and S. Gao, “RGB-D object recognition via incorpo- Appl. Earth Observ. Remote Sens., vol. 15, pp. 9160–9173, Sep. 27, 2022.
rating latent data structure and prior knowledge,” IEEE Trans. Multimedia, [50] W. Han et al., “A survey on methods of small weak object detection in
vol. 17, no. 11, pp. 1899–1908, Nov. 2015. optical high-resolution remote sensing images,” IEEE Geosci. Remote.
[23] F. Pala, R. Satta, G. Fumera, and F. Roli, “Multimodal person reidentifi- Sens. Mag., vol. 9, no. 4, pp. 8–34, Dec. 2021.
cation using RGB-D cameras,” IEEE Trans. Circuits Syst. Video Technol., [51] Z. Xinyu, Z. Zhen-Hong, L. Zhi-Wei, L. Hua-Ping, and L. Jun, “Deep
vol. 26, no. 4, pp. 788–799, Apr. 2016. multi-modal fusion in object detection for autonomous driving,” CAAI
[24] W. Zhou, Y. Pan, J. Lei, L. Ye, and L. Yu, “DEFNet: Dual-branch enhanced Trans. Intell. Syst., vol. 15, no. 4, pp. 758-771, 2020.
feature fusion network for RGB-T crowd counting,” IEEE Trans. Intell.
Transp. Syst., vol. 23, no. 12, pp. 24540–24549, Dec. 2022. Haoran Xu currently working toward the Ph.D.
[25] D. Guan, Y. Cao, J. Yang, Y. Cao, and M. Y. Yang, “Fusion of multispec- degree in geoscience information engineering
tral data through illumination-aware deep neural networks for pedestrian with China University of Geosciences, Wuhan,
detection,” Inf. Fusion, vol. 50, pp. 148–157, 2019. China.
[26] S. Shi, X. Wang, and H. Li, “PointRCNN: 3D object proposal generation His research interests include Internet of
and detection from point cloud,” in Proc. IEEE/CVF Conf. Comput. Vis. Things, unmanned systems, and data analysis.
Pattern Recognit., 2019, pp. 770–779.
[27] Y. Zhou and O. Tuzel, “VoxelNet: End-to-end learning for point cloud
based 3D object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., 2018, pp. 4490–4499.

Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.
XU et al.: DEEP LEARNING-BASED PEDESTRIAN DETECTION USING RGB IMAGES AND SPARSE LIDAR POINT CLOUDS 7161

Shuo Huang received the B.E. degree in elec- Xiaodao Chen received the B.E. degree in
tronic engineering from the Fudan University, telecommunication from the Wuhan Univer-
Shanghai, China, in 2007. sity of Technology, Wuhan, China, in 2006,
He is currently an Engineering Director with the M.S. degree in electrical engineering
the China FAW(Nanjing) Technology Develop- and Ph.D. degree in computer engineer-
ment Company Ltd, Nanjing, China. His re- ing from Michigan Technological University,
search interests include autonomous driving Houghton, MI, USA, in 2008 and 2012, respec-
and edge computing. tively.
He is an Associate Professor with the School
of Computer Science, China University of Geo-
sciences, Wuhan, China. His research interests
include high performance computing and CAD design for cyber-physical
systems.

Shiyan Hu (Senior Member, IEEE) received his


Yixin Yang is pursuing the M.S. degree in Ph.D. in computer engineering from Texas A&M
computer technology with China University of University, College Station, TX, USA, in 2008.
Geosciences (Wuhan). His research interests He is the Professor and Chair in Cyber-
include UAV remote sensing, Big Data, un- Physical System Security with the University of
manned systems. Southampton, U.K., who has published more
than 180 refereed papers in Cyber-Physical
Systems.
Prof. Hu is a Fellow of IET, a Fellow of the
British Computer Society, and a Member of the
European Academy of Sciences and Arts.

Authorized licensed use limited to: SOUTH CHINA UNIVERSITY OF TECHNOLOGY. Downloaded on September 18,2024 at 15:02:47 UTC from IEEE Xplore. Restrictions apply.

You might also like