Scholarworks at Utrgv Scholarworks at Utrgv
Scholarworks at Utrgv Scholarworks at Utrgv
ScholarWorks @ UTRGV
Computer Science Faculty Publications and College of Engineering and Computer Science
Presentations
1-2020
Pengpeng Sun
Zhigang Xu
Haigen Min
Hongkai Yu
The University of Texas Rio Grande Valley
Recommended Citation
Zhao, X., Sun, P., Xu, Z., Min, H., & Yu, H. (2020). Fusion of 3D LIDAR and Camera Data for Object Detection
in Autonomous Vehicle Applications. Ieee Sensors Journal, 20(9), 4901–4913. https://doi.org/10.1109/
JSEN.2020.2966034
This Article is brought to you for free and open access by the College of Engineering and Computer Science at
ScholarWorks @ UTRGV. It has been accepted for inclusion in Computer Science Faculty Publications and
Presentations by an authorized administrator of ScholarWorks @ UTRGV. For more information, please contact
[email protected], [email protected].
Page 1 of 14
Submitted to IEEE Sensors Journal 1
1
2 Fusion of 3D LIDAR and Camera Data for Object
3
4
5
Detection in Autonomous Vehicle Applications
6 Xiangmo Zhao, Pengpeng Sun, Zhigang Xu, Haigen Min, Hongkai Yu
7
8
9 Abstract—It’s critical for an autonomous vehicle to acquire of view, with precise depth information, and long-range and
10 accurate and real-time information of the objects in its vicinity, night-vision capabilities in target recognition [2-4]. In the
11 which will fully guarantee the safety of the passengers and vehicle object detection task, 3D LIDAR has certain advantages over
12 in various environment. 3D LIDAR can directly obtain the cameras in acquiring the pose and shape of the detected objects,
13 position and geometrical structure of the object within its
since laser scans contain spatial coordinates of the point clouds
14 detection range, while vision camera is very suitable for object
by nature [5]. However, the distribution of 3D LIDAR point
15 recognition. Accordingly, this paper presents a novel object
detection and identification method fusing the complementary clouds become more and more sparse as the distance from the
16 scanning center increases, which brings difficulties for a 3D
information of two kind of sensors. We first utilize the 3D LIDAR
17 data to generate accurate object-region proposals effectively. LIDAR to detect specific objects in the classification step.
18 Then, these candidates are mapped into the image space where the Cameras can provide high resolution images for precise
19 regions of interest (ROI) of the proposals are selected and input to classification, and the classification methods have been widely
20 a convolutional neural network (CNN) for further object used in recent years with extensive research of deep learning in
21 recognition. In order to identify all sizes of objects precisely, we the field of image recognition. Such methods usually first use
22 combine the features of the last three layers of the CNN to extract
an object-proposal generation method to generate box
23 multi-scale features of the ROIs. The evaluation results on the
proposals, such as the sliding-window [6], edge box [7], select
24 KITTI dataset demonstrate that : (1) Unlike sliding windows that
search [8], or multi-scale combinatorial grouping (MCG) [9],
25 produce thousands of candidate object-region proposals, 3D
LIDAR provides an average of 86 real candidates per frame and and then use the CNN pipeline [10, 11] to perform
26 object-region based recognition. A common disadvantage of
the minimal recall rate is higher than 95%, which greatly lowers
27 the proposals extraction time; (2) The average processing time for those approaches is the high computational costs associated
28 each frame of the proposed method is only 66.79ms, which meets with generating substantial candidate region proposals. Besides,
29 the real-time demand of autonomous vehicles; (3) The average camera suffers from varying illumination and lacking
30 identification accuracies of our method for car and pedestrian on information of the 3D location, orientation and geometry of the
31 the moderate level are 89.04% and 78.18% respectively, which
object, resulting in imprecise object-region proposals.
32 outperform most previous methods.
In order to obtain highly accurate object location and
33 Index Terms—Autonomous Vehicle, Object detection, Object
Identification, 3D LIDAR, CNN, Sensor Fusion classification in driving environments, one possible approach is
34
to take full advantage of the complementary information
35
I. INTRODUCTION between 3D LIDAR and cameras. For this purpose, we present
36
a multi-object detection methodology, applying the 3D
37
38 A UTONOMOUS vehicles can fundamentally improve the
safety and comfort of the driving population while
LIDAR-based object-region proposal generator on the point
clouds and combining a state-of-the-art CNN classifier on the
39 reducing the impact of automobiles on the environment [1]. To
camera data. The main contributions of this work are three-fold:
40 develop such a vehicle, the perceptual system is one of the
(1) we present a real-time multi-object detecting system, which
41 indispensable components allowing the vehicle to understand
performs long-range and high-precision object detection, and (2)
42 the driving environment, including the position, orientation and
propose a fast and accurate method for generating object-region
43 classification of the surrounding obstructions. Therefore,
proposals based on the 3D LIDAR data, while maintaining a
44 sensors such as LIDAR, cameras, radar, sonar have been
higher recall rate, and (3) implement a multi-scale CNN model
45 widely used in the environment sensing system of autonomous
to detect the tiny objects effectively. We are concerned on the
46 vehicles.
representative objects on the road, such as vehicles, pedestrians
47 3D LIDAR is one of the most prevalent sensors used in the
and bicycles, and the approach can also be extended to some
48 autonomous vehicle perceptual systems, and it has a wide range
other traffic elements around the moving autonomous vehicles.
49
To quickly and accurately generate the object-region
50
Manuscript submitted for review August 23, 2019. This work was supported proposals from 3D LIDAR point clouds, we first encode the
51 in part by the 111 Project on Information of Vehicle-Infrastructure Sensing and unordered original sparse point clouds into a multi-channel
52 ITS under Grant No. B14043, and in part by the Fundamental Research Funds
matrix according to the time stamp and vertical orientation of
53 for the Central Universities under Grant No. 300102248715 (Corresponding
author: P. P. Sun, e-mail: [email protected]). each laser beam, and extract ground points by analyzing the
54 P.P. Sun and H.G. Min,are with the Traffic Information Engineering & range difference between two adjacent beams. The non-ground
55 Control Department, Chang’an University, Xian 710064, China.
points were clustered using an adaptive threshold-based cluster
56 X.M Zhao, Z.G. Xu are with the Joint Laboratory for Internet of Vehicles,
Ministry of Education-China Mobile Communications Corporation, Chang’an algorithm and the bounding box of the clustering will be
57 University, Xi’an 710064, China. calculated. Thus, we can reduce the number of pesudo-targets
58
59
60
Page 2 of 14
Submitted to IEEE Sensors Journal 2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17 Fig. 1: Overview of the proposed framework for the multi-object detection algorithm.
18 based on the predefined position and the size of objects. Then, voxel grids built on the LIDAR point cloud for vehicle
19 on the basis of the corresponding spatial coordinates between detection. Zhou et al. [16] presented an efficient deep network
20 the 3D LIDAR and the camera, the detected bounding boxes architecture called VoxelNet for point cloud, which extracts
21 were projected back into the image space to create the 2D features directly on sparse points of the 3D voxel grid and
22 object-region proposals in the image. In this way, we can achieves remarkable results in the KITTI benchmark. One of
23 narrow the search range in the image and speed up the detection the advantages of object region proposals based on a voxel
24 algorithm. Those candidate regions were then processed by a representation is that the computational cost is only
25 CNN classifier for multi-object recognition. The architecture of proportional to the total number of voxels contained in the grid
26 the proposed multi-object detection algorithm can be seen in rather than the number of points. However, the precision of the
27 Fig. 1. object detection results is slightly reduced due to the fact that
28 The remainder of this article is arranged as follows: The the grid size in the map is much lower than the distance
29 section II surveys the previous related works. The section III accuracy of the 3D LIDAR data. In addition to operating
30 depicts the proposed multi-object detection method in detail. directly on the voxel grid map, some of the previous algorithms
31 The section IV gives the related metrics to evaluate the first projected the 3D point clouds onto 2D surfaces as the depth
32 performance of the proposed method, and discusses the map and then used some image-like methods to generate region
33 experimental results on the KITTI benchmark dataset [12]. proposals [5, 17]. Li et al. [5] detects a car by projecting the 3D
34 Conclusions are made in section V at last. point clouds into the front view to obtain the depth map, and
35 then applys a fully convolutional network to the map to predict
36 II. RELATED WORK the 3D box of vehicles, and obtained a comparable performance
37 on the KITTI object benchmark dataset [12]. Minemura et al.
This section gives a concise review of previous works related
38 [17] proposed an improved method called LMNet, which
to 3D-LIDAR-based object detection, camera-based object
39 represents the point cloud as five frontal-view maps (i.e.,
detection and multiple sensor fusion for object detection.
40 Reflection, Range, Distance, Side, Height) and is used to input
41 A. 3D LIDAR Object Detection Approaches LMNet for multiclass detection. However, projecting the 3D
42 There exists many works on autonomous vehicles covering point clouds to a 2D view will lose a lot of important
43 object detection using 3D LIDAR. Usually, the object detection information, and this information could be critical for robust
44 task based on LIDAR can be divided into two steps: extraction detection of objects, especially for detecting objects in crowded
45 of object region proposals and classification of the objects. scenes. Another method widely used is to divide points into
46 For the sake of extracting the object region proposals, it is clusters with characteristics. For example, when dealing with
47 usually to encode 3D point clouds which are captured from the the 3D point clouds captured by an autonomous vehicle, simply
48 3D LIDAR using a voxel grid [13-16]. Wang et al. [13] removing the ground points and aggregating the remaining
49 encoded the point clouds into 3D feature grid. Then, the 3D points can produce a reasonable segmentation [18]. Finer
50 detection window slides in the feature grid, and the score of the segmentation can be achieved by forming graphics on the point
51 object is directly voted to the discrete position of the sliding clouds [19, 20]. Recently, PointNet [21], PointNet++ [22] were
52 window. An improved approach based on voting strategy can proposed for processing point sets, and have shown to work
53 be found in [14]. This work performs object detection in 3D reliably well in indoor environments. Such approaches do not
54 point clouds with a convolutional neural network constructed need to carry on any kind of mapping transformation of the
55 from sparse convolutional layers based on the voting scheme point clouds, and operates at the point clouds level. Thus, those
56 and it obtains a faster speed. Li et al. [15] extends fully methods are more versatile and can use various 3D LIDAR
57 convolutional network (FCN) to 3D and designed a 3D-FCN on sensors.
58
59
60
Page 3 of 14
Submitted to IEEE Sensors Journal 3
1
To classify the object-region proposals, some early studies algorithm. Faster R-CNN [32] is the first framework to unify
2
mainly concentrated upon the hand-crafted features which the generation of object candidate region, feature extraction and
3
come from the spatial relation among the LIDAR points or the object classification into a convolutional neural network, which
4
intensity characteristics of them, e.g., spin image [23], fast improves the efficiency of the whole object detection system.
5
point feature histogram (FPFH) [24], and traditional However, this method is does not achieve good performance on
6
classification techniques, e.g., SVM [25], MLP and Ada Boost small object detection. To address this issue, Li et al. [33]
7
[26, 27]. In reference [25], a classifier based on SVM is developed a scale aware Fast R-CNN pipeline, which embeds
8
proposed, which divides the clusters into ground, vegetation, multiple built-in sub-networks and can detecte pedestrians from
9
construction and vehicles. A total of 13 features are extracted as a scale that does not intersect.
10
the input to the SVM classifier. However, these traditional In the cases of the end-to-end method, object detection is
11
classifiers have weak generalization ability and low recognition modeled as a regression problem to attempt to discard the links
12
precision, which can’t meet the requirement of the recognition that generated the object-region proposals [34-36]. In YOLO
13
accuracy of the perception system of autonomous vehicle. The (You Only Look Once) [34, 35], the image is divided into a
14
recently developed deep learning object detection algorithms, fixed size of grid, for each grid the object position and the
15
such as VeloDeep [14], VoxelNet [16] are more general and confidence degree will be predicted. The network output layer
16
robust than the above methods because they can identify more is mapped to the above results of the grids, thus achieving
17
object categories [28]. However, with the increase of the end-to-end training. The network of Fast YOLO [35] is further
18
amount of point clouds data involved in computing 3D network simplified, speed up the detection algorithm to 155 frames per
19
model, the computational power and memory requirements for second (fps). An improved method for the tiny object detection,
20
the computation of the 3D network model are increased in namely, SSD (single shot detector) [36] evaluates the candidate
21
cubic terms. object-region and category confidence maps by using different
22
layer features in the convolutional layer and achieves higher
23 B. Camera-based Object Detection Approaches
detection accuracy. The detection rate of these speeding up
24 Following the conventional learning or feature-based object methods can reach more than 30 fps. However, the speed of the
25 detection paradigm, deep learning has shown excellent algorithm comes at the cost of accuracy.
26 performance in the field of object detection using cameras for
27 intelligent transportation systems (ITS) application. The C. 3D-LIDAR and Camera Fusion Approaches
28 state-of-the-art methods of object recognition using deep Different sensors have their own merits but there are also
29 learning can be roughly divided into two categories: the some problems. 3D LIDAR is mainly used for 3D measurement
30 region-based method and the end-to-end method. The general and can’t be affected by the ambient lighting, but it provides
31 process of a region-based approach is to generate a large little information about the appearance of objects. In contrast,
32 number of candidate bounding boxes from the image using cameras can provide rich texture information of the detected
33 common methods like a sliding window [6], a selective search objects, but their performance greatly depends on illumination
34 [10], and the features of each object-region box would be conditions. Therefore, multi-sensor information fusion is
35 extracted and classified by a convolutional neural network critical for accurate object detection, but the fusion of sensor
36 model [29-31]. R-CNN [30] is a milestone applying CNN information should be based on accurate sensor calibration.
37 approach to object detection, and it achieves excellent object Recently, many studies are emerging on multi-sensor data
38 detection accuracy. On this method, a selective search [10] was fusion, and a survey can be referenced in [37]. Normally, the
39 used to generate region proposals, and the object image fusion techniques can be divided into three categories based on
40 extracted by the proposal was normalized as the standard input the level of abstraction that occurs, including (1) fusion on the
41 of the CNN. However, in classification, it needs to extract pixel level which combines the measurements to create a new
42 features from each extracted proposal of the test image, and the type of data [38], (2) fusion on the feature level that integrates
43 repetitive feature extraction leads to a huge computational features coming from data from different sensors [39-43] and (3)
44 waste. He et al. [31] improved the efficiency of R-CNN [30] by fusion on the decision level which combines the classified
45 accelerating the feature extraction link. In his method, the results from the data of each sensor [2, 44]. Schoenberg et al.
46 convolution feature map of the whole input image is calculated, [38] fused the LIDAR with the camera image on a pixel-level,
47 and then the feature vectors extracted from the shared feature and for each LIDAR point there is a pixel in the image
48 map are used to classify each object. This method is like to corresponding to it. Therefore, each point is added a pixel of
49
R-CNN [30], the training process of the network is still isolated, color intensity information. This method only uses of the
50
i.e., extracting the candidate regions, calculating CNN features intensity information and suffered from non-overlapping region
51
and SVM classification are carried out separately. This method problems. An improved approach presented by Cho et al. [39],
52
needs to pass a large number of intermediate results in the who extracted the data features of each sensor respectively, and
53
network besides the overall training parameters. In the fast combines them to classify and track the moving objects. The
54
R-CNN [29], a breakthrough idea was put forward, which work in [40] performed a pedestrian detection task by
55
combines the classification and bounding box regression. The combining the 3D-LIDAR data and the RGB image on different
56
training process is unified with further integration of the levels of the convolutional nets. The point clouds were first
57
multiple loss layer, which improves the accuracy of the converted into horizontal disparity, height, angle (HHA) maps,
58
59
60
Page 4 of 14
Submitted to IEEE Sensors Journal 4
1
and then the HHA maps and image were passed to two different object clustering. One common method of ground extraction
2
CNN models for classification. Chen et al. [41] proposed a and removal is to discard all points within a certain height [45].
3
multi-view network (MV3D) for 3D object detection, which Such method may play well in simple scenarios, but fails when
4
combines multiple views of LIDAR point cloud and images for the vehicle is moving in complex road environment. Li et al.
5
3D object proposals and object identification. An improved [46] introduced an improved method by projecting
6
deep model called AVOD [42] is proposed for small object measurements into a polar grid cell, where if both the mean
7
classes that multi-modally fuses features generated by point height and the standard deviation are within the predefined
8
clouds and RGB images to generate high-resolution feature thresholds, the region within the grid cell will be considered to
9
maps to generate reliable 3D object proposals. Liang et al. [43] belong to the ground set. However, even with this approach, an
10
exploits continuous convolutions to fuse image and LIDAR off-road environment may still be a challenge, and the
11
feature maps at different levels of resolution for 3D object operation could also be time consuming. The distance between
12
detector. Oh et al. [44] proposed an object detection method adjacent rings is more sensitive than the vertical displacement
13
based on the decision-level fusion, which fused the for measurement of the terrain slope [1]. The analysis of the
14
classification outputs from 3D LIDAR and the image data and range difference between adjacent rings provides a new idea for
15
obtained a classification performance of 77.72%. Instead of reliably detecting obstacles that are not even obvious to the
16
detecting the objects separately from the 3D LIDAR point vertical threshold algorithm [47, 48]. Choi et al. [47] compares
17
clouds or the image, it fuses the final results detected by the two the radius difference between adjacent beams with the given
18
sensors. In this paper, we just use the 3D LIDAR data to extract threshold to identify the ground points. Since the actual radial
19
object-region proposals to obtain the object’s initial location, difference between adjacent beams varies with the attitude of
20
and use a CNN network model to extract the feature from the the vehicle, it is very challenging to set an appropriate threshold.
21
corresponding image region and identify the object in the Hata et al. [48] identified curb-like points by checking whether
22
region. The superiority of our method is to take full advantage the ring distance between beams is within a given interval,
23
of the ability of 3D LIDAR to locate object quickly and which is based on a fixed ring distance on the plane. In this
24
accurately, and the merit of image for object recognition. paper, we still identify ground points analyzed the radius
25
distance between adjacent rings, but in different forms. We use
26
III. OBJECT DETECTION SYSTEM the ratio of the actual measured range difference to the
27
28 The framework of the proposed object detection algorithm is estimated range difference between adjacent rings to avoid the
29 shown in Fig. 1. This approach has two modalities of input, inconsistent variations in the range difference of adjacent rings
30 including 3D point cloud captured by a Velodyne 64E LIDAR of 3D LIDAR at different positions. In addition, the estimated
31 and color images captured by a CCD sensor, which are derived range difference between adjacent rings is not a fixed value, but
32 from the KITTI benchmark dataset [12]. The dataset was varies with the road conditions.
33 already calibrated by providing synchronized and calibrated One of the major challenges in processing the 3D LIDAR
34 data. The proposed framework is made up of two parts: (1) the data is that the 3D point cloud’s elements are represented by
35 generation of object-region proposals, including the Cartesian coordinates p = [ px , py , pz , pI ] , which contain a large
36 pre-processing of 3D LIDAR point clouds, extraction and number of discrete and unordered 3D points of the scenes. It is
37 removal of ground points, clustering non-ground obstacles, a time-consuming procedure to execute the search and index
38 calculating the 3D bounding boxes (BBs) of clustered obstacles operations among the points. Therefore, it is necessary to
39 and projecting the BBs onto an image to generate 2D reorganize the original disordered sparse 3D LIDAR point
40 object-region proposals, and (2) a multi-scale CNN-based clouds into the ordered point clouds.
41 classifier used to classify the object-region proposal. Actually, the raw output data of the 3D LIDAR is based on a
42 spherical coordinate system, which mainly includes the
A. Object-Region Proposal Generation Using 3D LIDAR Data
43 azimuth angle , the pitch angle of each beam , the
44 When an autonomous vehicle is moving, it may encounter
measurement distance d and the reflect intensity I . Therefore,
45 various sized objects from all directions and locations. To
we can encode the disordered sparse point cloud P into a
46 accelerate the detection process, the state-of-the-art approaches
multi-channel dense matrix M according to the rotation angle
47 generally use a proposal generator to generate a set of candidate
of the points and the number of the rings that the points belong
48 regions instead of exhaustive window search. The presented
to (i.e., the ID of the source laser beam), as illustrated in Fig. 2.
49 method only utilizes 3D spatial information provided by a 3D
The number of rows is defined by the numbers of rings in the
50 LIDAR to generate the object-region proposals, which can be
3D LIDAR frame. The number of columns depends on the
51 divided into 3 steps as below.
rotation rate of the Velodyne LIDAR, which is 10 Hz. And for
52 (1) Ground Point Extraction and Removal: In the 3D point
each rotation, the LIDAR sensor generate 64 2048 laser
53 cloud captured by 3D LIDAR, all the points that hit the
points.
54 obstacles on the ground, such as cars, trees, vegetation are
We first aggregated the point cloud P into the cells of matrix
55 always connected to the points on the ground. In order to
b r , c by the similar method from the previous work [5], which
56 improve the quality of the object-region proposals and to
reduce unnecessary computation, we need to remove the can be described through Eq. (1) to Eq. (5).
57
ground points from the raw point cloud before performing p =atan2(p y ,p x ) (1)
58
59
60
Page 5 of 14
Submitted to IEEE Sensors Journal 5
1
2 pdepth = px 2 + p 2 y (7)
3 An example of transformation of 3D LIDAR point cloud
4 from the KITTI benchmark in the multi-channel dense matrix is
5 shown in Fig. 2, where each row represents the measurement of
6 a single laser beam done during one rotation of the sensor. Each
7 column contains the measurements of all 64 laser beams
8 captured at a specific rotational angle at the same time. This
9 transformation provides an image-like coordinate frame to
10 organize discrete points and it also keeps the spatial
11
relationship between the points.
On the ideal flat horizontal plane, it is assumed that the
12
height of a 3D LIDAR installation and the pitch angle of each
13
laser beam are known and the expected depth difference
14
between the two adjacent beams can be computed. The
15 Fig. 2. Example of 3D LIDAR point clouds from the KITTI benchmark dataset
difference in this range decreases with the rising elevation of
16 followed by the corresponding depth mapping, height mapping and reflectance
17
mapping. Each row represents the measurement of a single laser beam done the surface. A geometrical model of ground extraction
during one rotation of the sensor. Each column contains measurements for all 64 algorithm is shown in Fig. 3.
18 laser beams captured at a specific rotational angle at the same time.
19 Suppose that the symbol bi +1, j is used to represent the cell of
p = arcsin(p z / p x 2 + p y 2 + p z 2 ) (2) (i + 1) th row and jth column of the matrix, and the symbol
20
21 pr = ( p + 180) / (3) pdepth i +1, j is used to represent the depth value of points in bi +1, j .
22
pc = p / (4) In order to determine if the points in bi +1, j are ground points, we
23
b r ,c = { p P | pr = r pc = c} first estimate the depth difference between the previous cell of
24 (5)
25 the same column (i.e., bi , j ), and use the symbol Ed (bi , j , bi +1, j ) to
Where p = [ px , py , pz , pI ] represents to a 3D point, ( p , p )
26 represent the estimated depth difference. The 3D LIDAR points
represents the rotation angle and pitch angle of the point, of two adjacent scan lines on the plane will form a concentric
27 ( pr , pc ) represents the row and column indices of a point in the
28 circle, and the depth difference between the two adjacent scan
matrix, represents the average rotation angle resolution, and lines depends on the installation height of 3D LIDAR and the
29
represents the vertical angle resolution of the continuous pitch angle in the vertical direction of the laser line. The
30
31 beam transmitter. In fact, the row also corresponds to the Ed (bi , j , bi +1, j ) value is a constant, and its value depends on the
32 number of laser beams and all the points that allocated to the pitch angle of the adjacent ( ith and (i + 1)th ) scan lines in the
same row are captured by the same laser beam.
33 vertical direction and the installation height of the LIDAR. The
Since the horizontal representation of our encoding is equal
34 actual depth difference between the adjacent cells bi , j and bi +1, j
to the original Velodyne resolution, then a few points may fall
35 is called the measured depth difference, and is represented by
into the same cell b r , c , in which case the point closest to the
36 M d (bi,j , bi+1,j ) . The measured depth difference M d (bi,j , bi+1,j ) on
37 observer is retained. We reduce the number of channels and
populate the cell b r , c with the 3-channel data m(b r , c ) which can the ground seldom changes, and the estimated depth difference
38
Ed (bi , j , bi +1, j ) is approximately equal to the measured value
39 be expressed by Eq. (6)
40 m(b r , c ) = ( pz , pI , pdepth ) (6) M d (bi,j , bi+1,j ) . However, when the points of 3D LIDAR in the
41 Where pz , pI , pdepth represent the height, intensity value and cell bi +1, j hit the obstacle as shown in Fig. 3, the depth of the
42 depth value of a point, respectively. The depth value is defined points are truncated by the obstacles, resulting in a sudden
43 in Eq. (7) decrease in the depth distance between the two adjacent points
44 of two adjacent laser line. It wills lead to an obvious difference
45 between the estimated depth difference Ed (bi , j , bi +1, j ) and the
46 measured depth difference M d (bi, j , bi +1,j ) . Therefore, we can
47
compare the values of Ed (bi , j , bi +1, j ) and M d (bi, j , bi +1,j ) to
48
49 determine whether the points in the cell bi +1, j are ground points
50 0 or obstacle points. The LIDAR point cloud is approximately
51 ri+1 i+1 i h concentrically distributed on the ground, and the farther the
52 ri
adjacent rings are from the origin of LIDAR, the greater the
Md
53 w value of Ed (bi , j , bi +1, j ) . The range of absolute difference between
54 bi+1 Ed bi b1
Rd
b0 M d (bi, j , bi +1,j ) and Ed (bi , j , bi +1, j ) is [0, Ed (bi , j , bi +1, j )] , thus this
55 range varies with position, and it is difficult to find a suitable
56 Fig. 3: The geometrical model for ground extraction is established by threshold to distinguish the category of LIDAR point cloud, but
57 comparing the expected range difference Ed with the measured range difference
the proportional range of M d (bi, j , bi +1,j ) and Ed (bi , j , bi +1, j ) at any
md between the two adjacent 3D LIDAR rays on the ground.
58
59
60
Page 6 of 14
Submitted to IEEE Sensors Journal 6
1
position is always [0,1] . Therefore, Instead of using the Where i +1 represents the vertical pitch angle of the (i + 1)th
2
3
absolute difference, we adopt a proportional method to avoid scan line. According to the geometrical relation, can be
the inconsistent variations in the depth difference of two calculated by
4
adjacent laser lines of 3D LIDAR at different positions. ri R
5 = d (12)
6 Accordingly, and the ground attestation of cell bi +1, j can be sin sin i
7 calculated by Eq.(8). Rd 2 = h 2 + ri 2 − 2hri cos i (13)
8 P(bi +1, j )=
M d (bi,j , bi+1,j )
Where i represents the vertical pitch angle of the ith scan line,
(8)
9 Ed (bi,j , bi+1,j ) and ri represents the radical distance of the points in the cell bi,j .
10 Where M d (bi,j , bi+1,j ) is the actual depth difference between Joint Eq. (3) to Eq. (6), the estimated range difference between
11
adjacent cells bi,j and bi+1,j , and Ed (bi,j , bi+1,j ) is the estimated two adjacent cells bi +1, j and bi , j can be calculated by Eq. (14).
12
13 depth difference between the adjacent cells. The M d (bi,j , bi+1,j ) ri sin
Ed (bi,j , bi+1,j ) =
14 value is calculated with Eq. (9) as below. h sin i
sin[arcsin( ) − i +1 ] (14)
15 M d (bi,j , bi+1,j ) = pdepth i +1, j − pdepth i , j (9) h + ri 2 − 2hri cos i
2
16 Where pdepth i +1, j represents the depth value of points in the cell The closer the value P(bi +1, j ) is to 1, the greater the probability
17
bi+1,j . The geometrical model of the ground extraction that the points in the cell bi +1, j belong to the ground set. All the
18
19 algorithm is illustrated in Fig. 3. Due to the variety of terrain, ground cells in the matrix are sequentially extracted by the
the vehicle may encounter flat, undulating, hillsides or other above method, then we convert those ground cells into point
20
roadways. The extension line of the LIDAR’s axis is clouds through Eq. (15):
21
perpendicular to the surface flat road, i.e., the angle between the pz
22 extension line of the LIDAR’s axis and the ground surface is px = sin(c ) cos(r ) cos(c )
23 90º. However, for a sloping road, the extension line of the
24 pz
LIDAR’s axis is no longer perpendicular to the road surface py = cos(r ) sin(c ) (15)
25 sin(c )
due to the pitch of the vehicle, as shown in Fig.3. In order to
26 p = p
make the proposed algorithm adaptive to different roads, it is z z
27 not always assumed that the extended line of the LIDAR axis is
28 perpendicular to the ground plane when calculating the After removal of the ground points, we get all the points
29 expected radial distance between two adjacent scanning lines. belong to the obstacle set. Some examples of 3D LIDAR
30 Here, the angle between the extension line of the LIDAR’s axis ground point cloud extracted from the KITTI benchmark
31 and the ground surface is defined as a variable , which varies dataset are shown in Fig. 4, and the white dots indicate the
32 with the pitch angle of the vehicle. extraction of ground points.
33 (2) Non-Ground Segmentation: After removing the ground
According to the geometrical relation, Ed (bi , j , bi +1, j ) can be
34 points, the rest of point cloud needs further segmentation. The
calculated by Eq. (10). Euclidean clustering method [49] is one of the most used
35 Ed (bi,j , bi+1,j ) h methods dividing points into individual clusters. This method
36 = (10)
sin sin requires a fixed radius threshold. However, the point cloud
37 captured from the 3D LIDAR is dense horizontally while sparse
38 Where h represents the installation height of 3D LIDAR,
represents the vertical angle resolution of 3D LIDAR, vertically, which causes the distribution of the points of the
39 object is fairly irregular. Therefore, under a fixed threshold, the
40 represents the angle between the ground surface and the segmentation of non-ground points will result in an
41 (i + 1)th scan line, and can be calculated by Eq. (11). under-segmentation or over-segmentation problem.
42 To avoid this problem, the non-ground points are segmented
43 = − i+1 − (11) in two steps. We first use a small azimuth difference threshold
44 to cluster the non-ground points into several groups, as
45 illustrated in Fig. 5 (b), and then an adaptive threshold method
46 is used to further segment the clustered groups, as illustrated in
47 Fig. 5 (c).
48 The segmentation process is described as the following pseudo
49 code. The input is a set of non-ground point clouds P captured
50 from a 3D LIDAR and the output is a set of clusters , in which
51 each cluster contains a set of non-ground points that belong to a
52 single object.
53
54 Algorithm: Segmentation of non-ground points
55 1 INPUT: non-ground points P from 3DLIDAR, the difference
56 azimuth threshold similarity
Fig. 4. Examples of 3D LIDAR ground point cloud extractions from the KITTI
57 benchmark dataset, and the white dots indicate the extraction of ground points. 2 OUTPUT: object segments = {C1 ,C2 ,...,Cn } , set of clusters
58
59
60
Page 7 of 14
Submitted to IEEE Sensors Journal 7
1
Algorithm: Segmentation of non-ground points
2
3 3 INITIALLY: as the set of clusters to keep
4 4 Foreach pi P do
5 5 isInserted false
6
7 6 Foreach C do
8 7 Foreach p j C do
9 8 If d _ azimuth(pi , p j ) similarity Then
10 (a)
11 9 If d _position(pi , p j ) d (p i ) Then
12 10 C C {p j }
13 11 isInserted true
14
15 12 Break;
16 13 End
17 14 End
18 (b)
19 15 End
20 16 End
21 17 If isInserted false Then
22
C {p j }
23 18
24 19 {C}
25 20 End
26
27 21 End (c)
Fig. 5. Illustration of the non-ground segmentation method: (a) shows the
28 original non-ground point clouds; (b) shows the clustering results using the
29 Initially, the first point is categorized to the first group. The azimuth difference threshold; and (c) shows the final non-ground
30 3D LIDAR gives the scanning data in the order of azimuth, thus segmentation results using two criterions.
31 the azimuth angle of the LIDAR point hitting the same object is similarity depends on the horizontal angle resolution of the
32 continuously distributed. If the difference of the azimuth of the
LIDAR . We take 3 as the threshold similarity in order to
33 two points is smaller than the threshold, they probably come
from the same object. For a point pi P,(i 1) that is not eliminate the influence of isolated noise points. The function
34 d _position(.) is used to calculate the Euler distance between
35 assigned to any other cluster, we first calculate the azimuth of
absolute difference d _ azimuth( pi , p j ) relative to the other two points. The adaptive threshold d ( pi ) is designed as a
36
37 elements pj C . If the difference is less than similarity , it means linear function of the depth values in this point, and can be
calculated by:
38 that pi is in the same azimuth zone with cluster C , and then we
d ( pi ) = Dxy ( pi ) u2 + u1 (16)
39 will further determine whether pi should be inserted into C by
40 comparing the Euler distance between the two points with the The function Dxy (.) refers to the depth value between the
41 adaptive threshold d ( pi ) . The value of the threshold current point and the origin on the x-y plane. The parameter u2
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56 Fig.6 Segmentation results of non-ground point clouds in some typical scenarios, including vehicles in the shade of the trees, darker vehicles, and denser scenes. The
57 proposed algorithm can segment the scene target well.
58
59
60
Page 8 of 14
Submitted to IEEE Sensors Journal 8
1
is obtained by analyzing the regular relationship between two to the CNN model for recognition.
2 adjacent points in the same laser beam. Considering that the
3 horizontal resolution of the Velodyne HDL-64E is 0.09ºwhen
B. CNN-based Feature Extraction and Classification
4 running at 10 Hz, and the interval between two adjacent points The CNN model is used to extract the features of the
5 extracted bounding boxes and classify the object in the
in the same laser beam is 0.09 Dxy / 360o theoretically. As a
6 bounding boxes. The CNN model has achieved remarkable
threshold, the value in this paper is magnified appropriately to
7 success in the field of object classification due to its ability to
8 triple as parameter u2 . The parameter u1 serves as the maximum learn to express and estimate objects directly. We present a
9 tolerance distance between two obstacles, and this value is CNN architecture to accurately classify the object-region
10 also used to distinguish two objects with different horizontal proposals, as illustrated in the Fig. 8.
rotation angles, and we use two times the horizontal resolution
11 The aim is to be able to detect objects that are captured under
angle of the 3D LIDAR. challenging conditions in which the scale of the object varies
12
13
If pi cannot meet the above conditions, a new cluster is dramatically. Although the previous region-based CNN models
14 created and pi assigned to a new cluster. Following the same e.g., Fast-RCNN [29], does not require the proposal box to have
15 criteria, it can separate non-ground objects and complete the a fixed size, but it is difficult to detect the tiny objects robustly.
16 entire segmentation. An example of a non-ground segmentation The main reason is that those networks perform ROI pooling
results in some typical scenarios is shown in Fig. 6, including only in the last feature map. However, after multiple
17
vehicles in the shade of the trees, darker vehicles, and denser convolution and pooling operations for the candidate region of
18
scenes. The proposed algorithm can segment the scene target a tiny object, there is very little information of the object in the
19
well. last layer of convolution feature layer. For example, in the
20 VGG-16 model [50], the global strides of ‘Conv5’ is 16, and
21 (3) Region Proposal Generation: The different processing
steps to generate object-region proposals in an image using 3D when given a bounding box area of less than 16 16 pixel size,
22 the feature of the final output is just one pixel. Under these
LIDAR data are shown in Fig. 7.
23 circumstances, even though the candidate area contains an
To generate more accurate object-region proposals and
24 object, it is difficult to locate and identify the object according
ensure better performance of the detector module, we compute
25 the 3D bounding box of each cluster and filter out some dummy to this feature.
26 objects based on empirical information. When the LIDAR To address this issue, the CNN model proposed in this paper
27 scanning distance exceeds 60 m, few points will be capture. does not carry out ROI pooling just on the last convolution
28 Therefore, we will abandon the candidate box beyond this feature map. Instead, the region proposal is projected into
29 scope. Besides, the bounding box will be discarded if the width multiple layers of feature maps, and the ROI pooling operation
30 of the bounding box is greater than 3 m, or the length exceeds is executed in each layer. More specifically, our model is based
31 10 m, or the height is lower than 0.5 m or greater than 2.5 m. on the VGG16 [50]. Rather than performing ROI pooling only
32 Next, according to the coordinate calibration relationship of the on the last convolutional layer, we execute ROI pooling after
33 3D LIDAR and the camera, the remaining 3D boundary boxes Conv3, Conv4 and Conv5 layers. Each layer will generate a
34 are mapped to the corresponding image space. The 3D fixed-size feature tensor. In order to bring the feature maps
35 boundary boxes beyond the image space are discarded, and the from different convolution layers to the same scale, we
36 2D candidate boundary rectangle are generated from the normalize the feature tensor using L2 normalization for
37 mapping area of each 3D boundary box in the image. To robustness of the detection system, and concatenate all the
38 guarantee the performance of the detector module, we enlarge normalized feature tensors similar to [51]. The normalization is
39 the rectangle by 15% so that the entire object is inside the conducted within each pixel of the feature maps, and all the
rectangle. The resulting rectangle areas of the image are passed feature maps are treated independently the normalization
40
procedure is expressed with Eq. (17) and Eq. (18) as below.
41
42
43
44
45
46
47
48
49
50
51
52
53
54 (a) (b) (c)
55 Fig. 7. Illustration of the object-region proposals generated in an image using 3D LIDAR data: (a) the results of non-ground clustering; (b) the rest of the 3D
56 bounding boxes after filtering with the experimental information; (c) the projection of the 3D bounding boxes in the 2D image space and obtained the final 2D
57 object-region proposals in the image space after enlarging the 2D bounding boxes
58
59
60
Page 9 of 14
Submitted to IEEE Sensors Journal 9
1
2
Object proposal Conv1 Conv2 Conv3 Conv4 Conv5
3 Input image
4 generation
5
6
7 ROI ROI ROI
8 Regions of Pooling Pooling Pooling
9 Interest
10 (RoIs) L2
L2 L2
11
3D point clouds normalized normalized normalized
12
13
14 concatenation
15 softmax
1 1 fc fc
16
17 Conv
bbox
18
19 Fig. 8. Structure of the convolutional neural network. The image and the acquired 2D candidate regions are used as input to the proposed network model. The
20 architecture is based on the VGG16 model [50], which consists of five sets of convolution layers: Conv1 to Conv5. We add ROI pooling layers and L2
21 normalization after Conv3-Conv5 layers to get multi-scale information. Then a 1 1 convolution is used to integrate the information and dimension reduction of the
concatenated features. Then we estimate the bounding boxes and class confidence by following two fully connected layers and multitask function.
22
23 x denoted as b = {bcx ,bcy ,b w ,b h } , which represents predicted
24 x= (17)
x2 bounding box location for each of the K object classes.
25
bcx ,bcy ,b w and b h denote the two coordinates of the predicted
26
d
x 2 = ( xi )1/ 2 (18) bounding box center, width and height respectively. For
27 i=1
28 instance, we assume that the ground-truth class label
Where x represents the original features and x represents
29 distribution is denoted as a vector q = {q 0 ,q1,...q i ,...q K } , where
the normalized features. In Eq. (17), d represents the
30 dimension of the feature from each convolution layer. In the q i −1 = 1 when the sample belong to category i , and the other
31 training process, the feature normalization step will redress the elements of the vector are 0. We assume the location of the
32 scale factor using the updated scale factors. For each channel of ground-truth bounding-box location is g = {g cx ,g cy ,g w ,g h } . For
33 the feature map, the scale factor is calculated by Eq. (19). object classification and bounding box regression, we defined
34
yi = i x i (19) the multi-task loss (classification loss and bounding box
35 regression loss) function on the ROI during the training phase
36 Where y i represents the re-scaled feature value. According following [29] as Eq. (23):
37 to the back-propagation rule, the scale factor i can be L( p,q,b,g) = Lcls (p,q) + [q bg]Lloc (b,g) (23)
38 renovated by Eq. (19) to Eq. (22).
39 The classification loss Lcls (p,q) is cross entropy loss, and
dl dl
40 = (20) calculated as follows:
dx dy N K
41 Lcls = − q i , j log(pi , j )
T (24)
42 dl
=
dl I
( −
xx
) i =1 j = 0
(21)
43
3
dx d x x 2 x2 Where N is the number of samples, K is the number of
44 categories, pi, j is the probability that the model predicts sample
dl dl
45 = xi (22)
46 d i yi yi i belong to the category j, and q i, j is the probability that the
47 sample i belong to category j. For the bounding box regression
Where y = [y1 , y2 , , yd ]T .
48 loss Lloc (b,g) as Eq. (24), we use a Smooth L1 loss between the
To match the original size of the ROI pooling feature map,
49 we use 11 convolution to narrow the connected feature predicated bounding box location and the ground-truth
50 dimensions. The final feature tensor is then passed to the two bounding box location defined in [29]. When q represents the
51 fully connected layers for object positioning and recognition background ROIs, we ignore Lloc (b,g) , i.e., q bg .
52 based on the feature tensor. N
Lloc (b,g) = smooth L1 (bij − g ij ) (25)
53 The output of the network model consists of two parts. One is i j{x,y,w,h}
54 a vector of K+1 dimension output by One-hot encoding,
55 denoted as p = {p0 ,p1,p 2 ,...p K } , which represents the probability IV. EXPERIMENTAL SETUP AND EVALUATION
56 distribution of which category a sample belongs to. Other
57 outputs a vector representing 4 parameterized coordinates, This section first introduces the object detection benchmark
58 and evaluation metrics. Then the experiments are carried out
59
60
Page 10 of 14
Submitted to IEEE Sensors Journal 10
1
and experimental results are analyzed and discussed. All We evaluated the proposed approach for all 9 object classes in
2
experiments were conducted using an Intel (R) Core (TM) the KITTI validation dataset [12]. We compared our proposed
3
i7-4790 3.6 GHz processor, with 64 GB RAM. The graphics method with other conventional ones such as sliding window
4
card for convolutional network training and testing is a Titan X [6], edge box [7], selective search [8] and MCG [9], and the
5
with 12 GB of memory. The CNN model was implemented detection results are limited to 60 m. The comparison of the
6
using C++ on the Ubuntu 14.04+ROS operating system and recall rates of all methods in generating different object-regions
7
trained on the Caffe platform [52]. is shown in Fig. 9.
8
We used 1000 object-region proposals to plot the recall rate
9 4.1. KITTI Object Detection Dataset
as a function of the IOU threshold. As observed, the proposed
10 In order to evaluate the performance of the proposed method provides over 95% of recall rate across the entire range
11 multi-object detection algorithm, quantitative and qualitative of IOUs.
12 experiments were conducted on the 2012 2D KITTI object The main reason is that all baseline methods generate
13 detection benchmark [12]. The dataset consists of a object-region proposals from 2D image space, while the
14 synchronized stereo camera image and a 3D LIDAR frame object-region overlap often appears in the image space, and it is
15 captured from an autonomous vehicle. The camera image is difficult to distinguish them. However, in the 3D point cloud
16 cropped to pixels and rectified to pixels. Specifically, the 3D captured from the 3D LIDAR, the object-regions can be
17 LIDAR frames are captured from HDL-64E with 64 scanning distinguished by the object depth feature, which is not easy to
18 lines, and can perform 360 scans. If it rotates at a 10 Hz distinguish in the image space. In addition, the region proposal
19 frequency, it can generate 1 million points per second. framework based on visual information can only provide a
20 The dataset provides 7,481 frames of training and 7,518 rough bounding box position. Thus, the recall rate declines
21 frames of testing. Since the labels in the test set were not rapidly when the higher overlap is required, while the 3D
22 disclosed, we adhered to [14], and divided the training data into LIDAR has obvious advantage over the camera on achieving
23 a training set (80%) and a validation set (20%). The training the posture and shape of the detected objects, since the laser
24 data contains 9 different categories of 51,867 labels: 'car’, scans contain the spatial coordinates of the point clouds by
25 ‘pedestrian’, cyclist’, ‘van’, ‘truck’, ‘sitting person’, ‘tram’, nature.
26 ‘miscellaneous’ and ‘don’t care’ and show road scene of
27 various appearances. In addition, based on the size of the 2D TABLE1
28 bounding box in the image space and the occlusion conditions,
THE RESULTS OF RUNTIME (MS) AND AVERAGE PRECISION (AP%) ON THE
KITTI DATA SET IN OUR STUDY COMPARED WITH FOUR STATE-OF-THE-ART
29 the object samples in the KITTI benchmark are divided into PROPOSAL GENERATION METHODS.
30 three difficulty levels: easy, moderate and hard.
31 Method NF
Runtime/ AP/%
ms Cars Pedestrian
32 4.2. Evaluation Sliding window [6] 58.8
2000 524 42.5
33 Firstly, the performance of the object-region generation +Fast-RCN
Selective search[8]
34 method based on 2D recall of the ground truth annotation is +Fast-RCN
2000 221 73.7 55.9
35 evaluated. We used the provided calibration file to project the MCG [9]
+Fast-RCN
2000 350 81.3 62.2
36 proposed object onto the 2D image plane and discarded any Edge boxes [7]
2000 139 78.3 62.4
+Fast-RCN
37 detections outside the image. The intersection-over-union (IOU) Our method
86 53 87.8 70.7
38 metric is used as the evaluation criterion to evaluate +Fast-RCN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 Fig. 10. Precision-recall curves for the three object classes evaluated at three difficulty levels using the KITTI validation set. All precision-recall curves were
16 obtained by our CNN model (solid-line) and VGG16 model (dashed-line).
17
18 As can be seen from TABLE 1, our object-region proposal method and applied the PASCAL VOC [53] evaluation tool kit
19 method generates on average 86 non-duplicated proposals per to calculate the average precision. Fig.10 shows the
20 frame (NF), which is smaller than other methods (2000 NF). precision-recall curve of the baseline method and our method.
21 However, due to our method of providing fewer errors and The area below the precision-recall curve is the AP value. By
22 higher recall rates, we achieved approximately 87% of AP for comparing the precision-recall curves, we can clearly see that
23 the cars category achieving better performance than most of the our approach greatly exceeds the baseline approach for each
24 state-of-the-art methods. At the same time, we outperformed grade of difficulty in the three object categories and still
25 the other methods in each category of moderate level by 89.8% performs better with increasing difficulty. This result
26 and 70.7% for cars, pedestrians respectively, while greatly demonstrates that the information loss can be reduced by
27 reduced the calculation time. This clearly shows that the point combining multiple convolutional feature layers. The results
28 cloud of 3D LIDAR can be applied to precisely extract object show that by combining the features of multiple convolution
29 regions at the object level. layers, the drop of information can be effective decreased and
30 To verify the quality of the proposed CNN model, we used the tiny objects can be detected more effectively, and we have
31 the generated region proposal as input and set the original achieved 89.04% and 78.18% of the AP in moderate level for
32 VGG16 [50] model as the baseline. In the experiment, the cars and pedestrians respectively, which is superior to most of
33 proposed CNN model was trained on the KITTI benchmark [12] the published object detection methods. This is the concrete
34 training set, and the employment categories consisting of cars, evidence to prove that the proposed method has achieved very
35 pedestrians and backgrounds. In the training phase, we first competitive results against state-of-the-art methods. Fig. 12
36 initialized the parameters using a pre-trained VGG-16 with the shows some examples of detection in the KITTI dataset.
37 Image Net [54], and then fine-tuned them using the ground Although there are some serious obstructions and small size
38 -truth annotations and the generated candidate regions obtained objects in the image, the proposed detection method can still be
39 from the KITTI benchmark training set. A sampled candidate accurately detected. At the same time, we also get the distance
40 region is considered as positive if and only if the candidate information of the target. In order to evaluate the runtime of our
41 region overlaps the ground truth annotation by more than 50%. proposed approach, we performed a total of 7481 frames of
42 Otherwise, the candidate region will be treated as a background. KITTI training and validation datasets. Fig. 11 shows the
43 The positive samples are a quarter of the total samples. The runtime results of the proposed approach in the experiment.
44 Nesterov Accelerated Gradient (NAG) [55] algorithm is used From Fig. 11 it can be seen that the average period is
45 for the optimization of the CNN training. NAG is one of the approximately 66.79 ms, which means that our multi-object
46 most popular algorithms to optimize neural networks. This detection pipeline has a faster frame rate than the 3D LIDAR
47 method is adaptively updated according to the slope of the loss
48 function in each learning process to accelerate the convergence.
49 We use a NAG optimizer to fine-tune the CNN model, with an
50 initial learning rate of 0.001, a batch size of 16 and a
51 momentum coefficient of 0.9. In addition, instead of
52 fine-tuning all the layers in the experiments, we keep the
53 parameters of the first two sets of the convolution layer
54 unchanged and fine-tune the other layers with maximum
55 number of iterations of 200, 000. After training, we tested the
56 object detection performance of our model’s and the baseline Fig. 11. Runtime for the proposed approach on the KITTI training and
57 approach on the KITTI validation sets using the standard validation datasets [12], the average running time of our algorithm is nearly
58 precision-recall (PR) curve. We followed KITTI’s assessment 66.79 ms, which is much lower than the TuSimple’s [56] running time.
59
60
Page 12 of 14
Submitted to IEEE Sensors Journal 12
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
Fig. 11. Examples of object detection results using our proposed method on the KITTI benchmark dataset [12], including Pedestrians and Cars at various difficulty
40 levels.
41
42 frame rate. This illustrates that our approach can be executed meaning that it can be executed rapidly and implemented online.
43 rapidly and online. The performance of this method is very competitive comparing
44 to current popular methods.
45 V. CONCLUSION AND FUTURE WORK Although the 3D LIDAR can avoid the effects of
46 environmental illumination changes, few points will be
In this work, we proposed a novel and fast multi-object
47 detection approach that fully utilizes the complementarity of
captured when LIDAR scanning range exceeds 60 meters. This
48 the 3D LIDAR and camera data to robustly identify multiple
will bring difficulties to generate accurate and complete
49 objects around an autonomous vehicle. The experimental
object-region proposals. The limitation of 3D LIDAR scan
50 results of the KITTI benchmark show that this method yields an
range will lead to a decrease in performance of the proposed
51 method when detecting tiny objects on moderate or hard levels.
average of 86 non-repeating object candidate regions per frame,
52 To address this problem, in the future, we will use
which generates fairly fewer pseudo candidate regions than
53 millimeter-wave radar to supplement more information to
other conventional methods. In the case of obtaining the object
54 generate enough object-region proposals. Another limitation of
distance information, the average accuracy rates of the
55 the proposed method is that the method only outputs the 2D
proposed method reached 89.04% and 78.18% respectively
bounding box of the object. We will make full use of
56 when detecting the vehicles and pedestrians on moderate
complementary information to export full 3D bounding boxes
57 difficulty level, which is better than most published methods.
58 The average runtime per frame of our method is about 66.79 ms,
59
60
Page 13 of 14
Submitted to IEEE Sensors Journal 13
1 [35] J. Redmon and A. Farhadi, "YOLO9000: Better, Faster, Stronger," [56] F. Yang, W. Choi, and Y. Lin, "Exploit all the layers: Fast and
2 in 2017 IEEE Conference on Computer Vision and Pattern accurate cnn object detector with scale dependent pooling and
3 Recognition (CVPR), 2017, pp. 6517-6525. cascaded rejection classifiers," in Proceedings of the IEEE
[36] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, et conference on computer vision and pattern recognition, 2016, pp.
4 al., "Ssd: Single shot multibox detector," in European conference 2129-2137.
5 on computer vision, 2016, pp. 21-37. Xiangmo Zhao, IEEE Member, received
6 [37] F. Garcia, D. Martin, A. De La Escalera, and J. M. Armingol,
the B.S. degree from Chongqing University,
"Sensor fusion methodology for vehicle detection," IEEE Intelligent
7 Transportation Systems Magazine, vol. 9, pp. 123-133, 2017. China, in 1987, and the M.S. and Ph.D. from
8 [38] J. R. Schoenberg, A. Nathan, and M. Campbell, "Segmentation of Chang’an University, China, in 2002 and
9 dense range information in complex urban scenes," in Intelligent 2005, respectively. He is currently a
10 Robots and Systems (IROS), 2010 IEEE/RSJ International
professor and the vice president of Chang’an
Conference on, 2010, pp. 2033-2038.
11 [39] H. Cho, Y.-W. Seo, B. V. Kumar, and R. R. Rajkumar, "A University, China. He has authored or
12 multi-sensor fusion system for moving object detection and tracking co-authored over 130 publications and
13 in urban driving environments," in Robotics and Automation (ICRA), received many technical awards for his contribution to the
2014 IEEE International Conference on, 2014, pp. 1836-1843.
14 [40] J. Schlosser, C. K. Chow, and Z. Kira, "Fusing lidar and images for
research and development of intelligent transportation systems.
15 pedestrian detection using convolutional neural networks," in His research interests include intelligent transportation systems,
16 Robotics and Automation (ICRA), 2016 IEEE International distributed computer networks, wireless communications and
Conference on, 2016, pp. 2198-2205. signal processing.
17 [41] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, "Multi-view 3d object
18 detection network for autonomous driving," in IEEE CVPR, 2017, p.
Pengpeng Sun, received the bachelor’s
19 3. degree in computer science and technology
20 [42] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. Waslander, "Joint 3d from Chang’an University, Xi’an, China, in
proposal generation and object detection from view aggregation," 2014, where he is currently pursuing the
21 arXiv preprint arXiv:1712.02294, 2017.
Ph.D. in traffic information engineering &
22 [43] M. Liang, B. Yang, S. Wang, and R. Urtasun, "Deep Continuous
control. His current research interests
23 Fusion for Multi-Sensor 3D Object Detection," in Proceedings of
the European Conference on Computer Vision (ECCV), 2018, pp. include 3D LIDAR point cloud data
24 641-656. processing, object detection based on
25 [44] S.-I. Oh and H.-B. Kang, "Object detection and classification by
mulit-sensor fusion for autonomous vehicles,
26 decision-level fusion for intelligent vehicle systems," Sensors, vol.
computational intelligence and image understanding.
17, p. 207, 2017.
27 [45] C. Mertz, L. E. Navarro‐Serment, R. MacLachlan, P. Rybski, A. Zhigang Xu, IEEE Member, received the
28 Steinfeld, A. Suppé, et al., "Moving object detection with laser B.S. degree in automation, and the M.S. and
29 scanners," Journal of Field Robotics, vol. 30, pp. 17-43, 2013. Ph.D. in traffic information engineering and
30 [46] Q. Li, L. Zhang, Q. Mao, Q. Zou, P. Zhang, S. Feng, et al., "Motion
control from Chang’an University, China, in
field estimation for a dynamic scene using a 3D LiDAR," Sensors,
31 vol. 14, pp. 16672-16691, 2014. 2002, 2005, and 2012, respectively. He is
32 [47] A. Petrovskaya and S. Thrun, "Model based vehicle detection and currently an associate professor at Chang’an
33 tracking for autonomous urban driving," Autonomous Robots, vol. University, China. His research focuses on
26, pp. 123-139, 2009.
34 [48] A. Y. Hata, F. S. Osorio, and D. F. Wolf, "Robust curb detection and
connected and automated vehicle, Intelligent
35 vehicle localization in urban environments," in Intelligent vehicles Transportation Systems and nondestructive
36 symposium proceedings, 2014 IEEE, 2014, pp. 1257-1262. testing of infrastructures.
37 [49] Y. Zhou, D. Wang, X. Xie, Y. Ren, G. Li, Y. Deng, et al., "A fast Haigen Min, IEEE Student Member,
and accurate segmentation method for ordered LiDAR point cloud
38 of large-scale scenes," IEEE Geoscience and Remote Sensing
received the B.S. and M.S. degrees in the
39 Letters, vol. 11, pp. 1981-1985, 2014. Department of computer science and he is
40 [50] K. Simonyan and A. Zisserman, "Very deep convolutional currently pursuing the Ph.D. degree in the
networks for large-scale image recognition," arXiv preprint Department of Traffic Information
41 arXiv:1409.1556, 2014.
42 [51] C. Zhu, Y. Zheng, K. Luu, and M. Savvides, "CMS-RCNN:
Engineering & Control from Chang’an
43 contextual multi-scale region-based CNN for unconstrained face University, China. His research interests
44
detection," arXiv preprint arXiv:1606.05413, 2016. include localization and navigation system for intelligent
[52] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, vehicle and test methodology for intelligent & connected
45 et al., "Caffe: Convolutional architecture for fast feature
vehicle.
46 embedding," in Proceedings of the 22nd ACM international
conference on Multimedia, 2014, pp. 675-678. Hongkai Yu received the Ph.D. degree in
47 [53] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. computer science and engineering from
48 Zisserman, "The pascal visual object classes (voc) challenge," University of South Carolina, Columbia, SC,
49 International journal of computer vision, vol. 88, pp. 303-338,
USA in 2018. He then joins the Department
50 2010.
of Computer Science at University of
[54] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
51 "Imagenet: A large-scale hierarchical image database," in Computer Texas-Rio Grande Valley, Edinburg, TX,
52 Vision and Pattern Recognition, 2009. CVPR 2009. IEEE USA as an assistant professor. His research
53 Conference on, 2009, pp. 248-255.
interests include computer vision, machine
[55] Y. Nesterov, "A method for unconstrained convex minimization
54 problem with the rate of convergence O (1/k^ 2)," in Doklady AN learning, deep learning and intelligent
55 USSR, 1983, pp. 543-547. transportation system. He is a member of the IEEE.
56
57
58
59
60