0% found this document useful (0 votes)
125 views8 pages

1608 07916 PDF

This document describes a method for vehicle detection from 3D lidar point clouds using fully convolutional networks (FCNs). Specifically, it projects the 3D point cloud data into a 2D point map representation and uses a single 2D end-to-end FCN to simultaneously predict objectness confidence and bounding boxes. Experiments on the KITTI dataset demonstrate state-of-the-art performance of the proposed method for 3D vehicle detection from lidar scans.

Uploaded by

hungbkpro90
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views8 pages

1608 07916 PDF

This document describes a method for vehicle detection from 3D lidar point clouds using fully convolutional networks (FCNs). Specifically, it projects the 3D point cloud data into a 2D point map representation and uses a single 2D end-to-end FCN to simultaneously predict objectness confidence and bounding boxes. Experiments on the KITTI dataset demonstrate state-of-the-art performance of the proposed method for 3D vehicle detection from lidar scans.

Uploaded by

hungbkpro90
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Vehicle Detection from 3D Lidar Using Fully

Convolutional Network
Bo Li, Tianlei Zhang and Tian Xia
Baidu Research Institute for Deep Learning
{libo24, zhangtianlei, xiatian}@baidu.com

AbstractConvolutional network techniques have recently example when processing the point cloud captured by an
achieved great success in vision based detection tasks. This autonomous vehicle, simply removing the ground plane and
arXiv:1608.07916v1 [cs.CV] 29 Aug 2016

paper introduces the recent development of our research on cluster the remaining points can generate reasonable segmen-
transplanting the fully convolutional network technique to the
detection tasks on 3D range scan data. Specifically, the scenario tation [10, 5]. More delicate segmentation can be obtained
is set as the vehicle detection task from the range data of Velodyne by forming graphs on the point cloud [32, 14, 21, 29, 30].
64E lidar. We proposes to present the data in a 2D point map and The subsequent object detection is done by classifying each
use a single 2D end-to-end fully convolutional network to predict segments and thus is sometimes vulnerable to incorrect seg-
the objectness confidence and the bounding boxes simultaneously. mentation. To avoid this issue, Behley et al. [2] suggests
By carefully design the bounding box encoding, it is able to
predict full 3D bounding boxes even using a 2D convolutional to segment the scene hierarchically and keep segments of
network. Experiments on the KITTI dataset shows the state-of- different scales. Other methods directly exhaust the range scan
the-art performance of the proposed method. space to propose candidates to avoid incorrect segmentation.
For example, Johnson and Hebert [13] randomly samples
I. I NTRODUCTION points from the point cloud as correspondences. Wang and
For years of the development of robotics research, 3D lidars Posner [31] scan the whole space by a sliding window to
have been widely used on different kinds of robotic platforms. generate proposals.
Typical 3D lidar data present the environment information by To classify the candidate data, some early researches assume
3D point cloud organized in a range scan. A large number of known shape model and match the model to the range scan
research have been done on exploiting the range scan data in data [6, 13]. In recent machine learning based detection works,
robotic tasks including localization, mapping, object detection a number of features have been hand-crafted to classify the
and scene parsing [16]. candidates. Triebel et al. [29], Wang et al. [32], Teichman
In the task of object detection, range scans have an specific et al. [28] use shape spin images, shape factors and shape
advantage over camera images in localizing the detected distributions. Teichman et al. [28] also encodes the object mov-
objects. Since range scans contain the spatial coordinates of ing track information for classification. Papon et al. [21] uses
the 3D point cloud by nature, it is easier to obtain the pose and FPFH. Other features include normal orientation, distribution
shape of the detected objects. On a robotic system including histogram and etc. A comparison of features can be found
both perception and control modules, e.g. an autonomous in [1]. Besides the hand-crafted features, Deuge et al. [4], Lai
vehicle, accurately localizing the obstacle vehicles in the 3D et al. [15] explore to learn feature representation of point cloud
coordinates is crucial for the subsequent planning and control via sparse coding.
stages. We would also like to mention that object detection on
In this paper, we design a fully convolutional network RGBD images [3, 17] is closely related to the topic of
(FCN) to detect and localize objects as 3D boxes from range object detection on range scan. The depth channel can be
scan data. FCN has achieved notable performance in computer interpreted as a range scan and naturally applies to some
vision based detection tasks. This paper transplants FCN to the detection algorithms designed for range scan. On the other
detection task on 3D range scans. We strict our scenario as 3D hand, numerous researches have been done on exploiting both
vehicle detection for an autonomous driving system, using a depth and RGB information in object detection tasks. We omit
Velodyne 64E lidar. The approach can be generalized to other detailed introduction about traditional literatures on RGBD
object detection tasks on other similar lidar devices. data here but the proposed algorithm in this paper can also
be generalized to RGBD data.
II. R ELATED W ORKS
A. Object Detection from Range Scans B. Convolutional Neural Network on Object Detection
Tranditional object detection algorithms propose candidates The Convolutional Neural Network (CNN) has achieved
in the point cloud and then classify them as objects. A common notable succuess in the areas of object classification and detec-
category of the algorithms propose candidates by segmenting tion on images. We mention some state-of-the-art CNN based
the point cloud into clusters. In some early works, rule-based detection framework here. R-CNN [8] proposes candidate
segmentation is suggested for specific scene [10, 20, 5]. For regions and uses CNN to verify candidates as valid objects.
(a) (b)

(c) (d)
Fig. 1. Data visualization generated at different stages of the proposed approach. (a) The input point map, with the d channel visualized. (b) The output
confidence map of the objectness branch at oa p . Red denotes for higher confidence. (c) Bounding box candidates corresponding to all points predicted as
positive, i.e. high confidence points in (b). (d) Remaining bounding boxes after non-max suppression. Red points are the groundtruth points on vehicles for
reference.

OverFeat [25], DenseBox [11] and YOLO [23] uses end-to- where p = (x, y, z)> denotes a 3D point and (r, c) denotes the
end unified FCN frameworks which predict the objectness con- 2D map position of its projection. and denote the azimuth
fidence and the bounding boxes simultaneously over the whole and elevation angle when observing the point. and is
image. Some research has also been focused on applying CNN the average horizontal and vertical angle resolution between
on 3D data. For example on RGBD data, one common aspect consecutive beam emitters, respectively. The projected point
is to treat the depthmaps as image channels and use 2D CNN map is analogous to cylindral images. We fill the element at
for classification or detection [9, 24, 26]. For 3D range scan (r, c)pin the 2D point map with 2-channel data (d, z) where
some works discretize point cloud along 3D grids and train d = x2 + y 2 . Note that x and y are coupled as d for rotation
3D CNN structure for classification [33, 19]. These classifiers invariance around z. An example of the d channel of the 2D
can be integrated with region proposal method like sliding point map is shown in Figure 1a. Rarely some points might
window [27] for detection tasks. The 3D CNN preserves more be projected into a same 2D position, in which case the point
3D spatial information from the data than 2D CNN while 2D nearer to the observer is kept. Elements in 2D positions where
CNN is computationally more efficient. no 3D points are projected into are filled with (d, z) = (0, 0).
In this paper, our approach project range scans as 2D maps
similar to the depthmap of RGBD data. The frameworks of B. Network Architecture
Huang et al. [11], Sermanet et al. [25] are transplanted to The trunk part of the proposed CNN architecture is similar
predict the objectness and the 3D object bounding boxes in a to Huang et al. [11], Long et al. [18]. As illustrated in Figure
unified end-to-end manner. 2, the CNN feature map is down-sampled consecutively in the
III. A PPROACH first 3 convolutional layers and up-sampled consecutively in
deconvolutional layers. Then the trunk splits at the 4th layer
A. Data Preparation into a objectness classification branch and a 3D bounding box
We consider the point cloud captured by the Velodyne 64E regression branch. We describe its implementation details as
lidar. Like other range scan data, points from a Velodyne scan follows:
can be roughly projected and discretized into a 2D point map, The input point map, output objectness map and bounding
using the following projection function. box map are of the same width and height, to provide
= atan2(y, x) point-wise prediction. Each element of the objectness
p map predicts whether its corresponding point is on a
= arcsin(z/ x2 + y 2 + z 2 )
(1) vehicle. If the corresponding point is on a vehicle, its
r = b/c corresponding element in the bounding box map predicts
c = b/c the 3D bounding box of the belonging vehicle. Section
concat bounding box map
point map deconv5b deconv6b (obp )

Fig. 2. The proposed FCN structure to


predict vehicle objectness and bounding
box simultaneously. The output feature map
of conv1/deconv5a, conv1/deconv5b and
conv2/deconv4 are first concatenated and then
ported to their consecutive layers, respectively.

conv1 conv2 conv3 concat concat deconv6a (oap)


deconv4 deconv5a objectness map

III-C explains how the objectness and bounding box is Corresponding to this 24d vector, deconv6b outputs a 24-
encoded. channel feature map accordingly.
In conv1, the point map is down-sampled by 4 hori- The transform (3) is designed due to the following two
zontally and 2 vertically. This is because for a point reasons:
map captured by Velodyne 64E, we have approximately Translation part Compared to cp which distributes over
= 2, i.e. points are denser on horizotal direction. the whole lidar perception range, e.g. [100m, 100m]
Similarly, the feature map is up-sampled by this factor of [100m, 100m] for Velodyne, the corner offset cp p
(4, 2) in deconv6a and deconv6b, respectively. The rest distributes in a much smaller range, e.g. within size of a
conv/deconv layers all have equal horizontal and vertical vehicle. Experiments show that it is easier for the CNN
resolution, respectively, and use squared strides of (2, 2) to learn the latter case.
when up-sampling or down-sampling. Rotation part R
>
ensures the rotation invariance of the
The output feature map pairs of conv3/deconv4, corner coordinate encoding. When a vehicle is moving
conv2/deconv5a, conv2/deconv5b are of the same sizes, around a circle and one observes it from the center, the
respectively. We concatenate these output feature map appearance of the vehicle does not change in the observed
pairs before passing them to the subsequent layers. This range scan but the bounding box coordinates vary in the
follows the idea of Long et al. [18]. Combining features range scan coordinate system. Since we would like to
from lower layers and higher layers improves the predic- ensure that same appearances result in same bounding
tion of small objects and object edges. box prediction encoding, the bounding box coordinates
are rotated by R> to be invariant. Figure 3b illustrates a
C. Prediction Encoding
simple case. Vehicle A and B have the same appearance
We now describe how the output feature maps are defined. for an observer at the center, i.e. the right side is observed.
The objectness map deconv6a consists of 2 channels corre- Vehicle C has a difference appearance, i.e. the rear-right
sponding to foreground, i.e. the point is on a vehicle, and part is observed. With the conversion of (3), the bounding
background. The 2 channels are normalized by softmax to box encoding b0p of A and B are the same but that of C
denote the confidence. is different.
The encoding of the bounding box map requires some extra
conversion. Consider a lidar point p = (x, y, z) on a vehicle. D. Training Phase
Its observation angle is (, ) by (1). We first denote a rotation 1) Data Augmentation: Similar to the training phase of a
matrix R as CNN for images, data augmentation significantly enhances the
R = Rz ()Ry () (2) network performance. For the case of images, training data
are usually augmented by randomly zooming or rotating the
where Rz () and Ry () denotes rotations around z and y
original images to synthesis more training samples. For the
axes respectively. If denote R as (rx , ry , rz ), rx is of the
case of range scans, simply applying these operations results
same direction as p and ry is parallel with the horizontal
in variable and in (1), which violates the geometry
plane. Figure 3a illustrate an example on how R is formed. A
property of the lidar device. To synthesis geometrically correct
bounding box corner cp = (xc , yc , zc ) is thus transformed as:
3D range scans, we randomly generate a 3D transform near
c0p = R> (cp p) (3) identity. Before projecting point cloud by (1), the random
transform is applied the point cloud. The translation com-
Our proposed approach uses c0p to encode the bounding box ponent of the transform results in zooming effect of the
corner of the vehicle which p belongs to. The full bounding synthesized range scan. The rotation component results in
box is thus encoded by concatenating 8 corners in a 24d vector rotation effect of the range scan.
as 2) Multi-Task Training: As illustrated Section III-B, the
b0p = (c0> 0> 0> >
p,1 , cp,2 , . . . , cp,8 ) (4) proposed network consists of one objectness classification
Fig. 3. (a) Illustration of (3). For
each vehicle point p, we define a
A C specific coordinate system which is
centered at p. The x axis (rx ) of
the coordinate system is along with
the ray from Velodyne origin to p
B (dashed line). (b) An example illus-
tration about the rotation invariance
when observing a vehicle. Vehicle
rz A and B have same appearance. See
(3) in Section III-C for details.
(a) ry (b)
p rx

branch and one bounding box regression branch. We respec- vehicle samples at different distances also need to be balanced.
tively denote the losses of the two branches in the training This helps avoid the prediction to bias towards near vehicles
phase. As notation, denote oap and obp as the feature map and neglect far vehicles or occluded vehicles. Denote n(p) as
output of deconv6a and deconv6b corresponding to point p the number of points belonging to the same vehicle with p.
respectively. Also denote P as the point cloud and V P as Since the 3D range scan points are almost uniquely projected
all points on all vehicles. onto the point map. n(p) is also the area of the vehicle of p
The loss of the objectness classification branch correspond- on the point map. Denote n as the average number of points
ing to a point p is denoted as a softmax loss of vehicles in the whole dataset. We re-weight Lobj (p) and
Lbox (p) by w2 as
Lobj (p) = log(pp )
(
exp(oap,lp ) (5) /n(p) p V
n
pp = P a
w2 (p) = (8)
l{0,1} exp(op,l ) 1 pP V
where lp {0, 1} denotes the groundtruth objectness label Using the losses and weights designed above, we accumu-
of p, i.e. 0 as background and 1 as a point on vechicles. oap,? late losses over deconv6a and deconv6b for the final training
denotes the deconv6a feature map output of channel ? for point loss
p. X X
The loss of the bounding box regression branch correspond- L= w1 (p)w2 (p)Lobj (p) + wbox w2 (p)Lbox (p) (9)
ing to a point p is denoted as a L2-norm loss pP pV

Lbox (p) = kobp b0p k2 (6) with wbox used to balance the objectness loss and the bounding
box loss.
where b0p is a 24d vector denoted in (4). Note that Lbox is
only computed for those points on vehicles. For non-vehicle E. Testing Phase
points, the bounding box loss is omitted. During the test phase, a range scan data is fed to the
3) Training strategies: Compared to positive points on network to produce the objectness map and the bounding
vehicles, negative (background) points account for the majority box map. For each point which is predicted as positive
portion of the point cloud. Thus if simply pass all objectness in the objectness map, the corresponding output obp of the
losses in (5) in the backward procedure, the network prediction bounding box map is splitted as c0p,i , i = 1, . . . , 8. c0p,i is
will significantly bias towards negative samples. To avoid then converted to box corner cp,i by the inverse transform of
this effect, losses of positive and negative points need to be (3). We denote each bounding box candidates as a 24d vector
balanced. Similar balance strategies can be found in Huang bp = (c> > > >
p,1 , cp,2 , , cp,8 ) . The set of all bounding box
et al. [11] by randomly discarding redundant negative losses. candidates is denoted as B = {bp |oap,1 > oap,0 }. Figure 1c
In our training procedure, the balance is done by keeping all shows the bounding box candidates of all the points predicted
negative losses but re-weighting them using as positive.
(
k|V|/(|P| |V|) p P V We next cluster the bounding boxes and prune outliers by
w1 (p) = (7) a non-max suppression strategy. Each bounding box bp is
1 pV
scored by counting its neighbor bounding boxes in B within
which denotes that the re-weighted negative losses are aver- a distance , denoted as #{x B|kx bp k < }. Bounding
agely equivalent to losses of k|V| negative samples. In our case boxes are picked from high score to low score. After one
we choose k = 4. Compared to randomly discarding samples, box is picked, we find out all points inside the bounding box
the proposed balance strategy keeps more information of and remove their corresponding bounding box candidates from
negative samples. B. Bounding box candidates whose score is lower than 5 is
Additionally, near vehicles usually account for larger por- discarded as outliers. Figure 1d shows the picked bounding
tion of points than far vehicles and occluded vehicles. Thus boxes for Figure 1a.
(a) (b)
Fig. 4. More examples of the detection results. See Section IV-A for details. (a) Detection result on a congested traffic scene. (b) Detection result on far
vehicles.

TABLE I
IV. E XPERIMENTS P ERFORMANCE IN AVERAGE P RECISION AND AVERAGE O RIENTATION
S IMILARITY FOR THE O FFLINE E VALUATION
Our proposed approach is evaluated on the vehicle de-
tection task of the KITTI object detection benchmark [7]. Easy Moderate Hard
This benchmark originally aims to evaluate object detection Image Space (AP) 74.1% 71.0% 70.0%
Image Space (AOS) 73.9% 70.9% 69.9%
of vehicles, pedestrians and cyclists from images. It contains World Space (AP) 77.3% 72.4% 69.4%
not only image data but also corresponding Velodyne 64E World Space (AOS) 77.2% 72.3% 69.4%
range scan data. The groundtruth labels include both 2D object
bounding boxes on images and its corresponding 3D bounding
boxes, which provides sufficient information to train and test
1.0
detection algorithm on range scans. The KITTI training dataset
contains 7500+ frames of data. We randomly select 6000 0.8
frames in our experiments to train the network and use the rest
1500 frames for detailed offline validation and analysis. The 0.6
Precision

KITTI online evaluation is also used to compare the proposed


approach with previous related works. 0.4
For simplicity of the experiments, we focus our experiemts Easy
only on the Car category of the data. In the training phase, 0.2 Moderate
we first label all 3D points inside any of the groundtruth Hard
car 3D bounding boxes as foreground vehicle points. Points 0.0
0.0 0.2 0.4 0.6 0.8 1.0
from objects of categories like Truck or Van are labeled to be Recall
ignored from P since they might confuse the training. The rest
of the points are labeled as background. This forms the label Fig. 5. Precision-recall curve in the offline evaluation, measured by the world
space criterion. See Section IV-A.
lp in (5). For each foreground point, its belonging bounding
box is encoded by (4) to form the label b0p in (6).
The experiments are based on the Caffe [12] framework. In
the KITTI object detection benchmark, images are captured predicts the 3D bounding boxes of the vehicles, we evaluate
from the front camera and range scans percept a 360 FoV the approach in both the image space and the world space in
of the environment. The benchmark groundtruth are only the offline validation. Compared to the image space, metric in
provided for vehicles inside the image. Thus in our experiment the world space is more crucial in the scenario of autonomous
we only use the front part of a range scan which overlaps with driving. Because for example many navigation and planning
the FoV of the front camera. algorithms take the bounding box in world space as input for
The KITTI benchmark divides object samples into three obstacle avoidance. Section IV-A describes the evaluation in
difficulty levels according to the size and the occlusion of the both image space and world space in our offline validation. In
2D bounding boxes in the image space. A detection is accepted Section IV-B, we compare the proposed approach with several
if its image space 2D bounding box has at least 70% overlap previous range scan detection algorithms via the KITTI online
with the groundtruth. Since the proposed approach naturally evaluation system.
TABLE II
A. Performane Analysis on Offline Evaluation P ERFORMANCE C OMPARISON IN AVERAGE P RECISION AND AVERAGE
We analyze the detection performance on our custom offline O RIENTATION S IMILARITY FOR THE O NLINE E VALUATION
evaluation data selected from the KITTI training dataset,
Easy Moderate Hard
whose groundtruth labels are accessable to public. To obtain Proposed 60.3% 47.5% 42.7%
an equivalent 2D bounding box for the original KITTI criterion Image Space (AP)
Vote3D 56.8% 48.0% 42.6%
in the image space, we projected the 3D bounding box into CSoR 34.8% 26.1% 22.7%
mBoW 36.0% 23.8% 18.4%
the image space and take the minimum 2D bounding rectangle Proposed 59.1% 45.9% 41.1%
Image Space (AOS)
as the 2D bounding box. For the world space evaluation, we CSoR 34.0% 25.4% 22.0%
project the detected and the groundtruth 3D bounding boxes
onto the ground plane and compute their overlap. The world
space criterion also requires at least 70% overlap to accept performance for far and occluded objects. Second, the image
a detection. The performance of the approach is measured space based criterion does not reflect the advantage of range
by the Average Precision (AP) and the Average Orientation scan methods in localizing objects in full 3D world space.
Similarity (AOS) [7]. The AOS is designed to jointly measure Related explanation can also be found from Wang and Posner
the precision of detection and orientation estimation. [31]. Thus in this experiments, we only compare the proposed
Table I lists the performance evaluation. Note that the world approach with range scan methods of Wang and Posner
space criterion results in slightly better performance than the [31], Behley et al. [2], Plotkin [22]. These three methods all
image space criterion. This is because the user labeled 2D use traditional features for classification. Wang and Posner
bounding box trends to be tighter than the 2D projection of [31] performs a sliding window based strategy to generate
the 3D bounding boxes in the image space, especially for candidates and Behley et al. [2], Plotkin [22] segment the point
vehicles observed from their diagonal directions. This size cloud to generate detection candidates.
difference diminishes the overlap between the detection and Table II shows the performance of the methods in AP and
the groundtruth in the image space. AOS reported on the KITTI online evaluation. The detection
Like most detection approaches, there is a noticeable drop AP of our approach outperforms the other methods in the
of performance from the easy evaluation to the moderate and easy task, which well illustrates the advantage of CNN in
hard evaluation. The minimal pixel height for easy samples representing rich features on near vehicles. In the moderate and
is 40px. This approximately corresponds to vehicles within hard detection tasks, our approach performs with similar AP as
28m. The minimal height for moderate and hard samples is Wang and Posner [31]. Because vehicles in these tasks consist
25px, corresponding to minimal distance of 47m. As shown of too few points for CNN to embed complicated features. For
in Figure 4 and Figure 1, some vehicles farther than 40m are the joint detection and orientation estimation evaluation, only
scanned by very few points and are even difficult to recognize our approach and CSoR support orientation estimation and our
for human. This results in the performance drop for moderate approach significantly wins the comparison in AOS.
and hard evalutaion.
Figure 5 shows the precision-recall curve of the world
space criterion as an example. Precision-recall curves of the V. C ONCLUSIONS
other criterion are similar and omitted here. Figure 4a shows
Although attempts have been made in a few previous
the detection result on a congested traffic scene with more
research to apply deep learning techniques on sensor data
than 10 vehicles in front of the lidar. Figure 4b shows the
other than images, there is still a gap inbetween this state-of-
detection result cars farther than 50m. Note that our algorithm
the-art computer vision techniques and the robotic perception
predicts the completed bounding box even for vehicles which
research. To the best of our knowledge, the proposed approach
are only partly visible. This significantly differs from previous
is the first to introduce the FCN detection techniques into
proposal-based methods and can contribute to stabler object
the perception on range scan data, which results in a neat
tracking and path planning results. For the easy evaluation,
and end-to-end detection framework. In this paper we only
the algorithm detects almost all vehicles, even occluded. This
evaluate the approach on 3D range scan from Velodyne 64E
is also illustrated in Figure 5 where the maximum recall rate
but the approach can also be applied on 3D range scan
is higher than 95%. The approach produces false-positive
from similar devices. By accumulating more training data and
detection in some occluded scenes, which is illustrated in
design deeper network, the detection performance can be even
Figure 4a for example.
further improved.
B. Related Work Comparison on the Online Evaluation
There have been several previous works in range scan based VI. ACKNOWLEDGEMENT
detection evaluated on the KITTI platform. Readers might
find that the performance of these works ranks much lower The author would like to acknowledge the help from Ji
compared to the state-of-the-art vision-based approaches. We Liang, Lichao Huang, Degang Yang, Haoqi Fan and Yifeng
explain this by two reasons. First, the image data have much Pan in the research of deep learning. Thanks also go to Ji
higher resolution which significantly enhance the detection Tao, Kai Ni and Yuanqing Lin for their support.
R EFERENCES [14] Klaas Klasing, Dirk Wollherr, and Martin Buss. A
clustering method for efficient segmentation of 3D laser
[1] Jens Behley, Volker Steinhage, and Armin B Cremers. data. Conference on Robotics and Automation, ICRA
Performance of Histogram Descriptors for the Classifi- 2008. IEEE International, pages 40434048, 2008.
cation of 3D Laser Range Data in Urban Environments. [15] Kevin Lai, Liefeng Bo, and Dieter Fox. Unsupervised
2012 IEEE International Conference on Robotics and Feature Learning for 3D Scene Labeling. IEEE Inter-
Automation, pages 43914398, 2012. national Conference on Robotics and Automation (ICRA
[2] Jens Behley, Volker Steinhage, and Armin B. Cremers. 2014), pages 30503057, 2014.
Laser-based segment classification using a mixture of [16] J. Levinson and S. Thrun. Robust vehicle localization in
bag-of-words. IEEE International Conference on Intelli- urban environments using probabilistic maps. Robotics
gent Robots and Systems, (1):41954200, 2013. and Automation (ICRA), 2010 IEEE International Con-
[3] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew G ference on, 2010.
Berneshawi, Huimin Ma, Sanja Fidler, and Raquel Ur- [17] Dahua Lin, Sanja Fidler, and Raquel Urtasun. Holistic
tasun. 3d object proposals for accurate object class scene understanding for 3D object detection with RGBD
detection. Advances in Neural Information Processing cameras. Proceedings of the IEEE International Confer-
Systems, pages 424432, 2015. ence on Computer Vision, pages 14171424, 2013.
[4] Mark De Deuge, F Robotics, and Alastair Quadros. [18] Jonathan Long, Evan Shelhamer, and Trevor Darrell.
Unsupervised Feature Learning for Classification of Out- Fully convolutional networks for semantic segmentation.
door 3D Scans. Araa.Asn.Au, pages 24, 2013. arXiv preprint arXiv:1411.4038, 2014.
[5] B. Douillard, J. Underwood, N. Kuntz, V. Vlaskine, [19] Daniel Maturana and Sebastian Scherer. VoxNet : A
a. Quadros, P. Morton, and a. Frenkel. On the segmen- 3D Convolutional Neural Network for Real-Time Object
tation of 3D lidar point clouds. Proceedings - IEEE Recognition. pages 922928, 2015.
International Conference on Robotics and Automation, [20] Frank Moosmann, Oliver Pink, and Christoph Stiller.
pages 27982805, 2011. Segmentation of 3D lidar data in non-flat urban environ-
[6] O.D. Faugeras and M. Hebert. The Representation, ments using a local convexity criterion. IEEE Intelligent
Recognition, and Locating of 3-D Objects. The Interna- Vehicles Symposium, Proceedings, pages 215220, 2009.
tional Journal of Robotics Research, 5(3):2752, 1986. [21] Jeremie Papon, Alexey Abramov, Markus Schoeler, and
[7] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are Florentin Worgotter. Voxel cloud connectivity segmenta-
we ready for autonomous driving? the KITTI vision tion - Supervoxels for point clouds. Proceedings of the
benchmark suite. Proceedings of the IEEE Computer IEEE Computer Society Conference on Computer Vision
Society Conference on Computer Vision and Pattern and Pattern Recognition, pages 20272034, 2013.
Recognition, pages 33543361, 2012. [22] Leonard Plotkin. Pydriver: Entwicklung eines frame-
[8] Ross Girshick, Jeff Donahue, Trevor Darrell, U C Berke- works fur raumliche detektion und klassifikation von
ley, and Jitendra Malik. Rich feature hierarchies for objekten in fahrzeugumgebung. Bachelors thesis (Stu-
accurate object detection and semantic segmentation. dienarbeit), Karlsruhe Institute of Technology, Germany,
Cvpr14, pages 29, 2014. March 2015.
[9] S Gupta, R Girshick, P Arbelaez, and J Malik. Learning [23] Joseph Redmon, Ross Girshick, and Ali Farhadi. You
Rich Features from RGB-D Images for Object Detec- Only Look Once: Unified, Real-Time Object Detection.
tion and Segmentation. arXiv preprint arXiv:1407.5736, arXiv, 2015.
pages 116, 2014. [24] Max Schwarz, Hannes Schulz, and Sven Behnke. RGB-D
[10] Michael Himmelsbach, Felix V Hundelshausen, and Object Recognition and Pose Estimation based on Pre-
Hans-Joachim Wunsche. Fast segmentation of 3d point trained Convolutional Neural Network Features. IEEE
clouds for ground vehicles. Intelligent Vehicles Sympo- International Conference on Robotics and Automation
sium (IV), 2010 IEEE, pages 560565, 2010. (ICRA), (May), 2015.
[11] Lichao Huang, Yi Yang, Yafeng Deng, and Yinan Yu. [25] Pierre Sermanet, David Eigen, Xiang Zhang, Michael
DenseBox: Unifying Landmark Localization with End Mathieu, Rob Fergus, and Yann LeCun. OverFeat
to End Object Detection. pages 113, 2015. : Integrated Recognition , Localization and Detec-
[12] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey tion using Convolutional Networks. arXiv preprint
Karayev, Jonathan Long, Ross B Girshick, Sergio arXiv:1312.6229, pages 115, 2013.
Guadarrama, and Trevor Darrell. Caffe: Convolutional [26] Richard Socher, Brody Huval, Bharath Bath, Christo-
architecture for fast feature embedding. ACM Multime- pher D Manning, and Andrew Y Ng. Convolutional-
dia, 2:4, 2014. recursive deep learning for 3d object classification. Ad-
[13] Andrew E Johnson and Martial Hebert. Using spin vances in Neural Information Processing Systems, pages
images for efficient object recognition in cluttered 3d 665673, 2012.
scenes. Pattern Analysis and Machine Intelligence, IEEE [27] Shuran Song and Jianxiong Xiao. Sliding shapes for 3d
Transactions on, 21(5):433449, 1999. object detection in depth images. pages 634651, 2014.
[28] Alex Teichman, Jesse Levinson, and Sebastian Thrun.
Towards 3D object recognition via classification of ar-
bitrary object tracks. Proceedings - IEEE International
Conference on Robotics and Automation, pages 4034
4041, 2011.
[29] Rudolph Triebel, Jiwon Shin, and Roland Siegwart.
Segmentation and Unsupervised Part-based Discovery of
Repetitive Objects. Robotics: Science and Systems, 2006.

[30] Rudolph Triebel, Richard Schmidt, Oscar Martnez Mo-
zos, and Wolfram Burgard. Instance-based amn classifi-
cation for improved object recognition in 2d and 3d laser
range data. Proceedings of the 20th international joint
conference on Artifical intelligence, pages 22252230,
2007.
[31] Dominic Zeng Wang and Ingmar Posner. Voting for vot-
ing in online point cloud object detection. Proceedings
of Robotics: Science and Systems, Rome, Italy, 2015.
[32] Dominic Zeng Wang, Ingmar Posner, and Paul Newman.
What could move? Finding cars, pedestrians and bicy-
clists in 3D laser data. Proceedings - IEEE International
Conference on Robotics and Automation, pages 4038
4044, 2012.
[33] Zhirong Wu and Shuran Song. 3D ShapeNets : A
Deep Representation for Volumetric Shapes. IEEE Con-
ference on Computer Vision and Pattern Recognition
(CVPR2015), pages 19, 2015.

You might also like