0% found this document useful (0 votes)
40 views7 pages

Squeeze Seg

This document summarizes a paper that proposes SqueezeSeg, a convolutional neural network (CNN) with a conditional random field (CRF) for real-time road object segmentation from 3D LiDAR point clouds. SqueezeSeg takes 3D point clouds as input, uses a CNN to output point-wise segmentation labels, and refines the labels with a CRF modeled as a recurrent layer. It is trained on the KITTI dataset and achieves high accuracy and fast runtime suitable for autonomous driving applications. To obtain more training data, the researchers also use Grand Theft Auto V as a simulator to generate labeled point clouds.

Uploaded by

Aditya Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views7 pages

Squeeze Seg

This document summarizes a paper that proposes SqueezeSeg, a convolutional neural network (CNN) with a conditional random field (CRF) for real-time road object segmentation from 3D LiDAR point clouds. SqueezeSeg takes 3D point clouds as input, uses a CNN to output point-wise segmentation labels, and refines the labels with a CRF modeled as a recurrent layer. It is trained on the KITTI dataset and achieves high accuracy and fast runtime suitable for autonomous driving applications. To obtain more training data, the researchers also use Grand Theft Auto V as a simulator to generate labeled point clouds.

Uploaded by

Aditya Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

SqueezeSeg: Convolutional Neural Nets with Recurrent CRF for

Real-Time Road-Object Segmentation from 3D LiDAR Point Cloud


Bichen Wu, Alvin Wan, Xiangyu Yue and Kurt Keutzer
UC Berkeley
{bichen, alvinwan, xyyue, keutzer}@berkeley.edu

Abstract— In this paper, we address semantic segmentation


of road-objects from 3D LiDAR point clouds. In particular, we
arXiv:1710.07368v1 [cs.CV] 19 Oct 2017

wish to detect and categorize instances of interest, such as cars,


pedestrians and cyclists. We formulate this problem as a point-
wise classification problem, and propose an end-to-end pipeline
called SqueezeSeg based on convolutional neural networks
(CNN): the CNN takes a transformed LiDAR point cloud as
input and directly outputs a point-wise label map, which is
then refined by a conditional random field (CRF) implemented
Ground truth segmentation Predicted segmentation
as a recurrent layer. Instance-level labels are then obtained by
conventional clustering algorithms. Our CNN model is trained Fig. 1: An example of SqueezeSeg segmentation results. Our
on LiDAR point clouds from the KITTI [1] dataset, and our predicted result is on the right and the ground truth is on
point-wise segmentation labels are derived from 3D bounding
boxes from KITTI. To obtain extra training data, we built a the left. Cars are annotated in red, pedestrians in green and
LiDAR simulator into Grand Theft Auto V (GTA-V), a popular cyclists in blue.
video game, to synthesize large amounts of realistic training
data. Our experiments show that SqueezeSeg achieves high
accuracy with astonishingly fast and stable runtime (8.7 ±
0.5 ms per frame), highly desirable for autonomous driving pipeline usually relies on hand-crafted features or decision
applications. Furthermore, additionally training on synthesized rules – some approaches rely on a scalar threshold [6] and
data boosts validation accuracy on real-world data. Our source others require more complicated features such as surface
code and synthesized data will be open-sourced. normals [7] or invariant descriptors [4], all of which may
fail to generalize and the latter of which require signifi-
I. INTRODUCTION
cant preprocessing. b) Multi-stage pipelines see aggregate
Autonomous driving systems rely on accurate, real-time effects of compounded errors, and classification or clustering
and robust perception of the environment. An autonomous algorithms in the pipeline above are unable to leverage
vehicle needs to accurately categorize and locate “road- context, most importantly the immediate surroundings of
objects”, which we define to be driving-related objects such an object. c) Many approaches for ground removal rely
as cars, pedestrians, cyclists, and other obstacles. Different on iterative algorithms such as RANSAC (random sample
autonomous driving solutions may have different combi- consensus) [5], GP-INSAC (Gaussian Process Incremental
nations of sensors, but the 3D LiDAR scanner is one of Sample Consensus) [2], or agglomerative clustering [2]. The
the most prevalent components. LiDAR scanners directly runtime and accuracy of these algorithmic components de-
produce distance measurements of the environment, which pend on the quality of random initializations and, therefore,
are then used by vehicle controllers and planners. Moreover, can be unstable. This instability is not acceptable for many
LiDAR scanners are robust under almost all lighting condi- embedded applications such as autonomous driving. We take
tions, whether it be day or night, with or without glare and an alternative approach: use deep learning to extract features,
shadows. As a result, LiDAR based perception tasks have develop a single-stage pipeline and thus sidestep iterative
attracted significant research attention. algorithms.
In this work, we focus on road-object segmentation using In this paper, we propose an end-to-end pipeline based
(Velodyne style) 3D LiDAR point clouds. Given point cloud on convolutional neural networks (CNN) and conditional
output from a LiDAR scanner, the task aims to isolate random field (CRF). CNNs and CRFs have been successfully
objects of interest and predict their categories, as shown in applied to segmentation tasks on 2D images [8], [9], [10],
Fig. 1. Previous approaches comprise or use parts of the [11]. To apply CNNs to 3D LiDAR point clouds, we designed
following stages: Remove the ground, cluster the remaining a CNN that accepts transformed LiDAR point clouds and
points into instances, extract (hand-crafted) features from outputs a point-wise map of labels, which is further refined
each cluster, and classify each cluster based on its features. by a CRF model. Instance-level labels are then obtained
This paradigm, despite its popularity [2], [3], [4], [5], has by applying conventional clustering algorithms (such as
several disadvantages: a) Ground segmentation in the above DBSCAN) on points within a category. To feed 3D point
clouds to a 2D CNN, we adopt a spherical projection to CNN models, trained for classification, to fully convolu-
transform sparse, irregularly distributed 3D point clouds to tional networks to predict pixel-wise labels. [9] proposed
dense, 2D grid representations. The proposed CNN model a CRF formulation for image segmentation and solved it
draws inspiration from SqueezeNet [12] and is carefully approximately with the mean-field iteration algorithm. CNNs
designed to reduce parameter size and computational com- and CRFs are combined in [10], where the CNN is used to
plexity, with an aim to reduce memory requirements and produce an initial probability map and the CRF is used to
achieve real-time inference speed for our target embedded refine and restore details. In [11], mean-field iteration is re-
applications. The CRF model is reformulated as a recurrent formulated as a recurrent neural network (RNN) module.
neural network (RNN) module as [11] and can be trained D. Data Collection through Simulation
end-to-end together with the CNN model. Our model is
trained on LiDAR point clouds from the KITTI dataset [1] Obtaining annotations, especially point-wise or pixel-wise
and point-wise segmentation labels are converted from 3D annotations for computer vision tasks is usually very difficult.
bounding boxes in KITTI. To obtain even more training data, As a consequence, synthetic datasets have seen growing
we leveraged Grand Theft Auto V (GTA-V) as a simulator to interest. In the autonomous driving community, the video
retrieve LiDAR point clouds and point-wise labels. game Grand Theft Auto has been used to retrieve data for
object detection and segmentation [19], [20].
Experiments show that SqueezeSeg achieves high accu-
racy and is extremely fast and stable, making it suitable III. M ETHOD DESCRIPTION
for autonomous driving applications. We additionally find A. Point Cloud Transformation
that supplanting our dataset with artificial, noise-injected Conventional CNN models operate on images, which can
simulation data further boosts validation accuracy on real- be represented by 3-dimentional tensors of size H × W × 3.
world data. The first two dimensions encode spatial position, where H
II. R ELATED W ORK and W are the image height and width, respectively. The last
dimension encodes features, most commonly RGB values.
A. Semantic segmentation for 3D LiDAR point clouds However, a 3D LiDAR point cloud is usually represented
Previous work saw a wide range of granularity in LiDAR as a set of cartesian coordinates, (x, y, z). Extra features can
segmentation, handling anything from specific components also be included, such as intensity or RGB values. Unlike the
to the whole pipeline. [7] proposed mesh based ground distribution of image pixels, the distribution of LiDAR point
and object segmentation relying on local surface convexity clouds is usually sparse and irregular. Therefore, naively
conditions. [2] summarized several approaches based on iter- discretizing a 3D space into voxels results in excessively
ative algorithms such as RANSAC (random sample consen- many empty voxels. Processing such sparse data is ineffi-
sus) and GP-INSAC (gaussian process incremental sample cient, wasting computation.
consensus) for ground removal. Recent work also focused To obtain a more compact representation, we project the
on algorithmic efficiency. [5] proposed efficient algorithms LiDAR point cloud onto a sphere for a dense, grid-based
for ground segmentation and clustering while [13] bypassed representation as
ground segmentation to directly extract foreground objects. z
θ = arcsin p , θ̃ = bθ/4θc,
[4] expanded its focus to the whole pipeline, including x + y2 + z2
2
y (1)
segmentation, clustering and classification. It proposed to di- φ = arcsin p , φ̃ = bφ/4φc.
rectly classify point patches into background and foreground x2 + y 2
objects of different categories then use EMST-RANSAC [5] φ and θ are azimuth and zenith angles, as shown in Fig.
to further cluster instances. 2 (A). 4θ and 4φ are resolutions for discretization and
(θ̃, φ̃) denotes the position of a point on a 2D spherical
B. CNN for 3D point clouds
grid. Applying equation (1) to each point in the cloud,
CNN approaches consider LiDAR point clouds in either we can obtain a 3D tensor of size H × W × C. In this
two or three dimensions. Work with two-dimensional data paper, we consider data collected from a Velodyne HDL-64E
considers raw images with projections of LiDAR point LiDAR with 64 vertical channels, so H = 64. Limited by
clouds top-down [14] or from a number of other views [15]. data annotations from the KITTI dataset, we only consider
Other work considers three-dimensional data itself, dis- the front view area of 90◦ and divide it into 512 grids so
cretizing the space into voxels and engineering features W = 512. C is the number of features for each point.
such as disparity, mean, and saturation [16]. Regardless In our experiments, we used 5 features for each point: 3
of data preparation, deep learning methods consider end- cartesian coordinates (x, y, z), an intensity measurement and
p
to-end models that leverage 2D convolutional [17] or 3D range r = x2 + y 2 + z 2 . An example of a projected point
convolutional [18] neural networks. cloud can be found at Fig. 2 (B). As can be seen, such
representation is dense and regularly distributed, resembling
C. Semantic Segmentation for Images an ordinary image Fig. 2 (C). This featurization allows us
Both CNNs and CRFs have been applied to semantic to avoid hand-crafted features, bettering the odds that our
segmentation tasks for images. [8] proposed transforming representation generalizes.
spherical
projection
(A) LiDAR Point Cloud (B) Projected Point Cloud (C) Camera View

Fig. 2: LiDAR Projections. Note that each channel reflects structural information in the camera-view image.

B. Network structure max-pooling. Similar phenomena are also observed with


Our convolutional neural network structure is shown in SqueezeSeg.
Fig. 3. SqueezeSeg is derived from SqueezeNet [12], a light- Accurate point-wise label prediction requires understand-
weight CNN that achieved AlexNet [21] level accuracy with ing not only the high-level semantics of the object and
50X fewer parameters. scene but also low-level details. The latter are crucial for
The input to SqueezeSeg is a 64 × 512 × 5 tensor as the consistency of label assignments. For example, if two
described in the previous section. We ported layers (conv1a points in the cloud are next to each other and have similar
to fire9) from SqueezeNet for feature extraction. SqueezeNet intensity measurements, it is likely that they belong to the
used max-pooling to down-sample intermediate feature maps same object and thus have the same label. Following [10],
in both width and height dimensions, but since our input we used a conditional random field (CRF) to refine the label
tensor’s height is much smaller than its width, we only down- map generated by the CNN. For a given point cloud and a
sample the width. The output of fire9 is a down-sampled label prediction c where ci denotes the predicted label of the
feature map that encodes the semantics of the point cloud. i-th point, a CRF model employs the energy function
X X
To obtain full resolution label predictions for each point, E(c) = ui (ci ) + bi,j (ci , cj ). (2)
we used deconvolution modules (more precisely, “transposed i i,j
convolutions”) to up-sample feature maps in the width
The unary potential term ui (ci ) = − log P (ci ) considers
dimension. We used skip-connections to add up-sampled
the predicted probability P (ci ) from the CNN classifier.
feature maps to lower-level feature maps of the same size, as
The binary potential terms define the “penalty” for as-
shown in Fig. 3. The output probability map is generated by
signing different labels to a pairPof similar points and is
a convolutional layer (conv14) with softmax activation. The M
defined as bi,j (ci , cj ) = µ(ci , cj ) m=1 wm k m (fi , fj ) where
probability map is further refined by a recurrent CRF layer,
µ(ci , cj ) = 1 if ci 6= cj and 0 otherwise, k m is the m-th
which will be discussed in the next section.
Gaussian kernel that depends on features f of point i and
In order to reduce the number of model parameters and
j and wm is the corresponding coefficient. In our work, we
computation, we replaced convolution and deconvolution lay-
used two Gaussian kernels
ers with fireModules [12] and fireDeconvs. The architecture
of both modules are shown in Fig. 4. In a fireModule, the kpi − pj k2 kxi − xj k2
w1 exp(− − )
input tensor of size H × W × C is first fed into a 1x1 2σα2 2σβ2
(3)
convolution to reduce the channel size to C/4. Next, a 3x3 kpi − pj k2
convolution is used to fuse spatial information. Together +w2 exp(− ).
2σγ2
with a parallel 1x1 convolution, they recover the channel
size of C. The input 1x1 convolution is called the squeeze The first term depends on both angular position p(θ̃, φ̃) and
layer and the parallel 1x1 and 3x3 convolution together is cartesian coordinates x(x, y, z) of two points. The second
called the expand layer. Given matching input and output term only depends on angular positions. σα , σβ and σγ are
size, a 3x3 convolutional layer requires 9C 2 parameters and three hyper parameters chosen empirically. Extra features
9HW C 2 computations, while the fireModule only requires such as intensity and RGB values can also be included.
3 2 3 2 Minimizing the above CRF energy function yields a
2 C parameters and 2 HW C computations. In a fireDe-
conv module, the deconvolution layer used to up-sample refined label assignment. Exact minimization of equation
the feature map is placed between squeeze and expand (2) is intractable, but [9] proposed a mean-field iteration
layers. To up-sample the width dimension by 2, a regular algorithm to solve it approximately and efficiently. [11]
1x4 deconvolution layer must contain 4C 2 parameters and reformulated the mean-field iteration as a recurrent neural
4HW C 2 computations. With the fireDeconv however, we network (RNN). We refer readers to [9] and [11] for a
only need 47 C 2 parameters and 74 HW C 2 computations. detailed derivation of the mean-field iteration algorithm and
its formulation as an RNN. Here, we provide just a brief
C. Conditional Random Field description of our implementation of the mean-field iteration
With image segmentation, label maps predicted by CNN as an RNN module as shown in Fig. 5. The output of
models tend to have blurry boundaries. This is due to loss the CNN model is fed into the CRF module as the initial
of low-level details in down-sampling operations such as probability map. Next, we compute Gaussian kernels based
Conv1b

Fig. 3: Network structure of SqueezeSeg.

Input tensor: H x W x C Input tensor: H x W x C

LiDAR input
Conv 1x1, C/4 Gaussian filters
Conv 1x1, C/4
Deconv
upsample X2 Unary update
Conv 3x3, C/2 Conv 1x1, C/2
Conv 3x3, C/2 Conv 1x1, C/2

Concatenate Concatenate

Output tensor: H x W x C Output tensor: H x 2W x C


FireModule FireDeconv
CNN output
Fig. 4: Structure of a FireModule (left) and a fireDeconv Refined
Message passing Re-weighting & Softmax
label map
(right). as a locally
connected layer
compatibilty
transformation as a
normalization

1x1 conv layer

Iteration

Fig. 5: Conditional Random Field (CRF) as an RNN layer.


on the input feature as equation (3). The value of above
Gaussian kernels drop very fast as the distance (in the 3D
cartesian space and the 2D angular space) between two points D. Data collection
increases. Therefore, for each point, we limit the kernel Our initial data is from the KITTI raw dataset, which
size to a small region of 3 × 5 on the input tensor. Next, provides images, LiDAR scans and 3D bounding boxes
we filter the initial probability map using above Gaussian organized in sequences. Point-wise annotations are converted
kernels. This step is also called message passing in [11] from 3D bounding boxes. All points within an object’s
since it essentially aggregates probabilities of neighboring 3D bounding box are considered part of the target object.
points. This step can be implemented as a locally connected We then assign the corresponding label to each point. An
layer with above Guassian kernels as parameters. Next, we example of such a conversion can be found in Fig. 2 (A,
re-weight the aggregated probability and use a “compatibilty B). Using this approach, we collected 10,848 images with
transformation” to decide how much it changes each point’s point-wise labels.
distribution. This step can be implemented as a 1x1 convolu- In order to obtain more training samples (both point clouds
tion whose parameters are learned during training. Next, we and point-wise labels), we built a LiDAR simulator in GTA-
update the initial probability by adding it to the output of the V. The framework of the simulator is based on DeepGTAV1 ,
1x1 convolution and use softmax to normalize it. The output which uses Script Hook V2 as a plugin.
of the module is a refined probability map, which can be We mounted a virtual LiDAR scanner atop an in-game
further refined by applying this procedure iteratively. In our car, which is then set to drive autonomously. The system
experiment, we used 3 iterations to achieve an accurate label collects both LiDAR point clouds and the game screen. In
map. This recurrent CRF module together with the CNN our setup, the virtual LiDAR and game camera are placed
model can be trained together end-to-end. With a single stage at the same position, which offers two advantages: First, we
pipeline, we sidestep the thread of propagated errors present
in multi-stage workflows and leverage contextual information 1 https://github.com/ai-tor/DeepGTAV

accordingly. 2 http://www.dev-c.com/gtav/scripthookv/
can easily run sanity checks on the collected data, since the
points and images need to be consistent. Second, the points
and images can be exploited for other research projects, e.g.
sensor fusion, etc.
We use ray casting to simulate each laser ray that LiDAR
emits. The direction of each laser ray is based on several
parameters of the LiDAR setup: vertical field of view (FOV),
vertical resolution, pitch angle, and the index of the ray
in the point cloud scan. Through a series of APIs, the
following data associated with each ray can be obtained: a)
the coordinates of the first point the ray hits, b) the class
of the object hit, c) the instance ID of the object hit (which Fig. 7: Fixing distribution of noise in synthesized data
is useful for instance-wise segmentation, etc.), d) the center
and bounding box of the object hit.
segmentation, we compare predicted with ground-truth la-
bels, point-wise, and evaluate precision, recall and IoU
(intersection-over-union) scores, which are defined as fol-
lows:
|Pc ∩ Gc | |Pc ∩ Gc | |Pc ∩ Gc |
P rc = , recallc = , IoUc = ,
|Pc | |Gc | |Pc ∪ Gc |
where Pc and Gc respectively denote the predicted and
Fig. 6: Left: Image of game scene from GTA-V. Right: ground-truth point sets that belong to class-c. | · | denotes
LiDAR point cloud corresponding to the game scene. the cardinality of a set. IoU score is used as the primary
accuracy metric in our experiments.
Using this simulator, we built a synthesized dataset with For instance-level segmentation, we first match each pre-
8,585 samples, roughly doubling our training set size. To dicted instance-i with a ground truth instance. This index
make the data more realistic, we further analyzed the distri- matching procedure can be denoted as M(i) = j where
bution of noise across KITTI point clouds (Fig. 7). We took i ∈ {1, · · · , N } is the predicted instance index and j ∈
empirical frequencies of noise at each radial coordinate and {∅, 1, · · · , M } is the ground truth index. If no ground truth is
normalized to obtain a valid probability distribution: 1) Let matched to instance-i, then we set M(i) to ∅. The matching
Pi be a 3D tensor in the format described earlier in Section procedure M(·) 1) sorts ground-truth instances by number
III-A denoting the spherically projected “pixel values” of the of points and 2) for each ground-truth instance, finds the
i-th KITTI point cloud. For each of the n KITTI point clouds, predicted instance with the largest IoU. The evaluation script
consider whether or not the pixel at the (θ̃, φ̃) coordinate will be released together with the source code.
contains “noise.” For simplicity, we consider “noise” to be For each class-c, we compute instance-level precision,
missing data, where all pixel channels are zero. Then, the recall, and IoU scores as
empirical frequency of noise at the (θ̃, φ̃) coordinate is P
|Pi,c ∩ GM(i),c |
1X
n
P rc = i ,
(θ̃, φ̃) = 1 . |Pc |
n i=1 {Pi [θ̃,φ̃]=0} P
|Pi,c ∩ GM(i),c |
recallc = i ,
2) We can then augment the synthesized data using the |Gc |
distribution of noise in the KITTI data. For any point cloud
P
|Pi,c ∩ GM(i),c |
in the synthetic dataset, at each (θ̃, φ̃) coordinate of the point IoUc = i .
|Pc ∪ Gc |
cloud, we randomly add noise by setting all feature values
to 0 with probability (θ̃, φ̃). Pi,c denotes the i-th predicted instance that belongs to
It is worth noting that GTA-V used very simple physical class-c.
P Different instance sets are mutually exclusive, thus
models for pedestrians, often reducing people to cylinders. i |Pi,c | = |Pc |. Likewise for GM(i),c . If no ground truth
In addition, GTA-V does not encode a separate category instance is matched with prediction-i, then GM(i),c is an
for cyclists, instead marking people and vehicles separately empty set.
on all accounts. For these reasons, we decided to focus on
B. Experimental Setup
the “car” class for KITTI evaluation when training with our
synthesized dataset. Our primary dataset is the converted KITTI dataset de-
scribed above. We split the publicly available raw dataset
IV. E XPERIMENTS
into a training set with 8,057 frames and a validation set
A. Evaluation metrics with 2,791 frames. Note that KITTI LiDAR scans can be
We evaluate our model’s performance on both class- temporally correlated if they are from the same sequence.
level and instance-level segmentation tasks. For class-level In our split, we ensured that frames in the training set do
not appear in validation sequences. Our training/validation TABLE I: Segmentation Performance of SqueezeSeg
split will be released as well. We developed our model in Class-level Instance-level
Tensorflow [22] and used NVIDIA TITAN X GPUs for our P R IoU P R IoU
w/ CRF 66.7 95.4 64.6 63.4 90.7 59.5
experiments. Since the KITTI dataset only provides reliable car
w/o CRF 62.7 95.5 60.9 60.0 91.3 56.7
3D bounding boxes for front-view LiDAR scans, we limit our w/ CRF 45.2 29.7 21.8 43.5 28.6 20.8
pedestrian
horizontal field of view to the forward-facing 90◦ . Details of w/o CRF 52.9 28.6 22.8 50.8 27.5 21.7
w/ CRF 35.7 45.8 25.1 30.4 39.0 20.6
our model training protocols will be released in our source cyclist
w/o CRF 35.2 51.1 26.4 30.1 43.7 21.7
code.
Summary of SqueezeSeg’s segmentation performance. P, R, IoU
in the header row respectively represent precision, recall and
C. Experimental Results intersection-over-union. IoU is used as the primary accuracy metric.
Segmentation accuracy of SqueezeSeg is summarized in All the values in this table are in percentages.
Table.I. We compared two variations of SqueezeSeg, one
with the recurrent CRF layer and one without. Although our Runtime of two SqueezeSeg models are summarized in
proposed metric is very challenging–as a high IoU requires Table.II. On a TITAN X GPU, SqueezeSeg without CRF
point-wise correctness–SqueezeSeg still achieved high IoU only takes 8.7 ms to process one LiDAR point cloud frame.
scores, especially for the car category. Note that both class- Combined with a CRF layer, the model takes 13.5 ms
level and instance-level recalls for the car category are higher each frame. This is much faster than the sampling rate of
than 90%, which is desirable for autonomous driving, as most LiDAR scanners today. The maximum rotation rate
false negatives are more likely to lead to accidents than for Velodyne HDL-64E LiDAR, for example, is 20Hz. On
false positives. We attribute lower performance on pedestrian vehicle embedded processors, where computational resources
and cyclist categories to two reasons: 1) there are many are more limited, SqueezeSeg comfortably allows trade-offs
fewer instances of pedestrian and cyclist in the dataset. 2) between speed and other practical concerns such as energy
Pedestrian and cyclist instances are much smaller in size and efficiency or processor cost. Also, note that the standard devi-
have much finer details, making it more difficult to segment. ation of runtime for both SqueezeSeg models is very small,
By combining our CNN with a CRF, we increased accu- which is crucial for the stability of the entire autonomous
racy (IoU) for the car category significantly. The performance driving system. However, our instance-wise segmentation
boost mainly comes from improvement in precision since currently relies on conventional clustering algorithms such as
CRF better filters mis-classified points on the borders. At the DBSCAN3 , which in comparison takes much longer and has
same time, we also noticed that the CRF resulted in slightly much larger variance. A more efficient and stable clustering
worse performance for pedestrian and cyclist segmentation implementation is necessary, but it is out of the scope of this
tasks. This may be due to lack of CRF hyperparameter tuning paper.
for pedestrians and cyclists.
TABLE II: Runtime Performance of SqueezeSeg Pipeline
Average Standard
runtime deviation
(ms) (ms)
SqueezeSeg w/o CRF 8.7 0.5
SqueezeSeg 13.5 0.8
DBSCAN clustering 27.3 45.8

We tested our model’s accuracy on KITTI data, when


trained on GTA simulated data–the results of which are
summarized in Table.III. Our GTA simulator is currently still
limited in its ability to provide realistic labels for pedestrians
and cyclists, so we consider only segmentation performance
for cars. Additionally, our simulated point cloud does not
contain intensity measurements; we therefore excluded inten-
sity as an input feature to the network. To quantify the effects
of training on synthesized data, we trained a SqueezeSeg
model on the KITTI training set, without using intensity
measurements, and validated on the KITTI validation set.
Fig. 8: Visualization of SqueezeSeg’s prediction on a pro- The model’s performance is shown in the first row of the
jected LiDAR depth map. For comparison, visualization of table. Compared with Table.I, the IoU score is worse, due
the ground-truth labels are plotted below the predicted ones. to the loss of the intensity channel. If we train the model
Notice that SqueezeSeg additionally and accurately segments completely on GTA simulated data, we see significantly
objects that are unlabeled in ground truth. 3 We used the implementation from http://scikit-learn.org/
0.15/modules/generated/sklearn.cluster.DBSCAN.html
worse performance. However, combining the KITTI training [4] D. Z. Wang, I. Posner, and P. Newman, “What could move? finding
set with our GTA-simulated dataset, we see significantly cars, pedestrians and bicyclists in 3d laser data,” in Robotics and
Automation (ICRA), 2012 IEEE International Conference on. IEEE,
increased accuracy that is even better than Table.I. 2012, pp. 4038–4044.
A visualization of the segmentation result by SqueezeSeg [5] D. Zermas, I. Izzat, and N. Papanikolopoulos, “Fast segmentation of
vs. ground truth labels can be found in Fig.8. For most of the 3d point clouds: A paradigm on lidar data for autonomous vehicle
applications,” in Robotics and Automation (ICRA), 2017 IEEE Inter-
objects, the predicted result is almost identical to the ground- national Conference on. IEEE, 2017, pp. 5067–5073.
truth, save for the ground beneath target objects. Also notice [6] S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron,
SqueezeSeg additionally and accurately segments objects that J. Diebel, P. Fong, J. Gale, M. Halpenny, G. Hoffmann et al., “Stanley:
The robot that won the darpa grand challenge,” Journal of field
are unlabeled in ground truth. These objects may be obscured Robotics, vol. 23, no. 9, pp. 661–692, 2006.
or too small, therefore placed in the “Don’t Care” category [7] F. Moosmann, O. Pink, and C. Stiller, “Segmentation of 3d lidar data
for the KITTI benchmark. in non-flat urban environments using a local convexity criterion,” in
Intelligent Vehicles Symposium, 2009 IEEE. IEEE, 2009, pp. 215–
220.
TABLE III: Segmentation Performance on the Car Category [8] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
with Simulated Data for semantic segmentation,” in CVPR, 2015.
[9] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected
Class-level Instance-level crfs with gaussian edge potentials,” in Advances in neural information
P R IoU P R IoU processing systems, 2011, pp. 109–117.
KITTI 58.9 95.0 57.1 56.1 90.5 53.0 [10] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
GTA 30.4 86.6 29.0 29.7 84.6 28.2 “Deeplab: Semantic image segmentation with deep convolutional
KITTI + GTA 69.6 92.8 66.0 66.6 88.8 61.4 nets, atrous convolution, and fully connected crfs,” arXiv preprint
arXiv:1606.00915, 2016.
V. CONCLUSIONS [11] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du,
C. Huang, and P. H. Torr, “Conditional random fields as recurrent
We propose SqueezeSeg, an accurate, fast and stable end- neural networks,” in Proceedings of the IEEE International Conference
to-end approach for road-object segmentation from LiDAR on Computer Vision, 2015, pp. 1529–1537.
point clouds. Addressing the deficiencies of previous ap- [12] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,
and K. Keutzer, “SqueezeNet: Alexnet-level accuracy with 50x fewer
proaches that were discussed in the Introduction, our deep parameters and <0.5mb model size,” arXiv:1602.07360, 2016.
learning approach 1) does not rely on hand-crafted features, [13] M.-O. Shin, G.-M. Oh, S.-W. Kim, and S.-W. Seo, “Real-time and
but utilizes convolutional filters learned through training; 2) accurate segmentation of 3-d point clouds based on gaussian process
regression,” IEEE Transactions on Intelligent Transportation Systems,
uses a deep neural network and therefore has no reliance 2017.
on iterative algorithms such as RANSAC, GP-INSAC, and [14] L. Caltagirone, S. Scheidegger, L. Svensson, and M. Wahde, “Fast
agglomerative clustering; and 3) reduces the pipeline to a lidar-based road detection using fully convolutional neural networks.”
in Intelligent Vehicles Symposium (IV), 2017 IEEE. IEEE, 2017, pp.
single stage, sidestepping the issue of propagated errors and 1019–1024.
allowing the model to fully leverage object context. The [15] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d
model accomplishes very high accuracy at faster-than-real- object detection network for autonomous driving,” arXiv preprint
arXiv:1611.07759, 2016.
time inference speeds with small variance, as required for [16] J. Schlosser, C. K. Chow, and Z. Kira, “Fusing lidar and images for
applications such as autonomous driving. Additionally, we pedestrian detection using convolutional neural networks,” in Robotics
synthesize large quantities of simulated data, then demon- and Automation (ICRA), 2016 IEEE International Conference on.
IEEE, 2016, pp. 2198–2205.
strate a significant boost in performance when training with [17] B. Li, T. Zhang, and T. Xia, “Vehicle detection from 3d lidar using
synthesized data and validating on real-world data. We use fully convolutional network,” arXiv preprint arXiv:1608.07916, 2016.
select classes as a proof-of-concept, granting synthesized [18] D. Maturana and S. Scherer, “3d convolutional neural networks
for landing zone detection from lidar,” in Robotics and Automation
data a potential role in self-driving datasets of the future. (ICRA), 2015 IEEE International Conference on. IEEE, 2015, pp.
3471–3478.
ACKNOWLEDGEMENT [19] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data:
Ground truth from computer games,” in European Conference on
This work was partially supported by the DARPA PER- Computer Vision (ECCV), ser. LNCS, B. Leibe, J. Matas, N. Sebe,
FECT program, Award HR0011-12-2-0016, together with and M. Welling, Eds., vol. 9906. Springer International Publishing,
ASPIRE Lab sponsor Intel, as well as lab affiliates HP, 2016, pp. 102–118.
[20] M. Johnson-Roberson, C. Barto, R. Mehta, S. N. Sridhar, and
Huawei, Nvidia, and SK Hynix. This work has also been R. Vasudevan, “Driving in the matrix: Can virtual worlds replace
partially sponsored by individual gifts from BMW, Intel, and human-generated annotations for real world tasks?” CoRR, vol.
the Samsung Global Research Organization. abs/1610.01983, 2016. [Online]. Available: http://arxiv.org/abs/1610.
01983
R EFERENCES [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classifica-
tion with Deep Convolutional Neural Networks,” in NIPS, 2012.
[1] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous [22] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
driving? the kitti vision benchmark suite,” in Computer Vision and Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,
Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kud-
pp. 3354–3361. lur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah,
[2] B. Douillard, J. Underwood, N. Kuntz, V. Vlaskine, A. Quadros, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,
P. Morton, and A. Frenkel, “On the segmentation of 3d lidar point V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden,
clouds,” in Robotics and Automation (ICRA), 2011 IEEE International M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-
Conference on. IEEE, 2011, pp. 2798–2805. scale machine learning on heterogeneous systems,” Google Technical
[3] M. Himmelsbach, A. Mueller, T. Lüttel, and H.-J. Wünsche, “Lidar- Report, 2015.
based 3d object perception,” in Proceedings of 1st international
workshop on cognition for technical systems, vol. 1, 2008.

You might also like