Squeeze Seg
Squeeze Seg
Fig. 2: LiDAR Projections. Note that each channel reflects structural information in the camera-view image.
LiDAR input
Conv 1x1, C/4 Gaussian filters
Conv 1x1, C/4
Deconv
upsample X2 Unary update
Conv 3x3, C/2 Conv 1x1, C/2
Conv 3x3, C/2 Conv 1x1, C/2
Concatenate Concatenate
Iteration
accordingly. 2 http://www.dev-c.com/gtav/scripthookv/
can easily run sanity checks on the collected data, since the
points and images need to be consistent. Second, the points
and images can be exploited for other research projects, e.g.
sensor fusion, etc.
We use ray casting to simulate each laser ray that LiDAR
emits. The direction of each laser ray is based on several
parameters of the LiDAR setup: vertical field of view (FOV),
vertical resolution, pitch angle, and the index of the ray
in the point cloud scan. Through a series of APIs, the
following data associated with each ray can be obtained: a)
the coordinates of the first point the ray hits, b) the class
of the object hit, c) the instance ID of the object hit (which Fig. 7: Fixing distribution of noise in synthesized data
is useful for instance-wise segmentation, etc.), d) the center
and bounding box of the object hit.
segmentation, we compare predicted with ground-truth la-
bels, point-wise, and evaluate precision, recall and IoU
(intersection-over-union) scores, which are defined as fol-
lows:
|Pc ∩ Gc | |Pc ∩ Gc | |Pc ∩ Gc |
P rc = , recallc = , IoUc = ,
|Pc | |Gc | |Pc ∪ Gc |
where Pc and Gc respectively denote the predicted and
Fig. 6: Left: Image of game scene from GTA-V. Right: ground-truth point sets that belong to class-c. | · | denotes
LiDAR point cloud corresponding to the game scene. the cardinality of a set. IoU score is used as the primary
accuracy metric in our experiments.
Using this simulator, we built a synthesized dataset with For instance-level segmentation, we first match each pre-
8,585 samples, roughly doubling our training set size. To dicted instance-i with a ground truth instance. This index
make the data more realistic, we further analyzed the distri- matching procedure can be denoted as M(i) = j where
bution of noise across KITTI point clouds (Fig. 7). We took i ∈ {1, · · · , N } is the predicted instance index and j ∈
empirical frequencies of noise at each radial coordinate and {∅, 1, · · · , M } is the ground truth index. If no ground truth is
normalized to obtain a valid probability distribution: 1) Let matched to instance-i, then we set M(i) to ∅. The matching
Pi be a 3D tensor in the format described earlier in Section procedure M(·) 1) sorts ground-truth instances by number
III-A denoting the spherically projected “pixel values” of the of points and 2) for each ground-truth instance, finds the
i-th KITTI point cloud. For each of the n KITTI point clouds, predicted instance with the largest IoU. The evaluation script
consider whether or not the pixel at the (θ̃, φ̃) coordinate will be released together with the source code.
contains “noise.” For simplicity, we consider “noise” to be For each class-c, we compute instance-level precision,
missing data, where all pixel channels are zero. Then, the recall, and IoU scores as
empirical frequency of noise at the (θ̃, φ̃) coordinate is P
|Pi,c ∩ GM(i),c |
1X
n
P rc = i ,
(θ̃, φ̃) = 1 . |Pc |
n i=1 {Pi [θ̃,φ̃]=0} P
|Pi,c ∩ GM(i),c |
recallc = i ,
2) We can then augment the synthesized data using the |Gc |
distribution of noise in the KITTI data. For any point cloud
P
|Pi,c ∩ GM(i),c |
in the synthetic dataset, at each (θ̃, φ̃) coordinate of the point IoUc = i .
|Pc ∪ Gc |
cloud, we randomly add noise by setting all feature values
to 0 with probability (θ̃, φ̃). Pi,c denotes the i-th predicted instance that belongs to
It is worth noting that GTA-V used very simple physical class-c.
P Different instance sets are mutually exclusive, thus
models for pedestrians, often reducing people to cylinders. i |Pi,c | = |Pc |. Likewise for GM(i),c . If no ground truth
In addition, GTA-V does not encode a separate category instance is matched with prediction-i, then GM(i),c is an
for cyclists, instead marking people and vehicles separately empty set.
on all accounts. For these reasons, we decided to focus on
B. Experimental Setup
the “car” class for KITTI evaluation when training with our
synthesized dataset. Our primary dataset is the converted KITTI dataset de-
scribed above. We split the publicly available raw dataset
IV. E XPERIMENTS
into a training set with 8,057 frames and a validation set
A. Evaluation metrics with 2,791 frames. Note that KITTI LiDAR scans can be
We evaluate our model’s performance on both class- temporally correlated if they are from the same sequence.
level and instance-level segmentation tasks. For class-level In our split, we ensured that frames in the training set do
not appear in validation sequences. Our training/validation TABLE I: Segmentation Performance of SqueezeSeg
split will be released as well. We developed our model in Class-level Instance-level
Tensorflow [22] and used NVIDIA TITAN X GPUs for our P R IoU P R IoU
w/ CRF 66.7 95.4 64.6 63.4 90.7 59.5
experiments. Since the KITTI dataset only provides reliable car
w/o CRF 62.7 95.5 60.9 60.0 91.3 56.7
3D bounding boxes for front-view LiDAR scans, we limit our w/ CRF 45.2 29.7 21.8 43.5 28.6 20.8
pedestrian
horizontal field of view to the forward-facing 90◦ . Details of w/o CRF 52.9 28.6 22.8 50.8 27.5 21.7
w/ CRF 35.7 45.8 25.1 30.4 39.0 20.6
our model training protocols will be released in our source cyclist
w/o CRF 35.2 51.1 26.4 30.1 43.7 21.7
code.
Summary of SqueezeSeg’s segmentation performance. P, R, IoU
in the header row respectively represent precision, recall and
C. Experimental Results intersection-over-union. IoU is used as the primary accuracy metric.
Segmentation accuracy of SqueezeSeg is summarized in All the values in this table are in percentages.
Table.I. We compared two variations of SqueezeSeg, one
with the recurrent CRF layer and one without. Although our Runtime of two SqueezeSeg models are summarized in
proposed metric is very challenging–as a high IoU requires Table.II. On a TITAN X GPU, SqueezeSeg without CRF
point-wise correctness–SqueezeSeg still achieved high IoU only takes 8.7 ms to process one LiDAR point cloud frame.
scores, especially for the car category. Note that both class- Combined with a CRF layer, the model takes 13.5 ms
level and instance-level recalls for the car category are higher each frame. This is much faster than the sampling rate of
than 90%, which is desirable for autonomous driving, as most LiDAR scanners today. The maximum rotation rate
false negatives are more likely to lead to accidents than for Velodyne HDL-64E LiDAR, for example, is 20Hz. On
false positives. We attribute lower performance on pedestrian vehicle embedded processors, where computational resources
and cyclist categories to two reasons: 1) there are many are more limited, SqueezeSeg comfortably allows trade-offs
fewer instances of pedestrian and cyclist in the dataset. 2) between speed and other practical concerns such as energy
Pedestrian and cyclist instances are much smaller in size and efficiency or processor cost. Also, note that the standard devi-
have much finer details, making it more difficult to segment. ation of runtime for both SqueezeSeg models is very small,
By combining our CNN with a CRF, we increased accu- which is crucial for the stability of the entire autonomous
racy (IoU) for the car category significantly. The performance driving system. However, our instance-wise segmentation
boost mainly comes from improvement in precision since currently relies on conventional clustering algorithms such as
CRF better filters mis-classified points on the borders. At the DBSCAN3 , which in comparison takes much longer and has
same time, we also noticed that the CRF resulted in slightly much larger variance. A more efficient and stable clustering
worse performance for pedestrian and cyclist segmentation implementation is necessary, but it is out of the scope of this
tasks. This may be due to lack of CRF hyperparameter tuning paper.
for pedestrians and cyclists.
TABLE II: Runtime Performance of SqueezeSeg Pipeline
Average Standard
runtime deviation
(ms) (ms)
SqueezeSeg w/o CRF 8.7 0.5
SqueezeSeg 13.5 0.8
DBSCAN clustering 27.3 45.8