A Comparative Study of Real-Time Semantic Segmentation For Autonomous Driving
A Comparative Study of Real-Time Semantic Segmentation For Autonomous Driving
Abstract
UNet SkipNet Dilation Frontend
Semantic segmentation is a critical module in robotics
related applications, especially autonomous driving. Most
of the research on semantic segmentation is focused on im-
proving the accuracy with less attention paid to computa-
Feature Extraction Decoding Method
tionally efficient solutions. Majority of the efficient seman- Module
tic segmentation algorithms have customized optimizations
without scalability and there is no systematic way to com-
pare them. In this paper, we present a real-time segmenta-
tion benchmarking framework and study various segmenta- VGG16 ResNet18 MobileNet ShuffleNet
tion algorithms for autonomous driving. We implemented
a generic meta-architecture via a decoupled design where
different types of encoders and decoders can be plugged
Figure 1: Overview of the different components in the
in independently. We provide several example encoders in-
framework with the decoupling of feature extraction mod-
cluding VGG16, Resnet18, MobileNet, and ShuffleNet and
ule and decoding method.
decoders including SkipNet, UNet and Dilation Frontend.
The framework is scalable for addition of new encoders and
decoders developed in the community for other vision tasks.
We performed detailed experimental analysis on cityscapes prominent benefit is in autonomous driving [61, 42, 4, 12].
dataset for various combinations of encoder and decoder. Segmentation has also been used in medical applications
The modular framework enabled rapid prototyping of a cus- [11, 66], and augmented reality [36]. The first promi-
tom efficient architecture which provides ∼x143 GFLOPs nent work in deep semantic segmentation was fully con-
reduction compared to SegNet and runs real-time at ∼15 volutional networks(FCNs) [35], that proposed an end-to-
fps on NVIDIA Jetson TX2. The source code of the frame- end method to learn pixel-wise classification. That method
work is publicly available 1 . paved the road to subsequent advances in the segmentation
accuracy. Multi-scale approaches [7][60], context aware
models [33][65], and temporal models [44] introduced dif-
1. Introduction ferent directions for improving accuracy. All of the above
approaches focused on accuracy and robustness of segmen-
Semantic segmentation has witnessed tremendous tation.
progress with deep learning. The main goal is to perform
However, some aspects for semantic segmentation such
pixel-wise classification of the image, that serves the pur-
as computational efficiency has not been thoroughly stud-
pose of scene understanding. Scene understanding has vari-
ied in the literature. Although, when it comes to applica-
ous benefits in robotics applications [55, 3, 56, 30], the most
tions such as autonomous driving this would have tremen-
1 https://github.com/MSiam/TFSegmentation dous impact. There is little work which address the seg-
1700
Temporal Models FCNs Context Aware Models
mentation networks’ efficiency such as [63][39]. The sur- egories due to the large body of work under them. Figure 2
vey on semantic segmentation [18] presented a comparative summarizes the general taxonomy and literature in semantic
study between different segmentation architectures includ- segmentation.
ing ENet [39]. Yet, there is no principled comparison of
different networks and meta-architectures. These previous 2.1. Fully Convolutional Networks(FCN)
studies compared different networks as a whole, without The initial direction in semantic segmentation using con-
comparing the effect of different modules. That does not volutional neural networks was towards patch-wise training
enable researchers and practitioners to pick the best suited [14, 19, 2] to yield the final segmentation. Grangier et al.
design choices for the required task. [19] proposed a multi-patch training strategy for convolu-
In this paper we propose the first framework toward tional neural networks to perform segmentation. Farabet et
benchmarking real-time architectures in segmentation. Our al. [14, 15] proposed a multi-scale dense feature extrac-
main contributions are: (1) we provide a modular de- tor. The method used a Laplacian pyramid of the image,
coupling of the segmentation architecture into feature ex- where each scale is forwarded through a 3-stage network
traction and decoding method which is termed as meta- to extract hierarchical features. For each pixel the features
architecture as shown in Figure 1. The separation helps in are encoded from a contextual patch around the pixel. The
understanding the impact of different parts of the network scene is then over-segmented into super pixels and condi-
on real-time performance. (2) A detailed ablation study tional random fields over the super pixels are used. Bell
highlighting the trade-off between accuracy and computa- et al. [2] proposed a method to utilize convolutional neural
tional efficiency is presented. (3) The modular design of our networks to classify each patch in a sliding window fashion.
framework allowed the emergence of two novel segmenta- The dominant direction in deep semantic segmentation
tion architectures using MobileNet [24] and ShuffleNet [62] is to learn pixel-wise classification in an end-to-end man-
with multiple decoding methods. ShuffleNet lead to x143 ner [35, 38, 1]. Long et al. [35] started with proposing
GFLOPs reduction in comparison to SegNet. It was able to fully convolutional networks(FCN). The network learned
run real-time at 15 fps on a Jetson TX2. Our framework is heatmaps that were then upsampled with-in the network us-
built on top of Tensorflow and is publicly available. ing transposed convolution to get dense predictions. Unlike
patch-wise training methods this method uses the full im-
2. Semantic Segmentation age to infer dense predictions. The SkipNet architecture
was utilized to refine the segmentation using higher res-
In this section a taxonomy of deep semantic segmenta- olution feature maps. Noh et al. [38] proposed a deeper
tion is presented. The literature work in semantic segmenta- decoder network, in which stacked transposed convolution
tion is categorized into three main subcategories: (1) Fully and unpooling layers are used. Badrinarayanan et al. [1]
Convolutional Networks. (2) Context Aware Models. (3) proposed SegNet which is an encoder-decoder architecture.
Temporal Models. The first category is about the main body The decoder network upsampled the feature maps by keep-
of work on semantic segmentation using deep learning. The ing the maxpooling indices from the corresponding encoder
other two categories include the work exploiting context layer. Kendall et al. [28] followed that work by proposing
knowledge and temporal information. Note that both tem- Bayesian SegNet, which incorporates uncertainties in the
poral and context aware models are considered under fully predictions using dropout during inference. Ronneberger et
convolutional networks category. However they are consid- al. [41] proposed a u-shaped architecture network where
ered as further refinement and are excluded in their own cat- feature maps from different encoding layers are concate-
2701
32x Upsampled
2x conv7
pool4
16x Upsampled
4x Conv7
2x pool4
pool3
8x Upsampled
Figure 3: Different Decoding methods for fully convolutional networks. Figure reproduced from [35, 41]
nated with the upsampled feature maps from the corre- attention model is able to learn a weight map, that weighs
sponding decoding layers. Paszke et al. [39] proposed the feature maps pixel-by-pixel from different scales. Eigen et
use of bottleneck modules for a computationally efficient al. [13] proposed a method to sequentially utilize multiple
solution that is denoted as ENet. Figure 3 shows the archi- scales to refine the prediction of depth, surface normals, and
tecture for FCN8s [35] and U-Net [41]. semantic segmentation.
One of the commonly used models to incorporate con-
2.2. Context Aware Models text is conditional random field (CRF). Chen et al. [7] uti-
Refinements on fully convolutional networks was intro- lized the fully connected conditional random fields as a post
duced to improve the segmentation accuracy by incorpo- processing. The unary potentials of the CRF are set to the
rating context. In this section we consider only the spatial probabilities from their convolutional network, while pair-
context that does not include any temporal information. The wise potentials are gaussian kernels based on the spatial and
methods to enforce models to become context aware are color features. Lin et al [34] proposed a method to use pair-
mainly categorized into multi-scale support, utilizing con- wise potentials based on convolutional neural networks fea-
ditional random fields, or recurrent neural networks. Fara- ture maps. In contrast to the previous work that uses condi-
bet et al. [14] handled the scale by introducing multiple tional random fields as post processing refinement step, this
rescaled versions of the image to the network. However work went further in integrating CNNs and CRFs. Zheng et
with the emergence of end-to-end pixel-wise training, Long al. [65] formulated the mean field CRF inference algorithm
et al. [35] proposed the skip architecture to merge heat- as a recurrent network. Thus, the proposed method enabled
maps from different resolutions. Since these architectures the end-to-end training of the model.
include pooling layers to increase the receptive field, this Another way to incorporate context is using recurrent
leads to the downsampling of the image with a loss in the neural networks (RNN) to capture the long range depen-
resolution. dencies of various regions. Visin et al. [57] used a recur-
Yu et al. [60] introduced dilated or atrous convolutions, rent layer to sweep the image horizontally and vertically,
which expanded the receptive field without losing resolu- which ensures the usage of contextual information for a bet-
tion based on the dilation factor. Thus it provided a better ter segmentation. One of the main bottlenecks in vanilla
solution for handling multiple scales. Wu et al. [59] pro- RNN is the vanishing gradients problem, gated recurrent
posed a shallower network using residual connections that architectures such as LSTMs [23] and GRUs [10] alleviate
included dilated convolution and outperformed deeper mod- this problem. Byeon et al. [5] proposed a segmentation
els. Chen et al. [7] proposed DeepLab that uses atrous spa- method that splits the image into non overlapping regions,
tial pyramid pooling (ASPP) for multi-scale support. This then incorporates context using four separate LSTM blocks.
idea builds on utilizing the dilated convolutions. Figure 4 Li et al. [32] proposed a method for context fusion using
shows dilated convolutions and spatial pyramid pooling as LSTMs. In their work both RGB and depth information
separate methods that can be used to incorporate multi-scale were utilized, and the global context was modeled verti-
support. Zhao et al. [64] proposed to incorporate global cally on both, followed by the horizontal fusion. Another
context features from previous layers into the next layers. bottleneck in vanilla recurrent networks is that it could lead
Chen et al. [8] refined further the DeepLab method by in- to the loss in spatial relationships. Shuai et al. [46] utilized
corporating global context features. Chen et al. [9] provided directed acyclic graph RNN to incorporate long range de-
a way for handling scale by using attention models that pro- pendencies. This directed acyclic graph maintains spatial
vides a mean to focus on the most relevant features. This relations-ships unlike using chained RNNs. Finally, Qi et
3702
segmentation prediction. Tokmakov et al. [53] proposed a
further improvement by utilizing optical flow information in
a two-stream architecture that utilizes convolutional gated
Spatial Pyramid Pooling recurrent units. Gadde et al. [17] proposed a method for
applying feature warping through an intermediate module
termed as NetWarp in order to incorporate temporal infor-
Atrous Convolution mation from videos.
3. Real-time CNNs
Figure 4: Atrous Convolution and Spatial Pyramid Pooling
for Multi-scale support. Figure reproduced from [8, 60]. In recent years there has been an increasing need for
running deep neural networks real-time on embedded plat-
forms, in various applications. Two main categories in the
al [40] proposed hierarchically gated deep network, which work of efficient CNNs are discussed: (1) Efficient CNN
is a multi-scale deep network that incorporates context at models that introduce different layers and modules to im-
various scales. Multiple LSTM memory cells are used in prove its computational efficiency. (2) Model compression
the network between convolutional layers, to learn whether and pruning. Other approaches such as model quantization
to incorporate spatial context from the lower layer into the and hardware acceleration are out of the scope of this paper.
higher one. 3.1. Efficient CNN Models
2.3. Temporal Models Convolutional layers are required to learn cross chan-
nel and spatial correlations. This process can be performed
All the discussed work was focused on still image seg- in an efficient manner by separating both. Szegedy et al.
mentation. Recently some approaches emerged for video [50, 51, 49] introduced the inception module and utilized
semantic segmentation that utilized temporal information it in Inception V1, V2 and further refined it in Inception
[45][16] [47][37]. Shelhamer et al. [45] introduced clock- V3 [51] and Inception-ResNet [49]. The main purpose of
work networks which are clock signals that control the the inception module is to decouple the cross channel and
learning of different layers with different rates. Tran et al. spatial convolution. This separation is performed using 1x1
[54] proposed a 3D convolutional network trained end-to- for the cross channel convolution that maps to 3 or 4 sep-
end for video semantic segmentation. An issue with 3D arate spaces. This is followed by 3x3 and/or 5x5 convolu-
convolutional is its small extent on the temporal axis that tion for the spatial correlations. The extreme case of the in-
would not capture long termporal dependencies. Recurrent ception module with one spatial convolution per channel is
neural networks can alleviate such a bottleneck. Fayyaz et what is termed as depthwise separable convolution. Figure
al. [16] incorporated spatio temporal features by using a 5 shows the inception module [50], and depthwise separable
layer grid of Long Short term memory models (LSTMs). convolution which is kind of an extreme case of inception.
However, conventional LSTMs as mentioned earlier do Howard et al. presented depth-wise separable convolutions
not utilize the spatial coherence and would end up with as a mean to improve efficiency [24] in what is known as
more parameters to learn. Siam et al. [47] proposed a MobileNets. Zhang et al. developed a generalized form
convolutional gated recurrent network to learn temporal in- of separable convolution denoted as grouped convolution,
formation to leverage the semantic segmentation of videos. while utilizing channel shuffle to ensure the input-output
The gated recurrent unit used in the work was convolutional, connectivity between different groups [62]. Figure 5 shows
this enabled it to learn both spatial and temporal information the shufflenet unit utilized in their model.
with less number of parameters. Nilsson et al. [37] com- Huang et al. [25] proposed training a densely con-
bined the power of both convolutional gated architectures nected network with sparsified connections denoted as Con-
and spatial transformers for leveraging video semantic seg- denseNet. The connectivity pattern is implemented effi-
mentation. However in an action recognition comparative ciently using grouped convolutions. This method is consid-
study [6], two-stream architectures that utilize optical flow ered also as a network pruning method. Most of the research
information has shown to perform better than Conv-LSTM conducted in efficient convolutional networks is directed to-
models. That motivated more research in the direction of wards classification and detection. Little attention is given
incorporating motion and appearance for video segmenta- to the computational efficiency of deep neural networks for
tion. Tokmakov et al. [52] proposed a U-Net architecture semantic segmentation. When it comes to applications such
that takes as input optical flow information to perform video as autonomous driving this consideration is extremely im-
segmentation. Jain et al. [27] proposed a model that fuses portant. Some studies such as the work by Paszke et al.
both RGB and optical flow information for the final video [39] tried to address the issue of segmentation efficiency.
4703
Concat 1x1 Conv
1x1 GC+Shuffle
3x3 Conv 5x5 Conv 1x1 Conv 3x3 Conv 3x3 Conv 3x3 Conv 3x3 Conv 3x3 Conv
3x3 AVG Pool
3x3 DWC
1x1 Conv 1x1 Conv 1x1 Conv 3x3 Max pooling Input Feature Maps 1x1 GC
(a) Inception Module [50] (b) Depthwise separable Convolution [24] (c) ShuffleNet Unit [62]
Figure 5: Differences between Computationally Efficient Modules for Convolution. GC: Grouped Convolution. DWC:
Depth-Wise Convolution.
Sandler et al. [43] proposed inverted residual module with cost. Li et al. [31] presented another pruning approach
linear bottleneck. This takes low dimensional representa- which is not based on filter magnitude. The method relied
tion as input then expands it to a higher dimensional space on reinforcement Learning to train a pruning agent which
applies convolution then maps it back. The convolution op- made a set of binary actions to decide to remove or not each
eration is performed using the efficient depthwise separable filter. It maximized a reward function which combined two
convolutions. This work proposed an efficient segmentation terms, the accuracy term and the efficiency term. The ac-
method as well. curacy term ensured the performance drop is bounded, and
the efficiency term encouraged to prune more filters away.
3.2. Model Compression and Pruning
4. Segmentation Benchmarking Framework
There are two main pruning techniques for model com-
pression namely weight pruning and filter pruning. Han In this section a detailed description of the benchmark-
et al. proposed DeepCompression framework [21] which ing framework is presented. We implemented a generic
learns both weights and connections in a three steps pro- framework through the decoupled encoder-decoder design.
cess. They make use of a regularization loss which pushes This allows the extensibility for more encoding and decod-
parameters towards zero and thus reducing the number of ing methods. It also allows principled comparison between
parameters of AlexNet by a factor of 9. Sparsity can lead different design choices that can aid practitioners.
to inefficient parallelism. To alleviate sparsity constraint,
Han et al. [20] presented an efficient inference engine rely-
4.1. Meta-Architectures
ing on sparse matrix-vector multiplication with weight shar- Three meta-architectures are integrated in our bench-
ing. The resulting computation speed achieves x189 and marking software: (1) SkipNet meta-architecture [35]. (2)
x13 gain when compared to CPU and GPU implementa- U-Net meta-architecture [41]. (3) Dilation Frontend meta-
tions of the same DNN without compression. Model com- architecture [60]. The meta-architectures for semantic seg-
pression also enables networks to fit in the on-chip SRAM mentation identify the decoding method for in the network
which reduces energy consumption per memory read by a upsampling. All of the network architectures share the same
factor x120 compared from fetching weights from DRAM. down-sampling factor of 32. The downsampling is achieved
Filter pruning is a similar approach like weight pruning. either by utilizing pooling layers, or strides in the convolu-
While weight pruning results in sparse connectivity pattern, tional layers. This ensures that different meta architectures
removing the entire filter and their associated feature maps have a unified down-sampling factor to assess the effect of
preserve dense connectivity. Consequently computational the decoding method only.
cost reduction does not rely on sparse convolution libraries SkipNet architecture denotes a similar architecture to
or dedicated hardware and existing efficient BLAS libraries FCN8s [35]. The main idea of the skip architecture is to
for dense matrix multiplication can be further used. Wen benefit from feature maps from higher resolution to improve
et al. [58] proposed filter pruning using model structure the output segmentation. SkipNet applies transposed convo-
learning and group lasso which is an efficient regularization lution on heatmaps in the label space instead of performing
to learn sparse structures. Their method is even more gen- it on the feature space. This entails a more computationally
eral than filter regularization since the Structured Sparsity efficient decoding method than others. Feature extraction
Learning (SSL) method can regularize any structure (fil- networks have the same downsampling factor of 32, so they
ters, channels, filter shapes, and layer depth) of CNNs. This follow the 8 stride version of skip architecture. Higher reso-
learning technique acts like a compression method to learn a lution feature maps are followed by 1x1 convolution to map
smaller model from a larger one reducing the computational from feature space to label space that produces heatmaps
5704
1x1 Conv
Segmentation
Upscale 2
Upscale 1
Segmentation
Upscale 3
Upscale 4
Upscale 5
x2 Upsampling
x8 Upsampling
Conv 6
1x1 Conv
Conv1_1 [2x2]
Conv2_1 [1x1]
Conv2_2 [2x2]
Conv4_1 [1x1]
Conv4_2 [2x2]
Conv3_1 [1x1]
Conv3_2 [2x2]
Conv5_1 [1x1]
Conv5_2 [1x1]
Conv5_3 [1x1]
Conv5_4 [1x1]
Conv5_5 [1x1]
Conv6_1 [1x1]
Conv5_6[2x2]
x2 Upsampling
Input
Conv1_1 [2x2]
Conv2_1 [1x1]
Conv2_2 [2x2]
Conv3_1 [1x1]
Conv3_2 [2x2]
Conv4_1 [1x1]
Conv4_2 [2x2]
Conv5_1 [1x1]
Conv5_2 [1x1]
Conv5_3 [1x1]
Conv5_4 [1x1]
Conv5_5 [1x1]
Conv5_6[2x2]
Input
(a) (b)
Figure 6: Different Meta Architectures using MobileNet as the feature extraction network. a) SkipNet architecture. b) UNet.
corresponding to each class. The final heatmap with down- 4.2. Feature Extraction Architectures
sampling factor of 32 is followed by transposed convolution
with stride 2. Elementwise addition between this upsam- In order to achieve real-time performance multiple net-
pled heatmaps and the higher resolution heatmaps is per- work architectures are integrated in the benchmarking
formed. Finally, the final output heat maps are followed framework. The framework includes four state of the
by a transposed convolution for up-sampling with stride 8. art real-time network architectures for feature extraction.
Figure 6(a) shows the SkipNet architecture utilizing a Mo- These are: (1) VGG16 [48]. (2) ResNet18 [22]. (3) Mo-
bileMet encoder. bileNet [24]. (4) ShuffleNet [62]. The reason for using
U-Net architecture denotes the method of decoding that VGG16 is to act as a baseline method to compare against
up-samples features using transposed convolution corre- as it was used in [35]. The other architectures have been
sponding to each downsampling stage [41]. The up- used in real-time systems for detection and classification.
sampled features are fused with the corresponding features ResNet18 incorporates the usage of residual blocks that di-
maps from the encoder with the same resolution. The stage- rects the network toward learning the residual representa-
wise upsampling provides higher accuracy than one shot 8x tion on identity mapping.
upsampling. The current fusion method used in the frame- MobileNet network architecture is based on depthwise
work is element-wise addition. Concatenation as a fusion separable convolution [24]. It is considered the extreme
method can provide better accuracy, as it enables the net- case of the inception module, where separate spatial con-
work to learn the weighted fusion of features. Nonetheless, volution for each channel is applied denoted as depthwise
it increases the computational cost, as it is directly affected convolutions. Then 1x1 convolution is used and denoted
by the number of channels. The upsampled features are then as pointwise convolutions. The separation in depthwise and
followed by 1x1 convolution to output the final pixel-wise pointwise convolution improve the computational efficiency
classification. Figure 6(b) shows the UNet architecture us- on one hand. On the other hand it improves the accuracy
ing MobileNet as a feature extraction network. as the cross channel and spatial correlations mapping are
Dilation Frontend architecture utilizes dilated convolu- learned separately.
tion [60] instead of downsampling the feature maps. Dilated ShuffleNet encoder is based on grouped convolution that
convolution enables the network to maintain an adequate is a generalization of depthwise separable convolution [62].
receptive field, but without degrading the resolution from It uses channel shuffling to ensure the connectivity between
pooling or strided convolution. However, a side-effect of input and output channels. This eliminates connectivity re-
this method is that computational cost increases, since the strictions posed by the grouped convolutions.
operations are performed on larger resolution feature maps.
The encoder network is modified to incorporate a downsam-
pling factor of 8 instead of 32. The decrease of the down- 5. Experiments
sampling is performed by either removing pooling layers
or converting stride 2 convolution to stride 1. The pooling In this section experimental setup, detailed ablation
or strided convolutions are then replaced with two dilated study and results in comparison to the state of the art are
convolutions [60] with dilation factor 2 and 4 respectively. reported.
6705
Table 1: Comparison of different encoders and decoders on Cityscapes validation set. GFLOPs are measured on image size
1024x512.
Encoder Decoder GFLOPs mIoU Road Sidewalk Building Sign Sky Person Car
SkipNet MobileNet 13.8 61.3 95.9 73.6 86.9 57.6 91.2 66.4 89.0
SkipNet ShuffleNet 4.63 55.5 94.8 68.6 83.9 50.5 88.6 60.8 86.5
UNet ResNet18 43.9 57.9 95.8 73.2 85.8 57.5 91.0 66.0 88.6
UNet MobileNet 55.9 61.0 95.2 71.3 86.8 60.9 92.8 68.1 88.8
UNet ShuffleNet 17.9 57.0 95.1 69.5 83.7 54.3 89.0 61.7 87.8
Dilation MobileNet 150 57.8 95.6 72.3 85.9 57.0 91.4 64.9 87.8
Dilation ShuffleNet 71.6 53.9 95.2 68.5 84.1 57.3 90.3 62.9 86.6
Table 2: Comparison of different encoders and decoders on Cityscapes validation set with Coarse annotations pre-training
then using fine annotations.
Encoder Decoder mIoU Road Sidewalk Building Sign Sky Person Car
SkipNet MobileNet 62.4 95.4 73.9 86.6 57.4 91.1 65.7 88.4
SkipNet ShuffleNet 59.3 94.6 70.5 85.5 54.9 90.8 60.2 87.5
5.1. Experimental Setup like the UNet stage-wise upsampling method. The Skip-
Net architecture provides on par results with UNet decoding
Through all of our experiments, weighted cross entropy
method. In some architectures such as SkipNet-ShuffleNet
loss from [39] is used, to overcome the class imbalance. The
it is less accurate than UNet counter part by 1.5%.
class weight is computed as wclass = ln(c+p1 class ) . Adam
The UNet method of incrementally upsampling with-in
optimizer [29] learning rate is set to 1e−4 . Batch normaliza-
the network provide the best in terms of accuracy. How-
tion [26] after all convolutional or transposed convolution
ever, SkipNet architecture is more computationally efficient
layers is incorporated. L2 regularization with weight decay
with x4 reduction in GFLOPs. This is explained by the fact
rate of 5e−4 is utilized to avoid over-fitting. The feature ex-
that transposed convolutions in UNet are applied in the fea-
tractor part of the network is initialized with the pre-trained
ture space unlike in SkipNet that are applied in label space.
corresponding encoder trained on Imagenet. A width multi-
Table 2 shows that pre-training with cityscapes coarse an-
plier of 1 for MobileNet to include all the feature channels is
notation, then finetuning on the fine annotation improves
performed through all experiments. The number of groups
the segmentation in terms of mIoU with 1-4%. The un-
used in ShuffleNet is 3. Based on previous [62] results on
derrepresented classes are the ones that often benefit from
classification and detection three groups provided adequate
pre-training.
accuracy.
Results are reported on Cityscapes dataset [12] which
5.3. Embedded Vision Experiments
contains 5000 images with fine annotation, with 20 classes
including the ignored class. Another section of the dataset Experimental results on the cityscapes test set are shown
contains coarse annotations with 20,000 labeled images. in Table 3. ENet [39] is compared to SkipNet-ShuffleNet
These are used in the case of Coarse pre-training that proved and SkipNet-MobileNet in terms of accuracy and computa-
to improve the results of the segmentation. Experiments are tional cost. SkipNet-ShuffleNet outperforms ENet in terms
conducted on images with resolution of 512x1024. of GFLOPs, yet it maintains on par mIoU. Both SkipNet-
ShuffleNet and SkipNet-MobileNet outperform SegNet [1]
5.2. Ablation Study
in terms of computational cost and accuracy with reduction
Semantic segmentation is evaluated using mean intersec- up to x143 in GFLOPs. SkipNet-ShuffleNet was deployed
tion over union (mIoU), per-class IoU, and per-category on a Jetson TX2 that delivered real-time performance in
IoU. Table 1 shows the results for the ablation study on 15 frames per second on image resolution 640x360. Fig-
different encoders-decoders with mIoU and GFLOPs to ure 8 shows the comparison between different image reso-
demonstrate the accuracy and computations trade-off. The lution versus frame-rate and running time in milliseconds.
main insight gained from our experiments is that, UNet de- These were measured on the Jetson TX2 for the SkipNet-
coding method provides more accurate segmentation results ShuffleNet architecture. Figure 7 shows qualitative results
than Dilation Frontend. This is mainly due to the transposed for different encoders including MobileNet, ShuffleNet and
convolution by x8 in the end of the Dilation Frontend, un- ResNet18. It shows that MobileNet provides more accurate
7706
Table 3: Comparison to the state of the art segmentation networks on Cityscapes test set. GFLOPs is computed on image
resolution 640x360.
Model GFLOPs Class IoU Class iIoU Category IoU Category iIoU
SegNet[1] 286.03 56.1 34.2 79.8 66.4
ENet[39] 3.83 58.3 24.4 80.4 64.0
SkipNet-VGG16[35] 445.9 65.3 41.7 85.7 70.1
SkipNet-ShuffleNet 2.0 58.3 32.4 80.2 62.2
SkipNet-MobileNet 6.2 61.5 35.2 82.0 63.0
(a) (b)
(c) (d)
Figure 7: Qualitative Results on CityScapes. (a) Original Image. (b) SkipNet-MobileNet pretrained with Coarse Annotations.
(c) UNet-Resnet18. (d) SkipNet-ShuffleNet pretrained with Coarse Annotations.
300
15 architecture allows for extensibility further on to other en-
200
10 coders and decoding methods. Detailed analysis of different
5
100 image resolutions versus frame-rate on Jetson TX2 is pre-
0 0
sented. Our benchmarking framework provides researchers
and practitioners a mechanism to systematically evaluate
0
20
0
24
64
48
60
08
x7
0x
0x
0x
0x
x1
80
32
36
64
80
20
12
Image Resolution
8707
References [12] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. En-
zweiler, R. Benenson, U. Franke, S. Roth, and
[1] V. Badrinarayanan, A. Kendall, and R. Cipolla. Seg- B. Schiele. The cityscapes dataset for semantic urban
net: A deep convolutional encoder-decoder archi- scene understanding. In Proceedings of the IEEE Con-
tecture for image segmentation. arXiv preprint ference on Computer Vision and Pattern Recognition,
arXiv:1511.00561, 2015. pages 3213–3223, 2016.
[2] S. Bell, P. Upchurch, N. Snavely, and K. Bala. Mate- [13] D. Eigen and R. Fergus. Predicting depth, surface nor-
rial recognition in the wild with the materials in con- mals and semantic labels with a common multi-scale
text database. In Computer Vision and Pattern Recog- convolutional architecture. In Proceedings of the IEEE
nition (CVPR). IEEE, 2015. International Conference on Computer Vision, pages
[3] T. M. Bonanni, A. Pennisi, D. Bloisi, L. Iocchi, and 2650–2658, 2015.
D. Nardi. Human-robot collaboration for semantic la- [14] C. Farabet, C. Couprie, L. Najman, and Y. Le-
beling of the environment. In Proceedings of the 3rd Cun. Learning hierarchical features for scene label-
Workshop on Semantic Perception, Mapping and Ex- ing. IEEE transactions on pattern analysis and ma-
ploration, 2013. chine intelligence, 35(8):1915–1929, 2013.
[4] G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic [15] C. Farabet, N. EDU, C. Couprie, L. Najman, and
object classes in video: A high-definition ground truth Y. LeCun. Scene parsing with multiscale feature learn-
database. Pattern Recognition Letters, 30(2):88–97, ing, purity trees, and optimal covers.
2009.
[16] M. Fayyaz, M. H. Saffar, M. Sabokrou, M. Fathy, and
[5] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki. R. Klette. STFCN: spatio-temporal FCN for semantic
Scene labeling with lstm recurrent neural networks. video segmentation. CoRR, abs/1608.05971, 2016.
In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 3547–3555, [17] R. Gadde, V. Jampani, and P. V. Gehler. Semantic
2015. video cnns through representation warping. CoRR,
abs/1708.03088, 2017.
[6] J. Carreira and A. Zisserman. Quo vadis, action recog-
nition? a new model and the kinetics dataset. In [18] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea,
2017 IEEE Conference on Computer Vision and Pat- V. Villena-Martinez, and J. Garcia-Rodriguez. A re-
tern Recognition (CVPR), pages 4724–4733. IEEE, view on deep learning techniques applied to seman-
2017. tic segmentation. arXiv preprint arXiv:1704.06857,
2017.
[7] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy,
and A. L. Yuille. Deeplab: Semantic image seg- [19] D. Grangier, L. Bottou, and R. Collobert. Deep con-
mentation with deep convolutional nets, atrous con- volutional networks for scene parsing. In ICML 2009
volution, and fully connected crfs. arXiv preprint Deep Learning Workshop, volume 3. Citeseer, 2009.
arXiv:1606.00915, 2016. [20] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A.
[8] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Horowitz, and W. J. Dally. Eie: efficient inference en-
Rethinking atrous convolution for semantic image gine on compressed deep neural network. In Proceed-
segmentation. arXiv preprint arXiv:1706.05587, ings of the 43rd International Symposium on Com-
2017. puter Architecture, pages 243–254. IEEE Press, 2016.
[9] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. [21] S. Han, J. Pool, J. Tran, and W. Dally. Learning both
Attention to scale: Scale-aware semantic image seg- weights and connections for efficient neural network.
mentation. arXiv preprint arXiv:1511.03339, 2015. In Advances in Neural Information Processing Sys-
tems (NIPS), pages 1135–1143, 2015.
[10] K. Cho, B. Van Merriënboer, D. Bahdanau, and
Y. Bengio. On the properties of neural machine trans- [22] K. He, X. Zhang, S. Ren, and J. Sun. Deep resid-
lation: Encoder-decoder approaches. arXiv preprint ual learning for image recognition. In Proceedings of
arXiv:1409.1259, 2014. the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016.
[11] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox,
and O. Ronneberger. 3d u-net: learning dense vol- [23] S. Hochreiter and J. Schmidhuber. Long short-term
umetric segmentation from sparse annotation. In In- memory. Neural computation, 9(8):1735–1780, 1997.
ternational Conference on Medical Image Computing [24] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko,
and Computer-Assisted Intervention, pages 424–432. W. Wang, T. Weyand, M. Andreetto, and H. Adam.
Springer, 2016. Mobilenets: Efficient convolutional neural networks
9708
for mobile vision applications. arXiv preprint [37] D. Nilsson and C. Sminchisescu. Semantic video seg-
arXiv:1704.04861, 2017. mentation by gated recurrent flow propagation. arXiv
[25] G. Huang, S. Liu, L. van der Maaten, and K. Q. preprint arXiv:1612.08871, 2016.
Weinberger. Condensenet: An efficient densenet [38] H. Noh, S. Hong, and B. Han. Learning deconvolution
using learned group convolutions. arXiv preprint network for semantic segmentation. In Proceedings of
arXiv:1711.09224, 2017. the IEEE International Conference on Computer Vi-
[26] S. Ioffe and C. Szegedy. Batch normalization: Accel- sion, pages 1520–1528, 2015.
erating deep network training by reducing internal co- [39] A. Paszke, A. Chaurasia, S. Kim, and E. Culur-
variate shift. In International Conference on Machine ciello. Enet: A deep neural network architecture
Learning, pages 448–456, 2015. for real-time semantic segmentation. arXiv preprint
arXiv:1606.02147, 2016.
[27] S. D. Jain, B. Xiong, and K. Grauman. Fusionseg:
Learning to combine motion and appearance for fully [40] G.-J. Qi. Hierarchically gated deep networks for se-
automatic segmention of generic objects in videos. mantic segmentation. In The IEEE Conference on
arXiv preprint arXiv:1701.05384, 2(3):6, 2017. Computer Vision and Pattern Recognition (CVPR),
June 2016.
[28] A. Kendall, V. Badrinarayanan, and R. Cipolla.
Bayesian segnet: Model uncertainty in deep convolu- [41] O. Ronneberger, P. Fischer, and T. Brox. U-net: Con-
tional encoder-decoder architectures for scene under- volutional networks for biomedical image segmenta-
standing. arXiv preprint arXiv:1511.02680, 2015. tion. In International Conference on Medical Im-
age Computing and Computer-Assisted Intervention,
[29] D. Kingma and J. Ba. Adam: A method for stochastic pages 234–241. Springer, 2015.
optimization. arXiv preprint arXiv:1412.6980, 2014.
[42] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and
[30] A. Kundu, Y. Li, F. Dellaert, F. Li, and J. M. Rehg. A. M. Lopez. The synthia dataset: A large collection
Joint semantic segmentation and 3d reconstruction of synthetic images for semantic segmentation of ur-
from monocular video. In European Conference on ban scenes. In Proceedings of the IEEE Conference
Computer Vision, pages 703–718. Springer, 2014. on Computer Vision and Pattern Recognition, pages
[31] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. 3234–3243, 2016.
Graf. Pruning filters for efficient convnets. arXiv [43] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and
preprint arXiv:1608.08710, 2016. L.-C. Chen. Inverted residuals and linear bottlenecks:
[32] Z. Li, Y. Gan, X. Liang, Y. Yu, H. Cheng, and L. Lin. Mobile networks for classification, detection and seg-
Lstm-cf: Unifying context modeling and fusion with mentation. arXiv preprint arXiv:1801.04381, 2018.
lstms for rgb-d scene labeling. In European Confer- [44] E. Shelhamer, K. Rakelly, J. Hoffman, and T. Dar-
ence on Computer Vision, pages 541–557. Springer, rell. Clockwork convnets for video semantic segmen-
2016. tation. In Computer Vision–ECCV 2016 Workshops,
[33] G. Lin, C. Shen, A. v. d. Hengel, and I. Reid. Explor- pages 852–868. Springer, 2016.
ing context with deep structured models for seman- [45] E. Shelhamer, K. Rakelly, J. Hoffman, and T. Darrell.
tic segmentation. arXiv preprint arXiv:1603.03183, Clockwork convnets for video semantic segmentation.
2016. CoRR, abs/1608.03609, 2016.
[34] G. Lin, C. Shen, I. Reid, et al. Efficient piecewise [46] B. Shuai, Z. Zuo, B. Wang, and G. Wang. Dag-
training of deep structured models for semantic seg- recurrent neural networks for scene labeling. In Pro-
mentation. arXiv preprint arXiv:1504.01013, 2015. ceedings of the IEEE conference on computer vision
[35] J. Long, E. Shelhamer, and T. Darrell. Fully convo- and pattern recognition, pages 3620–3629, 2016.
lutional networks for semantic segmentation. In Pro- [47] M. Siam, S. Valipour, M. Jagersand, and N. Ray. Con-
ceedings of the IEEE Conference on Computer Vision volutional gated recurrent networks for video segmen-
and Pattern Recognition, pages 3431–3440, 2015. tation. arXiv preprint arXiv:1611.05435, 2016.
[36] O. Miksik, V. Vineet, M. Lidegaard, R. Prasaath, [48] K. Simonyan and A. Zisserman. Very deep convo-
M. Nießner, S. Golodetz, S. L. Hicks, P. Pérez, lutional networks for large-scale image recognition.
S. Izadi, and P. H. Torr. The semantic paintbrush: arXiv preprint arXiv:1409.1556, 2014.
Interactive 3d mapping and recognition in large out- [49] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
door spaces. In Proceedings of the 33rd Annual ACM Inception-v4, inception-resnet and the impact of resid-
Conference on Human Factors in Computing Systems, ual connections on learning. In AAAI, volume 4,
pages 3317–3326. ACM, 2015. page 12, 2017.
10709
[50] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, [62] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet:
D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, An extremely efficient convolutional neural network
et al. Going deeper with convolutions. Cvpr, 2015. for mobile devices. arXiv preprint arXiv:1707.01083,
[51] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and 2017.
Z. Wojna. Rethinking the inception architecture for [63] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia. Icnet for
computer vision. In Proceedings of the IEEE Con- real-time semantic segmentation on high-resolution
ference on Computer Vision and Pattern Recognition, images. arXiv preprint arXiv:1704.08545, 2017.
pages 2818–2826, 2016. [64] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid
[52] P. Tokmakov, K. Alahari, and C. Schmid. Learning scene parsing network. In IEEE Conf. on Computer
motion patterns in videos. In 2017 IEEE Conference Vision and Pattern Recognition (CVPR), pages 2881–
on Computer Vision and Pattern Recognition (CVPR), 2890, 2017.
pages 531–539. IEEE, 2017. [65] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vi-
[53] P. Tokmakov, K. Alahari, and C. Schmid. Learning neet, Z. Su, D. Du, C. Huang, and P. H. Torr. Condi-
video object segmentation with visual memory. arXiv tional random fields as recurrent neural networks. In
preprint arXiv:1704.05737, 2017. Proceedings of the IEEE International Conference on
[54] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and Computer Vision, pages 1529–1537, 2015.
M. Paluri. Deep end2end voxel2voxel prediction. In [66] W. Zhu and X. Xie. Adversarial deep structural net-
Computer Vision and Pattern Recognition Workshops works for mammographic mass segmentation. arXiv
(CVPRW), 2016 IEEE Conference on, pages 402–409. preprint arXiv:1612.05970, 2016.
IEEE, 2016.
[55] A. Valada, G. L. Oliveira, T. Brox, and W. Burgard.
Deep multispectral semantic scene understanding of
forested environments using multimodal fusion. In
The 2016 International Symposium on Experimental
Robotics (ISER 2016), 2016.
[56] V. Vineet, O. Miksik, M. Lidegaard, M. Nießner,
S. Golodetz, V. A. Prisacariu, O. Kähler, D. W. Mur-
ray, S. Izadi, P. Perez, and P. H. S. Torr. Incremental
dense semantic stereo fusion for large-scale semantic
scene reconstruction. In IEEE International Confer-
ence on Robotics and Automation (ICRA), 2015.
[57] F. Visin, M. Ciccone, A. Romero, K. Kastner, K. Cho,
Y. Bengio, M. Matteucci, and A. Courville. Reseg:
A recurrent neural network-based model for seman-
tic segmentation. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition
Workshops, pages 41–48, 2016.
[58] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learn-
ing structured sparsity in deep neural networks. In
Advances in Neural Information Processing Systems,
pages 2074–2082, 2016.
[59] Z. Wu, C. Shen, and A. v. d. Hengel. Wider or
deeper: Revisiting the resnet model for visual recog-
nition. arXiv preprint arXiv:1611.10080, 2016.
[60] F. Yu and V. Koltun. Multi-scale context ag-
gregation by dilated convolutions. arXiv preprint
arXiv:1511.07122, 2015.
[61] H. Zhang, A. Geiger, and R. Urtasun. Understanding
high-level semantics by modeling traffic patterns. In
Proceedings of the IEEE International Conference on
Computer Vision, pages 3056–3063, 2013.
11710