0% found this document useful (0 votes)
15 views5 pages

Lednet: A Lightweight Encoder-Decoder Network For Real-Time Semantic Segmentation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views5 pages

Lednet: A Lightweight Encoder-Decoder Network For Real-Time Semantic Segmentation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

LEDNET: A LIGHTWEIGHT ENCODER-DECODER NETWORK FOR

REAL-TIME SEMANTIC SEGMENTATION

Yu Wang1 , Quan Zhou1,2,∗ , Jia Liu1 , Jian Xiong1 , Guangwei Gao3 , Xiaofu Wu1 , and Longin Jan Latecki4
1
National Engineering Research Center of Communications and Networking,
Nanjing University of Posts & Telecommunications, P.R. China.
2
Key Lab. of Broadband Wireless Communications and Sensor Network Technology,
arXiv:1905.02423v3 [cs.CV] 13 May 2019

Nanjing University of Posts & Telecommunications, P.R. China.


3
Institute of Advanced Technology, Nanjing University of Posts & Telecommunications, P.R. China.
4
Department of Computer and Information Sciences, Temple University, Philadelphia, USA.

ABSTRACT resources are not suitable for computationally limited mobile


The extensive computational burden limits the usage of platforms (e.g., drones, robots, and smartphones), which have
CNNs in mobile devices for dense estimation tasks. In this limited energy overhead, restrictive memory constraints, and
paper, we present a lightweight network to address this prob- reduced computational capabilities. Such kind of limitation is
lem, namely LEDNet, which employs an asymmetric encoder- particularly prominent on the computationally heavy task of
decoder architecture for the task of real-time semantic seg- semantic segmentation [4, 5, 6, 7, 8], where the goal here is
mentation. More specifically, the encoder adopts a ResNet as to assign a semantic category label for each image pixel.
backbone network, where two new operations, channel split In order to overcome this problem, many lightweight style
and shuffle, are utilized in each residual block to greatly re- networks have been designed to balance the segmentation ac-
duce computation cost while maintaining higher segmenta- curacy and implementing efficiency, which are roughly di-
tion accuracy. On the other hand, an attention pyramid net- vided into two categories: network compression [9, 10, 11,
work (APN) is employed in the decoder to further lighten the 12] and convolution factorization [13, 14, 15]. The first cat-
entire network complexity. Our model has less than 1M pa- egory prefers to reduce inference computation by compress-
rameters, and is able to run at over 71 FPS in a single GTX ing pre-trained networks, including hashing [9], pruning [10],
1080Ti GPU. The comprehensive experiments demonstrate and quantization [11, 12]. To further remove the redundancy,
that our approach achieves state-of-the-art results in terms of an alternative approach to lighten CNNs depends on sparse
speed and accuracy trade-off on CityScapes dataset. coding theory [16, 17]. On the contrary, motivated from the
Index Terms— CNN, Lightweight network, Encoder- convolution factorization principle (CFP) that decomposes a
decoder network, ResNet, Real-time semantic segmentation standard convolution into group convolution and depthwise
separable convolution [3, 15, 18], the second category focuses
on directly training network with smaller size. For example,
1. INTRODUCTION ENet [13] employs ResNet [2] as backbone to perform ef-
ficient inference. Zhao et al. [19] propose a cascade net-
Recently, building deeper and larger convolutional neural net-
work that incorporates high-level label guidance to improve
works (CNNs) is a primary trend for solving scene under-
performance. In [7, 14, 20], a symmetrical encoder-decoder
standing tasks [1, 2, 3, 4, 5]. The most accurate CNNs usu-
architecture is adopted, which greatly reduce the number of
ally have hundreds of convolutional layers and thousands of
parameters while maintaining the accuracy. Although some
feature channels. In spite of achieving higher performance,
work have conducted preliminary research on lightweight ar-
these advances are at the sacrifice of running time and speed.
chitecture networks, pursuing the best accuracy in very lim-
Especially in the context of many real-world scenarios, such
ited computational budgets is still an open research question
as augmented reality, robotics, and self-driving car to name of
for the task of real-time semantic segmentation.
few, the computationally cheap networks with smaller size are
often required to carry out online estimation in a timely fash- In this paper, we aim at solving this trade-off as a whole,
ion. Therefore, those accurate networks requiring enormous without sitting on only one of its sides. We introduce a novel
∗ Corresponding author: Quan Zhou, [email protected]. This
lightweight network called LEDNet, adopting an asymmetric
work is partly supported by NSFC (No. 61876093, 61701258, 61701252,
encoder-decoder architecture for real-time semantic segmen-
61671253), NSFJS (No. BK20181393, BK20170906), NSF (No. IIS- tation. As shown in Figure 1, our LEDNet is composed of two
1302164), and Huawei Innovation Research Program (HIRP2018). parts: encoder and decoder network. Following CFP, the core
Down-sampling Point-wise Point-wise Global average
Input image Split-Shuffle-non-bottleneck Convolution Upsampling
unit sum product pooling

1024×512×C

32×16×C 64×32×C 128×64×C 128×64×C

128×64×128
256×128×64
512×256×32 16×8×128 32×16×128
1024×512×3
Encoder network Decoder network

Fig. 1. Overall asymmetric architecture of the proposed LEDNet. The encoder employs a FCN-like network, while an APN is
adopted in decoder. C denotes the number of classes. Please refer to text for more details. (Best viewed in color)

unit of encoder is a novel residual module that leverages skip


Channel Split
connections and convolutions with channel split and shuffle. 1×1 Conv 3×1 Conv 1×1 Conv
3×1 Conv 1×3 Conv
While the skip connections allow the convolutions to learn ReLU ReLU BN ReLU ReLU ReLU
1×3 Conv Channel Shuffle 1×3 Conv 3×1 Conv
residual functions that facilitate training, the split and shuffle 3×3 Conv ReLU BN ReLU BN ReLU
ReLU 3×1 Conv 3×3 Conv 3×1 Conv(dilated) 1×3 Conv(dilated)
operations enhance the information exchange within the fea- ReLU BN
ReLU ReLU
1×1 Conv 1×3 Conv(dilated) 3×1 Conv(dilated)
ture channels while maintaining similar computational costs 1×3 Conv 1×1 Conv BN ReLU BN ReLU
BN Concat
compared to 1D factorized convolutions. In the decoder, in- Add Add Add
Add
stead of complicated dilated convolution [20], we design an ReLU ReLU ReLU
Channel Shuffle
ReLU

attention pyramid network (APN) to extract dense features,


(a) (b) (c) (d)
where the attention mechanism is utilized to estimate seman-
tic label for each pixel. Our contributions are three-folds: (1)
The asymmetrical architecture of our LEDNet leads to the Fig. 2. Comparison of different residual layer modules. From
great reduction of network parameters, which accelerates the left to right are (a) bottleneck [13, 15], (b) non-bottleneck-1D
inference process; (2) The channel split and shuffle operations [14], (c) ShuffleNet [21], and (d) our SS-nbt module.
in our residual layer leverage network size and powerful fea-
ture representation. In addition, channel shuffle is also differ- To balance performance and efficiency given limited com-
entiable, which means it can be embedded into network struc- putational budgets, we introduce two simple operators, called
tures for end-to-end training. (3) Attention mechanism of fea- channel split and shuffle, in residual layer. We refer to this
ture pyramid is employed to design APN in our decoder-end, proposed module as split-shuffle-non-bottleneck(SS-nbt), as
further lightening the complexity of the whole network. depicted in Figure 2 (d). Motivated from [12, 18], a split-
transform-merge strategy is employed in the designment of
2. OUR APPROACH our SS-nbt, approaching the representational power of large
and dense layers, but at a considerably lower computational
2.1. Residual Module with Split and Shuffle Operations complexity. At the beginning of each SS-nbt, the input is
split into two lower-dimensional branches, where each one
We focus on solving the efficiency limitation that is essen- has half channels of the input. To avoid pointwise convolu-
tially present in the residual block, which is used in recent tion, the transformation is performed using a set of specialized
accurate CNNs for image classification [2, 22, 21] and seman- 1D filters (e.g., 1 × 3, 3 × 1), and the convolutional outputs
tic segmentation [6, 13, 14]. The recent years have witnessed of two branches are merged using concatenation so that the
multiple successful instances of lightweight residual layer number of channels keeps the same. To facilitate training,
[13, 15], such as bottleneck (Figure 2 (a)), non-bottleneck- the stacked output is added with input through the branch of
1D (Figure 2 (b)), and ShuffleNet module (Figure 2 (c)), identity mapping. The same channel shuffle operation [21]
where the pointwise convolution is widely used. However, is finally used to enable information communication between
the contrary opinion of [21] claims that pointwise convolution two split branches. After the shuffle, the next SS-nbt unit be-
accounts for most of the computational complexity, which is gins. It is clear that our residual module is not only efficient,
especially disadvantageous for lightweight models. but also accurate. Firstly, the high efficiency in each SS-nbt
Table 1. The architecture of LEDNet. “Size” denotes the Table 2. Comparison with the state-of-the-art approaches in
dimension of output feature maps, C is the number of classes. terms of segmentation accuracy and implementing efficiency.

Method Cla Cat Time(ms) Speed(Fps) Para(M)


Stage Type Size SegNet[7] 57.0 79.1 67 15 29.5
Downsampling Unit 512 × 256 × 32 ENet[13] 58.3 80.4 34 31 0.36
3× SS-nbt Unit 512 × 256 × 32 ESPNet[20] 60.3 82.2 9 112 0.40
Downsampling Unit 256 × 128 × 64 CGNet[25] 64.8 85.7 20 50 0.50
2× SS-nbt Unit 256 × 128 × 64 ICNet [19] 69.5 86.4 33 30 7.80
Downsampling Unit 128 × 64 × 128 Ours 70.6 87.1 14 71 0.94
Encoder

SS-nbt Unit (dilated r = 1) 128 × 64 × 128


SS-nbt Unit (dilated r = 2) 128 × 64 × 128
SS-nbt Unit (dilated r = 5) 128 × 64 × 128
SS-nbt Unit (dilated r = 9) 128 × 64 × 128 form multi-scale feature pyramid. Then the pyramid structure
SS-nbt Unit (dilated r = 2) 128 × 64 × 128 fuses information of different scales step-by-step, which can
SS-nbt Unit (dilated r = 5) 128 × 64 × 128 incorporate neighbor scales of context more precisely. Since
SS-nbt Unit (dilated r = 9) 128 × 64 × 128 high-level feature maps has small resolution, using large ker-
SS-nbt Unit (dilated r = 17) 128 × 64 × 128 nel size does not bring too much computation burden. There-
after, a 1 × 1 convolution is applied to the output of encoder,
Decoder

APN Module 128 × 64 × C


then the convolutional feature maps are pixel-wisely multi-
Upsampling Unit ( ×8) 1024 × 512 × C plied by the pyramid attention features. To further enhance
performance, a global average pooling branch is introduced
to integrate global context prior attention. Finally, an upsam-
pling unit is employed to match the resolution of input im-
allows us to use more feature channels. Secondly, in each SS- age. Benefiting from pyramid architecture, APN can capture
nbt unit, the merged feature channels are randomly shuffled, multi-scale context cues, and produce pixel-level attention for
and then join into next unit. This can be regarded as a kind of convolutional features. Unlike DeepLab [5] and PSPNet [8]
feature reuse, which to some extent enlarges network capacity that stack multi-scale feature maps, our context information is
without significantly increasing complexity. pixel-wisely multiplied with original convolutional features,
without introducing too much computational budgets.
2.2. LEDNet Architecture Designment
3. EXPERIMENTS
As shown in Table 1, our LEDNet follows an encoder-decoder
architecture. Unlike [7], our approach employs an asymmet-
3.1. Implementation Details
ric sequential architecture, where a encoder produces down-
sampled feature maps, and a subsequent decoder adopts APN We select widely-used CityScapes dataset [26] to evaluate our
that upsamples the feature maps to match input resolution. LEDNet, which includes 19 object categories and one addi-
Besides SS-nbt unit, the encoder also includes down- tional background. Beside the images with fine pixel-level
sampling unit, which is performed by stacking two parallel annotations that contain 2,975 training, 500 validation and
outputs of a single 3 × 3 convolution with stride 2 and a 1,525 testing images, we also use the 20K coarsely annotated
Max-pooling. Downsampling enables more deeper network images for training. We adopt mean intersection-over-union
to gather context, while at the same time helps to reduce com- (mIoU) averaged across all classes and categories to evaluate
putation. Note we postpone downsampling in encoder, in the segmentation accuracy, while running time, speed (FPS), and
similar spirit of [18]. Moreover, the usage of dilated convolu- model size (number of parameters) to measure implementing
tions [14, 23] allows our architecture to have large receptive efficiency. To show the advantages of LEDNet, we selected
field, leading to an improvement in accuracy. Compared to 6 state-of-the-art lightweight networks as baselines, including
the use of larger kernel sizes, this technique has been proven SegNet [7], ENet [13], ERFNet [14], ICNet [19], CGNet [25],
more effective in terms of computational cost and parameters. and ESPNet [20]. For fair comparison, all the methods are
Inspired by attention mechanism [24], our decoder de- conducted on the same hardware platform of Dell workstation
signs a APN to perform dense estimation using spatial-wise with a single GTX 1080Ti GPU. We favor a large minibatch
attention. To increase receptive field, the APN adopts a pyra- size (set as 5) to make full use of the GPU memory, where the
mid attention module, which integrates features from three initial learning rate is 5 × 10−4 and the ‘poly’ learning rate
different pyramid scales. As shown in Figure 1, we first uti- policy is adopted with power 0.9, together with momentum
lize 3 × 3, 5 × 5, and 7 × 7 convolution with stride 2 to and weight decay are set to 0.9 and 10−4 , respectively.
Table 3. Individual category results on the CityScapes test set in terms of class and category mIoU scores. Methods trained
using both fine and coarse data are marked with superscript ‘†’.

Method Roa Sid Bui Wal Fen Pol TLi TSi Veg Ter Sky Ped Rid Car Tru Bus Tra Mot Bic Cla Cat
SegNet [7] 96.4 73.2 84.0 28.4 29.0 35.7 39.8 45.1 87.0 63.8 91.8 62.8 42.8 89.3 38.1 43.1 44.1 35.8 51.9 57.0 79.1
ENet [13] 96.3 74.2 75.0 32.2 33.2 43.4 34.1 44.0 88.6 61.4 90.6 65.5 38.4 90.6 36.9 50.5 48.1 38.8 55.4 58.3 80.4
ESPNet [20] 97.0 77.5 76.2 35.0 36.1 45.0 35.6 46.3 90.8 63.2 92.6 67.0 40.9 92.3 38.1 52.5 50.1 41.8 57.2 60.3 82.2
CGNet [25] 95.5 78.7 88.1 40.0 43.0 54.1 59.8 63.9 89.6 67.6 92.9 74.9 54.9 90.2 44.1 59.5 25.2 47.3 60.2 64.8 85.7
ERFNet [14] 97.2 80.0 89.5 41.6 45.3 56.4 60.5 64.6 91.4 68.7 94.2 76.1 56.4 92.4 45.7 60.6 27.0 48.7 61.8 66.3 85.2
ICNet [19] 97.1 79.2 89.7 43.2 48.9 61.5 60.4 63.4 91.5 68.3 93.5 74.6 56.1 92.6 51.3 72.7 51.3 53.6 70.5 69.5 86.4
Ours 97.1 78.6 90.4 46.5 48.1 60.9 60.4 71.1 91.2 60.0 93.2 74.3 51.8 92.3 61.0 72.4 51.0 43.3 70.2 69.2 86.8
Ours† 98.1 79.5 91.6 47.7 49.9 62.8 61.3 72.8 92.6 61.2 94.9 76.2 53.7 90.9 64.4 64.0 52.7 44.4 71.6 70.6 87.1

Fig. 3. The visual comparison on CityScapes val dataset. From left to right are input images, ground truth, segmentation outputs
from SegNet [7], ENet [13], ERFNet [14], ESPNet [20], ICNet [19], CGNet [25], and our LEDNet. (Best viewed in color)

3.2. Evaluation Results 4. CONCLUSION AND FUTURE WORK

Table 2 and Table 3 report comparison results, demonstrat- This paper has described a LEDNet model, which designs
ing that LEDNet achieves the best available trade-off in terms an asymmetric encoder-decoder architecture for real-time se-
of accuracy and efficiency. Among all the approaches, our mantic segmentation. The encoder adopts channel split and
LEDNet yields 70.6% class mIoU and 87.1% category mIoU, shuffle operations in residual layer, enhancing information
respectively, where 13 out of the 19 categories obtains best communication in the manner of feature reuse. On the other
scores. Regarding to the efficiency, LEDNet is nearly 5× hand, the decoder employs a APN, where the spatial pyramid
faster and 30× smaller than SegNet [7]. Although ENet [13], structure is beneficial to enlarge receptive fields without in-
an another efficient network, is 1.5× efficient, and has 3× less troducing significant computational budgets. The entire net-
parameters, but delivers poor segmentation accuracy of 10% work is trained end-to-end. The experimental results show
drop than our LEDNet. Figure 3 shows some visual examples our LEDNet achieves best trade-off on CityScapes dataset in
of segmentation outputs on the CityScapes validation set. It is terms of segmentation accuracy and implementing efficiency.
demonstrated that, compared with baselines, our LEDNet not The future work includes decomposing standard convolution
only correctly classifies object with different scales, but also in APN into 1D convolution, resulting in further lightweight
produces consistent qualitative results for all classes. network while still remaining segmentation accuracy.
5. REFERENCES [14] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Ar-
royo, “Erfnet: Efficient residual factorized convnet for
[1] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich real-time semantic segmentation,” IEEE TITS, vol. 19,
feature hierarchies for accurate object detection and se- no. 1, pp. 263–272, 2018.
mantic segmentation,” in CVPR, 2014, pp. 580–587.
[15] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko,
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual W.Wang, T. Weyand, M. Andreetto, and H. Adam,
learning for image recognition,” in CVPR, 2016, pp. “Mobilenets: efficient convolutional neural networks
770–778. for mobile vision applications,” in arXiv preprint
arXiv:1704.04861, 2017.
[3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi- [16] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning
novich, “Going deeper with convolutions,” in CVPR, structured sparsity in deep neural networks,” in NIPS,
2015, pp. 1–9. 2016, pp. 2074–2082.
[4] L. Jonathan, S. Evan, and D. Trevor, “Fully convo- [17] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pen-
lutional networks for semantic segmentation,” IEEE sky, “Sparse convolutional neural networks,” in CVPR,
TPAMI, vol. 39, no. 4, pp. 640–651, 2017. 2015, pp. 806–814.
[5] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, [18] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and
and A. L. Yuille, “Deeplab: Semantic image segmen- Z. Wojna, “Rethinking the inception architecture for
tation with deep convolutional nets, atrous convolution, computer vision,” in CVPR, 2016, pp. 2818–2826.
and fully connected crfs,” IEEE TPAMI, vol. 40, no. 4,
pp. 834–848, 2018. [19] H. S. Zhao, X. J. Qi, X. Y. Shen, J. P. Shi, and
J. Y. Jia, “Icnet for real-time semantic segmenta-
[6] L. Guosheng, M. Anton, S. Chunhua, and I. Reid, tion on high-resolution images,” in arXiv preprint
“Refinenet: multi-path refinement networks for high- arXiv:1704.08545v2, 2018.
resolution semantic segmentation,” in CVPR, 2017, pp.
5168–5177. [20] S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and
H. Hajishirzi, “Espnet: Efficient spatial pyramid of di-
[7] B. Vijay, A. Kendall, and R. Cipolla, “Segnet: A deep lated convolutions for semantic segmentation,” in arXiv
convolutional encoder-decoder architecture for image preprint arXiv:1803.06815v3, 2018.
segmentation,” IEEE TPAMI, vol. 39, no. 12, pp. 2481–
2495, 2017. [21] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet:
An extremely efficient convolutional neural network for
[8] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Y. Jia, “Pyramid mobile devices,” in CVPR, 2018, pp. 6848–6856.
scene parsing network,” in CVPR, 2016, pp. 6230–6239.
[22] X. Xie, R. Girshick, P. Dollar, Z. W. Tu, and K. M.
[9] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and He, “Aggregated residual transformations for deep neu-
Y. Chen, “Compressing neural networks with the hash- ral networks,” in CVPR, 2017, pp. 5987–5995.
ing trick,” in ICML, 2015.
[23] F. Yu and V. Koltun, “Multi-scale context aggregation
[10] S. Han, H. Mao, and W. J. Dally, “Deep compres-
by dilated convolutions,” in ICLR, 2016.
sion: Compressing deep neural networks with pruning,
trained quantization and huffman coding,” in ICLR, [24] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation
2016. networks,” in arXiv preprint arXiv:1709.01507, 2017.
[11] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, “Quan- [25] T. Y. Wu, S. Tang, R. Zhang, and Y. D. Zhang, “Cgnet:
tized convolutional neural networks for mobile devices,” A light-weight context guided network for semantic
in CVPR, 2016, pp. 5168–5177. segmentation,” in arXiv preprint arXiv:1811.08201v1,
2018.
[12] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi,
“Xnor-net: Imagenet classification using binary convo- [26] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. En-
lutional neural networks,” in ECCV, 2016. zweiler, R. Benenson, U. Franke, S. Roth, and
B. Schiele, “The cityscapes dataset for semantic urban
[13] A. Paszke, A. Chaurasia, S. Kim, and E. Culur-
scene understanding,” in CVPR, 2016, pp. 3213–3223.
ciello, “Enet: A deep neural network architecture for
real-time semantic segmentation,” in arXiv preprint
arXiv:1606.02147, 2016.

You might also like