Lednet: A Lightweight Encoder-Decoder Network For Real-Time Semantic Segmentation
Lednet: A Lightweight Encoder-Decoder Network For Real-Time Semantic Segmentation
Yu Wang1 , Quan Zhou1,2,∗ , Jia Liu1 , Jian Xiong1 , Guangwei Gao3 , Xiaofu Wu1 , and Longin Jan Latecki4
1
National Engineering Research Center of Communications and Networking,
Nanjing University of Posts & Telecommunications, P.R. China.
2
Key Lab. of Broadband Wireless Communications and Sensor Network Technology,
arXiv:1905.02423v3 [cs.CV] 13 May 2019
1024×512×C
128×64×128
256×128×64
512×256×32 16×8×128 32×16×128
1024×512×3
Encoder network Decoder network
Fig. 1. Overall asymmetric architecture of the proposed LEDNet. The encoder employs a FCN-like network, while an APN is
adopted in decoder. C denotes the number of classes. Please refer to text for more details. (Best viewed in color)
Method Roa Sid Bui Wal Fen Pol TLi TSi Veg Ter Sky Ped Rid Car Tru Bus Tra Mot Bic Cla Cat
SegNet [7] 96.4 73.2 84.0 28.4 29.0 35.7 39.8 45.1 87.0 63.8 91.8 62.8 42.8 89.3 38.1 43.1 44.1 35.8 51.9 57.0 79.1
ENet [13] 96.3 74.2 75.0 32.2 33.2 43.4 34.1 44.0 88.6 61.4 90.6 65.5 38.4 90.6 36.9 50.5 48.1 38.8 55.4 58.3 80.4
ESPNet [20] 97.0 77.5 76.2 35.0 36.1 45.0 35.6 46.3 90.8 63.2 92.6 67.0 40.9 92.3 38.1 52.5 50.1 41.8 57.2 60.3 82.2
CGNet [25] 95.5 78.7 88.1 40.0 43.0 54.1 59.8 63.9 89.6 67.6 92.9 74.9 54.9 90.2 44.1 59.5 25.2 47.3 60.2 64.8 85.7
ERFNet [14] 97.2 80.0 89.5 41.6 45.3 56.4 60.5 64.6 91.4 68.7 94.2 76.1 56.4 92.4 45.7 60.6 27.0 48.7 61.8 66.3 85.2
ICNet [19] 97.1 79.2 89.7 43.2 48.9 61.5 60.4 63.4 91.5 68.3 93.5 74.6 56.1 92.6 51.3 72.7 51.3 53.6 70.5 69.5 86.4
Ours 97.1 78.6 90.4 46.5 48.1 60.9 60.4 71.1 91.2 60.0 93.2 74.3 51.8 92.3 61.0 72.4 51.0 43.3 70.2 69.2 86.8
Ours† 98.1 79.5 91.6 47.7 49.9 62.8 61.3 72.8 92.6 61.2 94.9 76.2 53.7 90.9 64.4 64.0 52.7 44.4 71.6 70.6 87.1
Fig. 3. The visual comparison on CityScapes val dataset. From left to right are input images, ground truth, segmentation outputs
from SegNet [7], ENet [13], ERFNet [14], ESPNet [20], ICNet [19], CGNet [25], and our LEDNet. (Best viewed in color)
Table 2 and Table 3 report comparison results, demonstrat- This paper has described a LEDNet model, which designs
ing that LEDNet achieves the best available trade-off in terms an asymmetric encoder-decoder architecture for real-time se-
of accuracy and efficiency. Among all the approaches, our mantic segmentation. The encoder adopts channel split and
LEDNet yields 70.6% class mIoU and 87.1% category mIoU, shuffle operations in residual layer, enhancing information
respectively, where 13 out of the 19 categories obtains best communication in the manner of feature reuse. On the other
scores. Regarding to the efficiency, LEDNet is nearly 5× hand, the decoder employs a APN, where the spatial pyramid
faster and 30× smaller than SegNet [7]. Although ENet [13], structure is beneficial to enlarge receptive fields without in-
an another efficient network, is 1.5× efficient, and has 3× less troducing significant computational budgets. The entire net-
parameters, but delivers poor segmentation accuracy of 10% work is trained end-to-end. The experimental results show
drop than our LEDNet. Figure 3 shows some visual examples our LEDNet achieves best trade-off on CityScapes dataset in
of segmentation outputs on the CityScapes validation set. It is terms of segmentation accuracy and implementing efficiency.
demonstrated that, compared with baselines, our LEDNet not The future work includes decomposing standard convolution
only correctly classifies object with different scales, but also in APN into 1D convolution, resulting in further lightweight
produces consistent qualitative results for all classes. network while still remaining segmentation accuracy.
5. REFERENCES [14] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Ar-
royo, “Erfnet: Efficient residual factorized convnet for
[1] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich real-time semantic segmentation,” IEEE TITS, vol. 19,
feature hierarchies for accurate object detection and se- no. 1, pp. 263–272, 2018.
mantic segmentation,” in CVPR, 2014, pp. 580–587.
[15] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko,
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual W.Wang, T. Weyand, M. Andreetto, and H. Adam,
learning for image recognition,” in CVPR, 2016, pp. “Mobilenets: efficient convolutional neural networks
770–778. for mobile vision applications,” in arXiv preprint
arXiv:1704.04861, 2017.
[3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi- [16] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning
novich, “Going deeper with convolutions,” in CVPR, structured sparsity in deep neural networks,” in NIPS,
2015, pp. 1–9. 2016, pp. 2074–2082.
[4] L. Jonathan, S. Evan, and D. Trevor, “Fully convo- [17] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pen-
lutional networks for semantic segmentation,” IEEE sky, “Sparse convolutional neural networks,” in CVPR,
TPAMI, vol. 39, no. 4, pp. 640–651, 2017. 2015, pp. 806–814.
[5] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, [18] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and
and A. L. Yuille, “Deeplab: Semantic image segmen- Z. Wojna, “Rethinking the inception architecture for
tation with deep convolutional nets, atrous convolution, computer vision,” in CVPR, 2016, pp. 2818–2826.
and fully connected crfs,” IEEE TPAMI, vol. 40, no. 4,
pp. 834–848, 2018. [19] H. S. Zhao, X. J. Qi, X. Y. Shen, J. P. Shi, and
J. Y. Jia, “Icnet for real-time semantic segmenta-
[6] L. Guosheng, M. Anton, S. Chunhua, and I. Reid, tion on high-resolution images,” in arXiv preprint
“Refinenet: multi-path refinement networks for high- arXiv:1704.08545v2, 2018.
resolution semantic segmentation,” in CVPR, 2017, pp.
5168–5177. [20] S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and
H. Hajishirzi, “Espnet: Efficient spatial pyramid of di-
[7] B. Vijay, A. Kendall, and R. Cipolla, “Segnet: A deep lated convolutions for semantic segmentation,” in arXiv
convolutional encoder-decoder architecture for image preprint arXiv:1803.06815v3, 2018.
segmentation,” IEEE TPAMI, vol. 39, no. 12, pp. 2481–
2495, 2017. [21] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet:
An extremely efficient convolutional neural network for
[8] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Y. Jia, “Pyramid mobile devices,” in CVPR, 2018, pp. 6848–6856.
scene parsing network,” in CVPR, 2016, pp. 6230–6239.
[22] X. Xie, R. Girshick, P. Dollar, Z. W. Tu, and K. M.
[9] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and He, “Aggregated residual transformations for deep neu-
Y. Chen, “Compressing neural networks with the hash- ral networks,” in CVPR, 2017, pp. 5987–5995.
ing trick,” in ICML, 2015.
[23] F. Yu and V. Koltun, “Multi-scale context aggregation
[10] S. Han, H. Mao, and W. J. Dally, “Deep compres-
by dilated convolutions,” in ICLR, 2016.
sion: Compressing deep neural networks with pruning,
trained quantization and huffman coding,” in ICLR, [24] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation
2016. networks,” in arXiv preprint arXiv:1709.01507, 2017.
[11] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, “Quan- [25] T. Y. Wu, S. Tang, R. Zhang, and Y. D. Zhang, “Cgnet:
tized convolutional neural networks for mobile devices,” A light-weight context guided network for semantic
in CVPR, 2016, pp. 5168–5177. segmentation,” in arXiv preprint arXiv:1811.08201v1,
2018.
[12] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi,
“Xnor-net: Imagenet classification using binary convo- [26] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. En-
lutional neural networks,” in ECCV, 2016. zweiler, R. Benenson, U. Franke, S. Roth, and
B. Schiele, “The cityscapes dataset for semantic urban
[13] A. Paszke, A. Chaurasia, S. Kim, and E. Culur-
scene understanding,” in CVPR, 2016, pp. 3213–3223.
ciello, “Enet: A deep neural network architecture for
real-time semantic segmentation,” in arXiv preprint
arXiv:1606.02147, 2016.