2021 ICPR FASSDNet
2021 ICPR FASSDNet
2265
Networks such as ESPNetv2 [32], ESNet [32], and LEDNet FC-HarDNet process a high number of feature maps with
[33] employ lightweight pyramidal multi-resolution strategies 1/64 input size resolution, which is essential for classification
similar to our DAPF module. More specifically, LEDNet [33] tasks. However, we believe that this is not always the case for
and ESNet [14] incorporate asymmetric convolutions in their semantic segmentation tasks, where such small size feature
core modules. However, they heavily rely on the use of this maps lose track of small objects present in the scene (e.g.,
technique, which hurts the inference speed performance in for a 1024×2048 image and a downsampling rate of 64, the
comparison to highly optimized standard convolutions [8]. size becomes 16×32). In our FASSD-Net implementation, the
Contrary to these methods, our DAPF and MDA modules last encoder block and the first decoder block of FC-HarDNet-
utilize asymmetric convolutions only if dilation is applied 70 are replaced with our DAPF module, so that the smallest
concurrently, alleviating the reduction of inference speed feature maps processed by our network are 1/32 of the input
performance caused by the atrous convolutions [5]. When size resolution. Similarly, the FC-HarDNet-70 multi-resolution
compared to our proposals, LEDNet [33] and ESNet [14] are fusion scheme is substituted by our MDA module.
slower and much less accurate.
B. Dilated Asymmetric Pyramidal Fusion module
III. M ETHODOLOGY As shown in Figure 3, the ASPP module heavily re-
Most of the existing state-of-the-art methods for semantic lies on standard atrous convolutions and produces a fixed
segmentation are built on top of high-performance baselines number of feature maps Q in each of its five pyramidal
for image classification such as ResNet, WiderResNet, or branches. Pyramidal branches consist of: 1× Conv 1×1, 1×
Xception [1], [11], [16]–[18]. Following this trend, we adopt Pooling + Conv 1×1, and 3× atrous Conv 3×3 with dilation
and extend the work of Chao et al. [7] by incorporating the rates r = 12, 24, and 36, respectively.
DAPF and the MDA modules. The proposed network structure Our proposed DAPF module draws inspiration from the
is shown in Figure 2. In the figure, the stem convolution ASPP module, but with some key differences aimed to reduce
block consists of four consecutive convolution layers. The core its computational burden: Firstly, all 3×3 atrous convolutions
element of all encoder and decoder blocks is the HarDBlock are factorized into two consecutive 1D atrous convolutions,
(Harmonic Dense Block), proposed in HarDNet [7]. Our pro- specifically, a 3×1 convolution followed by a 1×3 convolu-
posed DAPF is placed at the end of the encoder, while MDA tion. Secondly, the image pooling branch is removed since it
modules connect each decoder block with its corresponding computes feature maps that might as well be learned through
encoder, in a U-shape fashion. Finally, the last block of the the 1×1 convolutional branch. Lastly, the number of feature
network consists of a single 1 × 1 convolution for making the maps generated by each pyramidal branch, is no longer fixed.
final prediction. Bilinear upsampling is used to reestablish the Instead, it is defined by the number of input feature maps of the
original input size (1024 × 2048). module K × α1 , where α serves as the compression factor. In
our implementation, we set α = 2. However, α can be adjusted
A. Network overview to any other value according to the available computational
HarDNet (Harmonic DenseNet) [7], is a recent state-of- budget. Figure 3 illustrates the differences between DAPF and
the-art network inspired by DenseNet (Densely Connected the ASPP modules.
Network) [34]. Compared to ResNet [35] and DenseNet [34], Undoubtedly, the major contributing factor for reducing
HarDNet achieves comparable accuracy with significantly the computational burden in our DAPF module is the use
lower GPU runtime for classification tasks. Its core compo- of asymmetric convolutions. For instance, in the pyramidal
nent, the HarDBlock (Harmonic Dense Block), is specifically branches, for each standard 2D convolution, we would have
designed to address the problem of the GPU memory traffic. to perform K × d × d × F operations, where K is the
The HarDBlock follows a concatenation scheme aimed to number of input channels (feature maps), d is the kernel
improve the throughput of the feature maps in the network, size, and F is the number of output channels. On the other
avoiding unnecessary DRAM (dynamic random-access mem- hand, following our asymmetric strategy with α = 2, we
ory) accesses. Additionally, the HarDBlock is optimized to perform (K × d × 12 K) + ( 21 K × d × 21 K) operations. For
increase the density of computations of the layers, defined a 3×3 kernel, this factorization strategy only requires 21 of
by the number of Multiply-Accumulate operations (MACs) the original number of operations, thus saving 50% of the
over the Convolutional Input/Output (CIO). These key im- needed computations and parameters in comparison to its non-
provements are based on the observation that when the density asymmetric convolution equivalent. Moreover, factorization
of computation is low, DRAM traffic can influence inference can also improve the learning capacity of the module as a
time more substantially than the model size and the number result of the intermediate activation layers used between the
of operations. two 1D convolutions [13].
Our baseline model, FC-HarDNet-70, is the implemen-
tation of HarDNet for the task of semantic segmentation. C. Muti-resolution Dilated Asymmetric module
FC-HarDNet-70 is a U-shape-like architecture [9] with five We design our MDA module to simultaneously exploit
encoder blocks and four decoder blocks (all of them HarD- contextual information and recover spatial information. As
Blocks). The convolution layers in the last encoder stage of shown in Figure 4, the two multi-resolution feature maps
2266
Fig. 2. Block diagram of the proposed network. “s” indicates the downsampling rate of the feature maps with respect to the original input image (e.g., s=32
indicates output feature maps of size 32×64).
2267
TABLE I TABLE II
A BLATION STUDY OF OUR PROPOSED MODULES ON THE C ITYSCAPES FASSD-N ET ARCHITECTURE . L DENOTES THE NUMBER OF
VALIDATION SET. CONVOLUTION LAYERS IN THE H AR DB LOCK .
Method GFLOPs No. Parameters ∆p FPS mIoU Stage Name Type Output size
FC-HarDNet-70 [7] 35.4 4.10M - 52.3 76.4 Input - - 1024×2048×3
Baseline 32.9 1.90M 0M 56.3 75.2
+ ASPP 36.8 3.85M 1.95M 50.2 75.8 Conv 3×3 (s=2)* 512×1024×16
+ DAPF 33.9 2.36M 0.46M 53.9 77.7 Conv 3×3** 512×1024×24
Stem Conv
+ MDA 44.2 2.38M 0.48M 42.2 77.4 Conv 3×3 (s=2) 256×512×32
+ ASPP + MDA 48.0 4.33M 2.43M 39.1 76.8 Conv 3×3 256×512×48
+ DAPF + MDA 45.1 2.85M 0.95M 41.1 78.2 Encoder B1 HarDBlock (L=4) 256×512×64
2D Average Pooling
Encoder Encoder B2 128×256×96
HarDBlock (L=4)
IV. E XPERIMENTS 2D Average Pooling
Encoder B3 64×128×160
We evaluate our proposed network architectures on the HarDBlock (L=8)
Cityscapes benchmark [36]. Performance is mainly measured 2D Average Pooling
Encoder B4 32×64×224
HarDBlock (L=8)
in mean Intersection over Union accuracy (mIoU) and Frames
DAPF - 32×64×224
per Second (FPS). Besides, we report the number of param-
MDA - 64×128×192
eters and computational complexity in GFLOPs. All experi- Decoder B1 HarDBlock (L=8) 64×128×160
ments are conducted using the publicly available Cityscapes MDA - 128×256×119
dataset [36]. This dataset consists of 5,000 finely annotated Decoder B2 HarDBlock (L=4) 128×256×78
Decoder
1024 × 2048 images: 2,975 for training, 1,525 for testing, MDA - 256×512×63
and another 500 images for validation. Additionally, 19,998 Decoder B3 HarDBlock (L=4) 256×512×48
images with coarse annotations are also provided. For a fair Conv 1×1 256×512×19
Output Conv
comparison, we only use the fine annotated images, and do Upsampling ×4 1024×2048×19
not employ any augmentation technique for testing, such as *Changes to (s=3) for FASSD-Net-L1
**Changes to (s=2) for FASSD-Net-L2
multi-scale or multi-crops, which increase the accuracy at the
cost of inference time.
A. Implementation Details Note that, for a fair comparison, all methods shown in Table
We use PyTorch 1.0 with CUDA 10.2 for all experi- I are trained without further fine-tuning.
ments. The same training setting is used for all models, Our DAPF module outperforms DeepLabV3+’s ASPP in
where Stochastic Gradient Descent (SGD) with weight-decay all four metrics (GFLOPS, Parameters, FPS and mIoU). In
5 × 10−4 and momentum 0.9 is used as the optimizer. We addition, the increase of parameters (∆p) by DAPF from the
employ the “poly” learning rate strategy lr = initial lr × Baseline is significantly fewer than the ASPP module (0.46M
iter 0.9 vs 1.95M). In resume, our module is more than four times
( total iter ) , and an initial learning rate of 0.02. Cross-
entropy loss is computed following the online bootstrapping lighter than ASPP, and presents an increase of 1.9% of mIoU.
strategy [37]. Data augmentation consists of random horizontal It can be observed that the addition of our MDA strategy
flip, random scale in the range [0.5, 2], and random cropping outperforms the mIoU of the baseline network to almost the
with 1024 × 1024 crop size. We trained all models for 90k same degree as DAPF. Likewise, our MDA strategy consumes
iterations with batch size 16. For the final models, we follow roughly the same number of parameters. However, since MDA
the same training protocol for 30K more iterations, setting the is used in three different levels of the decoder, the FPS drop
batch size to 24 and the initial learning rate to 0.001. becomes evident compared to DAPF.
All networks are pre-trained on the ImageNet dataset [38], The combination of our two proposals achieves the best
and the inference speed (in FPS) is measured on an Intel mIoU accuracy. Specifically, it outperforms the baseline and
Core i7-9700K desktop with one NVIDIA GTX 1080ti card the state-of-the-art results of FC-HarDNet-70 [7] by 3% and
unless specified otherwise. For all experiments, the speed is 1.8%, respectively. Additionally, the increase of parameters
calculated from the average FPS rate of 10, 000 iterations (∆p) of our proposal is less than 50% over the case when
measured on images of size 1024 × 2048 × 3. ASSP is used. On top of that, the total number of parameters
is significantly lower than those needed by FC-HArDNet-
B. Ablation Study 70 (2.85M vs 4.10M). Note that Baseline + DAPF + MDA
We show the performance comparison between our pro- corresponds to our final model called FASSD-Net, as shown
posed DAPF and the DeepLabV3+’s ASPP module [11]. in Figure 2.
We evaluate the effectiveness of our MDA module and its
combination with DAPF and ASPP. Table I summarizes the
C. Light variations of FASSD-Net
corresponding results. Baseline denotes the modified FC-
HarDNet-70, where the last encoder and the first decoder In addition to our network FASSD-Net, we introduce two
blocks are removed, as previously described in Section III. light versions designed to maintain a better tradeoff between
2268
TABLE III
P ER - CLASS M I O U SCORE COMPARISON OF OUR PROPOSALS ON THE C ITYSCAPES VALIDATION SET.
Method road s.walk build. wall fence pole t.light t.sign veg. terr. sky person rider car truck bus train mbik bike mIoU
FC-HarDNet-70 [7] 98.1 84.6 92.6 60.0 63.5 64.9 69.7 78.8 92.2 62.8 95.0 81.4 60.7 95.0 73.9 82.4 77.4 61.7 76.2 77.4
FASSD-Net 98.3 86.0 92.9 59.6 63.9 67.1 71.7 79.6 92.5 63.7 95.0 82.1 63.2 95.4 80.8 87.8 79.6 61.3 76.8 78.8
FASSD-Net-L1 98.2 84.6 92.4 54.7 61.3 63.3 68.2 77.1 92.1 61.8 94.9 80.0 59.8 94.9 76.0 82.7 74.7 59.1 74.6 76.3
FASSD-Net-L2 97.9 83.1 91.6 55.6 57.0 58.0 62.5 71.8 91.7 63.0 94.4 77.3 57.0 93.8 75.5 81.9 71.1 53.1 70.8 74.1
Fig. 5. Qualitative results of the proposed networks. Regions of improvement are highlighted with yellow squares.
speed and accuracy. We call these networks: FASSD-Net-L1 the truck and bus classes with 6.9% and 5.4%, respectively.
and FASSD-Net-L2. Similarly, FASSD-Net-L1 and FASSD-Net-L2 also obtain
Table II shows the detailed architecture of FASSD-Net, better results overall in these two classes compared to FC-
including the output and number of channels for each ele- HarDNet-70, despite being less accurate models. Qualitative
ment. From Table II, FASSD-Net-L1 differs only in the first results of our FASSD-Net variations are shown in Figure
convolution layer, where the convolution stride is increased 5. As the quantitative results suggest, the most significant
from 2 to 3. Such modification, preserves the same number improvements occur on pixels belonging to large objects, such
of parameters of the network and leads to a faster inference as trucks and buses. Compared to FC-HarDNet-70, all three
speed at the cost of a small drop in accuracy. Specifically, FASSD-Net variations better differentiate between car, bus,
it is 1.9× faster and 2.5% less accurate. FASSD-Net-L2, on and truck classes. For a fair comparison, we have conducted
the other hand, is designed to be the fastest among our three the evaluation of our final models against FC-HarDNet-70 with
proposals. It adopts an additional convolution stride of 2 in its official weights from its open-source implementation1 .
the second convolution layer of the stem block. In addition, all
the HarDBlocks in the decoder are replaced by conventional D. Comparison with state-of-the-art real-time methods
3×3 convolution layers of 64 channels. Thus, FASSD-Net- Table IV shows the overall comparison of our network
L2 is 3.2× faster than FASSD-Net with only a 4.7% drop in proposals versus other state-of-the-art methods for real-time
accuracy performance. semantic segmentation. The table is divided into three cat-
We report the results of per-class mIoU accuracy in Table egories, based on the inference speed directly comparable to
III. FASSD-Net obtains the highest score in 17 out of 19 each of our three proposed networks. For a more complete and
categories, outperforming the current state-of-the-art model,
FC-HarDNet-70. Most significative improvements occur in 1 github.com/PingoLH/FCHarDNet
2269
TABLE IV
C OMPARISON BETWEEN STATE - OF - THE - ART NETWORKS FOR REAL - TIME SEMANTIC SEGMENTATION .
Method Input Size GPU GFLOPs No. Parameters FPS FPS (norm.) mIoU
DeepLabV3+ [11] 512×1024 Titan X (P) - - ≈1 ≈1 79.6
PSPNet [15] 713×713 Titan X (P) - - <1 <1 79.7
Fast-SCNN [25] 1024×2048 Titan XP - 1.11M 123.5 110.3 69.2
SwiftNetRN-18 [8] 512×1024 GTX 1080ti 26.0 11.8M 134.9 134.9 70.2
FC-HarDNet-70 (L2) 1024×2048 GTX 1080ti 6.6 2.86M 152.8 152.8 72.1
FASSD-Net-L2 (ours) 1024×2048 GTX 1080ti 8.7 2.3M 133.1 133.1 74.1
ERFNet [13] 512×1024 Titan X (M) - 2.10M 41.7 68.4 71.5
CAS [28] 768×1536 Titan XP - - 108.0 96.4 71.6
DF1-Seg-d8 [26] 1024×2048 GTX 1080ti - - 136.9* 82.9 72.4
FasterSeg [12] 1024×2048 GTX 1080ti 28.2 4.4M 163.9* 99.3 73.1
FC-HarDNet-70 (L1) 1024×2048 GTX 1080ti 15.7 4.1M 93.4 93.4 74.8
FASSD-Net-L1 (ours) 1024×2048 GTX 1080ti 20.0 2.85M 78.0 78.0 76.3
ICNet [2] 1024×2048 Titan X (M) 28.3 26.5M 30.3 49.7 67.7
DABNet [5] 1024×2048 GTX 1080ti 41.8 0.76M 27.7 27.7 69.1
GUN [39] 512×1024 Titan XP - - 33.3 29.7 69.6
BiSeNet [3] 768×1536 Titan XP - 49M 65.5 58.5 74.8
SwiftNetRN-18 [8] 1024×2048 GTX 1080ti 104.0 11.8M 39.9 39.9 75.4
FC-HarDNet-70 [7] 1024×2048 GTX 1080ti 35.4 4.1M 53.0 53.0 77.4
FASSD-Net (ours) 1024×2048 GTX 1080ti 45.1 2.85M 41.1 41.1 78.8
* Speed measured with TensorRT acceleration
fair comparison with respect to our baseline, FC-HarDNet-70 and DF1-Seg-d8, while being 1% and 1.7% more accurate,
(L1) and FC-HarDNet-70 (L2) are our modified implemen- respectively.
tations of FC-HarDNet-70 that closely resemble to our two
light networks, following the same changes and our same
training protocol. FC-HarDNet-70 (L1) and FC-HarDNet-70 V. C ONCLUSION
(L2) are directly modified from the official source code of
FC-HarDNet-70. In this paper, we focus on reducing the accuracy gap
between real-time and non-real-time semantic segmentation
For fair comparison under different GPU architectures, we
networks. For this purpose, we have proposed the DAPF and
follow the same protocol as Orsic et al. [8] and let the column
MDA modules. These modules exploit the contextual informa-
FPS (norm.) in Table IV to provide a speed estimation of the
tion in several stages of the decoder and boost the accuracy
model running on a GTX 1080ti GPU. Scaling factors are: 1.0
performance of the baseline network, keeping relatively low
for GTX 1080ti, 0.61 for Titan X (Maxwell), 1.03 for Titan
computational cost. Using our two modules jointly, we have
X (Pascal), and 1.12 for Titan XP.
designed three network variations that can be chosen depend-
Our main network, FASSD-Net, surpasses by a considerable
ing on the computational budget. Our main network, FASSD-
margin the mIoU score of all other methods for real-time
Net, sets the new state-of-the-art mIoU accuracy for the task of
semantic segmentation, requiring 1.44× fewer parameters and
real-time semantic segmentation on the Cityscapes validation
being 1.4% more accurate than the closest competitor FC-
set. In addition, our proposed FASSD-Net-L2 ranks as the
HarDNet-70.
fastest network when evaluated on 1024×2048 images without
Our second network, FASSD-Net-L1, resembles BiSeNet
using additional network acceleration techniques. As future
in mIoU accuracy and FPS. However, the speed of BiSeNet
work, we would like to evaluate our proposals on different
has been originally measured on 768 × 1536 images with an
scenarios, such as indoor understanding, or medical images.
NVIDIA Titan XP card. For a fair comparison, and according
We also would like to apply acceleration techniques, such as
to Zhuang et al. [40], we let its speed to be 37 FPS evaluated
network quantization or network distillation, to increase the
on 1024 × 2048 images on an NVIDIA GTX 1080ti card.
speed of our models further.
Therefore, resulting in our network being about 2.1× faster,
1.5% more accurate, and requiring 17.2× fewer parameters.
Similarly, FASSD-Net-L2 can be compared to FasterSeg ACKNOWLEDGMENT
and DF1-Seg-d8, which were designed and optimized by NAS
methodologies. Both methods utilize TensorRT acceleration The authors would like to thank to the University of Electro-
[41] to increase their speed performance. For a fair compar- Communications, the National Polytechnic Institute ESIME-
ison, we let 1.65× be the acceleration factor of TensorRT. Culhuacan and CONACyT (Mexico) for providing financial
This value is approximated from works that present results support for the development of this work. This work was
with and without TensorRT, such as FasterSeg [12]. Under also supported by JSPS KAKENHI Grant Number 15H05915,
these assumptions, our FASSD-Net-L2 is faster than FasterSeg 17H01745, 17H06100 and 19H04929.
2270
R EFERENCES [21] F. Chollet, “Xception: Deep Learning With Depthwise Separable Con-
volutions,” in The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), July 2017.
[1] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, [22] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
“CCNet: Criss-Cross Attention for Semantic Segmentation,” in The M. Andreetto, and H. Adam, “MobileNets: Efficient Convolutional
IEEE International Conference on Computer Vision (ICCV), October Neural Networks for Mobile Vision Applications,” arXiv e-prints, p.
2019. arXiv:1704.04861, Apr 2017.
[2] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “ICNet for Real-Time [23] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
Semantic Segmentation on High-Resolution Images,” in The European “Mobilenetv2: Inverted residuals and linear bottlenecks,” in The IEEE
Conference on Computer Vision (ECCV), September 2018. Conference on Computer Vision and Pattern Recognition (CVPR), June
[3] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “BiSeNet: 2018.
Bilateral Segmentation Network for Real-time Semantic Segmentation,” [24] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang,
in The European Conference on Computer Vision (ECCV), September Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, and H. Adam, “Searching
2018. for MobileNetV3,” in The IEEE International Conference on Computer
[4] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, Vision (ICCV), October 2019.
P. Martinez-Gonzalez, and J. Garcia-Rodriguez, “A survey on [25] R. P. Poudel, S. Liwicki, and R. Cipolla, “Fast-scnn: fast semantic
deep learning techniques for image and video semantic segmentation,” segmentation network,” arXiv preprint arXiv:1902.04502, 2019.
Applied Soft Computing, vol. 70, pp. 41 – 65, 2018. [Online]. Available: [26] X. Li, Y. Zhou, Z. Pan, and J. Feng, “Partial Order Pruning: For Best
http://www.sciencedirect.com/science/article/pii/S1568494618302813 Speed/Accuracy Trade-Off in Neural Architecture Search,” in The IEEE
[5] G. Li and J. Kim, “DABNet: Depth-wise Asymmetric Bottleneck for Conference on Computer Vision and Pattern Recognition (CVPR), June
Real-time Semantic Segmentation,” in British Machine Vision Confer- 2019.
ence, 2019. [27] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang, “Filter Pruning via Geo-
[6] H. Li, P. Xiong, H. Fan, and J. Sun, “DFANet: Deep Feature Aggregation metric Median for Deep Convolutional Neural Networks Acceleration,”
for Real-Time Semantic Segmentation,” in The IEEE Conference on in The IEEE Conference on Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition (CVPR), June 2019. (CVPR), June 2019.
[7] P. Chao, C.-Y. Kao, Y.-S. Ruan, C.-H. Huang, and Y.-L. Lin, “HarDNet: [28] Y. Zhang, Z. Qiu, J. Liu, T. Yao, D. Liu, and T. Mei, “Customizable Ar-
A Low Memory Traffic Network,” in The IEEE International Conference chitecture Search for Semantic Segmentation,” in The IEEE Conference
on Computer Vision (ICCV), October 2019. on Computer Vision and Pattern Recognition (CVPR), June 2019.
[8] M. Orsic, I. Kreso, P. Bevandic, and S. Segvic, “In Defense of Pre- [29] R. Zhao, Y. Hu, J. Dotzel, C. De Sa, and Z. Zhang, “Improving neural
Trained ImageNet Architectures for Real-Time Semantic Segmentation network quantization without retraining using outlier channel splitting,”
of Road-Driving Images,” in The IEEE Conference on Computer Vision in International Conference on Machine Learning, 2019, pp. 7543–7552.
and Pattern Recognition (CVPR), June 2019. [30] X. Zheng, R. Ji, L. Tang, Y. Wan, B. Zhang, Y. Wu, Y. Wu, and L. Shao,
[9] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks “Dynamic distribution pruning for efficient network architecture search,”
for biomedical image segmentation,” in International Conference on arXiv preprint arXiv:1905.13543, 2019.
Medical image computing and computer-assisted intervention. Springer, [31] H. Cai, L. Zhu, and S. Han, “ProxylessNAS: Direct Neural
2015, pp. 234–241. Architecture Search on Target Task and Hardware,” in International
[10] F. Yu and V. Koltun, “Multi-Scale Context Aggregation by Conference on Learning Representations, 2019. [Online]. Available:
Dilated Convolutions,” in 4th International Conference on Learning https://arxiv.org/pdf/1812.00332.pdf
Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, [32] S. Mehta, M. Rastegari, L. Shapiro, and H. Hajishirzi, “ESPNetv2:
Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2016. A Light-Weight, Power Efficient, and General Purpose Convolutional
[Online]. Available: http://arxiv.org/abs/1511.07122 Neural Network,” in The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2019.
[11] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-
[33] Y. Wang, Q. Zhou, J. Liu, J. Xiong, G. Gao, X. Wu, and L. J.
Decoder with Atrous Separable Convolution for Semantic Image Seg-
Latecki, “Lednet: A Lightweight Encoder-Decoder Network for Real-
mentation,” arXiv e-prints, p. arXiv:1802.02611, Feb 2018.
Time Semantic Segmentation,” in 2019 IEEE International Conference
[12] W. Chen, X. Gong, X. Liu, Q. Zhang, Y. Li, and Z. Wang, “FasterSeg: on Image Processing (ICIP), Sep. 2019, pp. 1860–1864.
Searching for Faster Real-time Semantic Segmentation,” in International [34] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely
Conference on Learning Representations, 2020. Connected Convolutional Networks,” in The IEEE Conference on Com-
[13] E. Romera, J. M. Álvarez, L. M. Bergasa, and R. Arroyo, “ERFNet: puter Vision and Pattern Recognition (CVPR), July 2017.
Efficient Residual Factorized ConvNet for Real-Time Semantic Seg- [35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for
mentation,” IEEE Transactions on Intelligent Transportation Systems, Image Recognition,” in The IEEE Conference on Computer Vision and
vol. 19, pp. 263–272, 2018. Pattern Recognition (CVPR), June 2016.
[14] Y. Wang, Q. Zhou, and X. Wu, “ESNet: An Efficient Symmetric [36] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be-
Network for Real-time Semantic Segmentation,” arXiv e-prints, p. nenson, U. Franke, S. Roth, and B. Schiele, “The Cityscapes Dataset
arXiv:1906.09826, Jun 2019. for Semantic Urban Scene Understanding,” in The IEEE Conference on
[15] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid Scene Parsing Computer Vision and Pattern Recognition (CVPR), June 2016.
Network,” in The IEEE Conference on Computer Vision and Pattern [37] Z. Wu, C. Shen, and A. van den Hengel, “High-performance Semantic
Recognition (CVPR), July 2017. Segmentation Using Very Deep Fully Convolutional Networks,” CoRR,
[16] T. Takikawa, D. Acuna, V. Jampani, and S. Fidler, “Gated-SCNN: Gated vol. abs/1604.04339, 2016. [Online]. Available: http://arxiv.org/abs/
Shape CNNs for Semantic Segmentation,” in The IEEE International 1604.04339
Conference on Computer Vision (ICCV), October 2019. [38] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
[17] Z. Tian, T. He, C. Shen, and Y. Yan, “Decoders Matter for Semantic Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
Segmentation: Data-Dependent Decoding Enables Flexible Feature Ag- L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”
gregation,” in The IEEE Conference on Computer Vision and Pattern International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp.
Recognition (CVPR), June 2019. 211–252, 2015.
[18] X. Li, Z. Zhong, J. Wu, Y. Yang, Z. Lin, and H. Liu, “Expectation- [39] D. Mazzini, “Guided Upsampling Network for Real-Time Semantic
Maximization Attention Networks for Semantic Segmentation,” in The Segmentation,” arXiv e-prints, p. arXiv:1807.07466, Jul 2018.
IEEE International Conference on Computer Vision (ICCV), October [40] J. Zhuang, J. Yang, L. Gu, and N. Dvornek, “ShelfNet for Fast Semantic
2019. Segmentation,” in The IEEE International Conference on Computer
[19] Y. Yuan, X. Chen, and J. Wang, “Object-contextual representations for Vision (ICCV) Workshops, Oct 2019.
semantic segmentation,” ArXiv, vol. abs/1909.11065, 2019. [41] NVIDIA, “TensorRT,” September 2010, [Online; accessed February 14,
[20] R. P. K. Poudel, U. Bonde, S. Liwicki, and C. Zach, “Contextnet: 2020]. [Online]. Available: https://developer.nvidia.com/tensorrt
Exploring context and detail for semantic segmentation in real-time,”
in BMVC, 2018.
2271