0% found this document useful (0 votes)
58 views8 pages

2021 ICPR FASSDNet

Uploaded by

simonfx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views8 pages

2021 ICPR FASSDNet

Uploaded by

simonfx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2020 25th International Conference on Pattern Recognition (ICPR)

Milan, Italy, Jan 10-15, 2021

Fast and Accurate Real-Time Semantic


Segmentation with Dilated Asymmetric
Convolutions
Leonel Rosas-Arias∗ , Gibran Benitez-Garcia † , José Portillo-Portillo∗ , Gabriel Sánchez-Pérez ∗ and Keiji Yanai†
∗ Instituto
Politecnico Nacional, ESIME Culhuacan, Mexico City, Mexico
Email: [email protected], [email protected], [email protected]
† Department of Informatics, The University of Electro-Communications, Tokyo, Japan

Email: [email protected], [email protected]

Abstract—Recent works have shown promising results applied


to real-time semantic segmentation tasks. To maintain fast
inference speed, most of the existing networks make use of light
decoders, or they simply do not use them at all. This strategy
helps to maintain a fast inference speed; however, their accuracy
performance is significantly lower in comparison to non-real-
time semantic segmentation networks. In this paper, we introduce
two key modules aimed to design a high-performance decoder
for real-time semantic segmentation for reducing the accuracy
gap between real-time and non-real-time segmentation networks.
Our first module, Dilated Asymmetric Pyramidal Fusion (DAPF),
is designed to substantially increase the receptive field on the
top of the last stage of the encoder, obtaining richer contextual
features. Our second module, Multi-resolution Dilated Asym-
metric (MDA) module, fuses and refines detail and contextual
information from multi-scale feature maps coming from early and
deeper stages of the network. Both modules exploit contextual
information without excessively increasing the computational
complexity by using asymmetric convolutions. Our proposed
network entitled “FASSD-Net” reaches 78.8% of mIoU accuracy
on the Cityscapes validation dataset at 41.1 FPS on full resolution Fig. 1. Speed and Accuracy comparison between state-of-the-art methods for
images (1024×2048). Besides, with a light version of our network, real-time semantic segmentation on the Cityscapes validation set. For a fair
we reach 74.1% of mIoU at 133.1 FPS (full resolution) on a comparison, the speed of methods marked by (*) are approximated without
single NVIDIA GTX 1080Ti card with no additional acceleration TensorRT acceleration. PSPNet and FC-HarDNet-L2 speeds are placed on the
x-axis edges for the sake of better visualization.
techniques. The source code and pre-trained models are available
at github.com/GibranBenitez/FASSD-Net.

I. I NTRODUCTION factors that seemingly contradict each other, making real-time


The research of semantic segmentation is considered as a semantic segmentation a challenging task [2].
fundamental task in computer vision [1]–[3]. It aims to assign In order to increase the accuracy performance, some state-
semantic class labels to each pixel in a given input image. of-the-art networks for real-time semantic segmentation [6]–
In recent years, due to the development of new deep learning [8] use U-shape-like architectures [9] to recover hierarchical
techniques, semantic segmentation has been widely applied features from previous stages of the network. Nevertheless,
to several challenging fields, including: autonomous driving, their accuracy performance is still significantly lower in com-
robot sensing, medical imaging, augmented reality, and video parison to non-real-time semantic segmentation networks.
surveillance, to name a few [4]. In either encoder or decoder stages of the network, one
Some applications require the inference speed to be as fast common way to achieve a further increment in accuracy is
as possible with the maximum possible accuracy. Furthermore, leveraging the use of dilated (atrous) convolutions, which
according to the available hardware or budget, factors such enlarge the receptive field of the convolution kernel [10]. The
as low energy consumption and memory usage also become problem is, that methods designed to exploit this property
crucial [5]. In particular, applications such as autonomous require a considerable number of floating-point operations,
driving cars require to keep a balance between high accuracy such as the Atrous Spatial Pyramid Pooling module (ASPP)
prediction and low inference time to be able to take action [11]. On top of that, the ASPP module heavily relies on dilated
rapidly [3], [6]. However, that speed and accuracy are two convolutions, which by themselves are slow to compute due

978-1-7281-8808-9/20/$31.00 ©2020 IEEE 2264


to framework optimization constrains [8]. methods such as DeepLabV3+ [11] and PSPNet [15] in
The slow-down caused by the use of dilated convolutions terms of accuracy, while being about 40× faster.
can be alleviated by implementing convolution factorization,
as demonstrated in [5]. Therefore, in this paper, we introduce II. R ELATED WORK
two modules designed to increase the prediction accuracy, Models such as PSPNet [15] and DeepLabV3+ [11] exploit
exploiting contextual information in multiple stages of the contextual information by processing the same set of feature
network. We name these two modules: Dilated Asymmetric maps. PSPNet [15] does it by downsampling the feature
Pyramidal Fusion (DAPF) and Multi-resolution Dilated Asym- maps at four different rates, perform a series of convolutions
metric (MDA) module, respectively. on them, and finally perform a fusion process. Likewise,
In order to increase the kernel receptive field and keep low DeepLabV3+ [11] processes the feature maps by applying
computational complexity, we carefully employ 3×3 dilated atrous convolutions at different rates. These models have
convolutions factorized into two consecutive 1D dilated convo- achieved top results on several segmentation benchmarks by
lutions. From hereafter, we refer to this type of convolutions as leveraging the use of multi-scale information in a pyramidal
dilated asymmetric convolutions, due to the asymmetric nature fashion. However, even in modern GPUs, the required compu-
of the convolution kernel and the dilation implementation. tational resources for these methods are prohibited [13]. Thus,
Our DAPF module is designed to significantly increase making them unfeasible for real-time applications. In contrast,
the receptive field of the last stage of the encoder network, our proposed FASSD-Net handles the same pyramidal strat-
obtaining richer contextual features. We follow a pyramidal egy used in DeepLabV3+ [11] but using dilated asymmetric
scheme similar to the DeepLabV3+’s ASPP module [11], convolutions, allowing its use in real-time applications.
but all 3×3 dilated convolutions are replaced with dilated In addition, fusion strategies for multi-resolution feature
asymmetric convolutions. Our design, allows the number of maps have been used in recent works such as HarDNet
pyramidal feature maps of DAPF to vary according to the [7], SwiftNet [8] and FasterSeg [12]. These networks either
number of input feature maps, which further reduces the concatenate or add two sets of feature maps and further
computational complexity of the module. process them with a single convolution. This straightforward
Similarly, our MDA module, fuses multi-resolution feature fusion strategy requires a small amount of computation, but
maps coming from previous encoder and decoder levels of the do not exploit contextual information. In contrast, our MDA
network. In this module, feature maps are processed simultane- module concatenates two sets of feature maps, and it processes
ously by two parallel branches: the asymmetric branch and the them by using two parallel branches, simultaneously refining
non-asymmetric branch. The asymmetric convolutional branch features rich in detail information and features rich in context.
exploits the contextual information of the input feature maps, On the other hand, techniques for reducing the computa-
whereas the non-asymmetric branch focuses on recovering tional complexity of the networks such as depth-wise sepa-
details. This design allows a simultaneous refinement of detail rable convolutions [21]–[25], zoomed convolutions [12], or
and contextual information in multiple stages of the decoder. convolution factorization [5], [13], [14] have been proposed
Our proposed network, entitled FASSD-Net, combines the and applied to the task of real-time semantic segmentation.
DAPF and MDA modules and effectively increases the seman- Networks that employ these techniques such as Fast-SCNN
tic segmentation accuracy with a relatively low computational [25], ERFNet [13], and FasterSeg [12] achieve real-time
cost. Additionally, we present two light variations that provide performance, usually at the cost of significantly lower accuracy
a balanced trade-off between accuracy and inference speed, compared to non-real-time methods. Similarly, our three net-
namely, FASSD-Net-L1 and FASSD-Net-L2. work proposals rely on factorized (asymmetric) convolutions
Overall, our proposed networks bridge the accuracy gap used in the DAPF and MDA modules. However, as shown in
between existing real-time (around 70% mIoU) [2], [5], [6], Figure 1, our proposed networks outperform Fast-SCNN [25],
[8], [12]–[14] and non-real-time networks (about 80% mIoU) ERFNet [13] and FasterSeg [12] in terms of accuracy, keeping
for semantic segmentation [1], [11], [15]–[20]. This new gap comparable speed performance.
is illustrated in Figure 1. Additional methods for accelerating neural networks include
Our main contributions are summarized as follows: filtering or channel pruning, network distillation, Network
• We introduce DAPF, an efficient plug-in-and-play spatial Architecture Search (NAS), and Neural Network Quantization
pyramidal fusion module inspired by ASPP [11]. DAPF [12], [26]–[29]. Such methods, mainly reduce the number of
demands far less computational complexity, which en- parameters and weight of the model, boosting the inference
ables its use for real-time applications. speed. However, most of them, either utilize sophisticated
• We introduce the MDA module, which allows better methodologies, require a considerable amount of memory,
learning from two different stages of the network, simul- or cannot be directly applied to more elaborated network
taneously refining spatial and contextual information. architectures [30], [31]. Specifically, FasterSeg [12] and CAS
• As shown in Figure 1, our proposed networks obtain [28] obtain state-of-the-art inference speed by utilizing NAS
state-of-the-art mIoU results on the Cityscapes valida- techniques. By comparison, our proposed models are designed
tion set for the task of real-time semantic segmentation. manually and still, outperform these NAS networks in accu-
Furthermore, FASSD-Net is comparable to non-real-time racy and speed performance.

2265
Networks such as ESPNetv2 [32], ESNet [32], and LEDNet FC-HarDNet process a high number of feature maps with
[33] employ lightweight pyramidal multi-resolution strategies 1/64 input size resolution, which is essential for classification
similar to our DAPF module. More specifically, LEDNet [33] tasks. However, we believe that this is not always the case for
and ESNet [14] incorporate asymmetric convolutions in their semantic segmentation tasks, where such small size feature
core modules. However, they heavily rely on the use of this maps lose track of small objects present in the scene (e.g.,
technique, which hurts the inference speed performance in for a 1024×2048 image and a downsampling rate of 64, the
comparison to highly optimized standard convolutions [8]. size becomes 16×32). In our FASSD-Net implementation, the
Contrary to these methods, our DAPF and MDA modules last encoder block and the first decoder block of FC-HarDNet-
utilize asymmetric convolutions only if dilation is applied 70 are replaced with our DAPF module, so that the smallest
concurrently, alleviating the reduction of inference speed feature maps processed by our network are 1/32 of the input
performance caused by the atrous convolutions [5]. When size resolution. Similarly, the FC-HarDNet-70 multi-resolution
compared to our proposals, LEDNet [33] and ESNet [14] are fusion scheme is substituted by our MDA module.
slower and much less accurate.
B. Dilated Asymmetric Pyramidal Fusion module
III. M ETHODOLOGY As shown in Figure 3, the ASPP module heavily re-
Most of the existing state-of-the-art methods for semantic lies on standard atrous convolutions and produces a fixed
segmentation are built on top of high-performance baselines number of feature maps Q in each of its five pyramidal
for image classification such as ResNet, WiderResNet, or branches. Pyramidal branches consist of: 1× Conv 1×1, 1×
Xception [1], [11], [16]–[18]. Following this trend, we adopt Pooling + Conv 1×1, and 3× atrous Conv 3×3 with dilation
and extend the work of Chao et al. [7] by incorporating the rates r = 12, 24, and 36, respectively.
DAPF and the MDA modules. The proposed network structure Our proposed DAPF module draws inspiration from the
is shown in Figure 2. In the figure, the stem convolution ASPP module, but with some key differences aimed to reduce
block consists of four consecutive convolution layers. The core its computational burden: Firstly, all 3×3 atrous convolutions
element of all encoder and decoder blocks is the HarDBlock are factorized into two consecutive 1D atrous convolutions,
(Harmonic Dense Block), proposed in HarDNet [7]. Our pro- specifically, a 3×1 convolution followed by a 1×3 convolu-
posed DAPF is placed at the end of the encoder, while MDA tion. Secondly, the image pooling branch is removed since it
modules connect each decoder block with its corresponding computes feature maps that might as well be learned through
encoder, in a U-shape fashion. Finally, the last block of the the 1×1 convolutional branch. Lastly, the number of feature
network consists of a single 1 × 1 convolution for making the maps generated by each pyramidal branch, is no longer fixed.
final prediction. Bilinear upsampling is used to reestablish the Instead, it is defined by the number of input feature maps of the
original input size (1024 × 2048). module K × α1 , where α serves as the compression factor. In
our implementation, we set α = 2. However, α can be adjusted
A. Network overview to any other value according to the available computational
HarDNet (Harmonic DenseNet) [7], is a recent state-of- budget. Figure 3 illustrates the differences between DAPF and
the-art network inspired by DenseNet (Densely Connected the ASPP modules.
Network) [34]. Compared to ResNet [35] and DenseNet [34], Undoubtedly, the major contributing factor for reducing
HarDNet achieves comparable accuracy with significantly the computational burden in our DAPF module is the use
lower GPU runtime for classification tasks. Its core compo- of asymmetric convolutions. For instance, in the pyramidal
nent, the HarDBlock (Harmonic Dense Block), is specifically branches, for each standard 2D convolution, we would have
designed to address the problem of the GPU memory traffic. to perform K × d × d × F operations, where K is the
The HarDBlock follows a concatenation scheme aimed to number of input channels (feature maps), d is the kernel
improve the throughput of the feature maps in the network, size, and F is the number of output channels. On the other
avoiding unnecessary DRAM (dynamic random-access mem- hand, following our asymmetric strategy with α = 2, we
ory) accesses. Additionally, the HarDBlock is optimized to perform (K × d × 12 K) + ( 21 K × d × 21 K) operations. For
increase the density of computations of the layers, defined a 3×3 kernel, this factorization strategy only requires 21 of
by the number of Multiply-Accumulate operations (MACs) the original number of operations, thus saving 50% of the
over the Convolutional Input/Output (CIO). These key im- needed computations and parameters in comparison to its non-
provements are based on the observation that when the density asymmetric convolution equivalent. Moreover, factorization
of computation is low, DRAM traffic can influence inference can also improve the learning capacity of the module as a
time more substantially than the model size and the number result of the intermediate activation layers used between the
of operations. two 1D convolutions [13].
Our baseline model, FC-HarDNet-70, is the implemen-
tation of HarDNet for the task of semantic segmentation. C. Muti-resolution Dilated Asymmetric module
FC-HarDNet-70 is a U-shape-like architecture [9] with five We design our MDA module to simultaneously exploit
encoder blocks and four decoder blocks (all of them HarD- contextual information and recover spatial information. As
Blocks). The convolution layers in the last encoder stage of shown in Figure 4, the two multi-resolution feature maps

2266
Fig. 2. Block diagram of the proposed network. “s” indicates the downsampling rate of the feature maps with respect to the original input image (e.g., s=32
indicates output feature maps of size 32×64).

Fig. 4. Muti-resolution Dilated Asymmetric module. “C” denotes the number


of output feature maps, “||” indicates concatenation and “D” indicates dilated
convolution.

Fig. 3. ASPP and DAPF modules comparison. Each convolution is followed


by its respective batch normalization and activation layers. “C” denotes the
number of feature maps.

1×1 convolution to match the number of feature maps of


the first 1×1 convolution. Finally, feature maps of both 1×1
K and Q are concatenated and fused together with a 1×1
convolutions are summed up through a residual connection,
convolution. This convolution reduces the number of feature
helping to improve the gradient flow.
maps by half to reduce the number of computations in the
module and subsequent stages. The dilation rate r of the asymmetric branch gradually
After an additional 3×3 convolution that refines the initially decreases in every MDA block, from the deepest to the
fused feature maps, two parallel branches process the output shallowest stage of the decoder (see Figure 2). Specifically,
feature maps. The asymmetric convolutional branch aims to the dilation rates are r = (8, 4, 2) from Decoder B1 to
exploit the contextual information present in the feature maps Decoder B3, respectively. The intuition behind this idea is
by leveraging the use of dilated convolutions. In contrast, the that inner feature maps of the network are richer in contextual
non-asymmetric branch focuses on refining the details. The information and can be leveraged by atrous convolutions with
resulting feature maps are concatenated and processed by a larger dilation rates.

2267
TABLE I TABLE II
A BLATION STUDY OF OUR PROPOSED MODULES ON THE C ITYSCAPES FASSD-N ET ARCHITECTURE . L DENOTES THE NUMBER OF
VALIDATION SET. CONVOLUTION LAYERS IN THE H AR DB LOCK .

Method GFLOPs No. Parameters ∆p FPS mIoU Stage Name Type Output size
FC-HarDNet-70 [7] 35.4 4.10M - 52.3 76.4 Input - - 1024×2048×3
Baseline 32.9 1.90M 0M 56.3 75.2
+ ASPP 36.8 3.85M 1.95M 50.2 75.8 Conv 3×3 (s=2)* 512×1024×16
+ DAPF 33.9 2.36M 0.46M 53.9 77.7 Conv 3×3** 512×1024×24
Stem Conv
+ MDA 44.2 2.38M 0.48M 42.2 77.4 Conv 3×3 (s=2) 256×512×32
+ ASPP + MDA 48.0 4.33M 2.43M 39.1 76.8 Conv 3×3 256×512×48
+ DAPF + MDA 45.1 2.85M 0.95M 41.1 78.2 Encoder B1 HarDBlock (L=4) 256×512×64
2D Average Pooling
Encoder Encoder B2 128×256×96
HarDBlock (L=4)
IV. E XPERIMENTS 2D Average Pooling
Encoder B3 64×128×160
We evaluate our proposed network architectures on the HarDBlock (L=8)
Cityscapes benchmark [36]. Performance is mainly measured 2D Average Pooling
Encoder B4 32×64×224
HarDBlock (L=8)
in mean Intersection over Union accuracy (mIoU) and Frames
DAPF - 32×64×224
per Second (FPS). Besides, we report the number of param-
MDA - 64×128×192
eters and computational complexity in GFLOPs. All experi- Decoder B1 HarDBlock (L=8) 64×128×160
ments are conducted using the publicly available Cityscapes MDA - 128×256×119
dataset [36]. This dataset consists of 5,000 finely annotated Decoder B2 HarDBlock (L=4) 128×256×78
Decoder
1024 × 2048 images: 2,975 for training, 1,525 for testing, MDA - 256×512×63
and another 500 images for validation. Additionally, 19,998 Decoder B3 HarDBlock (L=4) 256×512×48
images with coarse annotations are also provided. For a fair Conv 1×1 256×512×19
Output Conv
comparison, we only use the fine annotated images, and do Upsampling ×4 1024×2048×19
not employ any augmentation technique for testing, such as *Changes to (s=3) for FASSD-Net-L1
**Changes to (s=2) for FASSD-Net-L2
multi-scale or multi-crops, which increase the accuracy at the
cost of inference time.
A. Implementation Details Note that, for a fair comparison, all methods shown in Table
We use PyTorch 1.0 with CUDA 10.2 for all experi- I are trained without further fine-tuning.
ments. The same training setting is used for all models, Our DAPF module outperforms DeepLabV3+’s ASPP in
where Stochastic Gradient Descent (SGD) with weight-decay all four metrics (GFLOPS, Parameters, FPS and mIoU). In
5 × 10−4 and momentum 0.9 is used as the optimizer. We addition, the increase of parameters (∆p) by DAPF from the
employ the “poly” learning rate strategy lr = initial lr × Baseline is significantly fewer than the ASPP module (0.46M
iter 0.9 vs 1.95M). In resume, our module is more than four times
( total iter ) , and an initial learning rate of 0.02. Cross-
entropy loss is computed following the online bootstrapping lighter than ASPP, and presents an increase of 1.9% of mIoU.
strategy [37]. Data augmentation consists of random horizontal It can be observed that the addition of our MDA strategy
flip, random scale in the range [0.5, 2], and random cropping outperforms the mIoU of the baseline network to almost the
with 1024 × 1024 crop size. We trained all models for 90k same degree as DAPF. Likewise, our MDA strategy consumes
iterations with batch size 16. For the final models, we follow roughly the same number of parameters. However, since MDA
the same training protocol for 30K more iterations, setting the is used in three different levels of the decoder, the FPS drop
batch size to 24 and the initial learning rate to 0.001. becomes evident compared to DAPF.
All networks are pre-trained on the ImageNet dataset [38], The combination of our two proposals achieves the best
and the inference speed (in FPS) is measured on an Intel mIoU accuracy. Specifically, it outperforms the baseline and
Core i7-9700K desktop with one NVIDIA GTX 1080ti card the state-of-the-art results of FC-HarDNet-70 [7] by 3% and
unless specified otherwise. For all experiments, the speed is 1.8%, respectively. Additionally, the increase of parameters
calculated from the average FPS rate of 10, 000 iterations (∆p) of our proposal is less than 50% over the case when
measured on images of size 1024 × 2048 × 3. ASSP is used. On top of that, the total number of parameters
is significantly lower than those needed by FC-HArDNet-
B. Ablation Study 70 (2.85M vs 4.10M). Note that Baseline + DAPF + MDA
We show the performance comparison between our pro- corresponds to our final model called FASSD-Net, as shown
posed DAPF and the DeepLabV3+’s ASPP module [11]. in Figure 2.
We evaluate the effectiveness of our MDA module and its
combination with DAPF and ASPP. Table I summarizes the
C. Light variations of FASSD-Net
corresponding results. Baseline denotes the modified FC-
HarDNet-70, where the last encoder and the first decoder In addition to our network FASSD-Net, we introduce two
blocks are removed, as previously described in Section III. light versions designed to maintain a better tradeoff between

2268
TABLE III
P ER - CLASS M I O U SCORE COMPARISON OF OUR PROPOSALS ON THE C ITYSCAPES VALIDATION SET.

Method road s.walk build. wall fence pole t.light t.sign veg. terr. sky person rider car truck bus train mbik bike mIoU
FC-HarDNet-70 [7] 98.1 84.6 92.6 60.0 63.5 64.9 69.7 78.8 92.2 62.8 95.0 81.4 60.7 95.0 73.9 82.4 77.4 61.7 76.2 77.4
FASSD-Net 98.3 86.0 92.9 59.6 63.9 67.1 71.7 79.6 92.5 63.7 95.0 82.1 63.2 95.4 80.8 87.8 79.6 61.3 76.8 78.8
FASSD-Net-L1 98.2 84.6 92.4 54.7 61.3 63.3 68.2 77.1 92.1 61.8 94.9 80.0 59.8 94.9 76.0 82.7 74.7 59.1 74.6 76.3
FASSD-Net-L2 97.9 83.1 91.6 55.6 57.0 58.0 62.5 71.8 91.7 63.0 94.4 77.3 57.0 93.8 75.5 81.9 71.1 53.1 70.8 74.1

Fig. 5. Qualitative results of the proposed networks. Regions of improvement are highlighted with yellow squares.

speed and accuracy. We call these networks: FASSD-Net-L1 the truck and bus classes with 6.9% and 5.4%, respectively.
and FASSD-Net-L2. Similarly, FASSD-Net-L1 and FASSD-Net-L2 also obtain
Table II shows the detailed architecture of FASSD-Net, better results overall in these two classes compared to FC-
including the output and number of channels for each ele- HarDNet-70, despite being less accurate models. Qualitative
ment. From Table II, FASSD-Net-L1 differs only in the first results of our FASSD-Net variations are shown in Figure
convolution layer, where the convolution stride is increased 5. As the quantitative results suggest, the most significant
from 2 to 3. Such modification, preserves the same number improvements occur on pixels belonging to large objects, such
of parameters of the network and leads to a faster inference as trucks and buses. Compared to FC-HarDNet-70, all three
speed at the cost of a small drop in accuracy. Specifically, FASSD-Net variations better differentiate between car, bus,
it is 1.9× faster and 2.5% less accurate. FASSD-Net-L2, on and truck classes. For a fair comparison, we have conducted
the other hand, is designed to be the fastest among our three the evaluation of our final models against FC-HarDNet-70 with
proposals. It adopts an additional convolution stride of 2 in its official weights from its open-source implementation1 .
the second convolution layer of the stem block. In addition, all
the HarDBlocks in the decoder are replaced by conventional D. Comparison with state-of-the-art real-time methods
3×3 convolution layers of 64 channels. Thus, FASSD-Net- Table IV shows the overall comparison of our network
L2 is 3.2× faster than FASSD-Net with only a 4.7% drop in proposals versus other state-of-the-art methods for real-time
accuracy performance. semantic segmentation. The table is divided into three cat-
We report the results of per-class mIoU accuracy in Table egories, based on the inference speed directly comparable to
III. FASSD-Net obtains the highest score in 17 out of 19 each of our three proposed networks. For a more complete and
categories, outperforming the current state-of-the-art model,
FC-HarDNet-70. Most significative improvements occur in 1 github.com/PingoLH/FCHarDNet

2269
TABLE IV
C OMPARISON BETWEEN STATE - OF - THE - ART NETWORKS FOR REAL - TIME SEMANTIC SEGMENTATION .

Method Input Size GPU GFLOPs No. Parameters FPS FPS (norm.) mIoU
DeepLabV3+ [11] 512×1024 Titan X (P) - - ≈1 ≈1 79.6
PSPNet [15] 713×713 Titan X (P) - - <1 <1 79.7
Fast-SCNN [25] 1024×2048 Titan XP - 1.11M 123.5 110.3 69.2
SwiftNetRN-18 [8] 512×1024 GTX 1080ti 26.0 11.8M 134.9 134.9 70.2
FC-HarDNet-70 (L2) 1024×2048 GTX 1080ti 6.6 2.86M 152.8 152.8 72.1
FASSD-Net-L2 (ours) 1024×2048 GTX 1080ti 8.7 2.3M 133.1 133.1 74.1
ERFNet [13] 512×1024 Titan X (M) - 2.10M 41.7 68.4 71.5
CAS [28] 768×1536 Titan XP - - 108.0 96.4 71.6
DF1-Seg-d8 [26] 1024×2048 GTX 1080ti - - 136.9* 82.9 72.4
FasterSeg [12] 1024×2048 GTX 1080ti 28.2 4.4M 163.9* 99.3 73.1
FC-HarDNet-70 (L1) 1024×2048 GTX 1080ti 15.7 4.1M 93.4 93.4 74.8
FASSD-Net-L1 (ours) 1024×2048 GTX 1080ti 20.0 2.85M 78.0 78.0 76.3
ICNet [2] 1024×2048 Titan X (M) 28.3 26.5M 30.3 49.7 67.7
DABNet [5] 1024×2048 GTX 1080ti 41.8 0.76M 27.7 27.7 69.1
GUN [39] 512×1024 Titan XP - - 33.3 29.7 69.6
BiSeNet [3] 768×1536 Titan XP - 49M 65.5 58.5 74.8
SwiftNetRN-18 [8] 1024×2048 GTX 1080ti 104.0 11.8M 39.9 39.9 75.4
FC-HarDNet-70 [7] 1024×2048 GTX 1080ti 35.4 4.1M 53.0 53.0 77.4
FASSD-Net (ours) 1024×2048 GTX 1080ti 45.1 2.85M 41.1 41.1 78.8
* Speed measured with TensorRT acceleration

fair comparison with respect to our baseline, FC-HarDNet-70 and DF1-Seg-d8, while being 1% and 1.7% more accurate,
(L1) and FC-HarDNet-70 (L2) are our modified implemen- respectively.
tations of FC-HarDNet-70 that closely resemble to our two
light networks, following the same changes and our same
training protocol. FC-HarDNet-70 (L1) and FC-HarDNet-70 V. C ONCLUSION
(L2) are directly modified from the official source code of
FC-HarDNet-70. In this paper, we focus on reducing the accuracy gap
between real-time and non-real-time semantic segmentation
For fair comparison under different GPU architectures, we
networks. For this purpose, we have proposed the DAPF and
follow the same protocol as Orsic et al. [8] and let the column
MDA modules. These modules exploit the contextual informa-
FPS (norm.) in Table IV to provide a speed estimation of the
tion in several stages of the decoder and boost the accuracy
model running on a GTX 1080ti GPU. Scaling factors are: 1.0
performance of the baseline network, keeping relatively low
for GTX 1080ti, 0.61 for Titan X (Maxwell), 1.03 for Titan
computational cost. Using our two modules jointly, we have
X (Pascal), and 1.12 for Titan XP.
designed three network variations that can be chosen depend-
Our main network, FASSD-Net, surpasses by a considerable
ing on the computational budget. Our main network, FASSD-
margin the mIoU score of all other methods for real-time
Net, sets the new state-of-the-art mIoU accuracy for the task of
semantic segmentation, requiring 1.44× fewer parameters and
real-time semantic segmentation on the Cityscapes validation
being 1.4% more accurate than the closest competitor FC-
set. In addition, our proposed FASSD-Net-L2 ranks as the
HarDNet-70.
fastest network when evaluated on 1024×2048 images without
Our second network, FASSD-Net-L1, resembles BiSeNet
using additional network acceleration techniques. As future
in mIoU accuracy and FPS. However, the speed of BiSeNet
work, we would like to evaluate our proposals on different
has been originally measured on 768 × 1536 images with an
scenarios, such as indoor understanding, or medical images.
NVIDIA Titan XP card. For a fair comparison, and according
We also would like to apply acceleration techniques, such as
to Zhuang et al. [40], we let its speed to be 37 FPS evaluated
network quantization or network distillation, to increase the
on 1024 × 2048 images on an NVIDIA GTX 1080ti card.
speed of our models further.
Therefore, resulting in our network being about 2.1× faster,
1.5% more accurate, and requiring 17.2× fewer parameters.
Similarly, FASSD-Net-L2 can be compared to FasterSeg ACKNOWLEDGMENT
and DF1-Seg-d8, which were designed and optimized by NAS
methodologies. Both methods utilize TensorRT acceleration The authors would like to thank to the University of Electro-
[41] to increase their speed performance. For a fair compar- Communications, the National Polytechnic Institute ESIME-
ison, we let 1.65× be the acceleration factor of TensorRT. Culhuacan and CONACyT (Mexico) for providing financial
This value is approximated from works that present results support for the development of this work. This work was
with and without TensorRT, such as FasterSeg [12]. Under also supported by JSPS KAKENHI Grant Number 15H05915,
these assumptions, our FASSD-Net-L2 is faster than FasterSeg 17H01745, 17H06100 and 19H04929.

2270
R EFERENCES [21] F. Chollet, “Xception: Deep Learning With Depthwise Separable Con-
volutions,” in The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), July 2017.
[1] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, [22] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
“CCNet: Criss-Cross Attention for Semantic Segmentation,” in The M. Andreetto, and H. Adam, “MobileNets: Efficient Convolutional
IEEE International Conference on Computer Vision (ICCV), October Neural Networks for Mobile Vision Applications,” arXiv e-prints, p.
2019. arXiv:1704.04861, Apr 2017.
[2] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “ICNet for Real-Time [23] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
Semantic Segmentation on High-Resolution Images,” in The European “Mobilenetv2: Inverted residuals and linear bottlenecks,” in The IEEE
Conference on Computer Vision (ECCV), September 2018. Conference on Computer Vision and Pattern Recognition (CVPR), June
[3] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “BiSeNet: 2018.
Bilateral Segmentation Network for Real-time Semantic Segmentation,” [24] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang,
in The European Conference on Computer Vision (ECCV), September Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, and H. Adam, “Searching
2018. for MobileNetV3,” in The IEEE International Conference on Computer
[4] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, Vision (ICCV), October 2019.
P. Martinez-Gonzalez, and J. Garcia-Rodriguez, “A survey on [25] R. P. Poudel, S. Liwicki, and R. Cipolla, “Fast-scnn: fast semantic
deep learning techniques for image and video semantic segmentation,” segmentation network,” arXiv preprint arXiv:1902.04502, 2019.
Applied Soft Computing, vol. 70, pp. 41 – 65, 2018. [Online]. Available: [26] X. Li, Y. Zhou, Z. Pan, and J. Feng, “Partial Order Pruning: For Best
http://www.sciencedirect.com/science/article/pii/S1568494618302813 Speed/Accuracy Trade-Off in Neural Architecture Search,” in The IEEE
[5] G. Li and J. Kim, “DABNet: Depth-wise Asymmetric Bottleneck for Conference on Computer Vision and Pattern Recognition (CVPR), June
Real-time Semantic Segmentation,” in British Machine Vision Confer- 2019.
ence, 2019. [27] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang, “Filter Pruning via Geo-
[6] H. Li, P. Xiong, H. Fan, and J. Sun, “DFANet: Deep Feature Aggregation metric Median for Deep Convolutional Neural Networks Acceleration,”
for Real-Time Semantic Segmentation,” in The IEEE Conference on in The IEEE Conference on Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition (CVPR), June 2019. (CVPR), June 2019.
[7] P. Chao, C.-Y. Kao, Y.-S. Ruan, C.-H. Huang, and Y.-L. Lin, “HarDNet: [28] Y. Zhang, Z. Qiu, J. Liu, T. Yao, D. Liu, and T. Mei, “Customizable Ar-
A Low Memory Traffic Network,” in The IEEE International Conference chitecture Search for Semantic Segmentation,” in The IEEE Conference
on Computer Vision (ICCV), October 2019. on Computer Vision and Pattern Recognition (CVPR), June 2019.
[8] M. Orsic, I. Kreso, P. Bevandic, and S. Segvic, “In Defense of Pre- [29] R. Zhao, Y. Hu, J. Dotzel, C. De Sa, and Z. Zhang, “Improving neural
Trained ImageNet Architectures for Real-Time Semantic Segmentation network quantization without retraining using outlier channel splitting,”
of Road-Driving Images,” in The IEEE Conference on Computer Vision in International Conference on Machine Learning, 2019, pp. 7543–7552.
and Pattern Recognition (CVPR), June 2019. [30] X. Zheng, R. Ji, L. Tang, Y. Wan, B. Zhang, Y. Wu, Y. Wu, and L. Shao,
[9] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks “Dynamic distribution pruning for efficient network architecture search,”
for biomedical image segmentation,” in International Conference on arXiv preprint arXiv:1905.13543, 2019.
Medical image computing and computer-assisted intervention. Springer, [31] H. Cai, L. Zhu, and S. Han, “ProxylessNAS: Direct Neural
2015, pp. 234–241. Architecture Search on Target Task and Hardware,” in International
[10] F. Yu and V. Koltun, “Multi-Scale Context Aggregation by Conference on Learning Representations, 2019. [Online]. Available:
Dilated Convolutions,” in 4th International Conference on Learning https://arxiv.org/pdf/1812.00332.pdf
Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, [32] S. Mehta, M. Rastegari, L. Shapiro, and H. Hajishirzi, “ESPNetv2:
Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2016. A Light-Weight, Power Efficient, and General Purpose Convolutional
[Online]. Available: http://arxiv.org/abs/1511.07122 Neural Network,” in The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2019.
[11] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-
[33] Y. Wang, Q. Zhou, J. Liu, J. Xiong, G. Gao, X. Wu, and L. J.
Decoder with Atrous Separable Convolution for Semantic Image Seg-
Latecki, “Lednet: A Lightweight Encoder-Decoder Network for Real-
mentation,” arXiv e-prints, p. arXiv:1802.02611, Feb 2018.
Time Semantic Segmentation,” in 2019 IEEE International Conference
[12] W. Chen, X. Gong, X. Liu, Q. Zhang, Y. Li, and Z. Wang, “FasterSeg: on Image Processing (ICIP), Sep. 2019, pp. 1860–1864.
Searching for Faster Real-time Semantic Segmentation,” in International [34] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely
Conference on Learning Representations, 2020. Connected Convolutional Networks,” in The IEEE Conference on Com-
[13] E. Romera, J. M. Álvarez, L. M. Bergasa, and R. Arroyo, “ERFNet: puter Vision and Pattern Recognition (CVPR), July 2017.
Efficient Residual Factorized ConvNet for Real-Time Semantic Seg- [35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for
mentation,” IEEE Transactions on Intelligent Transportation Systems, Image Recognition,” in The IEEE Conference on Computer Vision and
vol. 19, pp. 263–272, 2018. Pattern Recognition (CVPR), June 2016.
[14] Y. Wang, Q. Zhou, and X. Wu, “ESNet: An Efficient Symmetric [36] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be-
Network for Real-time Semantic Segmentation,” arXiv e-prints, p. nenson, U. Franke, S. Roth, and B. Schiele, “The Cityscapes Dataset
arXiv:1906.09826, Jun 2019. for Semantic Urban Scene Understanding,” in The IEEE Conference on
[15] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid Scene Parsing Computer Vision and Pattern Recognition (CVPR), June 2016.
Network,” in The IEEE Conference on Computer Vision and Pattern [37] Z. Wu, C. Shen, and A. van den Hengel, “High-performance Semantic
Recognition (CVPR), July 2017. Segmentation Using Very Deep Fully Convolutional Networks,” CoRR,
[16] T. Takikawa, D. Acuna, V. Jampani, and S. Fidler, “Gated-SCNN: Gated vol. abs/1604.04339, 2016. [Online]. Available: http://arxiv.org/abs/
Shape CNNs for Semantic Segmentation,” in The IEEE International 1604.04339
Conference on Computer Vision (ICCV), October 2019. [38] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
[17] Z. Tian, T. He, C. Shen, and Y. Yan, “Decoders Matter for Semantic Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
Segmentation: Data-Dependent Decoding Enables Flexible Feature Ag- L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”
gregation,” in The IEEE Conference on Computer Vision and Pattern International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp.
Recognition (CVPR), June 2019. 211–252, 2015.
[18] X. Li, Z. Zhong, J. Wu, Y. Yang, Z. Lin, and H. Liu, “Expectation- [39] D. Mazzini, “Guided Upsampling Network for Real-Time Semantic
Maximization Attention Networks for Semantic Segmentation,” in The Segmentation,” arXiv e-prints, p. arXiv:1807.07466, Jul 2018.
IEEE International Conference on Computer Vision (ICCV), October [40] J. Zhuang, J. Yang, L. Gu, and N. Dvornek, “ShelfNet for Fast Semantic
2019. Segmentation,” in The IEEE International Conference on Computer
[19] Y. Yuan, X. Chen, and J. Wang, “Object-contextual representations for Vision (ICCV) Workshops, Oct 2019.
semantic segmentation,” ArXiv, vol. abs/1909.11065, 2019. [41] NVIDIA, “TensorRT,” September 2010, [Online; accessed February 14,
[20] R. P. K. Poudel, U. Bonde, S. Liwicki, and C. Zach, “Contextnet: 2020]. [Online]. Available: https://developer.nvidia.com/tensorrt
Exploring context and detail for semantic segmentation in real-time,”
in BMVC, 2018.

2271

You might also like