0% found this document useful (0 votes)
32 views

DMPNet Distributed Multi-Scale Pyramid Network For

This document summarizes a research article that proposes a new neural network called DMPNet for real-time semantic segmentation. DMPNet introduces two key techniques: 1) an Efficient Multi-scale Context Aggregation (EMCA) module that uses a combination of convolutions to capture context at multiple scales in an efficient way, and 2) a Distributed Multi-scale Pyramid Pooling (DMPP) strategy that applies pyramid pooling modules across encoder stages to extract multi-scale context at different levels. Experimental results show that DMPNet achieves state-of-the-art accuracy on semantic segmentation datasets while being lightweight enough for real-time applications on resource-constrained devices.

Uploaded by

Kinga Bazak
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

DMPNet Distributed Multi-Scale Pyramid Network For

This document summarizes a research article that proposes a new neural network called DMPNet for real-time semantic segmentation. DMPNet introduces two key techniques: 1) an Efficient Multi-scale Context Aggregation (EMCA) module that uses a combination of convolutions to capture context at multiple scales in an efficient way, and 2) a Distributed Multi-scale Pyramid Pooling (DMPP) strategy that applies pyramid pooling modules across encoder stages to extract multi-scale context at different levels. Experimental results show that DMPNet achieves state-of-the-art accuracy on semantic segmentation datasets while being lightweight enough for real-time applications on resource-constrained devices.

Uploaded by

Kinga Bazak
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425

Digital Object Identifier

DMPNet: Distributed Multi-scale Pyramid


Network for Real-time Semantic
Segmentation
NADEEM ATIF1 , SAQUIB MAZHAR1 , SHAIK RAFI AHAMED1 , (Senior Member, IEEE), M.K.
BHUYAN1 , (Senior Member, IEEE), SULTAN ALFARHOOD2 , and MEJDL SAFRAN2
1
Indian Institute of Technology Guwahati, Guwahati, Assam 781039, India (e-mail: {atif176102103, saquibmazhar, rafiahamed, mkb}@iitg.ac.in)
2
Department of Computer Science, College of Computer and Information Sciences, King Saud University, P.O.Box 51178, Riyadh 11543, Saudi Arabia {sultanf,
mejdl}@ksu.edu.sa
Corresponding author: Sultan Alfarhood (e-mail: [email protected]).
This research is funded by the Researchers Supporting Project Number (RSPD2024R890), King Saud University, Riyadh, Saudi Arabia.

ABSTRACT
In semantic segmentation, an input image is partitioned into multiple meaningful segments each correspond-
ing to a specific object or region. Multi-scale context plays a vital role in the accurate recognition of objects
of different sizes and hence is key to overall accuracy enhancement. To achieve this goal, we introduce
a novel strategy called Distributed Multi-scale Pyramid Pooling (DMPP) to extract multi-scale context at
multiple levels of feature hierarchy. More specifically, we employ Pyramid Pooling Modules (PPM) in a
distributed fashion after all three stages during the encoding phase. This enhances the feature representation
capability of the network and leads to better performance. To extract context at a more granular level, we
propose an Efficient Multi-scale Context Aggregation (EMCA) module which uses a combination of small
and large kernels with large and small dilation rates, respectively. This alleviates the problem of sparse
sampling and leads to consistent recognition of different regions. Apart from model accuracy, small model
size and efficient execution are critically important for real-time mobile applications. To achieve it, we
employ a resource-friendly combination of depthwise and factorized convolutions in the EMCA module
to drastically reduce the number of parameters without significantly compromising the accuracy. Based on
the EMCA module and DMPP, we propose a lightweight and real-time Distributed Multi-scale Pyramid
Network (DMPNet) that achieves an excellent accuracy-efficiency trade-off. We also conducted extensive
experiments on both driving datasets (i.e., Cityscapes and CamVid) and a general-purpose dataset (i.e.,
ADE20K) to show the effectiveness of the proposed method.

INDEX TERMS Semantic segmentation, deep learning, real-time processing, autonomous driving,
resource-constrained.

I. INTRODUCTION camera; the closer the object the bigger it appears. So, an
Semantic Segmentation is one of the most fundamental visual effective visual recognition system must be able to extract
recognition tasks in computer vision. Its aim is to label each and represent features at multiple scales in order to recognize
pixel of an input image with a class from amongst a set of them correctly, irrespective of their actual sizes. Using dilated
predefined classes. This leads to a complete partition of the convolutions with multiple dilation rates is one of the most
input image into multiple segments that correspond to differ- common and efficient methods to capture multi-scale context
ent regions and objects present in the image. It has a wide [5]–[7]. It increases the receptive field without increasing the
range of applications including, but not limited to, medical number of parameters by inserting zeroes between different
imaging, autonomous driving, robotic vision, satellite/aerial entries of the filter. However, using large dilation rates to
imagery, virtual/augmented reality, and so on [1]–[4]. increase the receptive field leads to a very sparse probing (or
Different objects, that appear in a typical road-scene sce- sampling) of the input features. This causes accuracy degra-
nario, are of different sizes. Also, the same object can appear dation because of the introduction of a large number of holes
to have different sizes depending upon its distance from the

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425

in the filter. Another common technique to aggregate long- follows:


range context is to employ plug-and-play context modules • We propose a novel EMCA module that is designed
such as PPM [8] or ASPP [9], [10]. While the dedicated to capture multi-scale context in a highly efficient way
context modules enhance accuracy by capturing multi-scale by employing a smart combination of factorized, depth-
context, their localized application causes only the high-level wise, and standard convolutions. More specifically, one
feature maps (i.e., the last stage) to be taken into account. So, convolution captures a short-range context (from im-
to maximize the benefit of dedicated PPMs, they must also mediate neighbors), while the other two are meant to
be applied to the low and mid-level feature maps as well. capture long-range sparse and dense context.
Apart from enhancing segmentation quality, another very • We also introduce a strategy to apply PPMs in a dis-
important dimension of network design especially for au- tributed fashion. The proposed technique, called DMPP,
tonomous driving is model efficiency. The networks are also employs multiple PPMs distributed across the encoder
required to be fast and small in size to achieve compatibility stages to extract multi-scale context at multiple levels of
with resource-constrained platforms for mobile applications. feature hierarchy.
Hence, to ensure both segmentation quality and efficient • Based on the EMCA module and DMPP, we propose a
execution, designing networks with a decent trade-off be- lightweight and real-time network, called DMPNet. It
tween accuracy and efficiency is of paramount importance. achieves an excellent trade-off between accuracy and
Although many early works achieved lightweight models, efficiency. More specifically, with only 0.65 million
the model accuracies were significantly compromised [5], parameters, it achieves 71.1 and 69.2% mIoU on the
[11]–[13]. This restricts their practical scope, especially in Cityscapes and CamVid datasets, respectively.
risky scenarios, e.g., autonomous driving. As a result, it is
currently one the most vibrant sub-fields within the semantic II. RELATED WORK
segmentation research [14]–[17]. Our objective is to develop a lightweight and real-time seg-
Based on the above analysis, we propose a Distributed mentation model capable of modeling multi-scale feature
Multi-scale Pyramid Network (DMPNet) which is based on representation both on the high semantic level and the gran-
two different context aggregation mechanisms. To address ular level. To be consistent with the theme of this article, we
the issue of sparse probing by kernels with large dilation organize the existing methods based on the approach they
rates which leads to accuracy degradation, we introduce an take to model multi-scale context. As a result, we catego-
Efficient Multi-scale Context Aggregation (EMCA) module. rize them into three distinct categories; dilated convolution-
It employs a large kernel with a relatively smaller dilation rate based pyramid, pooling-based pyramid, and granular context
to extract dense context and thus avoid the loss of short-range extraction methods. Moreover, we also discuss methods that
spatial details. To reduce the number of parameters required take an efficiency-driven approach to develop lightweight and
by the large kernel, we factorize it into two asymmetric real-time segmentation networks.
kernels. Apart from the asymmetric dense kernel, we also use
two symmetric small convolutions; one for capturing local A. DILATION BASED PYRAMIDS
finer details without any dilation and the other for capturing To release the constraint of fixed-size inputs and to represent
long-range context with a large dilation rate. It should be features at multiple scales, Spatial pyramid pooling (SPP)
noted that the large dense kernel compensates for the sparsity was proposed by [18]. However, this revolutionary work was
of the small dilated kernel, especially along the horizontal meant to solve the task of image classification and object
and vertical directions. As a result, the proposed EMCA detection. DeepLab [19] introduced dilated convolutions in
module offers a very efficient and effective solution to sparse the SPP and proposed Atrous SPP (ASPP) which probes
sampling and at the same time enables the network to capture input features with different dilation rates to increase the net-
wider context. Apart from the EMCA module, to capture work receptive field. Its later follow-ups [20], [21] introduced
low, mid, and high-level context from the different stages of pointwise convolution and image pooling to further enhance
the encoder, we introduce a Distributed Multi-scale Pyramid the performance. To make the ASPP lightweight and hence to
Pooling (DMPP) strategy. It employs multiple PPMs in a make it compatible with efficiency-oriented networks, FASS-
distributed fashion each operating at a different scale. More DNet [10] factorized the symmetric kernels and proposed
specifically, we use a PPM after each stage with the bin sizes a Dilated Asymmetric Pyramidal Fusion (DAPF) module.
proportional to the spatial resolutions that they are operating Apart from these, employing non-local context which is
at. The proposed DMPNet is designed to have a U-shape primarily based on attention mechanisms is also effective for
encoder-decoder architecture. This allows our model to pro- semantic segmentation [22], [23].
gressively upsample the low-resolution feature maps while
hierarchically fusing finer details from the high-resolution B. SPATIAL POOLING BASED PYRAMIDS
layers of the encoder. This causes the proposed network to Another research trend to improve receptive fields and ag-
recover the finer details that are lost as a result of repeated gregate wider context is to use pooling-based pyramid tech-
downsampling operations during the encoding phase. niques. PSPNet [8] introduced a Pyramid Pooling module
The main contributions of this work can be summarized as (PPM) to pool features using different pooling windows. This
2 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425

allows multi-scale feature modeling and results in a more


robust and consistent scene labeling. APCNet [24] Global-
guided Local Affinity (GLA) to construct effective context
features. With the help of multiple ACMs, it relies on the
adaptive construction of multi-scale context representations.
While the above methods give great performances, they pool
features from a square window that is more effective for
isotropic patterns. To capture anisotropic shapes, SPNet [25] FIGURE 1. (a) 3 × 3 convolution with a large dilation rate achieves a large
introduced a novel strategy called Strip Pooling (SP). It pools receptive field but introduces large regions of holes. (b) 5 × 5 with a smaller
dilation rate solves the problem of sparsity but does not take features along
features in a long and narrow band along horizontal and diagonal directions into consideration. (c) Combining sparse and dense
vertical directions. Based on this technique, they proposed schemes leads to partially symmetric yet dense sampling.
the SP module. They also designed a composite of SPM and
PPM and called it Mixed Pooling Module (MPM) to model
long-range relations. A. EMCA MODULE
Using dilated convolutions in the basic blocks to enlarge
C. GRANULAR CONTEXT EXTRACTION receptive fields is a very common and effective technique
The context modules, both dilated convolution-based (ASPP, to capture a wider context. However, most of the existing
DAPF) and pooling-based (PPM, MPM), operate at a high methods use small kernels, e.g., 3 × 3 to reduce the number
semantic level. They can be used as a plug-and-play module of parameters. Also, the larger dilation rates lead to sparse
and can be attached to a segmentation network to enhance filtering and suffer from segmentation quality degradation.
accuracy. A more granular approach to aggregating context is We propose an alternative approach and show that a large
to use dilated convolutions within the basic building blocks factorized convolution with relatively smaller dilation rates
themselves. ESPNet [5] proposed an ESP module with 5 can achieve 1) a large receptive field 2) parameter efficiency
parallel convolutions; four are dilated with multiple dilation and more importantly 3) dense context.
rates. This results in multi-scale context aggregation on a Fig. 1 shows that a large kernel with a smaller dilation rate
more low-level. DABNet [6] also uses dilated convolutions can achieve enhanced context density with the same receptive
with factorized kernels for efficient context modeling. Simi- field as that of a small kernel with a large dilation rate with
larly, CGNet [26], MFNet [16], FBSNet [14], among others a negligible increase in parameters. We use a combination
[7], [27], employ dilated convolutions with and/or without of the sparse and dense convolution in the proposed EMCA
factorization in their networks. module to achieve both the high receptive field and dense
feature sampling. The comparison of the proposed EMCA
module with the modules of other methods is shown in Fig.
D. EFFICIENCY ORIENTED NETWORKS
2. The EMCA module first uses a pointwise convolution to
Another major trend in semantic segmentation research is internally compress the input tensor by a factor equal to the
to develop lightweight and real-time models for resource- number of branches. Most modules use a compression factor
constrained devices. ENet [28] was the first lightweight equal to 2, as shown in Fig. 2(a-c), whereas we compress the
and real-time network. Unfortunately, they achieved a input tensor by a factor of 3 which lowers the computational
lightweight model at the heavy expense of model accuracy. burden. The channel widths of the layers of different stages
Nonetheless, this work introduced a new direction of re- are usually kept as the power of 2, for example, 32, 64, 128,
search, i.e., real-time semantic segmentation. ENet has less 256, etc. However, since our design compresses the channel
than half a million parameters; 0.36 to be more specific. width by a factor of 3, we choose the channel width as an
ContextNet [29] developed a highly efficient two-path way integer multiple of 3. This is done to facilitate a direct con-
model to extract both high-level contextual information and catenation without needing a projection layer (1×1) to match
low-level spatial details. Inspired by a two-pathway struc- the input and output feature width of an EMCA module. The
ture, FBSNet [14] introduces a separate spatial branch in an compressed feature tensor is then subjected to three parallel
encoder-decoder context branch. The authors of LETNet [6] convolution operations. Two (left and right branches) of
proposed a highly efficient transformer to model non-local them are symmetric convolutions which are responsible for
dependencies in an efficient manner. By using a combination extracting short-range and long-range context. The middle
of depthwise and factorized convolution in their basic block, branch also extracts the long-range context, albeit in a dense
both LETNet and FBSNet achieved a lightweight model. fashion. To execute these operations efficiently, we employ a
Other efficiency-oriented works include but are not limited very balanced combination of symmetric, asymmetric, and
to, ESPNet [5], CGNet [26], [12], among others [27], [30]. depthwise convolutions. More specifically, the convolution
operations which are symmetric are instantiated by depth-
III. PROPOSED METHOD wise kernels and the asymmetric convolution is carried out by
In this section, the building blocks as well as the complete standard kernels. The outputs from these three branches are
network architecture will be presented. then concatenated and the result is then added with the input
VOLUME 4, 2016 3

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425

FIGURE 2. Comparison of different modules. (a) AF module in MFNet [16], (b) DAB module in DABNet [6], (c) CG module in CGNet [26], and (d) is the proposed
EMCA module. C is the number of channels and D represents dilated convolution. Also, gray, orange, and pink colors represent regular, depthwise, and pointwise
convolution, respectively.

feature to enable residual learning. The same EMCA module


is also used as a downsampling module at the beginning of
the first stage and the second stage. When the EMCA module
is operating as a downsampling module, all the convolutions
are executed with stride = 2 and the identity connection is
dropped. Our dilation rate policy is presented in Table 1.

TABLE 1. Dilation rate policy for the asymmetric and symmetric kernel of
EMCA module. d1 and d2 correspond to the dilation rate of the asymmetric
kernel and symmetric kernel, respectively. L stands for layer in a given stage.

Stage-2 Stage-3
Dilation rates
L1-L2 L3-L4 L1-L3 L4-L6 L7-L10
(d1 , d2 ) (2,4) (4,8) (2,4) (4,8) (8,16)

The operations executed by the proposed EMCA module


can be expressed by the following set of equations:
Fc = C1×1 (Fin ) (1)
Fs = C3×3 (Fc ) (2)
Fd,s = Cd,3×3 (Fc ) (3)
FIGURE 3. (a) PPM. Fin and Fout correspond to input and output feature
maps, respectively. "C" represents the number of channels. UP stands for
Fd,as = Cd,1×5 (Cd,5×1 (Fc )) (4) bilinear upsampling to input feature resolution. (1/4) with the pointwise
convolution represents the compression factor. (b) Schematic diagram
Fout = Concat(Fs , Fd,s , Fd,as ) + Fin (5) showing the four pooling bins of sizes {1, 2, 3, 6}, (c) Distributed multi-scale
pyramid pooling (DMPP) strategy. "O. R." stands for operating resolution with
where Fin and Fout represent input and output features of respect to the resolution of the input image.
the EMCA module, respectively. The subscripts c, d, s and
as denote compressed, dilated, symmetric, and asymmetric
operations, respectively. Cm×n means a convolutional oper- is 1/8 of that of the input image. Moreover, multiple PPMs
ation with a kernel of size m × n. have been used in the encoder in a distributed fashion to
aggregate multi-scale context on multiple levels of feature
B. COMPLETE NETWORK ARCHITECTURE hierarchy. The details of our distributed pyramid scheme are
The complete network architecture of the proposed DMPNet presented next.
is shown in Fig. 4. We employ a U-shape encoder-decoder ar-
chitecture to recover fine-grained spatial details from the ini- 1) DISTRIBUTED MULTI-SCALE PYRAMID POOLING
tial layers while progressively upsampling the low-resolution Conventionally, the PPM is used once after the lowest-
feature maps. The encoder of the proposed DMPNet has three resolution feature maps containing the semantic information.
stages, so the spatial resolution of the third stage feature maps However, we argue that spatial pooling of features from
4 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425

FIGURE 4. The overall architecture of the proposed DMPNet. The downsampler in stage 1 is a standard 3 × 3 convolution with stride 2. The downsamplers in
stage-2 and stage-3 are EMCA modules with all the convolution operations (except the first pointwise conv.) with stride 2. "P-1/x" stands for a pointwise
convolution-based projection layer to compress the tensor width by a factor of x. The cyan-colored diamond-shaped box represents the concatenation operation.

all the resolutions (or stages) causes more effective context as follows:
aggregation. Therefore, we employ a PPM after each stage
with bin sizes proportional to the spatial resolution of features MC (Fin ) = σ(fk×k (T (GAP (Fin ))))
in the corresponding stage. Fig. 3(a) and (b) show the PPM
and the set of pooling windows that it uses, respectively. To where MC ∈ RC×1×1 and Fin ∈ RC×H×W are channel
include the low and mid-level features for spatial pooling, we attention map and input feature maps, respectively. fk×k
employ two additional PPMs; one in stage-1 and the other denotes a standard convolution operation with filter size k.
in stage-2, as shown in Fig. 4. The schematic diagram of T represents compression and re-weighting operations. GAP
three different sets of scales at which the distributed PPMs and σ stand for global average pooling and sigmoid function,
operate is shown in Fig. 3. The bin sizes of the PPM used respectively.
after stage 3 are set to that of the PPM of PSPNet [8], i.e., {1, In a deep network, contextual information is captured
2, 3, 6}. As we move back towards the initial stage, we keep in deep layers whereas shallow layers are responsible for
increasing the bin sizes by a factor of 2; in accordance with extracting low-level finer details, like edges and boundaries.
the increasing resolution of feature maps. So, the bin sizes of So, to enhance the segmentation quality, models are required
PPMs of stage-2 and stage-1 are {2, 4, 6, 12}, and {4, 8, 12, to effectively capture and aggregate these two different, albeit
24}, respectively. In this way, our method effectively captures complementary types of features. So, to mix low-level finer
multi-scale context at multiple levels of the network. Lastly, details with high-level semantic information, we design our
the channel width of the output tensor of a PPM is twice that network as a U-shape structure, as shown in Fig. 4. To reduce
of the input tensor. This drastically increases the number of the number of parameters, we compress the tensors from
parameters and the model size, as we are using PPM after stage-2 and stage-1, by a factor of 4 and 2, respectively,
every stage. To address this issue, we use a projection layer before we fuse it with the decoder features. The upsampler
after every PPM to compress the output width by a factor block is a sequence of a deconvolution block, an activation
of 2. This leads to the efficient execution of the distributed layer, and a batch-normalization layer. The details of the
pyramid scheme. network architecture are presented in Table 2.

2) CHANNEL ATTENTION IV. EXPERIMENTS


While channels of a feature tensor carry different types of In this section, we are going to first present the datasets that
extracted feature information, they also are prone to inter- are used to evaluate the performance of our network. Then,
ference noise [14], [31]. So, we use a lightweight Channel we present the experimental setup and the implementation
Attention Module (CAM) [32] at the beginning and after protocol. Then, we present the extensive experiments that
every feature fusion of the decoder to give more emphasis have been conducted to carry out the ablation studies on the
to meaningful features. Besides, it also suppresses unneces- Cityscapes validation set. The ablation studies are done to
sary features; thus leading to enhanced learning. To extract show the effectiveness of the proposed method and to prove
global context, CAM employs global average pooling, and the optimality of the final network configuration. Finally,
an attention map is generated for feature extraction guidance. we compare our results with those of the state-of-the-art
Moreover, it causes only a slight increment in computational models in terms of different accuracy and efficiency metrics.
cost, thus is a very efficient way of improving model perfor- To measure accuracy, we use mean intersection-over-union
mance. The channel attention mechanism can be expressed (mIoU) and for efficiency, we use the number of parameters
VOLUME 4, 2016 5

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425

TABLE 2. Detailed architecture of the proposed DMPNet. “Out Ch.": Number


of feature channels at the Layer’s output. “Out Res." Output spatial resolution
of feature tensors for input resolution 512 × 1024.

Layer Operation Out Ch. Out Res.


1 Downsample (Conv-3) 36 256 × 512 FIGURE 5. Baseline network based on EMCA module. PX and QX mean
repeating a block P and Q times, respectively. Pointwise convolution is used as
2-3 Conv-3 36 256 × 512
a projection layer.
4 PPM-1 72 256 × 512
5 Projection 36 256 × 512
ENCODER

TABLE 3. Evaluation result of different depths of the baseline model.


6 Downsample (EMCA) 72 128 × 256
7-10 4× EMCA module 72 128 × 256
11 PPM-2 144 128 × 256 P Q mIoU (%) Parameters (k)
12 Projection 72 256 × 512 2 8 58.4 326.790
2 10 59.1 389.478
13 Downsample (EMCA) 144 64 × 128
4 8 58.1 343.158
14-23 10× EMCA module 144 64 × 128
4 10 60 405.846
24 PPM-3 288 64 × 128
6 8 58.9 359.526
25 Projection 144 64 × 128
6 10 59.4 422.214
26 CAM-1 144 64 × 128
27 Upsampler 72 128 × 256
DECODER

28-29 2× EMCA 72 128 × 256


B. IMPLEMENTATION DETAILS
30 CAM-2 90 128 × 256
For training the networks, we have used Tesla V100 GPU
31 Upsampler 45 256 × 512
32-33 2× EMCA 45 256 × 512 using PyTorch framework with CUDA support and cuDNN
34 CAM-3 63 256 × 512 backends. Inference speed evaluation is carried out on a
35 Projection (Deconv) C 512 × 1024 single GTX 1080Ti card. The average run-time of 100 frames
is used to compute the inference speed of the network. Mini-
batch stochastic gradient descent (SGD) [34] is used with
momentum 0.9 and weight decay 5 × 10−4 . We employed
in millions (M) and the speed of the network in frames per
the “poly" learning rate policy [19] which is given by
second (FPM).
iter
lr = lrinit × (1 − )power
A. DATASETS max_iter
1) CITYSCAPES lrinit corresponds to initial learning rate which is 0.001 with
The Cityscapes dataset is one of the most popular urban power 0.9. We use online hard example mining (OHEM) loss
scene datasets for semantic segmentation, especially for function [35]. The training is carried out in two stages; the
autonomous driving applications. It contains 5000 finely encoder part in the first stage and the complete network in
labeled high-resolution (1024 × 2048) images which are the second. The batch size is set to 12 and 6 for training
divided into training, validation, and testing sets having 2925, the encoder and the complete network, respectively. To train
500, and 1575 images, respectively. There are 19 semantic the network on the Cityscapes dataset, the original images
categories present in the dataset. are sub-sampled by a factor of 2, i.e., training resolution is
512 × 1024. For ADE20K dataset, the training resolution is
2) CAMVID 480 × 480. We trained the encoder for 500 epochs and the
complete network for 1000 epochs. For data augmentation,
Another very popular urban scene dataset for autonomous
standard strategies such as horizontal flipping, cropping, and
driving applications is the CamVid dataset. There are a total
scaling have been employed during training [5]. Following
of 701 images which are divided into training (367 images),
[28], a class weighting scheme has also been used to mitigate
validation (101 images), and test set (233 images). It contains
the class-imbalance problem. As per this scheme, different
11 semantic categories and the images have a spatial resolu-
weights are assigned to different classes during training;
tion of 360 × 480.
giving more weight to rare classes and less weight to dom-
inant classes. To be more specific, each class is assigned the
3) ADE20K following weight:
Unlike urban road-scene datasets such as Cityscapes and
1
CamVid, ADE20K [33] is a highly comprehensive general- wclass =
purpose dataset. It contains dense labels belonging to 150 ln(c + pclass )
stuff/object categories. It contains 20000, 2000, and 3000 Where pclass is the normalized frequency of the class and c
images for training, validation, and testing, respectively. is a constant that is set to 1.02.

6 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425

C. ABLATION STUDIES receptive field by virtue of the large kernel, and on the other
In this subsection, we present extensive experiments to hand, increases context density by virtue of the small dilation
demonstrate how the network has been developed and to rate.
show the effectiveness of the final network design. To con-
duct the ablation study, the cityscapes validation set has been 3) ABLATION STUDY FOR DISTRIBUTED PPM
used. Firstly, we design a baseline model that is based on Conventionally, PPM is used only once after the encoder,
the proposed EMCA module. The baseline model consists i.e., at the lowest-resolution feature maps. In this work,
only of the encoder, as shown in the Fig. 5. A projection we show that employing PPMs with different scales in a
layer has been used at the end of the encoder to compress the distributed fashion further enhances the accuracy with a neg-
width of the output feature maps to the number of classes. ligible increment in computational cost. In this subsection,
Then, we use the bilinear upsampling by a factor of 8 to we present two sets of experiments. In the first one as shown
directly upsample the low-resolution output feature maps to in Table 5, we show the effectiveness of distributing the
the original resolution. It should be noted that only the yellow identically scaled PPMs (i.e., with pooling bin sizes {1, 2,
region of the baseline model will be used to develop the final 3, 6}) in all the three stages. The experimental results show
network (i.e., DMPNet). The layers in the blue region will that employing PPMs in a distributed fashion, i.e. in all the
be discarded once an effective encoder has been achieved. stages, gives better performance than using only after the
After we have developed an effective encoder, we conduct final stage. Specifically, the distributed application of PPM
several experiments to design a decoder that enhances the increases accuracy by 0.7% compared to the conventional
performance of the complete network. approach. In the second set of experiments, shown in Table
6, the effectiveness of distributing multi-scale PPMs in the
TABLE 4. Effect of varying kernel size and dilation rate of the asymmetric three stages has been shown. In the multi-scale scheme,
kernel on the model performance. pooling with bin sizes {1, 2, 3, 6}, {2, 4, 6, 12}, and {4,
6, 12, 24} are used in stage-3, stage-2, and stage-1 of the
Dilation rate mIoU Parameters encoder, respectively. It can be observed that when we apply
Kernel size
(d1 ) (%) (M) PPM with different scales, it gives better performance. More
3 d2 53.1 292.950 specifically, in comparison to the conventional approach, it
3 d2 /2 54.8 292.950 increases the mIoU by 1.1% with only a slight increment
5 d2 58.5 405.846
in parameters. This shows the effectiveness of the proposed
5 d2 /2 60 405.846
distributed multi-scale pyramid pooling.

TABLE 5. Evaluation results of using PPMs in a distributed fashion with a


1) ABLATION STUDY FOR BASELINE DEPTH single scale, i.e., {1, 2, 3, 6}.

To find a suitable depth of the proposed DMPNet, we con-


ducted several experiments on the baseline model. Table 3 Identical-scale PPMs mIoU Param
Encoder
presents the details of the experiments. P and Q represent the Stage-1 Stage-2 Stage-3 (%) (k)
number of EMCA modules in the second and third stages, Baseline 60 405.846
Baseline ✓ 61.6 468.630
respectively. As we can see, the most suitable choice for the
Baseline ✓ ✓ 62 484.470
baseline depth is P = 4 and Q = 10. Therefore, we choose this Baseline ✓ ✓ ✓ 62.3 488.502
combination to be used in the proposed DMPNet.

2) ABLATION STUDY FOR KERNEL SIZES AND DILATION TABLE 6. Evaluation result of using PPMs in a distributed fashion with three
RATES different scales.

The proposed EMCA module has three convolutions; a non-


Multi-scale PPMs mIoU Param
dilated symmetric, a dilated asymmetric, and a dilated sym- Encoder
Stage-1 Stage-2 Stage-3 (%) (k)
metric convolution. The dilation rate policy of asymmetric
Baseline 60 405.846
and symmetric convolutions in the EMCA module is shown Baseline PPM-3 61.6 468.630
in Table 1. In this subsection, we demonstrate the effec- Baseline PPM-2 PPM-3 62.1 484.470
tiveness of using a large asymmetric kernel with a smaller Baseline PPM-1 PPM-2 PPM-3 62.7 488.502
dilation rate, i.e., half of that of the symmetric convolution.
We do not change the dilation rate of the symmetric kernel,
i.e., d2 remains consistent with Table 1. Table 4 demonstrates 4) ABLATION STUDY FOR DISTRIBUTED ASPP
the effects of varying kernel size and dilation rate of the Apart from PPM, ASPP [21] is another very commonly used
asymmetric kernel. long-range context extraction module. So, we also conduct
Experimental results show that using a large asymmetric extensive experiments to see its effects on the performance
kernel with a small dilation rate gives the best performance. of the proposed DMPNet. Like the distributed PPM experi-
This is because this combination on the one hand enlarges the ments, this section also includes two sets of experiments; one
VOLUME 4, 2016 7

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425

for identically scaled ASPPs and the other for multi-scale have followed one of the most common schemes, i.e., we
ASPPs. More specifically, the identically scaled distributed employ two EMCA modules in the first two stages of the
ASPP scheme has convolutions with dilation rates {12, 24, decoder. The third stage does not employ any EMCA module
36}. The multi-scale scheme, on the other hand, executes after upsampling to reduce the computational burden. Table
ASPPs with three different sets of dilation rates; ASPP-1 9 presents the experimental details with regards to different
with {3, 6, 9}, ASPP-2 with {6, 12, 18}, and ASPP-3 with decoders. Decoders are used to progressively upsample the
{12, 24, 36}. The evaluation results of single-scale and multi- low-resolution feature maps generated by the encoder to re-
scale distributed ASPP have been presented in Table 7 and cover the low-level details that get lost as a result of multiple
Table 8, respectively. It is interesting to note that although downsampling operations in the encoder. While some works
a single ASPP gives slightly better accuracy than a single have employed a sequential approach, the U-shape approach
PPM (see Table 5), it increases the model size by more than is the most common and most effective [10]. So, we have
2× which leads to a significant decrease in parameter effi- adopted a U-shape decoder design in our final architecture.
ciency. Moreover, when employed in a distributed fashion, Moreover, after the encoder and each of the two fusion
the PPM-based DMPNet outperforms the ASPP-based one points of the decoder, we use the channel attention module. It
despite having significantly fewer parameters than the ASPP enhances the accuracy by giving more emphasis to channels
scheme. The superiority of the distributed PPM scheme over and features that are more meaningful. Table 9 shows that
ASPP can be attributed to the fact that the functionality of using the channel attention scheme results in 0.4 and 0.6
the latter is already implemented to a small extent by the % mIoU increment in sequential and U-shape approaches,
proposed EMCA module thanks to multi-scale dilated con- respectively. More importantly, the U-shape decoder achieves
volutions. So, using them in a distributed (repeated) fashion 0.9% more mIoU than the sequential approach without the
leads to relatively less improvement. The PPM, on the other channel attention scheme. Similarly, with channel attention,
hand, aggregates context by pooling features which enables the U-shape decoder is 1.1% more accurate than the sequen-
the network to incorporate context in two different ways; tial approach. This proves the effectiveness of using the U-
one through dilated convolutions in the EMCA module and shape decoder in conjunction with channel attention.
the other through pooled features in PPM. This leads to a
more diverse approach to contextual information extraction TABLE 9. Evaluation result of employing different decoding strategies. "Seq."
stands for a sequential approach like as adopted by [36]. "Ch. Att." stands for
and hence explains the superiority of distributed PPM-based channel attention
DMPNet compared to distributed ASPP-based one.
Seq. U-shape Ch. Att. mIoU (%) Param (k)
TABLE 7. Evaluation results of using ASPPs in a distributed fashion with a
single scale, i.e., with convolutions having dilation rates {12, 24, 36}. ✓ 69.9 629.282
✓ ✓ 70.3 629.291
✓ 70.8 655.643
Identical-scale ASPPs mIoU Param ✓ ✓ 71.4 655.652
Encoder
Stage-1 Stage-2 Stage-3 (%) (k)
Baseline 60 405.846
Baseline ✓ 61.4 1091.574 D. COMPARISON WITH STATE-OF-THE-ARTS
Baseline ✓ ✓ 61.9 1263.366
Baseline ✓ ✓ ✓ 62.2 1306.494 In this subsection, we compare our method with other state-
of-the-art methods on cityscapes and camvid dataset.

TABLE 8. Evaluation results of using ASPPs in a distributed fashion with 1) COMPARISON WITH STATE-OF-THE-ARTS ON
three different scales.
CITYSCAPES

Multi-scale ASPPs mIoU Param


Accuracy and model-size comparison: The class-wise and
Encoder the comprehensive comparison of the proposed DMPNet
Stage-1 Stage-2 Stage-3 (%) (k)
with other methods have been shown in Table 10 and 11,
Baseline 60 405.846
Baseline ASPP-3 61.8 1091.574
respectively. Some recent models such as PIDNet [40], Hy-
Baseline ASPP-2 ASPP-3 62.2 1263.366 perSeg [41] and DDRNet [35] achieve excellent accuracies
Baseline ASPP-1 ASPP-2 ASPP-3 62.6 1306.494 at the cost of a huge number of parameters. To be more
specific, PIDNet, HyperSeg, and DDRNet achieve 78.6, 78.1,
and 77.4% mIoU with 7.6, 10.2, and 5.7 million parameters,
5) ABLATION STUDY FOR DECODER respectively. It should be noted that PIDNet [40] achieves
After conducting a series of experiments, we chose the 0.5% more mIoU than HyperSeg despite having 2.6 million
"Baseline + PPM-1 + PPM-2 + PPM-3" to be the encoder of fewer parameters. This demonstrates that an increase in pa-
the proposed DMPNet. We refer to the encoder of DMPNet rameters alone does not always lead to proportional accuracy
as DMP-Enc. To find a suitable decoding strategy, we had improvement. We can observe the same phenomenon when
to again conduct a series of experiments. As for the number we do the comparison of mid-size models with the large-size
of EMCA modules in different stages of the decoder, we ones. For example, FASSDNet is 0.1% more accurate than
8 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425

TABLE 10. Evaluation results (per-class basis) of different methods on cityscapes test set.

Traffic light

Traffic sign

Motorbike
Pedestrian
Vegetation

mIoU (%)
Sidewalk

Building

Terrain

Bicycle
Truck
Fence

Rider

Train
Road

Wall

Pole

Car

Bus
Sky
Method
ENet 96.3 74.2 75 32.2 33.2 43.4 34.1 44 88.6 61.4 90.6 65.5 38.4 90.6 36.9 50.5 48.1 38.8 55.4 58.3
ESPNet 97 77.5 76.2 35 36.1 45 35.6 46.3 90.8 63.2 92.6 67 40.9 92.3 38.1 52.5 50.1 41.8 57.2 60.3
CGNet 95.5 78.7 88.1 40 43 54.1 59.8 63.9 89.6 67.6 92.9 74.9 54.9 90.2 44.1 59.5 25.2 47.3 60.2 64.8
DABNet 97.9 82 90.6 45.5 50.1 59.3 63.5 67.7 91.8 70.1 92.8 78.1 57.8 93.7 52.8 63.7 56 51.3 66.8 70.1
FBSNet 98 83.2 91.5 50.9 53.5 62.5 67.6 71.5 92.7 70.5 94.4 82.5 63.8 93.9 50.5 56 37.6 56.2 70.1 70.9
LEDNet 97.1 78.6 90.4 46.5 48.1 60.9 60.4 71.1 91.2 60 93.2 74.3 51.8 92.3 61 72.4 51 43.3 70.2 69.2
DMPNet 97.8 82.2 91 47.8 44.2 58.5 64 68.8 91.2 69 94.5 79.5 60.1 93.7 57.2 70.2 63.1 52.7 66 71.1

TABLE 11. Comparison with other methods on Cityscapes test set. TABLE 12. Speed and run-time measurement of different methods on
different spatial resolutions. For a fair comparison, the networks are executed
on the same platform, i.e., GTX 1080Ti, so the results may slightly differ from
Pretrain mIoU Param. Speed the reported results by original works.
Methods
(ImageNet) (%) (M) (FPS)
FCN-8s [37] ✓ 65.3 134.5 - GTX 1080Ti
ICNet [38] ✓ 69.5 26.5 30.3 Model 256 × 512 512 × 1024 512 × 1024
SegNet [39] ✓ 57 29.5 16.7 ms fps ms fps ms fps
PSPNet [8] ✓ 81.2 250.8 0.78
DeepLab [19] ✓ 63.5 262.10 0.25 ENet [28] 10 99.8 13 74.9 44 22.9
DeepLabv3+ [21] ✓ 75.2 15.40 8.40 SegNet [39] 16 64.2 56 17.9 - -
BiSeNetV1 [11] ✗ 68.4 5.8 105.8 ERFNet [36] 7 147.8 21 48.2 74 13.5
PIDNet [40] ✗ 78.6 7.6 93.2 ICNet [38] 9 107.9 15 67.2 40 25.1
HyperSeg [41] ✓ 78.1 10.2 16.1 ESPNet [5] 5 182.5 9 115.2 30 33.3
DDRNet [35] ✗ 77.4 5.7 101.6 DABNet [6] 6 170.2 10 104.2 36 27.7
FBSNet [14] 7 142.3 12 77 45 22.1
ERFNet [36] ✗ 68.0 2.2 41.7
BiSeNetV2 [42] ✗ 75.3 4.59 47.3 DMPNet (ours) 5 178 9 105.1 31 31.8
FASSDNet [10] ✓ 77.5 2.85 41.1
MSCFNet [43] ✗ 71.9 1.15 50
BiAttenNet [44] ✗ 74.7 2.2 89.2
MLFNet [31] ✗ 71.5 3.99 90.8 networks despite having a smaller model size. More specifi-
MFNet [16] ✗ 72.1 1.34 116 cally, our method is 1%, 5%, and 3.8% more accurate while
DABNet [6] ✗ 70.1 0.76 104.2 having 110k, 230k, and 30k fewer parameters compared to
ContextNet [29] ✗ 66.1 0.88 65.5
DABNet, ContextNet, and EDANet, respectively. Compared
EDANet [30] ✗ 67.3 0.68 81.3
LEDNet [7] ✗ 70.6 0.94 71 to LETNet, a very recent state-of-the-art network, the mIoU
BANet [17] ✗ 70.1 0.72 83.2 of the proposed method is very close (only 1.7% less) despite
FBSNet [14] ✗ 70.9 0.62 90 being 300k fewer parameters. This shows the effectiveness
LETNet [7] ✗ 72.8 0.95 150 of our method. It is also very interesting to compare our
ENet [28] ✗ 58.3 0.36 74.9
method with some of the mid-size methods, as shown in
ESPNet [5] ✗ 60.3 0.36 112
NDNet [12] ✗ 65.3 0.5 101.1 the mid-section of Table 11. Experimental results show that
CGNet [26] ✗ 64.8 0.5 17.6 our method gives 2.1% better accuracy compared to ERFNet
DMPNet (ours) ✗ 71.1 0.65 105.1 despite being more than 3× smaller. Also, DPMNet achieves
very close accuracy compared to MLFNet and MFNet de-
spite being 6× and 2× smaller. Furthermore, even when
DDRNet while being 2× smaller. We can observe a similar we compare our model with some of the large-size models
effect between BiSeNetV1 and BiSeNetV2 [42]. In spite of (>5 million parameters), it achieves better accuracy than
having 1.21 million fewer parameters, BiSeNetV2 achieves BiSeNetV1 [11] while being more than 8× smaller. The
almost 7% more mIoU than BiSeNetV1. qualitative results of the proposed methods on the Cityscapes
These observations show that small-size networks can validation set are shown in Fig. 6.
achieve similar or even more accuracies than multiple-times Speed comparison: Inference speed highly depends upon
larger networks by means of smart architecture design. the device and the input sizes. It is therefore very important
Leveraging this possibility leads to highly efficient networks for fair comparison that we run the methods on the same plat-
with decent accuracies which in turn provides practical so- form. All the experiments to compute the inference speeds
lutions for resource-constrained platforms. To achieve this have been done on a single NVIDIA GTX 1080Ti GPU.
goal, we introduce a lightweight real-time DMPNet, which We present the comparison of the speed and run-time of our
achieves 71.1% mIoU on the cityscapes test set. It achieves method with other state-of-the-art methods in Table 12. We
better accuracy performance compared to most lightweight conduct experiments with three different resolutions; quarter,
VOLUME 4, 2016 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425

FIGURE 6. Qualitative results of the proposed methods on the Cityscapes validation set. Regions shown by yellow bounding boxes demonstrate how the model is
progressively getting better as we move from the baseline to the complete network, i.e., DMPNet.

TABLE 13. Comparison with other methods on Camvid test set. especially in risky scenarios such as autonomous driving.
Accuracy and speed-related experimental results show that
Pretrain our method achieves an excellent accuracy-efficiency balance
Methods mIoU (%) Parameters (M)
(ImageNet) which shows its effectiveness in real-time scenarios.
FCN-8s [37] ✓ 57 134.5
SegNet [39] ✓ 55.6 29.5 Accuracy-efficiency Trade-off Comparison: Existing
ICNet [38] ✓ 67.1 26.5 semantic segmentation models can be categorized into two
BiSeNetV1 [11] ✗ 65.6 5.8 broad categories; 1) lightweight networks with high process-
HyperSeg [41] ✓ 78.4 9.9 ing speed but with insufficient accuracies, and 2) methods
DDRNet [35] ✗ 74.7 5.7
that achieve decent accuracies but at the cost of increased
BiSeNetV2 [42] ✗ 72.4 4.59
FASSDNet [10] ✓ 69.3 2.85
model size and slow speed. PSPNet [8], DeepLabv3+ [20],
MSCFNet [43] ✗ 69.3 1.15 HyperSeg [41], DDRNet [35], among others [40], [42]
MFNet [16] ✗ 71.5 1.34 belong to the former category. These networks generally
DABNet [6] ✗ 66.4 0.76 achieve very high accuracies (> 75% mIoU) but are bulky (>
EDANet [30] ✗ 64.6 0.68 5 million parameters in most cases). Moreover, most of these
FBSNet [14] ✗ 68.9 0.62 networks do not achieve real-time performance. Specifically,
LETNet [7] ✗ 70.5 0.95
ENet [28] ✗ 51.3 0.36 the speed of PSPNet, DeepLab, DeepLabV3+, and HyperSeg
ESPNet [5] ✗ 55.6 0.36 are 0.76, 0.25, 8.40, and 16.1 FPS, respectively. Hence,
NDNet [12] ✗ 57.2 0.5 these are not suitable for real-time applications on resource-
CGNet [26] ✗ 65.6 0.5 constrained platforms. However, the efficient design enables
DMPNet (ours) ✗ 69.2 0.65 some of these large-scale networks to achieve real-time per-
formance, e.g., PIDNet and DDRNet. ENet [28], ESPNet [5],
CGNet [26], NDNet [12], etc., belong to the latter category
half, and full resolution on Cityscapes images. The proposed which is computationally efficient but achieve insufficient
DMPNet can process an incoming stream of images with accuracy for practical application, especially in risky scenar-
resolution 512×1024 at a speed of 105.1 fps. This makes our ios such as autonomous driving. A third narrow category is
method one of the fastest methods; second only to ESPNet an emerging area of research that includes methods achiev-
which achieves 115.2 fps. ESPNet, however, only achieves ing decent accuracies with lightweight model size and high
60.3% mIoU which is not suitable for real-time applications, processing speed. Examples include DABNet [6], FBSNet
10 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425

[14], LETNet [7], BANet [17] etc. Our DMPNet is 1% more TABLE 14. Comparison with other methods on ADE20K Validation set.
accurate than both DABNet and BANet, respectively, despite
having 110k and 70k fewer parameters. Also, our method Backbone Param
Methods pixAcc (%) mIoU (%)
(ImageNet) (M)
is faster than most of the methods in this third (trade-off
FCN-8s [37] 71.32 29.39 134.5
oriented) category. More specifically, the DMPNet achieves SegNet [39] 71 21.64 47.62
0.9, 21.9, and 15.1 higher FPS than DABNet, BANet, and DilatedNet [47] 73.55 32.31 62.74
FBSNet, respectively. LETNet achieves 1.7% more mIoU PSPNet [8] ResNet50 79.65 40.79 44.62
than DMPNet but with 200k more parameters. Although RefineNet [48] ResNet152 - 40.7 -
EncNet [49] ResNet50 79.73 41.11 -
the FBSNet achieves excellent trade-off between accuracy SASD [46] ResNet18 77.13 35.82 11.38
and model size, it is 28 FPS slower than our model when SegFormer [45] - 37.4 3.8
executed on the same platform, see Table 12. This shows that DMPNet (ours) 76.8 35.2 0.74
the proposed DMPNet achieves a decent balance between
accuracy, model size (parameters), and inference speed and
thus gives a highly competitive performance compared to our method outperforms many classical methods as well.
recent state-of-the-art methods. The proposed DMPNet is 5.81%, 13.56%, and 2.89% more
accurate than FCN-8s, SegNet, and DilatedNet, respectively,
2) COMPARISON WITH STATE-OF-THE-ARTS ON despite being 181×, 64.35×, and 84× smaller. This shows
CAMVID that our method achieves an excellent accuracy-efficiency
To show the effectiveness of our design and its generalizable trade-off.
nature, we train and evaluate our model on the CamVid
dataset as well. The comparison of our method with other V. CONCLUSION
state-of-the-art methods has been presented in Table 13. In this work, we propose a novel Efficient Multi-scale Con-
When we compare our method with the early benchmark text Aggregation (EMCA) module to capture contextual in-
methods, the proposed DMPNet is 12.2%, 13.6%, and 2.1% formation at multiple receptive fields. It uses a small symmet-
more accurate while being 206×, 45×, and 40× smaller ric kernel with a large dilation rate and a large asymmetric
than FCN-8s, SegNet, and ICNet, respectively. With respect kernel with a relatively smaller dilation rate to efficiently ex-
to lightweight works (<1 million parameters), the proposed tract dense and sparse features. Apart from these, a standard
method achieves better accuracy with all the works except convolution without any dilation is also used to preserve local
for LETNet. To be more specific, our method achieves 2.8%, details. We also introduce a distributed multi-scale pyramid
and 4.6% more mIoU while having 110k and 30k fewer pa- pooling (DMPP) strategy to extract context from all three
rameters than DABNet and EDANet, respectively. Compared levels of feature hierarchy; low, mid, and high-level. Based
to other ultra-lightweight works, such as ESPNet and CGNet, on the EMCA module and the DMPP strategy, we propose
the proposed model achieves 13.6% and 3.6% higher mIoU a lightweight and real-time network, called DMPNet, that
with 290k and 150k more parameters. So, it can be observed achieves an excellent accuracy-efficiency trade-off. More
that the proposed DMPNet achieves an excellent accuracy- specifically, the proposed DMPNet achieves 71.1 and 69.2%
efficiency trade-off. mIoU on Cityscapes and CamVid datasets with only 0.65
million parameters. Furthermore, it is capable of processing
3) COMPARISON WITH STATE-OF-THE-ARTS ON an incoming stream of high-resolution images at a speed of
ADE20K 105.1 frames per second (FPS).
Most real-time semantic segmentation works proposed for
autonomous driving evaluate their methods on road-scene VI. FUTURE SCOPE
datasets, for example, Cityscapes and Camvid. However, Although our method achieves a decent trade-off between
to show the generalizability of the proposed DMPNet, we accuracy and efficiency, it suffers from the problem of class
train and evaluate our method on a highly challenging and imbalance to a certain extent. Moreover, to recover finer
comprehensive general-purpose ADE20K dataset as well. details in the high-resolution segmentation map, deconvo-
The comparison of our method with other state-of-the-art lution is conventionally employed which negatively affects
methods has been presented in Table 14. Since mostly the the processing speed of the model. So, the extension of
large-scale general-purpose networks present the evaluation the proposed method in the future will include developing
results on the ADE20K dataset, Table 14 includes large-scale more advanced techniques to counter the class-imbalance
models with tens of millions of parameters. The experimental issue more effectively. Also, effective decoding strategies
results show that a carefully designed network with fewer that reduce the dependency on deconvolution blocks will be
parameters can achieve better performance in comparison explored.
to their larger counterparts. SegFormer [45] despite being
3× smaller achieve 1.5% better accuracy than SASD [46]. ACKNOWLEDGMENT
The proposed DMPNet achieves accuracy very close to that The authors extend their appreciation to King Saud Univer-
of SASD, despite being 15× smaller. Apart from SASD, sity for funding this research through Researchers Supporting
VOLUME 4, 2016 11

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425

Project Number (RSPD2024R890), King Saud University, [19] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
Riyadh, Saudi Arabia. “Deeplab: Semantic image segmentation with deep convolutional nets,
atrous convolution, and fully connected crfs,” IEEE transactions on pattern
analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
[20] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking
atrous convolution for semantic image segmentation,” arXiv preprint
arXiv:1706.05587, 2017.
REFERENCES [21] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-
[1] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, decoder with atrous separable convolution for semantic image segmen-
U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic tation,” in Proceedings of the European conference on computer vision
urban scene understanding,” in Proceedings of the IEEE conference on (ECCV), 2018, pp. 801–818.
computer vision and pattern recognition, 2016, pp. 3213–3223. [22] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. C. Loy, D. Lin, and J. Jia, “Psanet:
[2] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks Point-wise spatial attention network for scene parsing,” in Proceedings of
for biomedical image segmentation,” in Medical Image Computing and the European conference on computer vision (ECCV), 2018, pp. 267–283.
Computer-Assisted Intervention–MICCAI 2015: 18th International Con- [23] Y. Yuan, L. Huang, J. Guo, C. Zhang, X. Chen, and J. Wang, “Ocnet: Ob-
ference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. ject context for semantic segmentation,” International Journal of Computer
Springer, 2015, pp. 234–241. Vision, vol. 129, no. 8, pp. 2375–2398, 2021.
[3] W. Zhou, Y. Liu, C. Wang, Y. Zhan, Y. Dai, and R. Wang, “An automated [24] J. He, Z. Deng, L. Zhou, Y. Wang, and Y. Qiao, “Adaptive pyramid context
learning framework with limited and cross-domain data for traffic equip- network for semantic segmentation,” in Proceedings of the IEEE/CVF
ment detection from surveillance videos,” IEEE Transactions on Intelligent Conference on Computer Vision and Pattern Recognition, 2019, pp. 7519–
Transportation Systems, vol. 23, no. 12, pp. 24 891–24 903, 2022. 7528.
[4] X. Sun, Y. Qian, R. Cao, P. Tuerxun, and Z. Hu, “Bgfnet: Semantic [25] Q. Hou, L. Zhang, M.-M. Cheng, and J. Feng, “Strip pooling: Rethinking
segmentation network based on boundary guidance,” IEEE Geoscience spatial pooling for scene parsing,” in Proceedings of the IEEE/CVF
and Remote Sensing Letters, 2023. conference on computer vision and pattern recognition, 2020, pp. 4003–
[5] S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and H. Hajishirzi, “Espnet: 4012.
Efficient spatial pyramid of dilated convolutions for semantic segmen- [26] T. Wu, S. Tang, R. Zhang, J. Cao, and Y. Zhang, “Cgnet: A light-weight
tation,” in Proceedings of the european conference on computer vision context guided network for semantic segmentation,” IEEE Transactions on
(ECCV), 2018, pp. 552–568. Image Processing, vol. 30, pp. 1169–1179, 2020.
[6] G. Li, I. Yun, J. Kim, and J. Kim, “Dabnet: Depth-wise asymmet- [27] Y. Wang, Q. Zhou, J. Liu, J. Xiong, G. Gao, X. Wu, and L. J. Latecki,
ric bottleneck for real-time semantic segmentation,” arXiv preprint “Lednet: A lightweight encoder-decoder network for real-time semantic
arXiv:1907.11357, 2019. segmentation,” in 2019 IEEE international conference on image process-
[7] G. Xu, J. Li, G. Gao, H. Lu, J. Yang, and D. Yue, “Lightweight real-time ing (ICIP). IEEE, 2019, pp. 1860–1864.
semantic segmentation network with efficient transformer and cnn,” IEEE
[28] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep neural
Transactions on Intelligent Transportation Systems, 2023.
network architecture for real-time semantic segmentation,” arXiv preprint
[8] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing arXiv:1606.02147, 2016.
network,” in Proceedings of the IEEE conference on computer vision and
[29] R. P. Poudel, U. Bonde, S. Liwicki, and C. Zach, “Contextnet: Exploring
pattern recognition, 2017, pp. 2881–2890.
context and detail for semantic segmentation in real-time,” arXiv preprint
[9] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, “Denseaspp for semantic
arXiv:1805.04554, 2018.
segmentation in street scenes,” in Proceedings of the IEEE conference on
[30] S.-Y. Lo, H.-M. Hang, S.-W. Chan, and J.-J. Lin, “Efficient dense mod-
computer vision and pattern recognition, 2018, pp. 3684–3692.
ules of asymmetric convolution for real-time semantic segmentation,” in
[10] L. Rosas-Arias, G. Benitez-Garcia, J. Portillo-Portillo, J. Olivares-
Proceedings of the ACM Multimedia Asia, 2019, pp. 1–6.
Mercado, G. Sanchez-Perez, and K. Yanai, “Fassd-net: Fast and accurate
real-time semantic segmentation for embedded systems,” IEEE Transac- [31] J. Fan, F. Wang, H. Chu, X. Hu, Y. Cheng, and B. Gao, “Mlfnet: Multi-
tions on Intelligent Transportation Systems, vol. 23, no. 9, pp. 14 349– level fusion network for real-time semantic segmentation of autonomous
14 360, 2021. driving,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp. 756–
767, 2022.
[11] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral
segmentation network for real-time semantic segmentation,” in Proceed- [32] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: Efficient
ings of the European conference on computer vision (ECCV), 2018, pp. channel attention for deep convolutional neural networks,” in Proceedings
325–341. of the IEEE/CVF conference on computer vision and pattern recognition,
[12] Z. Yang, H. Yu, Q. Fu, W. Sun, W. Jia, M. Sun, and Z.-H. Mao, “Ndnet: 2020, pp. 11 534–11 542.
Narrow while deep network for real-time semantic segmentation,” IEEE [33] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene
Transactions on Intelligent Transportation Systems, vol. 22, no. 9, pp. parsing through ade20k dataset,” in Proceedings of the IEEE conference
5508–5519, 2020. on computer vision and pattern recognition, 2017, pp. 633–641.
[13] X. Zhang, Z. Chen, Q. J. Wu, L. Cai, D. Lu, and X. Li, “Fast semantic [34] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
segmentation for scene perception,” IEEE Transactions on Industrial In- with deep convolutional neural networks,” Communications of the ACM,
formatics, vol. 15, no. 2, pp. 1183–1192, 2018. vol. 60, no. 6, pp. 84–90, 2017.
[14] G. Gao, G. Xu, J. Li, Y. Yu, H. Lu, and J. Yang, “Fbsnet: A fast [35] H. Pan, Y. Hong, W. Sun, and Y. Jia, “Deep dual-resolution networks
bilateral symmetrical network for real-time semantic segmentation,” IEEE for real-time and accurate semantic segmentation of traffic scenes,” IEEE
Transactions on Multimedia, 2022. Transactions on Intelligent Transportation Systems, vol. 24, no. 3, pp.
[15] S. Mazhar, N. Atif, M. Bhuyan, and S. R. Ahamed, “Rethinking dabnet: 3448–3460, 2022.
Light-weight network for real-time semantic segmentation of road scenes,” [36] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “Erfnet: Efficient
IEEE Transactions on Artificial Intelligence, 2023. residual factorized convnet for real-time semantic segmentation,” IEEE
[16] M. Lu, Z. Chen, C. Liu, S. Ma, L. Cai, and H. Qin, “Mfnet: Multi-feature Transactions on Intelligent Transportation Systems, vol. 19, no. 1, pp. 263–
fusion network for real-time semantic segmentation in road scenes,” IEEE 272, 2017.
Transactions on Intelligent Transportation Systems, vol. 23, no. 11, pp. [37] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
20 991–21 003, 2022. for semantic segmentation,” in Proceedings of the IEEE conference on
[17] S. Mazhar, N. Atif, M. Bhuyan, and S. R. Ahamed, “Block attention computer vision and pattern recognition, 2015, pp. 3431–3440.
network: A lightweight deep network for real-time semantic segmentation [38] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Icnet for real-time semantic
of road scenes in resource-constrained devices,” Engineering Applications segmentation on high-resolution images,” in Proceedings of the European
of Artificial Intelligence, vol. 126, p. 107086, 2023. Conference on Computer Vision (ECCV), 2018, pp. 405–420.
[18] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in [39] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep con-
deep convolutional networks for visual recognition,” IEEE transactions on volutional encoder-decoder architecture for image segmentation,” IEEE
pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904–1916, transactions on pattern analysis and machine intelligence, vol. 39, no. 12,
2015. pp. 2481–2495, 2017.

12 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425

[40] J. Xu, Z. Xiong, and S. P. Bhattacharyya, “Pidnet: A real-time semantic M.K. BHUYAN (Senior Member, IEEE) received
segmentation network inspired by pid controllers,” in Proceedings of the Ph.D. degree in electronics and communica-
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, tion engineering from the Indian Institute of Tech-
2023, pp. 19 529–19 539. nology (IIT) Guwahati, India, in 2006. He was
[41] Y. Nirkin, L. Wolf, and T. Hassner, “Hyperseg: Patch-wise hypernetwork with the School of Information Technology and
for real-time semantic segmentation,” in Proceedings of the IEEE/CVF Electrical Engineering, University of Queensland,
Conference on Computer Vision and Pattern Recognition, 2021, pp. 4061– St. Lucia, QLD, Australia, where he was involved
4070.
in postdoctoral research. Subsequently, he was
[42] C. Yu, C. Gao, J. Wang, G. Yu, C. Shen, and N. Sang, “Bisenet v2: Bilateral
a Researcher with the SAFE Sensor Research
network with guided aggregation for real-time semantic segmentation,”
International Journal of Computer Vision, vol. 129, no. 11, pp. 3051–3068, Group, NICTA, Brisbane, QLD. He is currently
2021. a Professor with the Department of Electronics and Electrical Engineer-
[43] G. Gao, G. Xu, Y. Yu, J. Xie, J. Yang, and D. Yue, “Mscfnet: a lightweight ing, IIT Guwahati, and the Associate Dean of Infrastructure, Planning
network with multi-scale context fusion for real-time semantic segmenta- and Management, IIT Guwahati. In 2014, he was a Visiting Professor
tion,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, with Indiana University and Purdue University, Indiana, USA. He is also
no. 12, pp. 25 489–25 499, 2021. working as a Visiting Professor with the Department of Computer Science,
[44] G. Li, L. Li, and J. Zhang, “Biattnnet: bilateral attention for improving Chubu University, Japan. His current research interests include image/video
real-time semantic segmentation,” IEEE Signal Processing Letters, vol. 29, processing, computer vision, machine and deep learning, human computer
pp. 46–50, 2021. interactions (HCI), virtual reality and augmented reality, and biomedical
[45] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Seg- signal processing. He was a recipient of the National Award for Best Applied
former: Simple and efficient design for semantic segmentation with trans- Research/Technological Innovation, which was presented by the Honorable
formers,” Advances in Neural Information Processing Systems, vol. 34, President of India, in 2012, the Prestigious Fullbright-Nehru Academic and
pp. 12 077–12 090, 2021. Professional Excellence Fellowship, and the BOYSCAST Fellowship.
[46] S. An, Q. Liao, Z. Lu, and J.-H. Xue, “Efficient semantic segmentation
via self-attention and self-distillation,” IEEE Transactions on Intelligent SULTAN ALFARHOOD received the Ph.D. de-
Transportation Systems, vol. 23, no. 9, pp. 15 256–15 266, 2022. gree in computer science from the University of
[47] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolu- Arkansas. He is an Assistant Professor with the
tions,” arXiv preprint arXiv:1511.07122, 2015.
Department of Computer Science, King Saud Uni-
[48] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement
versity (KSU). Since joining KSU in 2007, he has
networks for high-resolution semantic segmentation,” in Proceedings of
the IEEE conference on computer vision and pattern recognition, 2017, made several contributions to the field of computer
pp. 1925–1934. science through his research and publications. Sul-
[49] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal, tan’s research spans a variety of domains including
“Context encoding for semantic segmentation,” in Proceedings of the IEEE machine learning, recommender systems, linked
Conference on Computer Vision and Pattern Recognition, 2018, pp. 7151– open data, text mining and ML-based IoT sys-
7160. tems. His work includes proposing innovative approaches and techniques
to enhance the accuracy and effectiveness of these systems. His recent
publications have focused on using deep learning and machine learning tech-
niques to address challenges in these domains. Sultan’s research continues to
NADEEM ATIF received the bachelor’s degree in make significant contributions to the field of computer science and machine
technology in 2015, and M. Tech. in 2017 from the learning. His work has been published in several high-impact journals and
Z.H.C.E.T, A.M.U, Aligarh, India. He is currently conferences.
pursuing the Ph.D. degree with the EEE Dept.,
Indian Institute of Technology Guwahati, Guwa- MEJDL SAFRAN is a passionate researcher and
hati, India. His research interests include computer educator in the field of artificial intelligence, with
vision, deep learning, and autonomous driving. a focus on deep learning and its applications
in various domains. He is currently an Assis-
tant Professor of Computer Science at King Saud
University, where he has been a faculty member
since 2008. He obtained his bachelor’s degree in
computer science from King Saud University in
SAQUIB MAZHAR received the bachelor’s de-
2007, his master’s degree in computer science
gree in technology from NIT Hameerpur, Himan-
from Southern Illinois University Carbondale in
chal Pradesh, India, and M. Tech. in 2017 from the
2013, and his doctoral degree in computer science from the same university
Z.H.C.E.T, A.M.U, Aligarh, India. He is currently
in 2018. His doctoral dissertation was on developing efficient learning-
pursuing the Ph.D. degree with the EEE Dept.,
based recommendation algorithms for top-N tasks and top-N workers in
Indian Institute of Technology Guwahati, Guwa-
large-scale crowdsourcing systems. He has published more than 20 arti-
hati, India. His research interests include computer
cles in peer-reviewed journals and conference proceedings, such as ACM
vision, deep learning, and autonomous driving.
Transactions on Information Systems, Applied Computing and Informatics,
Mathematics, Sustainability, International Journal of Digital Earth, IEEE
Access, Biomedicine, Sensors, IEEE International Conference on Cluster,
IEEE International Conference on Computer and Information Science, In-
SHAIK RAFI AHAMED (Senior Member, IEEE) ternational Conference on Database Systems for Advanced Applications,
received the B.Tech. and M.Tech. degrees in and International Conference on Computational Science and Computational
electronics and communication engineering from Intelligence. He has been leading grant projects in the fields of AI in medical
Sri Venkateswara University, Tirupati, India, in imaging and AI in smart farming. His current research interests include
1991 and 1993, respectively, and the Ph.D. degree developing novel deep learning methods for image processing, pattern
from Indian Institute of Technology Kharagpur, recognition, natural language processing, and predictive analytics, as well as
Kharagpur, India, in 2008. He is currently a Pro- modeling and analyzing user behavior and interest in online platforms. He
fessor with the Department of Electronics and has been working as an AI consultant for several national and international
Electrical Engineering, Indian Institute of Tech- agencies since 2018.
nology Guwahati, Guwahati, India.

VOLUME 4, 2016 13

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4

You might also like