DMPNet Distributed Multi-Scale Pyramid Network For
DMPNet Distributed Multi-Scale Pyramid Network For
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425
ABSTRACT
In semantic segmentation, an input image is partitioned into multiple meaningful segments each correspond-
ing to a specific object or region. Multi-scale context plays a vital role in the accurate recognition of objects
of different sizes and hence is key to overall accuracy enhancement. To achieve this goal, we introduce
a novel strategy called Distributed Multi-scale Pyramid Pooling (DMPP) to extract multi-scale context at
multiple levels of feature hierarchy. More specifically, we employ Pyramid Pooling Modules (PPM) in a
distributed fashion after all three stages during the encoding phase. This enhances the feature representation
capability of the network and leads to better performance. To extract context at a more granular level, we
propose an Efficient Multi-scale Context Aggregation (EMCA) module which uses a combination of small
and large kernels with large and small dilation rates, respectively. This alleviates the problem of sparse
sampling and leads to consistent recognition of different regions. Apart from model accuracy, small model
size and efficient execution are critically important for real-time mobile applications. To achieve it, we
employ a resource-friendly combination of depthwise and factorized convolutions in the EMCA module
to drastically reduce the number of parameters without significantly compromising the accuracy. Based on
the EMCA module and DMPP, we propose a lightweight and real-time Distributed Multi-scale Pyramid
Network (DMPNet) that achieves an excellent accuracy-efficiency trade-off. We also conducted extensive
experiments on both driving datasets (i.e., Cityscapes and CamVid) and a general-purpose dataset (i.e.,
ADE20K) to show the effectiveness of the proposed method.
INDEX TERMS Semantic segmentation, deep learning, real-time processing, autonomous driving,
resource-constrained.
I. INTRODUCTION camera; the closer the object the bigger it appears. So, an
Semantic Segmentation is one of the most fundamental visual effective visual recognition system must be able to extract
recognition tasks in computer vision. Its aim is to label each and represent features at multiple scales in order to recognize
pixel of an input image with a class from amongst a set of them correctly, irrespective of their actual sizes. Using dilated
predefined classes. This leads to a complete partition of the convolutions with multiple dilation rates is one of the most
input image into multiple segments that correspond to differ- common and efficient methods to capture multi-scale context
ent regions and objects present in the image. It has a wide [5]–[7]. It increases the receptive field without increasing the
range of applications including, but not limited to, medical number of parameters by inserting zeroes between different
imaging, autonomous driving, robotic vision, satellite/aerial entries of the filter. However, using large dilation rates to
imagery, virtual/augmented reality, and so on [1]–[4]. increase the receptive field leads to a very sparse probing (or
Different objects, that appear in a typical road-scene sce- sampling) of the input features. This causes accuracy degra-
nario, are of different sizes. Also, the same object can appear dation because of the introduction of a large number of holes
to have different sizes depending upon its distance from the
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425
FIGURE 2. Comparison of different modules. (a) AF module in MFNet [16], (b) DAB module in DABNet [6], (c) CG module in CGNet [26], and (d) is the proposed
EMCA module. C is the number of channels and D represents dilated convolution. Also, gray, orange, and pink colors represent regular, depthwise, and pointwise
convolution, respectively.
TABLE 1. Dilation rate policy for the asymmetric and symmetric kernel of
EMCA module. d1 and d2 correspond to the dilation rate of the asymmetric
kernel and symmetric kernel, respectively. L stands for layer in a given stage.
Stage-2 Stage-3
Dilation rates
L1-L2 L3-L4 L1-L3 L4-L6 L7-L10
(d1 , d2 ) (2,4) (4,8) (2,4) (4,8) (8,16)
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425
FIGURE 4. The overall architecture of the proposed DMPNet. The downsampler in stage 1 is a standard 3 × 3 convolution with stride 2. The downsamplers in
stage-2 and stage-3 are EMCA modules with all the convolution operations (except the first pointwise conv.) with stride 2. "P-1/x" stands for a pointwise
convolution-based projection layer to compress the tensor width by a factor of x. The cyan-colored diamond-shaped box represents the concatenation operation.
all the resolutions (or stages) causes more effective context as follows:
aggregation. Therefore, we employ a PPM after each stage
with bin sizes proportional to the spatial resolution of features MC (Fin ) = σ(fk×k (T (GAP (Fin ))))
in the corresponding stage. Fig. 3(a) and (b) show the PPM
and the set of pooling windows that it uses, respectively. To where MC ∈ RC×1×1 and Fin ∈ RC×H×W are channel
include the low and mid-level features for spatial pooling, we attention map and input feature maps, respectively. fk×k
employ two additional PPMs; one in stage-1 and the other denotes a standard convolution operation with filter size k.
in stage-2, as shown in Fig. 4. The schematic diagram of T represents compression and re-weighting operations. GAP
three different sets of scales at which the distributed PPMs and σ stand for global average pooling and sigmoid function,
operate is shown in Fig. 3. The bin sizes of the PPM used respectively.
after stage 3 are set to that of the PPM of PSPNet [8], i.e., {1, In a deep network, contextual information is captured
2, 3, 6}. As we move back towards the initial stage, we keep in deep layers whereas shallow layers are responsible for
increasing the bin sizes by a factor of 2; in accordance with extracting low-level finer details, like edges and boundaries.
the increasing resolution of feature maps. So, the bin sizes of So, to enhance the segmentation quality, models are required
PPMs of stage-2 and stage-1 are {2, 4, 6, 12}, and {4, 8, 12, to effectively capture and aggregate these two different, albeit
24}, respectively. In this way, our method effectively captures complementary types of features. So, to mix low-level finer
multi-scale context at multiple levels of the network. Lastly, details with high-level semantic information, we design our
the channel width of the output tensor of a PPM is twice that network as a U-shape structure, as shown in Fig. 4. To reduce
of the input tensor. This drastically increases the number of the number of parameters, we compress the tensors from
parameters and the model size, as we are using PPM after stage-2 and stage-1, by a factor of 4 and 2, respectively,
every stage. To address this issue, we use a projection layer before we fuse it with the decoder features. The upsampler
after every PPM to compress the output width by a factor block is a sequence of a deconvolution block, an activation
of 2. This leads to the efficient execution of the distributed layer, and a batch-normalization layer. The details of the
pyramid scheme. network architecture are presented in Table 2.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425
6 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425
C. ABLATION STUDIES receptive field by virtue of the large kernel, and on the other
In this subsection, we present extensive experiments to hand, increases context density by virtue of the small dilation
demonstrate how the network has been developed and to rate.
show the effectiveness of the final network design. To con-
duct the ablation study, the cityscapes validation set has been 3) ABLATION STUDY FOR DISTRIBUTED PPM
used. Firstly, we design a baseline model that is based on Conventionally, PPM is used only once after the encoder,
the proposed EMCA module. The baseline model consists i.e., at the lowest-resolution feature maps. In this work,
only of the encoder, as shown in the Fig. 5. A projection we show that employing PPMs with different scales in a
layer has been used at the end of the encoder to compress the distributed fashion further enhances the accuracy with a neg-
width of the output feature maps to the number of classes. ligible increment in computational cost. In this subsection,
Then, we use the bilinear upsampling by a factor of 8 to we present two sets of experiments. In the first one as shown
directly upsample the low-resolution output feature maps to in Table 5, we show the effectiveness of distributing the
the original resolution. It should be noted that only the yellow identically scaled PPMs (i.e., with pooling bin sizes {1, 2,
region of the baseline model will be used to develop the final 3, 6}) in all the three stages. The experimental results show
network (i.e., DMPNet). The layers in the blue region will that employing PPMs in a distributed fashion, i.e. in all the
be discarded once an effective encoder has been achieved. stages, gives better performance than using only after the
After we have developed an effective encoder, we conduct final stage. Specifically, the distributed application of PPM
several experiments to design a decoder that enhances the increases accuracy by 0.7% compared to the conventional
performance of the complete network. approach. In the second set of experiments, shown in Table
6, the effectiveness of distributing multi-scale PPMs in the
TABLE 4. Effect of varying kernel size and dilation rate of the asymmetric three stages has been shown. In the multi-scale scheme,
kernel on the model performance. pooling with bin sizes {1, 2, 3, 6}, {2, 4, 6, 12}, and {4,
6, 12, 24} are used in stage-3, stage-2, and stage-1 of the
Dilation rate mIoU Parameters encoder, respectively. It can be observed that when we apply
Kernel size
(d1 ) (%) (M) PPM with different scales, it gives better performance. More
3 d2 53.1 292.950 specifically, in comparison to the conventional approach, it
3 d2 /2 54.8 292.950 increases the mIoU by 1.1% with only a slight increment
5 d2 58.5 405.846
in parameters. This shows the effectiveness of the proposed
5 d2 /2 60 405.846
distributed multi-scale pyramid pooling.
2) ABLATION STUDY FOR KERNEL SIZES AND DILATION TABLE 6. Evaluation result of using PPMs in a distributed fashion with three
RATES different scales.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425
for identically scaled ASPPs and the other for multi-scale have followed one of the most common schemes, i.e., we
ASPPs. More specifically, the identically scaled distributed employ two EMCA modules in the first two stages of the
ASPP scheme has convolutions with dilation rates {12, 24, decoder. The third stage does not employ any EMCA module
36}. The multi-scale scheme, on the other hand, executes after upsampling to reduce the computational burden. Table
ASPPs with three different sets of dilation rates; ASPP-1 9 presents the experimental details with regards to different
with {3, 6, 9}, ASPP-2 with {6, 12, 18}, and ASPP-3 with decoders. Decoders are used to progressively upsample the
{12, 24, 36}. The evaluation results of single-scale and multi- low-resolution feature maps generated by the encoder to re-
scale distributed ASPP have been presented in Table 7 and cover the low-level details that get lost as a result of multiple
Table 8, respectively. It is interesting to note that although downsampling operations in the encoder. While some works
a single ASPP gives slightly better accuracy than a single have employed a sequential approach, the U-shape approach
PPM (see Table 5), it increases the model size by more than is the most common and most effective [10]. So, we have
2× which leads to a significant decrease in parameter effi- adopted a U-shape decoder design in our final architecture.
ciency. Moreover, when employed in a distributed fashion, Moreover, after the encoder and each of the two fusion
the PPM-based DMPNet outperforms the ASPP-based one points of the decoder, we use the channel attention module. It
despite having significantly fewer parameters than the ASPP enhances the accuracy by giving more emphasis to channels
scheme. The superiority of the distributed PPM scheme over and features that are more meaningful. Table 9 shows that
ASPP can be attributed to the fact that the functionality of using the channel attention scheme results in 0.4 and 0.6
the latter is already implemented to a small extent by the % mIoU increment in sequential and U-shape approaches,
proposed EMCA module thanks to multi-scale dilated con- respectively. More importantly, the U-shape decoder achieves
volutions. So, using them in a distributed (repeated) fashion 0.9% more mIoU than the sequential approach without the
leads to relatively less improvement. The PPM, on the other channel attention scheme. Similarly, with channel attention,
hand, aggregates context by pooling features which enables the U-shape decoder is 1.1% more accurate than the sequen-
the network to incorporate context in two different ways; tial approach. This proves the effectiveness of using the U-
one through dilated convolutions in the EMCA module and shape decoder in conjunction with channel attention.
the other through pooled features in PPM. This leads to a
more diverse approach to contextual information extraction TABLE 9. Evaluation result of employing different decoding strategies. "Seq."
stands for a sequential approach like as adopted by [36]. "Ch. Att." stands for
and hence explains the superiority of distributed PPM-based channel attention
DMPNet compared to distributed ASPP-based one.
Seq. U-shape Ch. Att. mIoU (%) Param (k)
TABLE 7. Evaluation results of using ASPPs in a distributed fashion with a
single scale, i.e., with convolutions having dilation rates {12, 24, 36}. ✓ 69.9 629.282
✓ ✓ 70.3 629.291
✓ 70.8 655.643
Identical-scale ASPPs mIoU Param ✓ ✓ 71.4 655.652
Encoder
Stage-1 Stage-2 Stage-3 (%) (k)
Baseline 60 405.846
Baseline ✓ 61.4 1091.574 D. COMPARISON WITH STATE-OF-THE-ARTS
Baseline ✓ ✓ 61.9 1263.366
Baseline ✓ ✓ ✓ 62.2 1306.494 In this subsection, we compare our method with other state-
of-the-art methods on cityscapes and camvid dataset.
TABLE 8. Evaluation results of using ASPPs in a distributed fashion with 1) COMPARISON WITH STATE-OF-THE-ARTS ON
three different scales.
CITYSCAPES
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425
TABLE 10. Evaluation results (per-class basis) of different methods on cityscapes test set.
Traffic light
Traffic sign
Motorbike
Pedestrian
Vegetation
mIoU (%)
Sidewalk
Building
Terrain
Bicycle
Truck
Fence
Rider
Train
Road
Wall
Pole
Car
Bus
Sky
Method
ENet 96.3 74.2 75 32.2 33.2 43.4 34.1 44 88.6 61.4 90.6 65.5 38.4 90.6 36.9 50.5 48.1 38.8 55.4 58.3
ESPNet 97 77.5 76.2 35 36.1 45 35.6 46.3 90.8 63.2 92.6 67 40.9 92.3 38.1 52.5 50.1 41.8 57.2 60.3
CGNet 95.5 78.7 88.1 40 43 54.1 59.8 63.9 89.6 67.6 92.9 74.9 54.9 90.2 44.1 59.5 25.2 47.3 60.2 64.8
DABNet 97.9 82 90.6 45.5 50.1 59.3 63.5 67.7 91.8 70.1 92.8 78.1 57.8 93.7 52.8 63.7 56 51.3 66.8 70.1
FBSNet 98 83.2 91.5 50.9 53.5 62.5 67.6 71.5 92.7 70.5 94.4 82.5 63.8 93.9 50.5 56 37.6 56.2 70.1 70.9
LEDNet 97.1 78.6 90.4 46.5 48.1 60.9 60.4 71.1 91.2 60 93.2 74.3 51.8 92.3 61 72.4 51 43.3 70.2 69.2
DMPNet 97.8 82.2 91 47.8 44.2 58.5 64 68.8 91.2 69 94.5 79.5 60.1 93.7 57.2 70.2 63.1 52.7 66 71.1
TABLE 11. Comparison with other methods on Cityscapes test set. TABLE 12. Speed and run-time measurement of different methods on
different spatial resolutions. For a fair comparison, the networks are executed
on the same platform, i.e., GTX 1080Ti, so the results may slightly differ from
Pretrain mIoU Param. Speed the reported results by original works.
Methods
(ImageNet) (%) (M) (FPS)
FCN-8s [37] ✓ 65.3 134.5 - GTX 1080Ti
ICNet [38] ✓ 69.5 26.5 30.3 Model 256 × 512 512 × 1024 512 × 1024
SegNet [39] ✓ 57 29.5 16.7 ms fps ms fps ms fps
PSPNet [8] ✓ 81.2 250.8 0.78
DeepLab [19] ✓ 63.5 262.10 0.25 ENet [28] 10 99.8 13 74.9 44 22.9
DeepLabv3+ [21] ✓ 75.2 15.40 8.40 SegNet [39] 16 64.2 56 17.9 - -
BiSeNetV1 [11] ✗ 68.4 5.8 105.8 ERFNet [36] 7 147.8 21 48.2 74 13.5
PIDNet [40] ✗ 78.6 7.6 93.2 ICNet [38] 9 107.9 15 67.2 40 25.1
HyperSeg [41] ✓ 78.1 10.2 16.1 ESPNet [5] 5 182.5 9 115.2 30 33.3
DDRNet [35] ✗ 77.4 5.7 101.6 DABNet [6] 6 170.2 10 104.2 36 27.7
FBSNet [14] 7 142.3 12 77 45 22.1
ERFNet [36] ✗ 68.0 2.2 41.7
BiSeNetV2 [42] ✗ 75.3 4.59 47.3 DMPNet (ours) 5 178 9 105.1 31 31.8
FASSDNet [10] ✓ 77.5 2.85 41.1
MSCFNet [43] ✗ 71.9 1.15 50
BiAttenNet [44] ✗ 74.7 2.2 89.2
MLFNet [31] ✗ 71.5 3.99 90.8 networks despite having a smaller model size. More specifi-
MFNet [16] ✗ 72.1 1.34 116 cally, our method is 1%, 5%, and 3.8% more accurate while
DABNet [6] ✗ 70.1 0.76 104.2 having 110k, 230k, and 30k fewer parameters compared to
ContextNet [29] ✗ 66.1 0.88 65.5
DABNet, ContextNet, and EDANet, respectively. Compared
EDANet [30] ✗ 67.3 0.68 81.3
LEDNet [7] ✗ 70.6 0.94 71 to LETNet, a very recent state-of-the-art network, the mIoU
BANet [17] ✗ 70.1 0.72 83.2 of the proposed method is very close (only 1.7% less) despite
FBSNet [14] ✗ 70.9 0.62 90 being 300k fewer parameters. This shows the effectiveness
LETNet [7] ✗ 72.8 0.95 150 of our method. It is also very interesting to compare our
ENet [28] ✗ 58.3 0.36 74.9
method with some of the mid-size methods, as shown in
ESPNet [5] ✗ 60.3 0.36 112
NDNet [12] ✗ 65.3 0.5 101.1 the mid-section of Table 11. Experimental results show that
CGNet [26] ✗ 64.8 0.5 17.6 our method gives 2.1% better accuracy compared to ERFNet
DMPNet (ours) ✗ 71.1 0.65 105.1 despite being more than 3× smaller. Also, DPMNet achieves
very close accuracy compared to MLFNet and MFNet de-
spite being 6× and 2× smaller. Furthermore, even when
DDRNet while being 2× smaller. We can observe a similar we compare our model with some of the large-size models
effect between BiSeNetV1 and BiSeNetV2 [42]. In spite of (>5 million parameters), it achieves better accuracy than
having 1.21 million fewer parameters, BiSeNetV2 achieves BiSeNetV1 [11] while being more than 8× smaller. The
almost 7% more mIoU than BiSeNetV1. qualitative results of the proposed methods on the Cityscapes
These observations show that small-size networks can validation set are shown in Fig. 6.
achieve similar or even more accuracies than multiple-times Speed comparison: Inference speed highly depends upon
larger networks by means of smart architecture design. the device and the input sizes. It is therefore very important
Leveraging this possibility leads to highly efficient networks for fair comparison that we run the methods on the same plat-
with decent accuracies which in turn provides practical so- form. All the experiments to compute the inference speeds
lutions for resource-constrained platforms. To achieve this have been done on a single NVIDIA GTX 1080Ti GPU.
goal, we introduce a lightweight real-time DMPNet, which We present the comparison of the speed and run-time of our
achieves 71.1% mIoU on the cityscapes test set. It achieves method with other state-of-the-art methods in Table 12. We
better accuracy performance compared to most lightweight conduct experiments with three different resolutions; quarter,
VOLUME 4, 2016 9
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425
FIGURE 6. Qualitative results of the proposed methods on the Cityscapes validation set. Regions shown by yellow bounding boxes demonstrate how the model is
progressively getting better as we move from the baseline to the complete network, i.e., DMPNet.
TABLE 13. Comparison with other methods on Camvid test set. especially in risky scenarios such as autonomous driving.
Accuracy and speed-related experimental results show that
Pretrain our method achieves an excellent accuracy-efficiency balance
Methods mIoU (%) Parameters (M)
(ImageNet) which shows its effectiveness in real-time scenarios.
FCN-8s [37] ✓ 57 134.5
SegNet [39] ✓ 55.6 29.5 Accuracy-efficiency Trade-off Comparison: Existing
ICNet [38] ✓ 67.1 26.5 semantic segmentation models can be categorized into two
BiSeNetV1 [11] ✗ 65.6 5.8 broad categories; 1) lightweight networks with high process-
HyperSeg [41] ✓ 78.4 9.9 ing speed but with insufficient accuracies, and 2) methods
DDRNet [35] ✗ 74.7 5.7
that achieve decent accuracies but at the cost of increased
BiSeNetV2 [42] ✗ 72.4 4.59
FASSDNet [10] ✓ 69.3 2.85
model size and slow speed. PSPNet [8], DeepLabv3+ [20],
MSCFNet [43] ✗ 69.3 1.15 HyperSeg [41], DDRNet [35], among others [40], [42]
MFNet [16] ✗ 71.5 1.34 belong to the former category. These networks generally
DABNet [6] ✗ 66.4 0.76 achieve very high accuracies (> 75% mIoU) but are bulky (>
EDANet [30] ✗ 64.6 0.68 5 million parameters in most cases). Moreover, most of these
FBSNet [14] ✗ 68.9 0.62 networks do not achieve real-time performance. Specifically,
LETNet [7] ✗ 70.5 0.95
ENet [28] ✗ 51.3 0.36 the speed of PSPNet, DeepLab, DeepLabV3+, and HyperSeg
ESPNet [5] ✗ 55.6 0.36 are 0.76, 0.25, 8.40, and 16.1 FPS, respectively. Hence,
NDNet [12] ✗ 57.2 0.5 these are not suitable for real-time applications on resource-
CGNet [26] ✗ 65.6 0.5 constrained platforms. However, the efficient design enables
DMPNet (ours) ✗ 69.2 0.65 some of these large-scale networks to achieve real-time per-
formance, e.g., PIDNet and DDRNet. ENet [28], ESPNet [5],
CGNet [26], NDNet [12], etc., belong to the latter category
half, and full resolution on Cityscapes images. The proposed which is computationally efficient but achieve insufficient
DMPNet can process an incoming stream of images with accuracy for practical application, especially in risky scenar-
resolution 512×1024 at a speed of 105.1 fps. This makes our ios such as autonomous driving. A third narrow category is
method one of the fastest methods; second only to ESPNet an emerging area of research that includes methods achiev-
which achieves 115.2 fps. ESPNet, however, only achieves ing decent accuracies with lightweight model size and high
60.3% mIoU which is not suitable for real-time applications, processing speed. Examples include DABNet [6], FBSNet
10 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425
[14], LETNet [7], BANet [17] etc. Our DMPNet is 1% more TABLE 14. Comparison with other methods on ADE20K Validation set.
accurate than both DABNet and BANet, respectively, despite
having 110k and 70k fewer parameters. Also, our method Backbone Param
Methods pixAcc (%) mIoU (%)
(ImageNet) (M)
is faster than most of the methods in this third (trade-off
FCN-8s [37] 71.32 29.39 134.5
oriented) category. More specifically, the DMPNet achieves SegNet [39] 71 21.64 47.62
0.9, 21.9, and 15.1 higher FPS than DABNet, BANet, and DilatedNet [47] 73.55 32.31 62.74
FBSNet, respectively. LETNet achieves 1.7% more mIoU PSPNet [8] ResNet50 79.65 40.79 44.62
than DMPNet but with 200k more parameters. Although RefineNet [48] ResNet152 - 40.7 -
EncNet [49] ResNet50 79.73 41.11 -
the FBSNet achieves excellent trade-off between accuracy SASD [46] ResNet18 77.13 35.82 11.38
and model size, it is 28 FPS slower than our model when SegFormer [45] - 37.4 3.8
executed on the same platform, see Table 12. This shows that DMPNet (ours) 76.8 35.2 0.74
the proposed DMPNet achieves a decent balance between
accuracy, model size (parameters), and inference speed and
thus gives a highly competitive performance compared to our method outperforms many classical methods as well.
recent state-of-the-art methods. The proposed DMPNet is 5.81%, 13.56%, and 2.89% more
accurate than FCN-8s, SegNet, and DilatedNet, respectively,
2) COMPARISON WITH STATE-OF-THE-ARTS ON despite being 181×, 64.35×, and 84× smaller. This shows
CAMVID that our method achieves an excellent accuracy-efficiency
To show the effectiveness of our design and its generalizable trade-off.
nature, we train and evaluate our model on the CamVid
dataset as well. The comparison of our method with other V. CONCLUSION
state-of-the-art methods has been presented in Table 13. In this work, we propose a novel Efficient Multi-scale Con-
When we compare our method with the early benchmark text Aggregation (EMCA) module to capture contextual in-
methods, the proposed DMPNet is 12.2%, 13.6%, and 2.1% formation at multiple receptive fields. It uses a small symmet-
more accurate while being 206×, 45×, and 40× smaller ric kernel with a large dilation rate and a large asymmetric
than FCN-8s, SegNet, and ICNet, respectively. With respect kernel with a relatively smaller dilation rate to efficiently ex-
to lightweight works (<1 million parameters), the proposed tract dense and sparse features. Apart from these, a standard
method achieves better accuracy with all the works except convolution without any dilation is also used to preserve local
for LETNet. To be more specific, our method achieves 2.8%, details. We also introduce a distributed multi-scale pyramid
and 4.6% more mIoU while having 110k and 30k fewer pa- pooling (DMPP) strategy to extract context from all three
rameters than DABNet and EDANet, respectively. Compared levels of feature hierarchy; low, mid, and high-level. Based
to other ultra-lightweight works, such as ESPNet and CGNet, on the EMCA module and the DMPP strategy, we propose
the proposed model achieves 13.6% and 3.6% higher mIoU a lightweight and real-time network, called DMPNet, that
with 290k and 150k more parameters. So, it can be observed achieves an excellent accuracy-efficiency trade-off. More
that the proposed DMPNet achieves an excellent accuracy- specifically, the proposed DMPNet achieves 71.1 and 69.2%
efficiency trade-off. mIoU on Cityscapes and CamVid datasets with only 0.65
million parameters. Furthermore, it is capable of processing
3) COMPARISON WITH STATE-OF-THE-ARTS ON an incoming stream of high-resolution images at a speed of
ADE20K 105.1 frames per second (FPS).
Most real-time semantic segmentation works proposed for
autonomous driving evaluate their methods on road-scene VI. FUTURE SCOPE
datasets, for example, Cityscapes and Camvid. However, Although our method achieves a decent trade-off between
to show the generalizability of the proposed DMPNet, we accuracy and efficiency, it suffers from the problem of class
train and evaluate our method on a highly challenging and imbalance to a certain extent. Moreover, to recover finer
comprehensive general-purpose ADE20K dataset as well. details in the high-resolution segmentation map, deconvo-
The comparison of our method with other state-of-the-art lution is conventionally employed which negatively affects
methods has been presented in Table 14. Since mostly the the processing speed of the model. So, the extension of
large-scale general-purpose networks present the evaluation the proposed method in the future will include developing
results on the ADE20K dataset, Table 14 includes large-scale more advanced techniques to counter the class-imbalance
models with tens of millions of parameters. The experimental issue more effectively. Also, effective decoding strategies
results show that a carefully designed network with fewer that reduce the dependency on deconvolution blocks will be
parameters can achieve better performance in comparison explored.
to their larger counterparts. SegFormer [45] despite being
3× smaller achieve 1.5% better accuracy than SASD [46]. ACKNOWLEDGMENT
The proposed DMPNet achieves accuracy very close to that The authors extend their appreciation to King Saud Univer-
of SASD, despite being 15× smaller. Apart from SASD, sity for funding this research through Researchers Supporting
VOLUME 4, 2016 11
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425
Project Number (RSPD2024R890), King Saud University, [19] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
Riyadh, Saudi Arabia. “Deeplab: Semantic image segmentation with deep convolutional nets,
atrous convolution, and fully connected crfs,” IEEE transactions on pattern
analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
[20] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking
atrous convolution for semantic image segmentation,” arXiv preprint
arXiv:1706.05587, 2017.
REFERENCES [21] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-
[1] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, decoder with atrous separable convolution for semantic image segmen-
U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic tation,” in Proceedings of the European conference on computer vision
urban scene understanding,” in Proceedings of the IEEE conference on (ECCV), 2018, pp. 801–818.
computer vision and pattern recognition, 2016, pp. 3213–3223. [22] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. C. Loy, D. Lin, and J. Jia, “Psanet:
[2] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks Point-wise spatial attention network for scene parsing,” in Proceedings of
for biomedical image segmentation,” in Medical Image Computing and the European conference on computer vision (ECCV), 2018, pp. 267–283.
Computer-Assisted Intervention–MICCAI 2015: 18th International Con- [23] Y. Yuan, L. Huang, J. Guo, C. Zhang, X. Chen, and J. Wang, “Ocnet: Ob-
ference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. ject context for semantic segmentation,” International Journal of Computer
Springer, 2015, pp. 234–241. Vision, vol. 129, no. 8, pp. 2375–2398, 2021.
[3] W. Zhou, Y. Liu, C. Wang, Y. Zhan, Y. Dai, and R. Wang, “An automated [24] J. He, Z. Deng, L. Zhou, Y. Wang, and Y. Qiao, “Adaptive pyramid context
learning framework with limited and cross-domain data for traffic equip- network for semantic segmentation,” in Proceedings of the IEEE/CVF
ment detection from surveillance videos,” IEEE Transactions on Intelligent Conference on Computer Vision and Pattern Recognition, 2019, pp. 7519–
Transportation Systems, vol. 23, no. 12, pp. 24 891–24 903, 2022. 7528.
[4] X. Sun, Y. Qian, R. Cao, P. Tuerxun, and Z. Hu, “Bgfnet: Semantic [25] Q. Hou, L. Zhang, M.-M. Cheng, and J. Feng, “Strip pooling: Rethinking
segmentation network based on boundary guidance,” IEEE Geoscience spatial pooling for scene parsing,” in Proceedings of the IEEE/CVF
and Remote Sensing Letters, 2023. conference on computer vision and pattern recognition, 2020, pp. 4003–
[5] S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and H. Hajishirzi, “Espnet: 4012.
Efficient spatial pyramid of dilated convolutions for semantic segmen- [26] T. Wu, S. Tang, R. Zhang, J. Cao, and Y. Zhang, “Cgnet: A light-weight
tation,” in Proceedings of the european conference on computer vision context guided network for semantic segmentation,” IEEE Transactions on
(ECCV), 2018, pp. 552–568. Image Processing, vol. 30, pp. 1169–1179, 2020.
[6] G. Li, I. Yun, J. Kim, and J. Kim, “Dabnet: Depth-wise asymmet- [27] Y. Wang, Q. Zhou, J. Liu, J. Xiong, G. Gao, X. Wu, and L. J. Latecki,
ric bottleneck for real-time semantic segmentation,” arXiv preprint “Lednet: A lightweight encoder-decoder network for real-time semantic
arXiv:1907.11357, 2019. segmentation,” in 2019 IEEE international conference on image process-
[7] G. Xu, J. Li, G. Gao, H. Lu, J. Yang, and D. Yue, “Lightweight real-time ing (ICIP). IEEE, 2019, pp. 1860–1864.
semantic segmentation network with efficient transformer and cnn,” IEEE
[28] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep neural
Transactions on Intelligent Transportation Systems, 2023.
network architecture for real-time semantic segmentation,” arXiv preprint
[8] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing arXiv:1606.02147, 2016.
network,” in Proceedings of the IEEE conference on computer vision and
[29] R. P. Poudel, U. Bonde, S. Liwicki, and C. Zach, “Contextnet: Exploring
pattern recognition, 2017, pp. 2881–2890.
context and detail for semantic segmentation in real-time,” arXiv preprint
[9] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, “Denseaspp for semantic
arXiv:1805.04554, 2018.
segmentation in street scenes,” in Proceedings of the IEEE conference on
[30] S.-Y. Lo, H.-M. Hang, S.-W. Chan, and J.-J. Lin, “Efficient dense mod-
computer vision and pattern recognition, 2018, pp. 3684–3692.
ules of asymmetric convolution for real-time semantic segmentation,” in
[10] L. Rosas-Arias, G. Benitez-Garcia, J. Portillo-Portillo, J. Olivares-
Proceedings of the ACM Multimedia Asia, 2019, pp. 1–6.
Mercado, G. Sanchez-Perez, and K. Yanai, “Fassd-net: Fast and accurate
real-time semantic segmentation for embedded systems,” IEEE Transac- [31] J. Fan, F. Wang, H. Chu, X. Hu, Y. Cheng, and B. Gao, “Mlfnet: Multi-
tions on Intelligent Transportation Systems, vol. 23, no. 9, pp. 14 349– level fusion network for real-time semantic segmentation of autonomous
14 360, 2021. driving,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp. 756–
767, 2022.
[11] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral
segmentation network for real-time semantic segmentation,” in Proceed- [32] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: Efficient
ings of the European conference on computer vision (ECCV), 2018, pp. channel attention for deep convolutional neural networks,” in Proceedings
325–341. of the IEEE/CVF conference on computer vision and pattern recognition,
[12] Z. Yang, H. Yu, Q. Fu, W. Sun, W. Jia, M. Sun, and Z.-H. Mao, “Ndnet: 2020, pp. 11 534–11 542.
Narrow while deep network for real-time semantic segmentation,” IEEE [33] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene
Transactions on Intelligent Transportation Systems, vol. 22, no. 9, pp. parsing through ade20k dataset,” in Proceedings of the IEEE conference
5508–5519, 2020. on computer vision and pattern recognition, 2017, pp. 633–641.
[13] X. Zhang, Z. Chen, Q. J. Wu, L. Cai, D. Lu, and X. Li, “Fast semantic [34] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
segmentation for scene perception,” IEEE Transactions on Industrial In- with deep convolutional neural networks,” Communications of the ACM,
formatics, vol. 15, no. 2, pp. 1183–1192, 2018. vol. 60, no. 6, pp. 84–90, 2017.
[14] G. Gao, G. Xu, J. Li, Y. Yu, H. Lu, and J. Yang, “Fbsnet: A fast [35] H. Pan, Y. Hong, W. Sun, and Y. Jia, “Deep dual-resolution networks
bilateral symmetrical network for real-time semantic segmentation,” IEEE for real-time and accurate semantic segmentation of traffic scenes,” IEEE
Transactions on Multimedia, 2022. Transactions on Intelligent Transportation Systems, vol. 24, no. 3, pp.
[15] S. Mazhar, N. Atif, M. Bhuyan, and S. R. Ahamed, “Rethinking dabnet: 3448–3460, 2022.
Light-weight network for real-time semantic segmentation of road scenes,” [36] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “Erfnet: Efficient
IEEE Transactions on Artificial Intelligence, 2023. residual factorized convnet for real-time semantic segmentation,” IEEE
[16] M. Lu, Z. Chen, C. Liu, S. Ma, L. Cai, and H. Qin, “Mfnet: Multi-feature Transactions on Intelligent Transportation Systems, vol. 19, no. 1, pp. 263–
fusion network for real-time semantic segmentation in road scenes,” IEEE 272, 2017.
Transactions on Intelligent Transportation Systems, vol. 23, no. 11, pp. [37] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
20 991–21 003, 2022. for semantic segmentation,” in Proceedings of the IEEE conference on
[17] S. Mazhar, N. Atif, M. Bhuyan, and S. R. Ahamed, “Block attention computer vision and pattern recognition, 2015, pp. 3431–3440.
network: A lightweight deep network for real-time semantic segmentation [38] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Icnet for real-time semantic
of road scenes in resource-constrained devices,” Engineering Applications segmentation on high-resolution images,” in Proceedings of the European
of Artificial Intelligence, vol. 126, p. 107086, 2023. Conference on Computer Vision (ECCV), 2018, pp. 405–420.
[18] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in [39] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep con-
deep convolutional networks for visual recognition,” IEEE transactions on volutional encoder-decoder architecture for image segmentation,” IEEE
pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904–1916, transactions on pattern analysis and machine intelligence, vol. 39, no. 12,
2015. pp. 2481–2495, 2017.
12 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3359425
[40] J. Xu, Z. Xiong, and S. P. Bhattacharyya, “Pidnet: A real-time semantic M.K. BHUYAN (Senior Member, IEEE) received
segmentation network inspired by pid controllers,” in Proceedings of the Ph.D. degree in electronics and communica-
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, tion engineering from the Indian Institute of Tech-
2023, pp. 19 529–19 539. nology (IIT) Guwahati, India, in 2006. He was
[41] Y. Nirkin, L. Wolf, and T. Hassner, “Hyperseg: Patch-wise hypernetwork with the School of Information Technology and
for real-time semantic segmentation,” in Proceedings of the IEEE/CVF Electrical Engineering, University of Queensland,
Conference on Computer Vision and Pattern Recognition, 2021, pp. 4061– St. Lucia, QLD, Australia, where he was involved
4070.
in postdoctoral research. Subsequently, he was
[42] C. Yu, C. Gao, J. Wang, G. Yu, C. Shen, and N. Sang, “Bisenet v2: Bilateral
a Researcher with the SAFE Sensor Research
network with guided aggregation for real-time semantic segmentation,”
International Journal of Computer Vision, vol. 129, no. 11, pp. 3051–3068, Group, NICTA, Brisbane, QLD. He is currently
2021. a Professor with the Department of Electronics and Electrical Engineer-
[43] G. Gao, G. Xu, Y. Yu, J. Xie, J. Yang, and D. Yue, “Mscfnet: a lightweight ing, IIT Guwahati, and the Associate Dean of Infrastructure, Planning
network with multi-scale context fusion for real-time semantic segmenta- and Management, IIT Guwahati. In 2014, he was a Visiting Professor
tion,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, with Indiana University and Purdue University, Indiana, USA. He is also
no. 12, pp. 25 489–25 499, 2021. working as a Visiting Professor with the Department of Computer Science,
[44] G. Li, L. Li, and J. Zhang, “Biattnnet: bilateral attention for improving Chubu University, Japan. His current research interests include image/video
real-time semantic segmentation,” IEEE Signal Processing Letters, vol. 29, processing, computer vision, machine and deep learning, human computer
pp. 46–50, 2021. interactions (HCI), virtual reality and augmented reality, and biomedical
[45] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Seg- signal processing. He was a recipient of the National Award for Best Applied
former: Simple and efficient design for semantic segmentation with trans- Research/Technological Innovation, which was presented by the Honorable
formers,” Advances in Neural Information Processing Systems, vol. 34, President of India, in 2012, the Prestigious Fullbright-Nehru Academic and
pp. 12 077–12 090, 2021. Professional Excellence Fellowship, and the BOYSCAST Fellowship.
[46] S. An, Q. Liao, Z. Lu, and J.-H. Xue, “Efficient semantic segmentation
via self-attention and self-distillation,” IEEE Transactions on Intelligent SULTAN ALFARHOOD received the Ph.D. de-
Transportation Systems, vol. 23, no. 9, pp. 15 256–15 266, 2022. gree in computer science from the University of
[47] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolu- Arkansas. He is an Assistant Professor with the
tions,” arXiv preprint arXiv:1511.07122, 2015.
Department of Computer Science, King Saud Uni-
[48] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement
versity (KSU). Since joining KSU in 2007, he has
networks for high-resolution semantic segmentation,” in Proceedings of
the IEEE conference on computer vision and pattern recognition, 2017, made several contributions to the field of computer
pp. 1925–1934. science through his research and publications. Sul-
[49] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal, tan’s research spans a variety of domains including
“Context encoding for semantic segmentation,” in Proceedings of the IEEE machine learning, recommender systems, linked
Conference on Computer Vision and Pattern Recognition, 2018, pp. 7151– open data, text mining and ML-based IoT sys-
7160. tems. His work includes proposing innovative approaches and techniques
to enhance the accuracy and effectiveness of these systems. His recent
publications have focused on using deep learning and machine learning tech-
niques to address challenges in these domains. Sultan’s research continues to
NADEEM ATIF received the bachelor’s degree in make significant contributions to the field of computer science and machine
technology in 2015, and M. Tech. in 2017 from the learning. His work has been published in several high-impact journals and
Z.H.C.E.T, A.M.U, Aligarh, India. He is currently conferences.
pursuing the Ph.D. degree with the EEE Dept.,
Indian Institute of Technology Guwahati, Guwa- MEJDL SAFRAN is a passionate researcher and
hati, India. His research interests include computer educator in the field of artificial intelligence, with
vision, deep learning, and autonomous driving. a focus on deep learning and its applications
in various domains. He is currently an Assis-
tant Professor of Computer Science at King Saud
University, where he has been a faculty member
since 2008. He obtained his bachelor’s degree in
computer science from King Saud University in
SAQUIB MAZHAR received the bachelor’s de-
2007, his master’s degree in computer science
gree in technology from NIT Hameerpur, Himan-
from Southern Illinois University Carbondale in
chal Pradesh, India, and M. Tech. in 2017 from the
2013, and his doctoral degree in computer science from the same university
Z.H.C.E.T, A.M.U, Aligarh, India. He is currently
in 2018. His doctoral dissertation was on developing efficient learning-
pursuing the Ph.D. degree with the EEE Dept.,
based recommendation algorithms for top-N tasks and top-N workers in
Indian Institute of Technology Guwahati, Guwa-
large-scale crowdsourcing systems. He has published more than 20 arti-
hati, India. His research interests include computer
cles in peer-reviewed journals and conference proceedings, such as ACM
vision, deep learning, and autonomous driving.
Transactions on Information Systems, Applied Computing and Informatics,
Mathematics, Sustainability, International Journal of Digital Earth, IEEE
Access, Biomedicine, Sensors, IEEE International Conference on Cluster,
IEEE International Conference on Computer and Information Science, In-
SHAIK RAFI AHAMED (Senior Member, IEEE) ternational Conference on Database Systems for Advanced Applications,
received the B.Tech. and M.Tech. degrees in and International Conference on Computational Science and Computational
electronics and communication engineering from Intelligence. He has been leading grant projects in the fields of AI in medical
Sri Venkateswara University, Tirupati, India, in imaging and AI in smart farming. His current research interests include
1991 and 1993, respectively, and the Ph.D. degree developing novel deep learning methods for image processing, pattern
from Indian Institute of Technology Kharagpur, recognition, natural language processing, and predictive analytics, as well as
Kharagpur, India, in 2008. He is currently a Pro- modeling and analyzing user behavior and interest in online platforms. He
fessor with the Department of Electronics and has been working as an AI consultant for several national and international
Electrical Engineering, Indian Institute of Tech- agencies since 2018.
nology Guwahati, Guwahati, India.
VOLUME 4, 2016 13
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4