Main
Main
Keywords: Deep-learning-based semantic segmentation networks typically incorporate object classification networks in
Autonomous driving their backbone. This leads to a loss of context because classification networks have a smaller field of view.
Convolutional neural networks The architecture has been extended to recover context with additional downsampling feature maps, a parallel
Road scenes
context branch, or pyramid pooling modules after the backbone. However, these extensions increase multiply–
Semantic segmentation
accumulate operations and memory requirements, thus, making them unsuitable for resource-constrained
Real-time
devices. To overcome this limitation, a novel convolutional building block with attention-based context
guidance is proposed. The block is repeated to build an efficient encoder–decoder network. Our network runs
in real-time, has a lightweight design with only 0.72 Million parameters, and achieves 70.1%, and 66.3%
mean intersection-over-union scores on the highly competitive Cityscapes and CamVid datasets, respectively.
An efficient decoder is also designed to replace other semantic segmentation network decoders with minimal
performance loss. The performance measures on mobile platforms show that our network suits resource-
constrained devices. Further, experimental results show that the proposed method can optimally balance the
model size-inference speed and segmentation accuracy.
1. Introduction solve this problem, researchers have proposed several lightweight net-
works (Paszke et al., 2018; Lo et al., 2019; Mehta et al., 2018; Wang
Semantic segmentation involves labelling each pixel in an image et al., 2019). However, they have decreased accuracy due to limited
as belonging to a particular object class. This task serves as the basis feature representation capability. Thus, the accuracy, speed, and model
for many computer vision applications, including autonomous driving, size must be balanced.
bio-medical image processing, robot navigation, and satellite imagery Several novel networks were designed for road-scene understanding
(Ronneberger et al., 2015; Saha and Chakraborty, 2018; Jung et al., tasks since the fully convolutional network (FCN) (Shelhamer et al.,
2022; Romera et al., 2018). Semantic segmentation methods can be 2017) was first developed as an end-to-end trainable CNN-based se-
broadly divided into classical and deep learning (DL) based. Classical mantic segmentation network. These methods were primarily based on
approaches use thresholding, clustering, texture analyzers, boundary backbone networks designed for ImageNet Large Scale Visual Recog-
detectors and probabilistic graphical models. All these methods re- nition Challenge (ILSVRC) (Deng et al., 2009) commonly known as
quire the selection and handcrafting of features, which becomes more ImageNet (e.g., VGG Net Liu and Deng, 2015, GoogLeNet Szegedy et al.,
cumbersome as the number of segmentation classes increases. By con- 2015, ResNet He et al., 2016). These architectures mainly adopted a
trast, deep learning enables convolutional neural networks(CNNs) and backbone architecture using dilated convolutions because road-scene
transformers to automatically learn features in an image. CNNs are understanding requires a large field of view. However, their deep struc-
computationally intensive for pixel labelling, which have been used ture and dense convolutions were unsuitable for real-time deployment
effectively for computer vision tasks such as object detection, clas- on resource-constrained devices.
sification, and segmentation. The recent advancement of intelligent ENet (Paszke et al., 2018) achieved a real-time performance on
vehicles and robotics requires deep learning models that can work semantic segmentation by factorizing the convolution kernels at the
in these resource-constrained environments due to the limited avail- expense of only 0.37M parameters. Since then, ENet has become a de
ability of computing power. Thus, large-scale models are challenging facto standard at the lower model size limit. While ESPNet (Mehta
to implement on these devices, including embedded platforms. To
∗ Corresponding author.
E-mail addresses: [email protected] (S. Mazhar), [email protected] (N. Atif), [email protected] (M.K. Bhuyan), [email protected]
(S.R. Ahamed).
https://doi.org/10.1016/j.engappai.2023.107086
Received 6 June 2023; Received in revised form 7 August 2023; Accepted 29 August 2023
Available online 7 September 2023
0952-1976/© 2023 Elsevier Ltd. All rights reserved.
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086
Fig. 1. Block diagram of the proposed network. The colour of the blocks represents their type, given in the top row. The encoder has three levels, represented as L1, L2 and L3.
Each level operates at half the resolution of the previous level. The dimension of the feature maps is given at the bottom of the block as ‘𝐻𝑒𝑖𝑔ℎ𝑡 × 𝑊 𝑖𝑑𝑡ℎ × 𝐶ℎ𝑎𝑛𝑛𝑒𝑙’. 𝑛_𝐶𝑙𝑎𝑠𝑠 is
the number of classes in the final segmented image. The network consists of the proposed BA module, downsampler and lightweight decoder, the details of which are discussed
in subsequent sections and figures.
et al., 2018) used a pyramid of factorized convolutions to improve 3. A novel, lightweight decoder is designed using efficient 3 × 3
on ENet without increasing the model size, ERFNet (Romera et al., convolutions to include in our network. With only simple skip
2018) heavily relied on asymmetric convolutions and dilations, which connections from the encoder stages, it can recover some spatial
increased the model size. ContextNet (Poudel et al., 2018) runs an details lost during the downsampling process.
additional context branch at full resolution in parallel to a back- 4. Without any pre-training, post-refinement, or pyramid pooling
bone structure. DABNet (Li et al., 2019a) modified the basic block modules (PPMs), our lightweight architecture achieves compet-
of ERFNet by dividing the asymmetric convolutions into two paral- itive results on the Cityscapes and CamVid datasets. Specifi-
lel branches. Despite these design choices, the networks could not cally, with only 0.72M parameters, BANet achieves a mIoU of
improve performance while keeping the number of network param- 70.1% and 66.3% on the Cityscapes and CamVid benchmarks,
eters in check. Depth-wise separable convolution drastically reduces respectively.
the parameter count at the cost of accuracy. Asymmetric convolution
reduces the filter-kernel size with a minimal reduction in accuracy. 2. Related works
However, large dilations in the filter kernels can significantly reduce
the inference speed. Thus, different convolutions must be combined to Increasing demand for the real-world deployment of semantic seg-
balance accuracy with speed. mentation has led to a shift in designs from high-accuracy large models
To this end, a new design for a lightweight encoder–decoder net- to moderate accuracy but much smaller models. Starting with ENet,
work is proposed. The network performs on par with the large real-time lightweight designs have attracted the research community’s atten-
semantic segmentation networks while keeping the number of parame- tion. Substantial work has been performed to explore the potential of
ters small. The encoder is built on a novel attention-guided asymmetric small and lightweight architectures. These architectures can be broadly
basic block. It has a two-branch structure to simultaneously learn divided into three categories.
local and global features essential for the semantic segmentation of
images. It utilizes group convolution in the main branch to reduce the 2.1. Context based architectures
parameters without compromising representation capability too much.
The first parallel branch factorizes the 3 × 3 convolutions with an Methods developed in recent years rely heavily on dilated back-
asymmetric 3 × 1 kernel, followed by a 1 × 3 kernel. We use dilation in
bones developed for dense prediction tasks. As contextual information
these asymmetric convolutions to expand the field of view. The second
plays a crucial role in the accurate semantic labelling of pixels, context-
parallel branch uses dilated kernels, which help learn the local features.
based models like (Zhao et al., 2017; Chen et al., 2018; Yang et al.,
Inspired by ResNet (He et al., 2016), a skip connection is added to the
2018; Nirkin et al., 2021) use dilated convolutions in the backbone
basic module. It improves gradient flow during backpropagation. To
network or attach a pyramid-pooling of dilated convolutions as an
this skip connection, an attention module is attached, leading to the
end-block in the encoder. This helps to capture context information at
weighted addition of the feature maps from the previous block to the
different scales. However, convolution kernels at higher dilation rates
present one. It will be shown that adding this attention-skip connection
are slow as significant time is required for memory restructuring (Arani
reduces the dilation rates to only 2. This leads to a significant speed-up
et al., 2021). Thus, they are sub-optimal for fast networks, which must
of the network.
perform in real-time.
The main contributions of this paper are summarized below:
2
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086
et al. (2018). The feature maps from the final output of the encoder
have the smallest resolution and largest depth. However, the decoder
is responsible for restoring reduced resolution and enhancing boundary
information. It recovers the spatial information of the encoded and
downsampled outputs.
ers. Consequently, transformer-based networks require many floating- Standard 𝑘×𝑘 𝑘2 𝐶𝑖𝑛 𝐶𝑜𝑢𝑡
Groupwise 𝑘×𝑘 𝑘2 𝐶𝑖𝑛 𝐶𝑜𝑢𝑡 ∕𝑔
point operations (FLOPs) and thus are computationally intensive. This
Pointwise 1 × 1 𝐶𝑖𝑛 𝐶𝑜𝑢𝑡
hinders their deployment on resource-constrained embedded devices. Asymmetric 𝑘×1 𝑘𝐶𝑖𝑛 𝐶𝑜𝑢𝑡
3. Preliminary knowledge
This section introduces the basis of our design: the lightweight ker- its sparsity, the hardware kernel calls for pointwise convolution are not
nel decomposition of the standard convolution (Conv2d) and attention as optimized as standard convolution.
module. Attention Module- By assigning varying weights to picture pixels,
Lightweight Convolutions- Asymmetric, depthwise separable, the attention mechanism can selectively emphasize meaningful fea-
group, and pointwise convolution reduce a deep network’s size and tures while ignoring inconsequential information. Some recent pub-
computational costs. For group convolution, the 𝑘 × 𝑘 filter kernel lications (Yu et al., 2018, 2021) have focused on applying attention
is divided into groups along the channel axis. This reduces kernel mechanisms to semantic segmentation to boost the network’s feature
weights and multiply–accumulate operations in the hyperparameter learning ability and object segmentation accuracy. An improved atten-
order. Fig. 2 shows an example of group convolution for an input tion refinement module (ARM) based on Yu et al. (2018) is proposed
feature map size of 𝐻𝑒𝑖𝑔ℎ𝑡 × 𝑊 𝑖𝑑𝑡ℎ × (𝐶ℎ𝑎𝑛𝑛𝑒𝑙𝑠 = 6). A value of 3 as shown in Fig. 3.
represents a three-fold reduction in kernel weights. The ARM is improved for our application by replacing the 1 × 1
Depthwise separable convolution decomposes a Conv2d kernel into pointwise convolution with a 3 × 3 convolution. First, we require a
a group convolution kernel followed by a pointwise convolution. The larger receptive field because our attention module operates in initial
number of groups, g, is equal to the number of input channels. So, each as well as deeper layers. Second, it must learn contextual features for
channel of 𝐻 × 𝑊 feature map is multiplied by a single filter kernel of better semantics because it is used in every block. Our module also
size 𝑘×𝑘. For pointwise convolution, a 1 × 1 filter kernel is used instead differs from the ARM in how it combines weighted features. Directly
of the standard 𝑘 × 𝑘 kernel. Table 1 presents the total parameter count multiplying the input features with the attention weights leads to the
of these factorized kernels. It shows that pointwise convolution is the loss of certain features due to zero weights. This problem is solved
lightest for a given input and output channel value. However, due to with an extra connection that adds the input features to the weighted
3
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086
Table 2
Detailed structure of the proposed Block Attention Network.
Block Layer Module (S,D)a Ch. Output size
1 Down (2,1) 32 512 × 256
2 BA (1,2) 32 512 × 256
3 BA (1,2) 32 512 × 256
4 Down1D (2,1) 64 256 × 128
5 BA (1,2) 64 256 × 128
Encoder 6 BA (1,2) 64 256 × 128
7 BA (1,2) 64 256 × 128
8 Down1D (2,1) 128 128 × 64
9 BA (1,2) 128 128 × 64
10 BA (1,2) 128 128 × 64
11 BA (1,2) 128 128 × 64
12 BA (1,2) 128 128 × 64
13 Bi-Intb ×2 128 256 × 128
14 Conv3x3 (1,1) 64 256 × 128
Decoder 15 Bi-Int ×2 64 512 × 256
16 Conv3x3 (1,1) 20 512 × 256
14 Bi-Int ×2 20 1024 × 512
a
S = Stride, D = Dilation; ‘Ch.’ — Channel.
b
Bi-Int = Bilinear Interpolation; ‘×2’ denotes the interpolation factor.
where 𝑋𝑖𝑛 is the set of feature maps at the input of the ARM, 𝑓𝑎𝑣𝑔 is the
global average pooling function, 𝑊0 is the convolution kernel weight,
𝑓𝐵𝑁 is batch normalization and 𝜎 is the sigmoid function. In contrast,
the weighted features in our improved ARM are calculated as,
4. Proposed method
4.1. Block-attention module after every convolution layer. The feature learning is further enhanced
by using a 3 × 3 convolution in the input branch followed by 1 × 1
The basic block of our proposed network, referred to as the block- convolutions.
attention module (BA), is shown in Fig. 4. The efficient bottleneck
A skip connection is used to prevent the problem of vanishing
design is used (He et al., 2016) as it reduces the number of channels be-
gradients (He et al., 2016) during back-propagation in the BA module.
fore performing convolution and non-linear operations, thus reducing
Nevertheless, unlike other networks, an attention-guided skip connec-
the computations.
tion is introduced. This design technique reduces the dilation rates to
The BA module jointly learns the local and global contextual fea-
only 2, thus improving network speed.
tures, quintessential for accurate semantic pixel labelling (Li et al.,
2019a; Wu et al., 2021; Wang et al., 2019; Romera et al., 2018) using
a two-branch structure. Local features are extracted in the first branch 4.2. Downsampler
using a 3 × 3 depth-wise separable convolution without dilation. This
is followed by a point-wise fusion operation performed by 1 × 1 con- To reduce the spatial resolution of the feature maps and learn
volution. These design choices drastically reduce the parameter count, scale-invariant features, it is common practice to downsample at each
reducing the computation and memory requirements. The module uses stage in the network. Techniques include max-pooling (Giusti et al.,
channel shuffle operation (Zhang et al., 2017) to improve the accuracy. 2013), average pooling (Lin et al., 2013), and strided convolution.
The second branch is designed to extract global contextual features. ENet showed that using a pooling operation and strided-convolution in
Asymmetric convolution is used to minimize the parameter count in parallel reduces the loss of finer details while downsampling. A similar
which a 𝑛 × 𝑛 2D convolution kernel is approximated by 𝑛 × 1 and downsampling module shown in Fig. 5 is designed with 1D asymmetric
1 × 𝑛 kernels. For a 3 × 3 convolution, this essentially means a 33% convolutions to reduce weights further. This makes our downsampling
reduction in the number of parameters from 9 to 6 per channel. Dilation module extremely lightweight and hence suitable for low-memory de-
is used in the asymmetric kernels to expand the receptive field and gain vices. A convolution block is added after the concatenation to fuse the
global context. Batch normalization and ReLU non-linearity are used features efficiently.
4
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086
5. Experiments
5
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086
Fig. 7. Different variants of BA Module. (a) is a lightweight design having a single 1 × 1 convolution in the main branch. (b) and (c) vary in the position of the channel shuffle
operation.
6
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086
Table 5 Table 7
Accuracy comparison for different layer combinations in BANet. Decoder ablation studies. Results of our encoder combined with different decoders.
#Encoder layers #Channelsb Timea (ms) #Parameters (M) mIoU (%) Decoder #Parameters (M) mIoU (%)
(–,4,8) (–,64,128) 17.5 1.20 68.8 ERFNet (Romera et al., 2018) 0.15 64.3
(2,4,8) (32,64,128) 18.3 1.30 69.1 ESPNet (Mehta et al., 2018) 0.04 63.5
(2,3,6) (32,64,128) 13.9 0.93 68.2 LEDNet (Wang et al., 2019) 1.30 65.2
(2,3,4) (32,64,128) 12.5 0.72 69.8 RGPNet (Arani et al., 2021) 2.00 67.5
(2,2,4) (32,64,128) 11.3 0.63 68.5 FASSD-Net (Rosas-Arias et al., 2021) 0.95 66.4
Proposed 0.09 69.8
a
Denotes the forward pass time for a single input image of resolution-1024 × 512, in
milliseconds.
b
Denotes channels in (Stage1, Stage2, Stage3).
Table 8
Table 6 Accuracy and inference time comparison for different dilation rates and residual
Decoder ablation studies. Results of various feature fusion strategies in the decoder. connections.
Decoder mIoU (%) Dilation rates Residual connection
shuffle layer is put after adding the residual connection. This block 2*(1) indicates 2 blocks with dilation ‘1’ in a stage.
design performs the best.
Backbone: As mentioned earlier, the backbone is built using BA
and downsampling modules. In Table 5, it is seen that having (2,
3, 4) BA modules in corresponding stages gives the best accuracy. A
higher number of modules in the initial layer increased inference time
because of the large feature map. Increasing the network depth had an
inhibitory effect on the network performance. It not only increases the
number of parameters but also increases the latency. Given that our
basic module could learn the required features with a shallow setting,
greater depth showed no improvement.
The Decoder: In this ablation study, BA module-based backbone
is fixed and two decoder variants are designed. The first employed
bilinear upsampling with 3 × 3 convolution layers and the second
employed 3 × 3 deconvolution layers directly. As seen in Table 6,
the first combination has the highest speed and accuracy. For the
primary input to the decoder, the final encoder stage feature maps
are taken, which are 1∕8× the resolution of the input image. For the
skip architecture, two auxiliary outputs from the backbone are taken, Fig. 8. Speed, accuracy, and model size comparison on the Cityscapes test set. The size
of the bubble represents the number of parameters written on top of each. The proposed
one after the first stage (1∕2 resolution) and another after the second
method is highlighted in red. It balances the accuracy-speed tradeoff while having a
stage (1∕4 resolution). This design strategy is adapted from Shelhamer small model size. Methods with high accuracy have large sizes and are significantly
et al. (2017). We have experimented with normalized addition, mul- slower.
tiplication, and concatenation of the auxiliary skip connections with
the primary input. In the case of addition or multiplication, first stage
output is applied with [1 × 1, 64] convolutions, where 1 × 1 is the observed that in the presence of an attention-based skip connection
kernel size, and 64 is the number of channels. While, for concatenation in the basic block, a dilation rate of two throughout gives the best
[1 × 1, 32], convolution is used. Similarly, the second stage output is accuracy.
applied with [1×1, 128] and [1×1, 64] convolutions. The channel widths Increasing the dilation rate further has a detrimental effect. How-
are reduced by half for the concatenation operation while keeping other ever, if we replace the attention skip connection with the standard skip
settings unchanged. connection, it is seen that similar accuracy is attained at higher dilation
To show the efficiency of our decoder design, we trained the BANet rates. This further reinforces our claim of the efficacy of the BA module.
backbone with some of the popularly used decoders from Romera et al.
(2018), Mehta et al. (2018), Wang et al. (2019), Arani et al. (2021) and 5.4. Comparisons on the cityscapes benchmark
Rosas-Arias et al. (2021). The results in Table 7 shows that the proposed
decoder design performs the best. Our network is shallower than other
This section compares the overall performance of the proposed ar-
networks, thus requiring a custom-designed decoder.
chitecture with state-of-the-art semantic segmentation networks on the
Dilation rates: To improve contextual information, semantic seg- Cityscapes dataset. The best-performing deep network model, in terms
mentation networks use dilation of the convolution kernels in the of accuracy, is ViT-Adapter-L (Chen et al., 2022), which is based on the
backbone (Romera et al., 2018; Paszke et al., 2018; Wu et al., 2021). resource-hungry vision transformer (Dosovitskiy et al., 2021). Table 9
This increases the pixel’s field of view at the kernel’s centre. Increasing shows that state-of-the-art models, which require many parameters and
the dilation rate with the depth of the network is common (Yang et al., FLOPs (Zhao et al., 2017; Yang et al., 2018; Chen et al., 2022; Xie et al.,
2021; Hong et al., 2021; Wang et al., 2019). Table 8 shows the ablation 2021; Zhang et al., 2022a) report the top mIoU scores. Such models are
study for different dilation rates in the backbone. Dilations only in not readily deployable on resource-constrained embedded devices (see
the asymmetric convolution branch of the BA module are used. It is Fig. 8).
7
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086
Table 9
Comparison of evaluation results of BANet with similar networks on Cityscapes test benchmark. The first section is for high-accuracy methods,
which are usually very large. The second section is for lightweight methods.
Method FLOPs (G) InputSize mIoU (%) FPSa Parameters (M)
Dense ASPP (Yang et al., 2018) 214.7 2048 × 1024 80.6 <1 28.6
SegFormer-B5 (Xie et al., 2021) 1447.60 1024 × 1024 84.0 2.5 84.7
Trans4trans (Zhang et al., 2022a) 94.25 1536 × 768 81.5 27.3 49.5
PSP Net (Zhao et al., 2017) 453.6 2048 × 1024 78.4 <1 65.6
ViT-Adapter-L (Chen et al., 2022) – 896 × 896 84.9 <1 347.9
ENet (Paszke et al., 2018) 3.8 640 × 360 58.3 135.4 0.4
ESPNet (Mehta et al., 2018) 4.0 512 × 1024 60.3 112 0.4
EDANet (Lo et al., 2019) 11.34 512 × 1024 67.3 81.3 0.7
ERFNet (Romera et al., 2018) 21.0 512 × 1024 68.0 41 2.1
LBN-AA (Dong et al., 2021) 49.5 448 × 896 73.6 51 6.2
CFPNet (Lou and Loew, 2021) 14.7 1024 × 2048 70.1 30.0 0.55
ICNet (Zhao et al., 2018) 28.3 1024 × 2048 69.5 30.3 7.8
Hyperseg-M (Nirkin et al., 2021) 7.5 512 × 1024 75.8 36.9 10.1
FDDWNet (Liu et al., 2020) 10.3 512 × 1024 71.5 60 0.8
BiSeNet (Yu et al., 2018) 14.8 768 × 1536 68.4 72.3 5.8
MLFNet-MobileV2 (Fan et al., 2023) 4.67 512 × 1024 71.5 90.8 4.0
ContextNet (Poudel et al., 2018) 6.74 1024 × 2048 66.1 65.5 0.85
FASSDNet (Rosas-Arias et al., 2021) 45.1 1024 × 2048 76.0 41.1 2.85
LEDNet (Wang et al., 2019) 11.44 512 × 1024 70.6 40.0 0.94
CABiNet (Yang et al., 2021) 12 1024 × 2048 75.9 76.5 2.64
CGNet (Wu et al., 2021) 7.14 512 × 1024 64.8 50 0.5
BANet 6.96 512 × 1024 70.1 83.2 0.72
‘‘–’’ indicates that the method does not report the result.
a
Reported on different GPUs.
To further demonstrate the generalization ability of the proposed 5.6. Qualitative results
model, it is compared with some state-of-the-art networks trained on
the CamVid dataset in Table 10. Our network has decent accuracy The segmented images from the Cityscapes validation set and
with a small memory footprint while running in real-time. Specifically, CamVid test set are shown in Figs. 10 and 11, respectively. Fig. 10
the accuracy of the proposed method is comparable to BiSeNet but shows that the lowest dilation rates achieve the best qualitative results
with 8× fewer parameters. However, due to more usage of 1 × 1 in the backbone. The first row in the figure shows only the encoder
convolution kernels, our network is slower than BiSeNet. EDANet (Lo output. Without a decoder, the model performs poorly, as seen in
et al., 2019) performs closely to our method but with significantly these segmented images. The CamVid visualization results show that,
less speed. FASSDNet outperforms in terms of speed and accuracy but while our method can accurately segment large objects like roads, cars,
with 4× more parameters. LAANet (Zhang et al., 2022b) was among vegetation, buildings, and sky, it misses smaller objects like poles and
the fastest but worked at a quarter of the resolution of the proposed traffic signs.
network. These results show that our method can perform on par with The segmented images from an unseen Berkeley deep drive (BDD)
state-of-the-art methods while being lightweight. (Yu et al., 2020) dataset are also visualized in Fig. 12. This dataset
8
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086
Fig. 10. Qualitative results on Cityscape validation set. First row: Input image; second row: ground truth; third row: BANet output without normalized addition and decoder;
fourth row: BANet output with normalized addition and decoder. It can be seen that the method performs the best with the decoder.
Fig. 11. Qualitative results: Validation images on the CamVid dataset. First row: Input image; second row: ground truth; third row: BANet output.
9
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086
Fig. 12. Qualitative results on the unseen dataset: Validation results on Berkeley Deep Drive (BDD) dataset. First column: Input image; second column: ground truth; third column:
BANet output.
Table 10
Comparison of evaluation results of BANet with similar networks on CamVid test benchmark.
Method InputSize mIoU (%) FPSa Parameters (M)
RTFormer-Slim (Wang et al., 2022) 960 × 720 81.4 190.7 4.8
SegNet (Badrinarayanan et al., 2017) 480 × 360 46.4 49.4 15.2
EDANet (Lo et al., 2019) 480 × 360 66.4 40.75 0.7
DFANet (Li et al., 2019b) 960 × 720 59.3 116 7.8
SwiftNet-MN (Oršić and Šegvić, 2021) 960 × 720 65.0 27.7 11.8
ICNet (Zhao et al., 2018) 960 × 720 67.1 46.7 7.8
BiSeNet (Yu et al., 2018) 960 × 720 65.6 175 5.8
BiSeNetV2-L (Yu et al., 2021) 960 × 720 73.2 32.7 –
MLFNet-Res34 (Fan et al., 2023) 960 × 720 69.0 57.2 4.0
LAANet (Zhang et al., 2022b) 480 × 360 67.9 112.5 0.7
CGNet (Wu et al., 2021) 960 × 720 65.6 59.0 0.5
FASSDNet (Rosas-Arias et al., 2021) 960 × 720 69.3 80.0 2.85
BANet 960 × 720 66.3 68.1 0.72
‘‘–’’ indicates that the method does not report the result.
a
As reported in method on different GPUs.
was chosen for its class compatibility with Cityscapes. It is observed Table 11
Specifications of the mobile-GPU cards used for inference.
that the method performs well for large objects, even on unseen road
scene datasets. In addition, to further verify the generalizability of GPU card Cores Speed RAM Mem.B/W Power
the proposed network, we obtain the segmentation results on images Xavier 384 854 MHz 8 GB 51.2 GBps 15 W
940MX 384 1.2 GHz 4 GB 40.1 GBps 20 W
captured from the institute campus shown in Fig. 13. The images are
randomly selected from videos recorded through a car-dash-mounted ‘RAM’: Random Access Memory; ‘Mem.B/W’: Memory Bandwidth.
camera at 1080p and 30 fps. The network used is pre-trained on the
Cityscapes dataset. The visualization shows that the performance of the
network, even on unseen images, is at-par with the datasets used for
940MX. Table 11 presents their technical specifications. Both the GPUs
training.
are comparable, apart from their speed and memory bandwidth. In
Table 12, we present the performance analysis of the proposed model
5.7. Evaluation on mobile GPU systems for two different input resolutions. Our model runs faster than SwiftNet,
LEDNet, BiSeNet, and CaBiNet on both devices for a 1024 × 512 input
Mobile GPUs are manufactured for resource-constrained devices image. This can be attributed to the lightweight design. However, ENet
like laptops and embedded systems. They are designed with fewer is faster than the other methods because of its shallow and extremely
CUDA cores and lower clock frequency. As a result, they have low small network structure, but it has low accuracy. Despite having a
power consumption on the order of 20 W. The proposed method is very small size, LEDNet shows an abysmal speed. This could be due
evaluated on two mobile GPUs, NVIDIA Jetson Xavier and NVIDIA to its extremely complex decoder design resulting in many FLOPs (ref.
10
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086
Fig. 13. Qualitative results on unseen images: Results on our custom captured images in the institute campus. First/third column: Input images; second/fourth column: BANet
outputs.
Table 12
Inference speed (fps) of the proposed method for different input image resolutions on mobile GPU-based systems.
Method Jetson Xavier 940MX mIoU Params
2048 × 1024 1024 × 512 2048 × 1024 1024 × 512 % (M)
SwiftNet (Oršić and Šegvić, 2021) 2.61 9.9 2.4 8.64 70.2 11.8
ENet (Paszke et al., 2018) 13.8 53.82 13.4 45.56 58.3 0.4
ContextNet (Poudel et al., 2018) 10.49 40.91 10.2 35.7 66.1 0.85
Fast-SCNN (Poudel et al., 2019) 11.49 40.21 11.34 37.42 68.4 1.14
LEDNet (Wang et al., 2019) 0.7 2.73 0.5 1.75 70.6 0.94
BiSeNet (Yu et al., 2018) 2.42 9.6 2.1 7.35 68.4 5.8
CaBiNet (Yang et al., 2021) 8.21 30.95 7.9 27.65 76.5 2.64
FASSDNet (Rosas-Arias et al., 2021) 7.3 29.2 7.1 26.27 76.0 2.85
MLFNet (Fan et al., 2023) 8.41 32.79 8.15 30.15 71.5 4.0
BANet 9.02 34.23 8.2 30.12 70.1 0.72
Table 9). Fast-SCNN is faster than the proposed model in speed with a CRediT authorship contribution statement
low accuracy score.
Saquib Mazhar: Conceptualization, Methodology, Software, Writ-
6. Conclusion ing – original draft. Nadeem Atif: Validation, Writing – review &
editing. M.K. Bhuyan: Validation, Supervision. Shaik Rafi Ahamed:
This paper proposed a lightweight real-time semantic segmentation Validation, Supervision.
model for road-scene understanding, targeting resource-constrained
devices. We have developed a novel block-attention module that uses
Declaration of competing interest
simple attention-based skip connections for enhanced feature learning
capabilities. The network based on the block-attention module has
reduced latency due to lower dilation rates in the encoder backbone. The authors declare that they have no known competing finan-
We have also introduced a lightweight and efficient decoder design in cial interests or personal relationships that could have appeared to
our network, which performs better than popular decoders. Through influence the work reported in this paper.
extensive experiments on highly competitive benchmark datasets —
Cityscapes and CamVid, we have presented the efficacy of our design Data availability
choices. Appropriately balancing size, accuracy, and speed, BANet
can outperform some of the state-of-the-art lightweight models while The authors do not have permission to share data.
achieving 70.1% and 66.3% mIoU on Cityscapes and CamVid datasets
in real time. The inference speed of 34.23 fps on mobile devices further Acknowledgement
shows that the proposed model is suitable for resource-constrained
devices. We acknowledge the Science and Engineering Research Board, De-
partment of Science and Technology, Government of India, for the
7. Future work financial support for Project CRG/2022/003473.
11
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086
Badrinarayanan, V., Kendall, A., Cipolla, R., 2017. SegNet: A deep convolu- Paszke, A., Chaurasia, A., Kim, S., Culurciello, E., 2018. ENet: A deep neural network
tional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern architecture for real-time semantic segmentation. In: 4th International Conference
Anal. Mach. Intell. 39 (12), 2481–2495. http://dx.doi.org/10.1109/TPAMI.2016. on Learning Representations, ICLR.
2644615. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T.,
Brostow, G.J., Fauqueur, J., Cipolla, R., 2009. Semantic object classes in video: A Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E.Z., DeVito, Z.,
high-definition ground truth database. Pattern Recognit. Lett. 30 (2), 88–97. http: Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.,
//dx.doi.org/10.1016/j.patrec.2008.04.005, URL: https://www.sciencedirect.com/ 2019. PyTorch: An imperative style, high-performance deep learning library.
science/article/pii/S0167865508001220. Video-based Object and Event Analysis. In: Annual Conference on Neural Information Processing Systems, NeurIPS. pp.
Chen, Z., Duan, Y., Wang, W., He, J., Lu, T., Dai, J., Qiao, Y., 2022. Vision trans- 8024–8035.
former adapter for dense predictions. http://dx.doi.org/10.48550/ARXIV.2205. Poudel, R.P.K., Bonde, U.D., Liwicki, S., Zach, C., 2018. ContextNet: Exploring context
08534, URL: https://arxiv.org/abs/2205.08534. and detail for semantic segmentation in real-time. In: BMVC.
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2018. DeepLab: Poudel, R.P.K., Liwicki, S., Cipolla, R., 2019. Fast-SCNN: Fast semantic segmentation
Semantic image segmentation with deep convolutional nets, atrous convolution, network. In: 30th British Machine Vision Conference, BMVC. BMVA Press, pp.
and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), 834—848. 289–296.
http://dx.doi.org/10.1109/tpami.2017.2699184. Romera, E., Álvarez, J.M., Bergasa, L.M., Arroyo, R., 2018. ERFNet: Efficient residual
Cordts, M., Omran, M., Ramos, S., Scharwächter, T., Enzweiler, M., Benenson, R., factorized ConvNet for real-time semantic segmentation. IEEE Trans. Intell. Transp.
Franke, U., Roth, S., Schiele, B., 2015. The cityscapes dataset. In: CVPR Workshop Syst. 19 (1), 263–272. http://dx.doi.org/10.1109/TITS.2017.2750080.
on the Future of Datasets in Vision. Ronneberger, O., Fischer, P., Brox, T., 2015. U-Net: Convolutional networks for biomed-
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L., 2009. ImageNet: A large-scale ical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F.
hierarchical image database. In: IEEE Conference on Computer Vision and Pattern (Eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI
Recognition. pp. 248–255. 2015. Springer International Publishing, Cham, pp. 234–241.
Dong, G., Yan, Y., Shen, C., Wang, H., 2021. Real-time high-performance semantic Rosas-Arias, L., Benitez-Garcia, G., Portillo-Portillo, J., Olivares-Mercado, J., Sanchez-
image segmentation of urban street scenes. IEEE Trans. Intell. Transp. Syst. 22 (6), Perez, G., Yanai, K., 2021. FASSD-Net: Fast and accurate real-time semantic
3258–3274. http://dx.doi.org/10.1109/TITS.2020.2980426. segmentation for embedded systems. IEEE Trans. Intell. Transp. Syst. 1–12. http:
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., //dx.doi.org/10.1109/TITS.2021.3127553.
Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N., Saha, M., Chakraborty, C., 2018. Her2Net: A deep framework for semantic segmentation
2021. An image is worth 16x16 words: Transformers for image recognition at scale. and classification of cell membranes and nuclei in breast cancer evaluation. IEEE
In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Trans. Image Process. 27 (5), 2189–2200. http://dx.doi.org/10.1109/TIP.2018.
Event, Austria, May 3-7, 2021. 2795742.
Fan, J., Wang, F., Chu, H., Hu, X., Cheng, Y., Gao, B., 2023. MLFNet: Multi-level fusion Shelhamer, E., Long, J., Darrell, T., 2017. Fully convolutional networks for semantic
network for real-time semantic segmentation of autonomous driving. IEEE Trans. segmentation. In: IEEE Transaction on Pattern Analysis and Machine Intelli-
Intell. Veh. 8 (1), 756–767. http://dx.doi.org/10.1109/TIV.2022.3176860. gence(PAMI), Vol. 39. USA, pp. 640–651. http://dx.doi.org/10.1109/TPAMI.2016.
Giusti, A., Cireşan, D.C., Masci, J., Gambardella, L.M., Schmidhuber, J., 2013. Fast 2572683.
image scanning with deep max-pooling convolutional neural networks. In: 2013 Strudel, R., Garcia, R., Laptev, I., Schmid, C., 2021. Segmenter: Transformer for
IEEE International Conference on Image Processing. pp. 4034–4038. http://dx.doi. semantic segmentation. In: 2021 IEEE/CVF International Conference on Computer
org/10.1109/ICIP.2013.6738831. Vision (ICCV). IEEE Computer Society, pp. 7242–7252. http://dx.doi.org/10.1109/
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. ICCV48922.2021.00717.
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. Sun, K., Xiao, B., Liu, D., Wang, J., 2021. Deep high-resolution representation learning
770–778. http://dx.doi.org/10.1109/CVPR.2016.90. for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43 (10), 3349–3364.
Hong, Y., Pan, H., Sun, W., Jia, Y., 2021. Deep dual-resolution networks for real- http://dx.doi.org/10.1109/TPAMI.2020.2983686.
time and accurate semantic segmentation of road scenes. arXiv preprint arXiv: Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,
2101.06085. Vanhoucke, V., Rabinovich, A., 2015. Going deeper with convolutions. In: IEEE
Jung, H., Choi, H.-S., Kang, M., 2022. Boundary enhancement semantic segmentation Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1–9. http:
for building extraction from remote sensed image. IEEE Trans. Geosci. Remote Sens. //dx.doi.org/10.1109/CVPR.2015.7298594.
60, 1–12. http://dx.doi.org/10.1109/TGRS.2021.3108781. Wang, J., Gou, C., Wu, Q., Feng, H., Han, J., Ding, E., Wang, J., 2022. RT-
Kaur, A., Chauhan, A.P.S., Aggarwal, A.K., 2023. Prediction of enhancers in DNA Former: Efficient design for real-time semantic segmentation with transformer. In:
sequence data using a hybrid CNN-DLSTM model. IEEE/ACM Trans. Comput. Biol. Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (Eds.), Advances in Neural Information
Bioinform. 20 (2), 1327–1336. http://dx.doi.org/10.1109/TCBB.2022.3167090. Processing Systems.
Li, H., Xiong, P., Fan, H., Sun, J., 2019b. DFANet: Deep feature aggregation for real- Wang, Y., Zhou, Q., Liu, J., Xiong, J., Gao, G., Wu, X., Latecki, L.J., 2019. LEDNet:
time semantic segmentation. In: 2019 IEEE/CVF Conference on Computer Vision A lightweight encoder-decoder network for real-time semantic segmentation. In:
and Pattern Recognition (CVPR). pp. 9514–9523. http://dx.doi.org/10.1109/CVPR. 2019 IEEE International Conference on Image Processing (ICIP). pp. 1860–1864.
2019.00975. http://dx.doi.org/10.1109/ICIP.2019.8803154.
Li, G., Yun, I.Y., Kim, J., Kim, J., 2019a. DABNet: Depth-wise asymmetric bottleneck Wu, T., Tang, S., Zhang, R., Cao, J., Zhang, Y., 2021. CGNet: A light-weight con-
for real-time semantic segmentation. In: BMVC. text guided network for semantic segmentation. IEEE Trans. Image Process. 30,
Lin, M., Chen, Q., Yan, S., 2013. Network in network. http://dx.doi.org/10.48550/ 1169–1179. http://dx.doi.org/10.1109/TIP.2020.3042065.
ARXIV.1312.4400, arXiv. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P., 2021. SegFormer:
Liu, S., Deng, W., 2015. Very deep convolutional neural network based image classifi- Simple and efficient design for semantic segmentation with transformers. In:
cation using small training sample size. In: 3rd IAPR Asian Conference on Pattern Advances in Neural Information Processing Systems.
Recognition (ACPR). pp. 730–734. http://dx.doi.org/10.1109/ACPR.2015.7486599. Yang, M.Y., Kumaar, S., Lyu, Y., Nex, F., 2021. Real-time semantic segmentation with
Liu, J., Zhou, Q., Qiang, Y., Kang, B., Wu, X., Zheng, B., 2020. FDDWNet: A lightweight context aggregation network. ISPRS J. Photogramm. Remote Sens. 178, 124–134.
convolutional neural network for real-time semantic segmentation. In: ICASSP 2020 http://dx.doi.org/10.1016/j.isprsjprs.2021.06.006.
- 2020 IEEE International Conference on Acoustics, Speech and Signal Processing Yang, M., Yu, K., Zhang, C., Li, Z., Yang, K., 2018. DenseASPP for semantic segmen-
(ICASSP). pp. 2373–2377. http://dx.doi.org/10.1109/ICASSP40776.2020.9053838. tation in street scenes. In: 2018 IEEE/CVF Conference on Computer Vision and
Lo, S.Y., Hang, H.M., Chan, S.W., Lin, J.J., 2019. Efficient dense modules of asymmetric Pattern Recognition. pp. 3684–3692. http://dx.doi.org/10.1109/CVPR.2018.00388.
convolution for real-time semantic segmentation. In: Proceedings of the ACM Yu, F., Chen, H., Wang, X., Chen, Y., Liu, F., Madhavan, V., Darrell, T., 2020. BDD100K:
Multimedia Asia. pp. 1–6. A diverse driving dataset for heterogeneous multitask learning. In: Proceedings of
Lou, A., Loew, M., 2021. CFPNET: Channel-wise feature pyramid for real-time semantic the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). pp.
segmentation. In: 2021 IEEE International Conference on Image Processing (ICIP). 2636–2645.
pp. 1894–1898. http://dx.doi.org/10.1109/ICIP42928.2021.9506485. Yu, C., Gao, C., Wang, J., Yu, G., Shen, C., Sang, N., 2021. BiSeNet V2: Bilateral network
Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H., 2018. ESPNet: Efficient with guided aggregation for real-time semantic segmentation. Int. J. Comput. Vis.
spatial pyramid of dilated convolutions for semantic segmentation. In: Proceedings 129 (11), 3051–3068. http://dx.doi.org/10.1007/s11263-021-01515-2.
of the European Conference on Computer Vision (ECCV). Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N., 2018. BiSeNet: Bilateral segmen-
Nirkin, Y., Wolf, L., Hassner, T., 2021. HyperSeg: Patch-wise hypernetwork for real-time tation network for real-time semantic segmentation. In: Ferrari, V., Hebert, M.,
semantic segmentation. In: 2021 IEEE/CVF Conference on Computer Vision and Pat- Sminchisescu, C., Weiss, Y. (Eds.), Computer Vision – ECCV 2018. Springer
tern Recognition (CVPR). pp. 4060–4069. http://dx.doi.org/10.1109/CVPR46437. International Publishing.
2021.00405. Zhang, X., Du, B., Wu, Z., Wan, T., 2022b. LAANet: lightweight attention-guided
Oršić, M., Šegvić, S., 2021. Efficient semantic segmentation with pyramidal fusion. asymmetric network for real-time semantic segmentation. Neural Comput. Appl.
Pattern Recognit. 110, 107611. http://dx.doi.org/10.1016/j.patcog.2020.107611. 34 (5), 3573–3587. http://dx.doi.org/10.1007/s00521-022-06932-z.
12
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086
Zhang, J., Yang, K., Constantinescu, A., Peng, K., Müller, K., Stiefelhagen, R., 2022a. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J., 2017. Pyramid scene parsing network.
Trans4Trans: Efficient transformer for transparent object and semantic scene In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp.
segmentation in real-world navigation assistance. IEEE Trans. Intell. Transp. Syst. 6230–6239. http://dx.doi.org/10.1109/CVPR.2017.660.
23 (10), 19173–19186. http://dx.doi.org/10.1109/TITS.2022.3161141. Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T.,
Zhang, X., Zhou, X., Lin, M., Sun, J., 2017. ShuffleNet: An extremely efficient Torr, P.S., Zhang, L., 2021. Rethinking semantic segmentation from a sequence-
convolutional neural network for mobile devices. arXiv. arXiv:1707.01083. to-sequence perspective with transformers. In: 2021 IEEE/CVF Conference on
Zhao, H., Qi, X., Shen, X., Shi, J., Jia, J., 2018. ICNet for real-time semantic seg- Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, pp.
mentation on high-resolution images. In: Ferrari, V., Hebert, M., Sminchisescu, C., 6877–6886. http://dx.doi.org/10.1109/CVPR46437.2021.00681.
Weiss, Y. (Eds.), Computer Vision – ECCV 2018. Springer International Publishing,
Cham, pp. 418–434.
13