0% found this document useful (0 votes)

22 views13 pages

Main

Uploaded by

giribabukande

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views13 pages

Main

Uploaded by

giribabukande

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Engineering Applications of Artificial Intelligence 126 (2023) 107086

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence

journal homepage: www.elsevier.com/locate/engappai

Block attention network: A lightweight deep network for real-time semantic

segmentation of road scenes in resource-constrained devices
Saquib Mazhar ∗, Nadeem Atif, M.K. Bhuyan, Shaik Rafi Ahamed
Department of Electronics and Electrical Engineering, Indian Institute of Technology, Guwahati, AS, 781039, India

ARTICLE INFO ABSTRACT

Keywords: Deep-learning-based semantic segmentation networks typically incorporate object classification networks in
Autonomous driving their backbone. This leads to a loss of context because classification networks have a smaller field of view.
Convolutional neural networks The architecture has been extended to recover context with additional downsampling feature maps, a parallel
Road scenes
context branch, or pyramid pooling modules after the backbone. However, these extensions increase multiply–
Semantic segmentation
accumulate operations and memory requirements, thus, making them unsuitable for resource-constrained
Real-time
devices. To overcome this limitation, a novel convolutional building block with attention-based context
guidance is proposed. The block is repeated to build an efficient encoder–decoder network. Our network runs
in real-time, has a lightweight design with only 0.72 Million parameters, and achieves 70.1%, and 66.3%
mean intersection-over-union scores on the highly competitive Cityscapes and CamVid datasets, respectively.
An efficient decoder is also designed to replace other semantic segmentation network decoders with minimal
performance loss. The performance measures on mobile platforms show that our network suits resource-
constrained devices. Further, experimental results show that the proposed method can optimally balance the
model size-inference speed and segmentation accuracy.

1. Introduction solve this problem, researchers have proposed several lightweight net-
works (Paszke et al., 2018; Lo et al., 2019; Mehta et al., 2018; Wang
Semantic segmentation involves labelling each pixel in an image et al., 2019). However, they have decreased accuracy due to limited
as belonging to a particular object class. This task serves as the basis feature representation capability. Thus, the accuracy, speed, and model
for many computer vision applications, including autonomous driving, size must be balanced.
bio-medical image processing, robot navigation, and satellite imagery Several novel networks were designed for road-scene understanding
(Ronneberger et al., 2015; Saha and Chakraborty, 2018; Jung et al., tasks since the fully convolutional network (FCN) (Shelhamer et al.,
2022; Romera et al., 2018). Semantic segmentation methods can be 2017) was first developed as an end-to-end trainable CNN-based se-
broadly divided into classical and deep learning (DL) based. Classical mantic segmentation network. These methods were primarily based on
approaches use thresholding, clustering, texture analyzers, boundary backbone networks designed for ImageNet Large Scale Visual Recog-
detectors and probabilistic graphical models. All these methods re- nition Challenge (ILSVRC) (Deng et al., 2009) commonly known as
quire the selection and handcrafting of features, which becomes more ImageNet (e.g., VGG Net Liu and Deng, 2015, GoogLeNet Szegedy et al.,
cumbersome as the number of segmentation classes increases. By con- 2015, ResNet He et al., 2016). These architectures mainly adopted a
trast, deep learning enables convolutional neural networks(CNNs) and backbone architecture using dilated convolutions because road-scene
transformers to automatically learn features in an image. CNNs are understanding requires a large field of view. However, their deep struc-
computationally intensive for pixel labelling, which have been used ture and dense convolutions were unsuitable for real-time deployment
effectively for computer vision tasks such as object detection, clas- on resource-constrained devices.
sification, and segmentation. The recent advancement of intelligent ENet (Paszke et al., 2018) achieved a real-time performance on
vehicles and robotics requires deep learning models that can work semantic segmentation by factorizing the convolution kernels at the
in these resource-constrained environments due to the limited avail- expense of only 0.37M parameters. Since then, ENet has become a de
ability of computing power. Thus, large-scale models are challenging facto standard at the lower model size limit. While ESPNet (Mehta
to implement on these devices, including embedded platforms. To

∗ Corresponding author.
E-mail addresses: [email protected] (S. Mazhar), [email protected] (N. Atif), [email protected] (M.K. Bhuyan), [email protected]
(S.R. Ahamed).

https://doi.org/10.1016/j.engappai.2023.107086
Received 6 June 2023; Received in revised form 7 August 2023; Accepted 29 August 2023
Available online 7 September 2023
0952-1976/© 2023 Elsevier Ltd. All rights reserved.
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086

Fig. 1. Block diagram of the proposed network. The colour of the blocks represents their type, given in the top row. The encoder has three levels, represented as L1, L2 and L3.
Each level operates at half the resolution of the previous level. The dimension of the feature maps is given at the bottom of the block as ‘𝐻𝑒𝑖𝑔ℎ𝑡 × 𝑊 𝑖𝑑𝑡ℎ × 𝐶ℎ𝑎𝑛𝑛𝑒𝑙’. 𝑛_𝐶𝑙𝑎𝑠𝑠 is
the number of classes in the final segmented image. The network consists of the proposed BA module, downsampler and lightweight decoder, the details of which are discussed
in subsequent sections and figures.

et al., 2018) used a pyramid of factorized convolutions to improve 3. A novel, lightweight decoder is designed using efficient 3 × 3
on ENet without increasing the model size, ERFNet (Romera et al., convolutions to include in our network. With only simple skip
2018) heavily relied on asymmetric convolutions and dilations, which connections from the encoder stages, it can recover some spatial
increased the model size. ContextNet (Poudel et al., 2018) runs an details lost during the downsampling process.
additional context branch at full resolution in parallel to a back- 4. Without any pre-training, post-refinement, or pyramid pooling
bone structure. DABNet (Li et al., 2019a) modified the basic block modules (PPMs), our lightweight architecture achieves compet-
of ERFNet by dividing the asymmetric convolutions into two paral- itive results on the Cityscapes and CamVid datasets. Specifi-
lel branches. Despite these design choices, the networks could not cally, with only 0.72M parameters, BANet achieves a mIoU of
improve performance while keeping the number of network param- 70.1% and 66.3% on the Cityscapes and CamVid benchmarks,
eters in check. Depth-wise separable convolution drastically reduces respectively.
the parameter count at the cost of accuracy. Asymmetric convolution
reduces the filter-kernel size with a minimal reduction in accuracy. 2. Related works
However, large dilations in the filter kernels can significantly reduce
the inference speed. Thus, different convolutions must be combined to Increasing demand for the real-world deployment of semantic seg-
balance accuracy with speed. mentation has led to a shift in designs from high-accuracy large models
To this end, a new design for a lightweight encoder–decoder net- to moderate accuracy but much smaller models. Starting with ENet,
work is proposed. The network performs on par with the large real-time lightweight designs have attracted the research community’s atten-
semantic segmentation networks while keeping the number of parame- tion. Substantial work has been performed to explore the potential of
ters small. The encoder is built on a novel attention-guided asymmetric small and lightweight architectures. These architectures can be broadly
basic block. It has a two-branch structure to simultaneously learn divided into three categories.
local and global features essential for the semantic segmentation of
images. It utilizes group convolution in the main branch to reduce the 2.1. Context based architectures
parameters without compromising representation capability too much.
The first parallel branch factorizes the 3 × 3 convolutions with an Methods developed in recent years rely heavily on dilated back-
asymmetric 3 × 1 kernel, followed by a 1 × 3 kernel. We use dilation in
bones developed for dense prediction tasks. As contextual information
these asymmetric convolutions to expand the field of view. The second
plays a crucial role in the accurate semantic labelling of pixels, context-
parallel branch uses dilated kernels, which help learn the local features.
based models like (Zhao et al., 2017; Chen et al., 2018; Yang et al.,
Inspired by ResNet (He et al., 2016), a skip connection is added to the
2018; Nirkin et al., 2021) use dilated convolutions in the backbone
basic module. It improves gradient flow during backpropagation. To
network or attach a pyramid-pooling of dilated convolutions as an
this skip connection, an attention module is attached, leading to the
end-block in the encoder. This helps to capture context information at
weighted addition of the feature maps from the previous block to the
different scales. However, convolution kernels at higher dilation rates
present one. It will be shown that adding this attention-skip connection
are slow as significant time is required for memory restructuring (Arani
reduces the dilation rates to only 2. This leads to a significant speed-up
et al., 2021). Thus, they are sub-optimal for fast networks, which must
of the network.
perform in real-time.
The main contributions of this paper are summarized below:

1. A basic bottle-neck block is proposed, incorporating an 2.2. Encoder–decoder architectures

attention-based skip connection. It is referred to as the block-
attention (BA) module. It extracts local and contextual features Encoder–decoder architectures are lighter and faster (Paszke et al.,
with a parallel branch structure comprising dilated asymmetric 2018; Romera et al., 2018; Wang et al., 2019; Badrinarayanan et al.,
and depth-wise separable convolutions. 2017) than dilated ImageNet backbones. The encoder is built by repe-
2. Our network, BANet, builds on the repetition of the BA module tition of a basic block, and the feature maps are progressively reduced
with downsampling modules placed at appropriate locations. spatially. The basic block is responsible for contextual feature extrac-
The network balances model size and accuracy while improving tion. The encoder enlarges the field-of-view of the pixels at larger
inference speed. depths by reducing the resolution to 1/8, or even up to 1/32, as in Yu

2
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086

et al. (2018). The feature maps from the final output of the encoder
have the smallest resolution and largest depth. However, the decoder
is responsible for restoring reduced resolution and enhancing boundary
information. It recovers the spatial information of the encoded and
downsampled outputs.

2.3. Multi-pathway architectures

Multi-pathway architectures learn contextual and spatial features

in parallel. Typically, a deep branch learns contextual features, and
a shallow-yet-wide branch learns spatial features. BiSeNet (Yu et al.,
2018) adopted a similar two-pathway architecture. Its spatial branch
has average pooling at the end to re-weight features before adding them
to features from the context branch. The addition is performed by a
feature-fusion module that uses sigmoid-based attention. Inspired by
Fig. 2. Illustration of group convolution. The feature map size is 𝐻 × 𝑊 × 6, where
HRNet (Sun et al., 2021), DDRNet (Hong et al., 2021) has a lightweight
6 is the number of input channels. An output channel size of 3 requires three 𝑘 × 𝑘
version with half the parallel branches. This structure has feature filter kernels of channel depth 6, each, in the case of conventional convolution (left).
flow connections after each stage. Both branches work at different Whereas in group convolution, for ‘g’ = 3, the filter kernel depth reduces to 2 each
resolutions. These multi-pathway methods have dense backbones with (right).
large dilation or many parameters, rendering them unsuitable for real-
time applications running on resource-constrained devices. This paper
proposes a lightweight network with a novel block-attention module
that balances the size accuracy trade-off. Furthermore, we designed a
lightweight decoder to restore details lost in downsampling.

2.4. Transformer architectures

Models based on transformers have recently beaten several state-

of-the-art models based on CNNs in semantic image segmentation.
Transformers were originally designed for neural language processing.
Drawing on language models, Vision transformer (Dosovitskiy et al.,
2021) encodes 16 × 16 patches of the image as a sequence. These mod-
els use multi-head self-attention modules comprising fully connected
layers to learn dependencies in the sequence. While SETR (Zheng Fig. 3. Attention Refinement Module. Left: BiSeNet module; right: BANet module.
et al., 2021) segments images using a transformer encoder and a Where ‘S’: Sigmoid non-linearity; ‘AvgPool’: global average pooling; ‘x’: element-wise
CNN decoder, Segmenter (Strudel et al., 2021) uses transformers for multiplication; ‘+’: element-wise addition; ‘BN’: batch-normalization and ‘C’: number of
both encoding and decoding. Trans4trans (Zhang et al., 2022a) uses channels.
a transformer parsing module to jointly learn features between a self-
attention-based encoder and a decoder. These models have higher Table 1
Filter parameter count for various convolutions(Conv.)
accuracy than purely CNN-based models. However, their architecture
requires dense operations among the sequences in fully connected lay- Conv. type Kernel size Parameters

ers. Consequently, transformer-based networks require many floating- Standard 𝑘×𝑘 𝑘2 𝐶𝑖𝑛 𝐶𝑜𝑢𝑡
Groupwise 𝑘×𝑘 𝑘2 𝐶𝑖𝑛 𝐶𝑜𝑢𝑡 ∕𝑔
point operations (FLOPs) and thus are computationally intensive. This
Pointwise 1 × 1 𝐶𝑖𝑛 𝐶𝑜𝑢𝑡
hinders their deployment on resource-constrained embedded devices. Asymmetric 𝑘×1 𝑘𝐶𝑖𝑛 𝐶𝑜𝑢𝑡

3. Preliminary knowledge

This section introduces the basis of our design: the lightweight ker- its sparsity, the hardware kernel calls for pointwise convolution are not
nel decomposition of the standard convolution (Conv2d) and attention as optimized as standard convolution.
module. Attention Module- By assigning varying weights to picture pixels,
Lightweight Convolutions- Asymmetric, depthwise separable, the attention mechanism can selectively emphasize meaningful fea-
group, and pointwise convolution reduce a deep network’s size and tures while ignoring inconsequential information. Some recent pub-
computational costs. For group convolution, the 𝑘 × 𝑘 filter kernel lications (Yu et al., 2018, 2021) have focused on applying attention
is divided into groups along the channel axis. This reduces kernel mechanisms to semantic segmentation to boost the network’s feature
weights and multiply–accumulate operations in the hyperparameter learning ability and object segmentation accuracy. An improved atten-
order. Fig. 2 shows an example of group convolution for an input tion refinement module (ARM) based on Yu et al. (2018) is proposed
feature map size of 𝐻𝑒𝑖𝑔ℎ𝑡 × 𝑊 𝑖𝑑𝑡ℎ × (𝐶ℎ𝑎𝑛𝑛𝑒𝑙𝑠 = 6). A value of 3 as shown in Fig. 3.
represents a three-fold reduction in kernel weights. The ARM is improved for our application by replacing the 1 × 1
Depthwise separable convolution decomposes a Conv2d kernel into pointwise convolution with a 3 × 3 convolution. First, we require a
a group convolution kernel followed by a pointwise convolution. The larger receptive field because our attention module operates in initial
number of groups, g, is equal to the number of input channels. So, each as well as deeper layers. Second, it must learn contextual features for
channel of 𝐻 × 𝑊 feature map is multiplied by a single filter kernel of better semantics because it is used in every block. Our module also
size 𝑘×𝑘. For pointwise convolution, a 1 × 1 filter kernel is used instead differs from the ARM in how it combines weighted features. Directly
of the standard 𝑘 × 𝑘 kernel. Table 1 presents the total parameter count multiplying the input features with the attention weights leads to the
of these factorized kernels. It shows that pointwise convolution is the loss of certain features due to zero weights. This problem is solved
lightest for a given input and output channel value. However, due to with an extra connection that adds the input features to the weighted

3
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086

Table 2
Detailed structure of the proposed Block Attention Network.
Block Layer Module (S,D)a Ch. Output size
1 Down (2,1) 32 512 × 256
2 BA (1,2) 32 512 × 256
3 BA (1,2) 32 512 × 256
4 Down1D (2,1) 64 256 × 128
5 BA (1,2) 64 256 × 128
Encoder 6 BA (1,2) 64 256 × 128
7 BA (1,2) 64 256 × 128
8 Down1D (2,1) 128 128 × 64
9 BA (1,2) 128 128 × 64
10 BA (1,2) 128 128 × 64
11 BA (1,2) 128 128 × 64
12 BA (1,2) 128 128 × 64
13 Bi-Intb ×2 128 256 × 128
14 Conv3x3 (1,1) 64 256 × 128
Decoder 15 Bi-Int ×2 64 512 × 256
16 Conv3x3 (1,1) 20 512 × 256
14 Bi-Int ×2 20 1024 × 512
a
S = Stride, D = Dilation; ‘Ch.’ — Channel.
b
Bi-Int = Bilinear Interpolation; ‘×2’ denotes the interpolation factor.

Fig. 4. An overview of Block-Attention module. ‘C’: Number of channels. ‘(G) Conv’:

Groupwise convolution. ‘DW Conv’: Depth wise convolution. ‘D’: Dilated convolution.
features, as shown in Fig. 3. In summary, the ARM’s weighted features ‘∣∣’: concatenation; ‘+’: element wise addition. ‘Attn. Mod.’ is our improved attention
are calculated as, refinement module.

𝑋𝑎𝑡𝑡𝑛 = 𝜎{𝑓𝐵𝑁 [𝑊0 (𝑓𝑎𝑣𝑔 (𝑋𝑖𝑛 ))]} ∗ 𝑋𝑖𝑛 (1)

where 𝑋𝑖𝑛 is the set of feature maps at the input of the ARM, 𝑓𝑎𝑣𝑔 is the
global average pooling function, 𝑊0 is the convolution kernel weight,
𝑓𝐵𝑁 is batch normalization and 𝜎 is the sigmoid function. In contrast,
the weighted features in our improved ARM are calculated as,

𝑋𝑎𝑡𝑡𝑛 = {𝜎{𝑓𝐵𝑁 [𝑊0 (𝑓𝑎𝑣𝑔 (𝑋𝑖𝑛 ))]} + 1} ∗ 𝑋𝑖𝑛 (2)

Based on this knowledge, the BANet is designed using lightweight

convolution kernels and an improved attention module. Its architecture
is presented in the next section.

4. Proposed method

Fig. 5. Downsampling Module: Left — ENet; right — BANet. ‘MaxPool2d’ is the 2D

BANet follows an encoder–decoder structure, details of which are maximum pooling kernel along the height and width dimensions; ‘C’ represents the
given in Table 2. This section introduces the basic block and other number of channels; ‘s’ is the convolution stride.
modules, followed by the complete network structure.

4.1. Block-attention module after every convolution layer. The feature learning is further enhanced
by using a 3 × 3 convolution in the input branch followed by 1 × 1
The basic block of our proposed network, referred to as the block- convolutions.
attention module (BA), is shown in Fig. 4. The efficient bottleneck
A skip connection is used to prevent the problem of vanishing
design is used (He et al., 2016) as it reduces the number of channels be-
gradients (He et al., 2016) during back-propagation in the BA module.
fore performing convolution and non-linear operations, thus reducing
Nevertheless, unlike other networks, an attention-guided skip connec-
the computations.
tion is introduced. This design technique reduces the dilation rates to
The BA module jointly learns the local and global contextual fea-
only 2, thus improving network speed.
tures, quintessential for accurate semantic pixel labelling (Li et al.,
2019a; Wu et al., 2021; Wang et al., 2019; Romera et al., 2018) using
a two-branch structure. Local features are extracted in the first branch 4.2. Downsampler
using a 3 × 3 depth-wise separable convolution without dilation. This
is followed by a point-wise fusion operation performed by 1 × 1 con- To reduce the spatial resolution of the feature maps and learn
volution. These design choices drastically reduce the parameter count, scale-invariant features, it is common practice to downsample at each
reducing the computation and memory requirements. The module uses stage in the network. Techniques include max-pooling (Giusti et al.,
channel shuffle operation (Zhang et al., 2017) to improve the accuracy. 2013), average pooling (Lin et al., 2013), and strided convolution.
The second branch is designed to extract global contextual features. ENet showed that using a pooling operation and strided-convolution in
Asymmetric convolution is used to minimize the parameter count in parallel reduces the loss of finer details while downsampling. A similar
which a 𝑛 × 𝑛 2D convolution kernel is approximated by 𝑛 × 1 and downsampling module shown in Fig. 5 is designed with 1D asymmetric
1 × 𝑛 kernels. For a 3 × 3 convolution, this essentially means a 33% convolutions to reduce weights further. This makes our downsampling
reduction in the number of parameters from 9 to 6 per channel. Dilation module extremely lightweight and hence suitable for low-memory de-
is used in the asymmetric kernels to expand the receptive field and gain vices. A convolution block is added after the concatenation to fuse the
global context. Batch normalization and ReLU non-linearity are used features efficiently.

4
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086

5. Experiments

In order to validate the proposed BANet, extensive ablations are

carried out on the widely used Cityscapes (Cordts et al., 2015) dataset.
In this section, the datasets and implementation details are discussed.
In the ablation studies, the efficacy of our design choices and the
network’s overall design are investigated. The network’s performance
is compared with state-of-the-art models on the Cityscapes and CamVid
(Brostow et al., 2009) datasets.

5.1. Datasets and evaluation metrics

Cityscapes- This dataset is widely used for semantic segmentation

in autonomous driving. It contains 5000 finely annotated images ex-
tracted from video frames collected over 50 German cities. The finely
annotated images divide into 35 classes, of which only 19 were used
for training and evaluation. The full-scale resolution of the images
is 1024 × 2048. The dataset is divided into training, validation, and
testing subsets with 2975, 500, and 1525 images.
CamVid- Cambridge University’s video (CamVid) dataset is exten-
sively used for autonomous driving. The video is recorded from the
streets of Cambridge during a two-hour capture, and then random
frames are manually selected for annotation. It contains 701 photos
Fig. 6. Light weight decoder. It is a three-level decoder in which Encoder L1, L2 and
L3 activations are used. While L1 features are directly upsampled. L2 and L3 features
with a resolution of 960 × 720, individually labelled into 11 categories.
are added after normalization in the decoder stages. ‘2X’ represents the upsampling It is further divided into training, validation, and testing subsets with
factor. 367, 100, and 233 images, respectively.
Evaluation metrics- Accuracy measurement for the semantic image
segmentation task is reported using the intersection-over-union (IoU)
metric. It is calculated as,
4.3. Efficient decoder design
𝑡𝑎𝑟𝑔𝑒𝑡 ∩ 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛
𝐼𝑜𝑈 = (3)
𝑡𝑎𝑟𝑔𝑒𝑡 ∪ 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛
For the decoder, the literature has mainly followed two design
choices to increase feature map resolution, deconvolution (Romera Mathematically,
et al., 2018) and simple bilinear upsampling (Li et al., 2019a). Decon- ∑ 𝑛𝑖𝑖
𝐼𝑜𝑈 = ∑ (4)
volution improves the representation capability due to the learnable 𝑖 𝑡𝑖 + 𝑗 [𝑛𝑗𝑖 − 𝑛𝑖𝑖 ]
weights; however, kernel memory restructuring lowers its speed. By
where 𝑛𝑖𝑗 is the number of pixels of class-i predicted to class-j, and
contrast, bilinear upsampling does not require any weights, which ∑
𝑡𝑖 = 𝑗 𝑛𝑖𝑗 is the total number of pixels of class-i. The mean-class IoU
makes it fast, but its representation capability is poor. To overcome
(mIoU) is calculated by dividing the IoU score by the total number
these limitations, an efficient 3-stage decoder (Fig. 6) that combines
of classes. The number of parameters and FLOPs are used to measure
bilinear upsampling and conventional 3 × 3 convolution layers instead model size and complexity. A model with very high FLOPs may be too
of deconvolution layers to maintain the representation capability is computationally intensive for embedded devices. Speed is measured in
designed. terms of frames-per-second(fps).
Following Badrinarayanan et al. (2017), encoder skip connections
are used in the decoder. After each upsampler, we fuse the encoder 5.2. Implementation details
features with the decoded features. However, it is found that simply
adding features from different layers does not lead to optimal perfor- The model is implemented with PyTorch 1.10 (Paszke et al., 2019)
mance because they have activations of varying scales. The features are and CUDA 10.2. Adam optimizer with an initial learning rate of 5×10−3
normalized using the 𝐿2 -norm before addition to solve this problem. It and weight decay of 1 × 10−4 is used. For learning rate (lr) adjustments,
is referred to as normalized addition. we use the poly strategy (Romera et al., 2018):
[ ]0.9
4.4. Block attention network 𝑝𝑟𝑒𝑠𝑒𝑛𝑡 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛
𝑛𝑒𝑤 𝑙𝑟 = (𝑖𝑛𝑖𝑡𝑖𝑎𝑙 𝑙𝑟) × (5)
𝑡𝑜𝑡𝑎𝑙 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠
Fig. 1 shows the complete network structure. The encoder con- Simple yet highly effective cross-entropy loss is used for the error
sists of repetitive block attention modules with downsampling blocks calculation during learning. Data augmentation is essential to increase
interleaved at appropriate places. The first block of our network is the number of training samples in order to prevent over-fitting (Kaur
the downsampler which reduces the input resolution by half. The first et al., 2023). The following data augmentation techniques are used to
stage consists of only two BA modules to minimize FLOPs due to increase the number of examples in the Cityscapes training set: random
a higher-resolution input feature map. We refrain from aggressively horizontal and vertical flips, random translations of pixels in the range
downsampling feature maps as in Romera et al. (2018) because it [0, 2], and random crops with a 512 crop size. We train the model with
lowers the accuracy. The second stage consists of three BA modules a batch size of 12 and 300 epochs. However, no ImageNet pre-training
after the downsampler. This stage operates at 1∕4th the input resolu- was employed (Poudel et al., 2018). The training was performed at an
tion. In the encoder’s final stage, we use four BA modules. Thus, we input image resolution of 512 × 1024, and the segmented output image
gradually increase the depth of each stage. The final resolution of the was upsampled using bilinear interpolation for testing. The inference
encoder output is 1∕8th the input resolution. The number of channels is results have been reported on an Intel Core i7-11700-based desktop
restricted to the minimum possible in each stage to reduce layer width. with 32 GB RAM and a single NVIDIA RTX 3080 GPU unless stated
This reduces the memory requirement (Wu et al., 2021). otherwise.

5
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086

Fig. 7. Different variants of BA Module. (a) is a lightweight design having a single 1 × 1 convolution in the main branch. (b) and (c) vary in the position of the channel shuffle
operation.

5.3. Ablation studies Table 3

Accuracy and parameter details for using cascaded 1 × 1 convolution layers in parallel
branches of BA module.
This subsection reports experiments performed to support our de-
Module 1 × 1 Conv #Parameters (M) mIoU (%)
sign choices. For the ablation studies, the networks were trained on the
Cityscapes training subset. Results are reported on its validation subset. BA Lite No 0.60 63.9
BA Yes 0.72 69.8
BA Module: The detailed evaluation results are presented for
different design choices made in the BA module. We first analyse the ‘#’-number of. ‘M’ — Million.
effectiveness of using two cascaded 1 × 1 convolution layers in the two
paths. The results in Table 3 show that using a single 1 × 1 convolution
layer (BA Lite module, Fig. 7) reduces the model accuracy with fewer Table 4
Accuracy comparison for different positions of channel shuffle operation in BA
parameters. An added convolution layer improves accuracy by adding
Module.
learnable parameters along with non-linearity.
Module Shuffle position #Channels mIoU (%)
Next, the effect of using a channel shuffle operation at three dif-
BA-1 DW Conv Branch 𝐶𝑖𝑛𝑝𝑢𝑡 ∕2 64.7
ferent locations is studied. For the first (BA-1, Fig. 7(a)), we choose
BA-2 After Concat 𝐶𝑖𝑛𝑝𝑢𝑡 68.1
a 3 × 3 depth-wise separable convolution (DW Conv) branch as used BA After Residual 𝐶𝑖𝑛𝑝𝑢𝑡 69.8
in Zhang et al. (2017). The channel shuffle operation after the DW Conv
‘#’-number of ; ‘𝐶𝑖𝑛𝑝𝑢𝑡 ’ indicates number of channels at input of module.
enables the model to learn inter-channel correlation among the feature
maps more effectively (Zhang et al., 2017). However, this placement
was found to be less effective because the architecture only uses half
the number of input channels in each of the two branches (see Table 4). selected. As the total number of channels is restored in this branch, the
For the second variant (BA-2, Fig. 7(b)) the single branch location channel shuffle operation improved the BA module’s feature learning
after channel concatenation, before adding a residual connection is capability. In the third variation of our BA module (Fig. 4), the channel

6
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086

Table 5 Table 7
Accuracy comparison for different layer combinations in BANet. Decoder ablation studies. Results of our encoder combined with different decoders.
#Encoder layers #Channelsb Timea (ms) #Parameters (M) mIoU (%) Decoder #Parameters (M) mIoU (%)
(–,4,8) (–,64,128) 17.5 1.20 68.8 ERFNet (Romera et al., 2018) 0.15 64.3
(2,4,8) (32,64,128) 18.3 1.30 69.1 ESPNet (Mehta et al., 2018) 0.04 63.5
(2,3,6) (32,64,128) 13.9 0.93 68.2 LEDNet (Wang et al., 2019) 1.30 65.2
(2,3,4) (32,64,128) 12.5 0.72 69.8 RGPNet (Arani et al., 2021) 2.00 67.5
(2,2,4) (32,64,128) 11.3 0.63 68.5 FASSD-Net (Rosas-Arias et al., 2021) 0.95 66.4
Proposed 0.09 69.8
a
Denotes the forward pass time for a single input image of resolution-1024 × 512, in
milliseconds.
b
Denotes channels in (Stage1, Stage2, Stage3).
Table 8
Table 6 Accuracy and inference time comparison for different dilation rates and residual
Decoder ablation studies. Results of various feature fusion strategies in the decoder. connections.
Decoder mIoU (%) Dilation rates Residual connection

Add+Up+Conv 69.8 (Stage1)+(Stage2)+(Stage3) Attention Direct

Add+Deconv 65.1 mIoU (%) Time (ms) mIoU (%) Time (ms)
Mul+Up+Conv 64.5
Mul+Deconv 65.2 (1,2)+(1,2,8)+(1,2,8,16) 63.6 14.38 62.8 13.81
Concat+Up+Conv 63.7 (1,2)+(2,4,6)+(8,10,12) 66.5 13.74 65.2 12.53
Concat+Deconv 64.8 2*(1)+3*(4)+(4,6,8,10) 65.1 14.30 63.4 14.07
2*(1)+(1,3,6)+(1,3,6,12) 66.4 13.69 64.7 12.99
2*(4)+3*(8)+4*(10) 67.9 14.47 65.3 13.92
2*(1)+3*(2)+4*(2) 69.8 12.53 67.9 11.77

shuffle layer is put after adding the residual connection. This block 2*(1) indicates 2 blocks with dilation ‘1’ in a stage.
design performs the best.
Backbone: As mentioned earlier, the backbone is built using BA
and downsampling modules. In Table 5, it is seen that having (2,
3, 4) BA modules in corresponding stages gives the best accuracy. A
higher number of modules in the initial layer increased inference time
because of the large feature map. Increasing the network depth had an
inhibitory effect on the network performance. It not only increases the
number of parameters but also increases the latency. Given that our
basic module could learn the required features with a shallow setting,
greater depth showed no improvement.
The Decoder: In this ablation study, BA module-based backbone
is fixed and two decoder variants are designed. The first employed
bilinear upsampling with 3 × 3 convolution layers and the second
employed 3 × 3 deconvolution layers directly. As seen in Table 6,
the first combination has the highest speed and accuracy. For the
primary input to the decoder, the final encoder stage feature maps
are taken, which are 1∕8× the resolution of the input image. For the
skip architecture, two auxiliary outputs from the backbone are taken, Fig. 8. Speed, accuracy, and model size comparison on the Cityscapes test set. The size
of the bubble represents the number of parameters written on top of each. The proposed
one after the first stage (1∕2 resolution) and another after the second
method is highlighted in red. It balances the accuracy-speed tradeoff while having a
stage (1∕4 resolution). This design strategy is adapted from Shelhamer small model size. Methods with high accuracy have large sizes and are significantly
et al. (2017). We have experimented with normalized addition, mul- slower.
tiplication, and concatenation of the auxiliary skip connections with
the primary input. In the case of addition or multiplication, first stage
output is applied with [1 × 1, 64] convolutions, where 1 × 1 is the observed that in the presence of an attention-based skip connection
kernel size, and 64 is the number of channels. While, for concatenation in the basic block, a dilation rate of two throughout gives the best
[1 × 1, 32], convolution is used. Similarly, the second stage output is accuracy.
applied with [1×1, 128] and [1×1, 64] convolutions. The channel widths Increasing the dilation rate further has a detrimental effect. How-
are reduced by half for the concatenation operation while keeping other ever, if we replace the attention skip connection with the standard skip
settings unchanged. connection, it is seen that similar accuracy is attained at higher dilation
To show the efficiency of our decoder design, we trained the BANet rates. This further reinforces our claim of the efficacy of the BA module.
backbone with some of the popularly used decoders from Romera et al.
(2018), Mehta et al. (2018), Wang et al. (2019), Arani et al. (2021) and 5.4. Comparisons on the cityscapes benchmark
Rosas-Arias et al. (2021). The results in Table 7 shows that the proposed
decoder design performs the best. Our network is shallower than other
This section compares the overall performance of the proposed ar-
networks, thus requiring a custom-designed decoder.
chitecture with state-of-the-art semantic segmentation networks on the
Dilation rates: To improve contextual information, semantic seg- Cityscapes dataset. The best-performing deep network model, in terms
mentation networks use dilation of the convolution kernels in the of accuracy, is ViT-Adapter-L (Chen et al., 2022), which is based on the
backbone (Romera et al., 2018; Paszke et al., 2018; Wu et al., 2021). resource-hungry vision transformer (Dosovitskiy et al., 2021). Table 9
This increases the pixel’s field of view at the kernel’s centre. Increasing shows that state-of-the-art models, which require many parameters and
the dilation rate with the depth of the network is common (Yang et al., FLOPs (Zhao et al., 2017; Yang et al., 2018; Chen et al., 2022; Xie et al.,
2021; Hong et al., 2021; Wang et al., 2019). Table 8 shows the ablation 2021; Zhang et al., 2022a) report the top mIoU scores. Such models are
study for different dilation rates in the backbone. Dilations only in not readily deployable on resource-constrained embedded devices (see
the asymmetric convolution branch of the BA module are used. It is Fig. 8).

7
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086

Table 9
Comparison of evaluation results of BANet with similar networks on Cityscapes test benchmark. The first section is for high-accuracy methods,
which are usually very large. The second section is for lightweight methods.
Method FLOPs (G) InputSize mIoU (%) FPSa Parameters (M)
Dense ASPP (Yang et al., 2018) 214.7 2048 × 1024 80.6 <1 28.6
SegFormer-B5 (Xie et al., 2021) 1447.60 1024 × 1024 84.0 2.5 84.7
Trans4trans (Zhang et al., 2022a) 94.25 1536 × 768 81.5 27.3 49.5
PSP Net (Zhao et al., 2017) 453.6 2048 × 1024 78.4 <1 65.6
ViT-Adapter-L (Chen et al., 2022) – 896 × 896 84.9 <1 347.9
ENet (Paszke et al., 2018) 3.8 640 × 360 58.3 135.4 0.4
ESPNet (Mehta et al., 2018) 4.0 512 × 1024 60.3 112 0.4
EDANet (Lo et al., 2019) 11.34 512 × 1024 67.3 81.3 0.7
ERFNet (Romera et al., 2018) 21.0 512 × 1024 68.0 41 2.1
LBN-AA (Dong et al., 2021) 49.5 448 × 896 73.6 51 6.2
CFPNet (Lou and Loew, 2021) 14.7 1024 × 2048 70.1 30.0 0.55
ICNet (Zhao et al., 2018) 28.3 1024 × 2048 69.5 30.3 7.8
Hyperseg-M (Nirkin et al., 2021) 7.5 512 × 1024 75.8 36.9 10.1
FDDWNet (Liu et al., 2020) 10.3 512 × 1024 71.5 60 0.8
BiSeNet (Yu et al., 2018) 14.8 768 × 1536 68.4 72.3 5.8
MLFNet-MobileV2 (Fan et al., 2023) 4.67 512 × 1024 71.5 90.8 4.0
ContextNet (Poudel et al., 2018) 6.74 1024 × 2048 66.1 65.5 0.85
FASSDNet (Rosas-Arias et al., 2021) 45.1 1024 × 2048 76.0 41.1 2.85
LEDNet (Wang et al., 2019) 11.44 512 × 1024 70.6 40.0 0.94
CABiNet (Yang et al., 2021) 12 1024 × 2048 75.9 76.5 2.64
CGNet (Wu et al., 2021) 7.14 512 × 1024 64.8 50 0.5
BANet 6.96 512 × 1024 70.1 83.2 0.72

‘‘–’’ indicates that the method does not report the result.
a
Reported on different GPUs.

Therefore, some of the best-performing lightweight and real-time

models are focused upon. ENet (Paszke et al., 2018) is the smallest
peer-reviewed semantic segmentation model for road scenes, followed
by ESPNet. Our model is substantially more accurate than either of
them but requires slightly more parameters. ERFNet outperformed the
proposed method by 1.2%. However, it has higher FLOPs (21Giga)
and model size (2.1M). This is due to many encoder layers with
asymmetric convolutions and deconvolution kernels in the decoder.
Despite having 10× more parameters than the average lightweight
model, Image-Cascade Network (Zhao et al., 2018) has comparable
accuracy. This shows that there are several redundancies present in the
deep networks. Hyperseg (Nirkin et al., 2021), though has comparable
FLOPs with our model, it has 15× more parameters than the proposed
model. With 16.5% fewer parameters, BANet can significantly improve
over ContextNet (Poudel et al., 2018). LEDNet (Wang et al., 2019)
achieves impressive accuracy, working in real-time but has 3.5× more
parameters and FLOPS than the proposed method. LBN-AA (Dong et al.,
2021), FASSADNet (Rosas-Arias et al., 2021), and CABiNet (Yang et al.,
2021) are more accurate than our model but have many parameters and
FLOPs. These results show that BANet is comparable with the SOTA
methods trained on Cityscapes. In addition, it can efficiently balance
size, speed, and accuracy, which is the prime requirement of model
design for embedded devices (see Fig. 8). Fig. 9. Flow-chart for the proposed method.

5.5. Comparisons on the CamVid benchmark

To further demonstrate the generalization ability of the proposed 5.6. Qualitative results
model, it is compared with some state-of-the-art networks trained on
the CamVid dataset in Table 10. Our network has decent accuracy The segmented images from the Cityscapes validation set and
with a small memory footprint while running in real-time. Specifically, CamVid test set are shown in Figs. 10 and 11, respectively. Fig. 10
the accuracy of the proposed method is comparable to BiSeNet but shows that the lowest dilation rates achieve the best qualitative results
with 8× fewer parameters. However, due to more usage of 1 × 1 in the backbone. The first row in the figure shows only the encoder
convolution kernels, our network is slower than BiSeNet. EDANet (Lo output. Without a decoder, the model performs poorly, as seen in
et al., 2019) performs closely to our method but with significantly these segmented images. The CamVid visualization results show that,
less speed. FASSDNet outperforms in terms of speed and accuracy but while our method can accurately segment large objects like roads, cars,
with 4× more parameters. LAANet (Zhang et al., 2022b) was among vegetation, buildings, and sky, it misses smaller objects like poles and
the fastest but worked at a quarter of the resolution of the proposed traffic signs.
network. These results show that our method can perform on par with The segmented images from an unseen Berkeley deep drive (BDD)
state-of-the-art methods while being lightweight. (Yu et al., 2020) dataset are also visualized in Fig. 12. This dataset

8
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086

Fig. 10. Qualitative results on Cityscape validation set. First row: Input image; second row: ground truth; third row: BANet output without normalized addition and decoder;
fourth row: BANet output with normalized addition and decoder. It can be seen that the method performs the best with the decoder.

Fig. 11. Qualitative results: Validation images on the CamVid dataset. First row: Input image; second row: ground truth; third row: BANet output.

9
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086

Fig. 12. Qualitative results on the unseen dataset: Validation results on Berkeley Deep Drive (BDD) dataset. First column: Input image; second column: ground truth; third column:
BANet output.

Table 10
Comparison of evaluation results of BANet with similar networks on CamVid test benchmark.
Method InputSize mIoU (%) FPSa Parameters (M)
RTFormer-Slim (Wang et al., 2022) 960 × 720 81.4 190.7 4.8
SegNet (Badrinarayanan et al., 2017) 480 × 360 46.4 49.4 15.2
EDANet (Lo et al., 2019) 480 × 360 66.4 40.75 0.7
DFANet (Li et al., 2019b) 960 × 720 59.3 116 7.8
SwiftNet-MN (Oršić and Šegvić, 2021) 960 × 720 65.0 27.7 11.8
ICNet (Zhao et al., 2018) 960 × 720 67.1 46.7 7.8
BiSeNet (Yu et al., 2018) 960 × 720 65.6 175 5.8
BiSeNetV2-L (Yu et al., 2021) 960 × 720 73.2 32.7 –
MLFNet-Res34 (Fan et al., 2023) 960 × 720 69.0 57.2 4.0
LAANet (Zhang et al., 2022b) 480 × 360 67.9 112.5 0.7
CGNet (Wu et al., 2021) 960 × 720 65.6 59.0 0.5
FASSDNet (Rosas-Arias et al., 2021) 960 × 720 69.3 80.0 2.85
BANet 960 × 720 66.3 68.1 0.72

‘‘–’’ indicates that the method does not report the result.
a
As reported in method on different GPUs.

was chosen for its class compatibility with Cityscapes. It is observed Table 11
Specifications of the mobile-GPU cards used for inference.
that the method performs well for large objects, even on unseen road
scene datasets. In addition, to further verify the generalizability of GPU card Cores Speed RAM Mem.B/W Power

the proposed network, we obtain the segmentation results on images Xavier 384 854 MHz 8 GB 51.2 GBps 15 W
940MX 384 1.2 GHz 4 GB 40.1 GBps 20 W
captured from the institute campus shown in Fig. 13. The images are
randomly selected from videos recorded through a car-dash-mounted ‘RAM’: Random Access Memory; ‘Mem.B/W’: Memory Bandwidth.
camera at 1080p and 30 fps. The network used is pre-trained on the
Cityscapes dataset. The visualization shows that the performance of the
network, even on unseen images, is at-par with the datasets used for
940MX. Table 11 presents their technical specifications. Both the GPUs
training.
are comparable, apart from their speed and memory bandwidth. In
Table 12, we present the performance analysis of the proposed model
5.7. Evaluation on mobile GPU systems for two different input resolutions. Our model runs faster than SwiftNet,
LEDNet, BiSeNet, and CaBiNet on both devices for a 1024 × 512 input
Mobile GPUs are manufactured for resource-constrained devices image. This can be attributed to the lightweight design. However, ENet
like laptops and embedded systems. They are designed with fewer is faster than the other methods because of its shallow and extremely
CUDA cores and lower clock frequency. As a result, they have low small network structure, but it has low accuracy. Despite having a
power consumption on the order of 20 W. The proposed method is very small size, LEDNet shows an abysmal speed. This could be due
evaluated on two mobile GPUs, NVIDIA Jetson Xavier and NVIDIA to its extremely complex decoder design resulting in many FLOPs (ref.

10
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086

Fig. 13. Qualitative results on unseen images: Results on our custom captured images in the institute campus. First/third column: Input images; second/fourth column: BANet
outputs.

Table 12
Inference speed (fps) of the proposed method for different input image resolutions on mobile GPU-based systems.
Method Jetson Xavier 940MX mIoU Params
2048 × 1024 1024 × 512 2048 × 1024 1024 × 512 % (M)
SwiftNet (Oršić and Šegvić, 2021) 2.61 9.9 2.4 8.64 70.2 11.8
ENet (Paszke et al., 2018) 13.8 53.82 13.4 45.56 58.3 0.4
ContextNet (Poudel et al., 2018) 10.49 40.91 10.2 35.7 66.1 0.85
Fast-SCNN (Poudel et al., 2019) 11.49 40.21 11.34 37.42 68.4 1.14
LEDNet (Wang et al., 2019) 0.7 2.73 0.5 1.75 70.6 0.94
BiSeNet (Yu et al., 2018) 2.42 9.6 2.1 7.35 68.4 5.8
CaBiNet (Yang et al., 2021) 8.21 30.95 7.9 27.65 76.5 2.64
FASSDNet (Rosas-Arias et al., 2021) 7.3 29.2 7.1 26.27 76.0 2.85
MLFNet (Fan et al., 2023) 8.41 32.79 8.15 30.15 71.5 4.0
BANet 9.02 34.23 8.2 30.12 70.1 0.72

Table 9). Fast-SCNN is faster than the proposed model in speed with a CRediT authorship contribution statement
low accuracy score.
Saquib Mazhar: Conceptualization, Methodology, Software, Writ-
6. Conclusion ing – original draft. Nadeem Atif: Validation, Writing – review &
editing. M.K. Bhuyan: Validation, Supervision. Shaik Rafi Ahamed:
This paper proposed a lightweight real-time semantic segmentation Validation, Supervision.
model for road-scene understanding, targeting resource-constrained
devices. We have developed a novel block-attention module that uses
Declaration of competing interest
simple attention-based skip connections for enhanced feature learning
capabilities. The network based on the block-attention module has
reduced latency due to lower dilation rates in the encoder backbone. The authors declare that they have no known competing finan-
We have also introduced a lightweight and efficient decoder design in cial interests or personal relationships that could have appeared to
our network, which performs better than popular decoders. Through influence the work reported in this paper.
extensive experiments on highly competitive benchmark datasets —
Cityscapes and CamVid, we have presented the efficacy of our design Data availability
choices. Appropriately balancing size, accuracy, and speed, BANet
can outperform some of the state-of-the-art lightweight models while The authors do not have permission to share data.
achieving 70.1% and 66.3% mIoU on Cityscapes and CamVid datasets
in real time. The inference speed of 34.23 fps on mobile devices further Acknowledgement
shows that the proposed model is suitable for resource-constrained
devices. We acknowledge the Science and Engineering Research Board, De-
partment of Science and Technology, Government of India, for the
7. Future work financial support for Project CRG/2022/003473.

A future extension of the proposed work will be to (i) validate the

References
backbone for other image segmentation tasks; (ii) improve segmen-
tation over image boundaries; (iii) increase the processing speed by
Arani, E., Marzban, S., Pata, A., Zonooz, B., 2021. RGPNet: A real-time general purpose
redesigning the basic blocks; and (iv) deploy on standalone embedded semantic segmentation. In: 2021 IEEE Winter Conference on Applications of
GPU development boards with live camera feed in an autonomous Computer Vision (WACV). pp. 3008–3017. http://dx.doi.org/10.1109/WACV48630.
vehicle. 2021.00305.

11
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086

Badrinarayanan, V., Kendall, A., Cipolla, R., 2017. SegNet: A deep convolu- Paszke, A., Chaurasia, A., Kim, S., Culurciello, E., 2018. ENet: A deep neural network
tional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern architecture for real-time semantic segmentation. In: 4th International Conference
Anal. Mach. Intell. 39 (12), 2481–2495. http://dx.doi.org/10.1109/TPAMI.2016. on Learning Representations, ICLR.
2644615. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T.,
Brostow, G.J., Fauqueur, J., Cipolla, R., 2009. Semantic object classes in video: A Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E.Z., DeVito, Z.,
high-definition ground truth database. Pattern Recognit. Lett. 30 (2), 88–97. http: Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.,
//dx.doi.org/10.1016/j.patrec.2008.04.005, URL: https://www.sciencedirect.com/ 2019. PyTorch: An imperative style, high-performance deep learning library.
science/article/pii/S0167865508001220. Video-based Object and Event Analysis. In: Annual Conference on Neural Information Processing Systems, NeurIPS. pp.
Chen, Z., Duan, Y., Wang, W., He, J., Lu, T., Dai, J., Qiao, Y., 2022. Vision trans- 8024–8035.
former adapter for dense predictions. http://dx.doi.org/10.48550/ARXIV.2205. Poudel, R.P.K., Bonde, U.D., Liwicki, S., Zach, C., 2018. ContextNet: Exploring context
08534, URL: https://arxiv.org/abs/2205.08534. and detail for semantic segmentation in real-time. In: BMVC.
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2018. DeepLab: Poudel, R.P.K., Liwicki, S., Cipolla, R., 2019. Fast-SCNN: Fast semantic segmentation
Semantic image segmentation with deep convolutional nets, atrous convolution, network. In: 30th British Machine Vision Conference, BMVC. BMVA Press, pp.
and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), 834—848. 289–296.
http://dx.doi.org/10.1109/tpami.2017.2699184. Romera, E., Álvarez, J.M., Bergasa, L.M., Arroyo, R., 2018. ERFNet: Efficient residual
Cordts, M., Omran, M., Ramos, S., Scharwächter, T., Enzweiler, M., Benenson, R., factorized ConvNet for real-time semantic segmentation. IEEE Trans. Intell. Transp.
Franke, U., Roth, S., Schiele, B., 2015. The cityscapes dataset. In: CVPR Workshop Syst. 19 (1), 263–272. http://dx.doi.org/10.1109/TITS.2017.2750080.
on the Future of Datasets in Vision. Ronneberger, O., Fischer, P., Brox, T., 2015. U-Net: Convolutional networks for biomed-
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L., 2009. ImageNet: A large-scale ical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F.
hierarchical image database. In: IEEE Conference on Computer Vision and Pattern (Eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI
Recognition. pp. 248–255. 2015. Springer International Publishing, Cham, pp. 234–241.
Dong, G., Yan, Y., Shen, C., Wang, H., 2021. Real-time high-performance semantic Rosas-Arias, L., Benitez-Garcia, G., Portillo-Portillo, J., Olivares-Mercado, J., Sanchez-
image segmentation of urban street scenes. IEEE Trans. Intell. Transp. Syst. 22 (6), Perez, G., Yanai, K., 2021. FASSD-Net: Fast and accurate real-time semantic
3258–3274. http://dx.doi.org/10.1109/TITS.2020.2980426. segmentation for embedded systems. IEEE Trans. Intell. Transp. Syst. 1–12. http:
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., //dx.doi.org/10.1109/TITS.2021.3127553.
Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N., Saha, M., Chakraborty, C., 2018. Her2Net: A deep framework for semantic segmentation
2021. An image is worth 16x16 words: Transformers for image recognition at scale. and classification of cell membranes and nuclei in breast cancer evaluation. IEEE
In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Trans. Image Process. 27 (5), 2189–2200. http://dx.doi.org/10.1109/TIP.2018.
Event, Austria, May 3-7, 2021. 2795742.
Fan, J., Wang, F., Chu, H., Hu, X., Cheng, Y., Gao, B., 2023. MLFNet: Multi-level fusion Shelhamer, E., Long, J., Darrell, T., 2017. Fully convolutional networks for semantic
network for real-time semantic segmentation of autonomous driving. IEEE Trans. segmentation. In: IEEE Transaction on Pattern Analysis and Machine Intelli-
Intell. Veh. 8 (1), 756–767. http://dx.doi.org/10.1109/TIV.2022.3176860. gence(PAMI), Vol. 39. USA, pp. 640–651. http://dx.doi.org/10.1109/TPAMI.2016.
Giusti, A., Cireşan, D.C., Masci, J., Gambardella, L.M., Schmidhuber, J., 2013. Fast 2572683.
image scanning with deep max-pooling convolutional neural networks. In: 2013 Strudel, R., Garcia, R., Laptev, I., Schmid, C., 2021. Segmenter: Transformer for
IEEE International Conference on Image Processing. pp. 4034–4038. http://dx.doi. semantic segmentation. In: 2021 IEEE/CVF International Conference on Computer
org/10.1109/ICIP.2013.6738831. Vision (ICCV). IEEE Computer Society, pp. 7242–7252. http://dx.doi.org/10.1109/
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. ICCV48922.2021.00717.
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. Sun, K., Xiao, B., Liu, D., Wang, J., 2021. Deep high-resolution representation learning
770–778. http://dx.doi.org/10.1109/CVPR.2016.90. for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43 (10), 3349–3364.
Hong, Y., Pan, H., Sun, W., Jia, Y., 2021. Deep dual-resolution networks for real- http://dx.doi.org/10.1109/TPAMI.2020.2983686.
time and accurate semantic segmentation of road scenes. arXiv preprint arXiv: Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,
2101.06085. Vanhoucke, V., Rabinovich, A., 2015. Going deeper with convolutions. In: IEEE
Jung, H., Choi, H.-S., Kang, M., 2022. Boundary enhancement semantic segmentation Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1–9. http:
for building extraction from remote sensed image. IEEE Trans. Geosci. Remote Sens. //dx.doi.org/10.1109/CVPR.2015.7298594.
60, 1–12. http://dx.doi.org/10.1109/TGRS.2021.3108781. Wang, J., Gou, C., Wu, Q., Feng, H., Han, J., Ding, E., Wang, J., 2022. RT-
Kaur, A., Chauhan, A.P.S., Aggarwal, A.K., 2023. Prediction of enhancers in DNA Former: Efficient design for real-time semantic segmentation with transformer. In:
sequence data using a hybrid CNN-DLSTM model. IEEE/ACM Trans. Comput. Biol. Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (Eds.), Advances in Neural Information
Bioinform. 20 (2), 1327–1336. http://dx.doi.org/10.1109/TCBB.2022.3167090. Processing Systems.
Li, H., Xiong, P., Fan, H., Sun, J., 2019b. DFANet: Deep feature aggregation for real- Wang, Y., Zhou, Q., Liu, J., Xiong, J., Gao, G., Wu, X., Latecki, L.J., 2019. LEDNet:
time semantic segmentation. In: 2019 IEEE/CVF Conference on Computer Vision A lightweight encoder-decoder network for real-time semantic segmentation. In:
and Pattern Recognition (CVPR). pp. 9514–9523. http://dx.doi.org/10.1109/CVPR. 2019 IEEE International Conference on Image Processing (ICIP). pp. 1860–1864.
2019.00975. http://dx.doi.org/10.1109/ICIP.2019.8803154.
Li, G., Yun, I.Y., Kim, J., Kim, J., 2019a. DABNet: Depth-wise asymmetric bottleneck Wu, T., Tang, S., Zhang, R., Cao, J., Zhang, Y., 2021. CGNet: A light-weight con-
for real-time semantic segmentation. In: BMVC. text guided network for semantic segmentation. IEEE Trans. Image Process. 30,
Lin, M., Chen, Q., Yan, S., 2013. Network in network. http://dx.doi.org/10.48550/ 1169–1179. http://dx.doi.org/10.1109/TIP.2020.3042065.
ARXIV.1312.4400, arXiv. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P., 2021. SegFormer:
Liu, S., Deng, W., 2015. Very deep convolutional neural network based image classifi- Simple and efficient design for semantic segmentation with transformers. In:
cation using small training sample size. In: 3rd IAPR Asian Conference on Pattern Advances in Neural Information Processing Systems.
Recognition (ACPR). pp. 730–734. http://dx.doi.org/10.1109/ACPR.2015.7486599. Yang, M.Y., Kumaar, S., Lyu, Y., Nex, F., 2021. Real-time semantic segmentation with
Liu, J., Zhou, Q., Qiang, Y., Kang, B., Wu, X., Zheng, B., 2020. FDDWNet: A lightweight context aggregation network. ISPRS J. Photogramm. Remote Sens. 178, 124–134.
convolutional neural network for real-time semantic segmentation. In: ICASSP 2020 http://dx.doi.org/10.1016/j.isprsjprs.2021.06.006.
- 2020 IEEE International Conference on Acoustics, Speech and Signal Processing Yang, M., Yu, K., Zhang, C., Li, Z., Yang, K., 2018. DenseASPP for semantic segmen-
(ICASSP). pp. 2373–2377. http://dx.doi.org/10.1109/ICASSP40776.2020.9053838. tation in street scenes. In: 2018 IEEE/CVF Conference on Computer Vision and
Lo, S.Y., Hang, H.M., Chan, S.W., Lin, J.J., 2019. Efficient dense modules of asymmetric Pattern Recognition. pp. 3684–3692. http://dx.doi.org/10.1109/CVPR.2018.00388.
convolution for real-time semantic segmentation. In: Proceedings of the ACM Yu, F., Chen, H., Wang, X., Chen, Y., Liu, F., Madhavan, V., Darrell, T., 2020. BDD100K:
Multimedia Asia. pp. 1–6. A diverse driving dataset for heterogeneous multitask learning. In: Proceedings of
Lou, A., Loew, M., 2021. CFPNET: Channel-wise feature pyramid for real-time semantic the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). pp.
segmentation. In: 2021 IEEE International Conference on Image Processing (ICIP). 2636–2645.
pp. 1894–1898. http://dx.doi.org/10.1109/ICIP42928.2021.9506485. Yu, C., Gao, C., Wang, J., Yu, G., Shen, C., Sang, N., 2021. BiSeNet V2: Bilateral network
Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H., 2018. ESPNet: Efficient with guided aggregation for real-time semantic segmentation. Int. J. Comput. Vis.
spatial pyramid of dilated convolutions for semantic segmentation. In: Proceedings 129 (11), 3051–3068. http://dx.doi.org/10.1007/s11263-021-01515-2.
of the European Conference on Computer Vision (ECCV). Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N., 2018. BiSeNet: Bilateral segmen-
Nirkin, Y., Wolf, L., Hassner, T., 2021. HyperSeg: Patch-wise hypernetwork for real-time tation network for real-time semantic segmentation. In: Ferrari, V., Hebert, M.,
semantic segmentation. In: 2021 IEEE/CVF Conference on Computer Vision and Pat- Sminchisescu, C., Weiss, Y. (Eds.), Computer Vision – ECCV 2018. Springer
tern Recognition (CVPR). pp. 4060–4069. http://dx.doi.org/10.1109/CVPR46437. International Publishing.
2021.00405. Zhang, X., Du, B., Wu, Z., Wan, T., 2022b. LAANet: lightweight attention-guided
Oršić, M., Šegvić, S., 2021. Efficient semantic segmentation with pyramidal fusion. asymmetric network for real-time semantic segmentation. Neural Comput. Appl.
Pattern Recognit. 110, 107611. http://dx.doi.org/10.1016/j.patcog.2020.107611. 34 (5), 3573–3587. http://dx.doi.org/10.1007/s00521-022-06932-z.

12
S. Mazhar et al. Engineering Applications of Artificial Intelligence 126 (2023) 107086

Zhang, J., Yang, K., Constantinescu, A., Peng, K., Müller, K., Stiefelhagen, R., 2022a. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J., 2017. Pyramid scene parsing network.
Trans4Trans: Efficient transformer for transparent object and semantic scene In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp.
segmentation in real-world navigation assistance. IEEE Trans. Intell. Transp. Syst. 6230–6239. http://dx.doi.org/10.1109/CVPR.2017.660.
23 (10), 19173–19186. http://dx.doi.org/10.1109/TITS.2022.3161141. Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T.,
Zhang, X., Zhou, X., Lin, M., Sun, J., 2017. ShuffleNet: An extremely efficient Torr, P.S., Zhang, L., 2021. Rethinking semantic segmentation from a sequence-
convolutional neural network for mobile devices. arXiv. arXiv:1707.01083. to-sequence perspective with transformers. In: 2021 IEEE/CVF Conference on
Zhao, H., Qi, X., Shen, X., Shi, J., Jia, J., 2018. ICNet for real-time semantic seg- Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, pp.
mentation on high-resolution images. In: Ferrari, V., Hebert, M., Sminchisescu, C., 6877–6886. http://dx.doi.org/10.1109/CVPR46437.2021.00681.
Weiss, Y. (Eds.), Computer Vision – ECCV 2018. Springer International Publishing,
Cham, pp. 418–434.