A Multi-Path Semantic Segmentation Network Based o
A Multi-Path Semantic Segmentation Network Based o
sciences
Article
A Multi-Path Semantic Segmentation Network Based on
Convolutional Attention Guidance
Chenyang Feng, Shu Hu and Yi Zhang *
College of Computer Science, Sichuan University, Chengdu 610042, China; [email protected] (C.F.);
[email protected] (S.H.)
* Correspondence: [email protected]
Abstract: Due to the efficiency of self-attention mechanisms in encoding spatial information, Transformer-
based models have recently taken a dominant position among semantic segmentation methods. However,
Transformer-based models have the disadvantages of requiring a large amount of computation and
lacking attention to detail, so we look back to the CNN model. In this paper, we propose a multi-path
semantic segmentation network with convolutional attention guidance (dubbed MCAG). It has a multi-
path architecture, and feature guidance from the main path is used in other paths, which forces the
model to focus on the object’s boundaries and details. It also explores multi-scale convolutional features
through spatial attention. Finally, it captures both local and global contexts in spatial and channel
dimensions in an adaptive manner. Extensive experiments were conducted on popular benchmarks,
and it was found that MCAG surpasses other SOTA methods by achieving 47.7%, 82.51% and 43.6%
mIoU on ADE20K, Cityscapes and COCO-Stuff, respectively. Specifically, the experimental results prove
that the proposed model has high segmentation precision for small objects, which demonstrates the
effectiveness of convolutional attention mechanisms and multi-path strategies. The results show that the
CNN model can achieve good segmentation effects with a lower amount of calculation.
Keywords: convolutional attention; deep learning; feature guidance; multi-path; semantic segmentation
For the decoder, features from different layers are extracted, and a Hamburger model [9] is
implemented to extract the global context. In addition, a spatial and channel reconstruction
module is incorporated in the main path of our model to enhance feature interaction,
which also removes redundant features. Multi-path encoding allows the model to use
convolutional attention to extract overall features without ignoring details and boundary
information. The main contributions in this paper can be summarized as follows:
(1) A multi-path convolutional self-attention structure is proposed to enhance the learning
of advanced semantic information. It also integrates global information and focuses
more on the boundary information.
(2) A spatial and channel reconstruction module is developed to reinforce feature interac-
tion, which also eliminates redundant information.
(3) Extensive experiments are conducted on mainstream datasets, where our model
exhibits superior performances against other popular methods.
The rest of the paper is organized as follows: related works are discussed in Section 2.
The architecture of our model is described in detail in Section 3. The experimental results
an ablation studies are presented in Section 4. A final conclusion is drawn in Section 5.
2. Related Works
2.1. Semantic Segmentation
Semantic segmentation is a fundamental task in computer vision. Since the introduction
of fully convolutional networks (FCNs) [1], convolutional neural networks (CNNs) [10–13]
have achieved tremendous success and have become a popular architecture for semantic
segmentation. Fully convolutional networks keep pushing forward this field forward via
their end-to-end, per-pixel classification paradigms. They capture multi-scale features, in-
corporate channel attention and introduce self-attention blocks to refine contextual priors.
More recently, Transformer-based methods [5,6,14–16] have demonstrated significant potential
and have outperformed CNN-based approaches. The general structure of a segmentation
network consists of an encoder and a decoder. ResNet [17] and DenseNet [18] are commonly
adopted backbones for the encoder. Meanwhile, different decoders are advised for different
emphases, including achieving multi-scale receptive fields [12], collecting multi-scale seman-
tics [4,6,19], expanding receptive fields [2,20], enhancing edge features [11] and capturing
global contexts [13,21].
Figure 1. The encoder network architecture of MCAG. AGFM stands for attention-guided fusion
module; PFM stands for pyramid pooling module; and MP, DP and BP stand for main path, detail
path and boundary path respectively.
The results of the first stage of the MP are fed into two subsidiary paths: the detail
path (DP) and the boundary path (BP). The DP maintains the same high resolution across
all stages, emphasizing the extraction of detailed features. Starting from the second stage,
guided by the MP, the DP selectively learns high-level semantic information. The BP, on
the other hand, fuses with the lower-resolution MP at each stage, further integrating global
information and focusing on boundary details while maintaining the same high resolution
across all stages. Both paths utilize convolution-based feature extraction, with the resolution
Appl. Sci. 2024, 14, 2024 4 of 17
and number of channels remaining constant, but differing in how they leverage guidance
from the MP, as elaborated in Section 3.3.
After completing the fourth stage, a fusion module called the attention-guided fusion
module (AGFM) is employed to merge the results from the three paths, producing the final
result of the encoder section. For the decoder part, we adopt the lightweight Hamburger
model [9] to generate improved segmentation results, as detailed in Section 3.4.
X−µ
Xout = GN ( X ) = γ √ +β (4)
σ2 + ε
γi
Wγ = {wi } = C
(i, j = 1, 2, . . . , C ) (5)
∑ j =1 γj
related weights are obtained using the trainable parameters γ in the group normalization
layer GN, as shown in Equation (5), representing the importance of different feature maps.
The weighted feature map is then mapped to the (0, 1) range through a Sigmoid function,
Appl. Sci. 2024, 14, x FOR PEER REVIEW
and a threshold gate Gate is applied to set weights above the threshold to 1 for informative
weights W1 and set those below the threshold to 0 for non-informative weights W2 . The
entire process is represented by Equation (6).
Figure 2. (a) The network architecture of MSCAN. (b) The network architecture of MSCA. CRU and
Figure 2. (a) The network architecture of MSCAN. (b) The network architecture of M
SRU stand for spatial recurrent unit and channel recurrent unit, respectively.
SRU stand for spatial recurrent unit and channel recurrent unit, respectively.
Finally, the input features X are multiplied by W1 and W2 to obtain two weighted
features: X1w , with high information content, and X2w , with low information content; fea-
Additionally, spatial and channel recurrent units (SRUs and CRUs) are
tures X2w with little or no spatial content information are considered as redundant. A cross
our MSCA
operation module.
is then applied For the SRUs,
to thoroughly our aim
combine is two
these to leverage
differentlyspatial redundancy
weighted infor-
usingfeatures,
mation a separate-and-reconstruct
enhancing the information flowoperation. The purpose
between them. of the separate o
The cross-reconstructed
features X w1 and X w2 are concatenated to obtain spatially refined features X w . The entire
extract information-rich feature maps from those with less information. A
reconstruction process is represented by Equation (7):
from the group normalization layer is used to assess the information conte
w
Xi = W
feature maps. Specifically,
i ⊗ X, (an
given i = 1,intermediate
2) feature map X wit
Xijw = Split Xiw , (i, j = 1, 2)
N × C × H × W , where N wis thew batch axis, C is the channel axis and H a
X11 ⊕ X22 = X w1 , (7)
spatial height and width axes,
w wwe
X ⊕X = X ,
firstly
w 2 normalize the input as shown in E
21 12
X w1 ∪ X w2 = X w .
X −μ
= GN
X outSplit
where ⊗ denotes element-wise multiplication, ( )
X the
denotes = γoperation of+halving
β along
σ 2
+ ε
the channel dimension, ⊕ denotes element-wise summation and ∪ denotes concatenation.
After applying SRUs to the intermediate input features X, not only are the features with
γi
high information content separated from those with low information content, they are also
reconstructed to enhance representative =
Wfeatures { }
w =
and suppress (
i , j = 1, 2,..,
redundant )
C in the
features
γ i C
spatial dimension. γ
j =1 j
For CRU, the aim is to exploit channel redundancy in features, utilizing a split–
transform–merge strategy to further reduce redundancy along the channel dimension
W = Gate
of spatially refined feature maps. Initially, the split
(
Sigmoid ( (
operationWis applied.
( ))))
GN X For a given
X w ∈ Rc×h×w , its channel dimension is decomposed into αC andγ (1 − α)C components,
where 0 ≤ α ≤ 1, denoted as Xup and Xlow , respectively. This process is expressed through
Equations (8) and (9):
Y1 = M G Xup + M P1 Xup (8)
1 H W
H × W ∑ i =1 ∑ j =1
Sm = P(Ym ) = Yc (i, j) (10)
Here, P represents the pooling operation (m = 1, 2). Next, the upper and lower global
channel descriptors, S1 and S2 , are stacked together, and a channel-wise soft attention
operation is applied to generate a feature importance vector, as shown in Equation (11):
e s1
β 1 = e s1 + e s2 ,
s
β 2 = es1e+2es2 , (11)
β + β = 1.
1 2
Finally, guided by the feature importance vector, the channel-refined feature Y can
be obtained by merging the upper feature Y1 and the lower feature Y2 in a channel-wise
partitioning manner, as expressed in Equation (12):
Y = β 1 Y1 + β 2 Y2 (12)
Thestructure
Figure4.4.The
Figure structure diagram
diagram of
of block_BP.
block_BP.
Each stage of DP consists of k = 2 block_DP modules. The initial input is the output
Each stage of DP consists of k = 2 block_DP modules. The initial input is the output
of the first stage of the main path (MP), and subsequently, the number of channels in
of the first stage of the main path (MP), and subsequently, the number of channels in the
the DP remains unchanged (except for the output channels in the fourth stage, which
DP remains
match thoseunchanged
of the main (except forresolution
path). The the output channels
is also in the fourth
consistently stage,which
maintained, which match
better
those of thethe
preserves main path).
details. The resolution
In each stage, the DPis learns
also consistently
higher-levelmaintained, which better
semantic information underpre-
serves the details. In each stage, the DP learns higher-level semantic
the guidance of the main path to compensate for the loss of advanced information due to information under
the
theguidance
lower numberof theof main path and
channels to compensate for the losskernels.
smaller convolutional of advanced information
In block_DP, due to
the upper
the lower
path number
employs of channelskernel
a convolutional and smaller convolutional
with a size kernels.
of 3 and a stride of 1,Inwhile
block_DP,
the lowerthepath
upper
path
usesemploys
a kernel aofconvolutional kernelpurpose
size 1. The primary with a size of 3lower
of the and apath
stride of adjust
is to 1, whilethethe lower of
number path
channels (in the fourth stage) and add these to the original information from
uses a kernel of size 1. The primary purpose of the lower path is to adjust the number of the upper path,
preserving
channels (inthe
theoriginal
fourthfeatures as much
stage) and add as possible
these to thefororiginal
subsequent stages. Thus,
information from block_DP
the upper
efficiently
path, extracts
preserving thefeatures
originalfrom high-resolution
features as much asimages with
possible forlower parameters
subsequent (fewer
stages. Thus,
channels and smaller convolutional kernels) and selectively learns
block_DP efficiently extracts features from high-resolution images with lower parameters advanced semantic
information
(fewer channelsunderandthe guidance
smaller of the main path.
convolutional kernels) and selectively learns advanced se-
Each stage of the BP consists of k = 1 block_BP module. Similar to the DP, the initial
mantic information under the guidance of the main path.
input for the BP is the output of the first stage of the main path (MP). The number of
Each stage of the BP consists of k = 1 block_BP module. Similar to the DP, the initial
channels and resolution remain consistent in the second and third stages, and in the last
input for the BP is the output of the first stage of the main path (MP). The number of
stage, the number of channels matches that of the main path’s fourth stage, maintaining a
channels and resolution
lower parameter remain
count (fewer consistent
channels and in the second
smaller and third
convolutional stages,Inand
kernels). in thetolast
contrast
stage, the number of channels matches that of the main path’s fourth stage, maintaining a
lower parameter count (fewer channels and smaller convolutional kernels). In contrast to
the DP, the upper path’s first convolution block in block_BP doubles the feature channel
count using a 1 × 1 convolutional kernel. The second convolutional block extracts features
from the high channel count, allowing the model to better capture details and patterns in
Appl. Sci. 2024, 14, 2024 8 of 17
the DP, the upper path’s first convolution block in block_BP doubles the feature channel
count using a 1 × 1 convolutional kernel. The second convolutional block extracts features
from the high channel count, allowing the model to better capture details and patterns
in the input data, learn more types of features and improve the model’s generalization
ability to different samples, enhancing its robustness. The kernel size for the second
block is 3 × 3. The third convolutional block then restores the channel count to its initial
value using a 1 × 1 convolutional kernel. The lower path’s convolutional block, similar
to block_DP, primarily adjusts the number of channels (in the fourth stage) and adds
the original information from the upper path. Therefore, the BP, composed of block_BP,
effectively utilizes lower parameters (fewer channels and smaller convolutional kernels),
relies on guidance from the main path, pays more attention to boundary information, and
under the guidance of the main path, further integrates global information on a larger scale.
At the end of each stage, the main path guides and provides information to both pathways.
For the DP, due to its low number of stacked convolutional layers and small kernel
sizes, the main path guides its feature extraction at high resolution, allowing it to selectively
learn higher-level semantics. Specifically, the main path’s output from each stage starting
from stage two is combined with the corresponding output of the DP and then fed into the
AGFM module (attention-guided fusion module). The schematic diagram of this module
is shown in Figure 5, where “dp” represents the features from the DP; “mp” represents
the features from the main path; “S” represents the combination operation of sum and
Sigmoid; ⊗ denotes element-wise multiplication, which is the weight allocation; and ⊕
denotes element-wise summation. The main path’s high-level semantic information is
selectively incorporated into the pathway, and the DP retains a significant amount of
high-quality detailed information that ultimately enhances the segmentation results. The
lateral connections used in [26–28] strengthen the information flow between feature maps
of different scales, improving the model’s representational capacity. In the AGFM, the
outputs of the DP and the main path, both passed through convolutional blocks and
channel expansion, are adjusted to the same resolution. Denoting these as dp and mp, the
output of the Sigmoid function can be expressed as Equation (13):
where the computed result S indicates the likelihood of these two pixels belonging to
the same object class, sum represents the summation along the channel dimension and ⊗
denotes element-wise multiplication. When S is higher, there is reason to trust the results
from the main path since it provides rich and accurate semantics, and vice versa. After
obtaining S, we adjust the number of channels and resolution of the main path to match
those of the DP and perform the final addition. Thus, the output of the AGFM module can
be written as Equation (14):
Therefore, in the case of deeper feature extraction, the main path can leverage higher
semantic information to guide the DP in selectively learning better semantic information
while preserving detailed information, ultimately optimizing the segmentation results.
For the BP, at the end of each stage, the output of the main path is directly added to the
output of the BP after adjusting the number of channels and the resolution. This integrates
global information and focuses on boundary information using the output features of the
main path (MP).
results from the main path since it provides rich and accurate semantics, and vice versa.
After obtaining S , we adjust the number of channels and resolution of the main path to
match those of the DP and perform the final addition. Thus, the output of the AGFM mod-
ule can be written as Equation (14):
Out AGFM = S ⊗ mp + (1 − S ) ⊗ dp
Appl. Sci. 2024, 14, 2024 9 of 17
(14)
Figure
Figure 5.
5. The
Thestructural
structuraldiagram
diagram of
ofAGFM.
AGFM.
To construct
Therefore, in athe
better
caseglobal scene
of deeper prior, extraction,
feature PSPNet introduces
the maina path
pyramid pooling module
can leverage higher
(PPM), concatenating
semantic information to multi-scale
guide thepooled representations
DP in selectively learningbefore thesemantic
better convolution layers to
information
capture
while both localdetailed
preserving and global contexts. ultimately
information, In MCAG, optimizing
after the lastthestage of the mainresults.
segmentation path, the
output is fed into a parallel fusion module (PFM) to prepare for the fusion
For the BP, at the end of each stage, the output of the main path is directly added of the final three
to
paths.
the Thisofparallel
output the BP fusion module the
after adjusting enhances
number the
ofcontext
channels embedding capability,This
and the resolution. forming
inte-
a fusion
grates of local
global and global
information andcontexts
focuses onto boundary
analyze global correlations.
information PFM
using the processes
output the
features
output
of of the
the main last(MP).
path stage of the main path in parallel through four pooling paths, with kernel
sizesTo
of construct
5, 9 and 17 for theglobal
a better first three
scenepaths and
prior, globalintroduces
PSPNet average pooling for the
a pyramid last path.
pooling mod-It
then passes through BN and ReLU layers, followed by a convolutional
ule (PPM), concatenating multi-scale pooled representations before the convolution layers layer that doubles
thecapture
to number bothof local
channels and concatenates
and global contexts. In the results.
MCAG, afterFinally,
the lastastage
residual connection
of the main path, is
established with the input features of PFM to obtain the final output of PFM,
the output is fed into a parallel fusion module (PFM) to prepare for the fusion of the final as expressed
in Equations
three (15)parallel
paths. This and (16):fusion module enhances the context embedding capability, form-
3
ing a fusion of local and global contexts
P = toPooling
analyze(input
global
) correlations. PFM processes the
∑ i (15)
i =0
S = Sigmoid(bp) (17)
tablished with the input features of PFM to obtain the final output of PFM, as expres
in Equations (15) and (16):
3
P = Pooling i ( input )
i =0
Appl. Sci. 2024, 14, 2024 10 of 17
Figure 6. FusionFigure
diagram of the diagram
6. Fusion main path
of and branch
the main path.
path and branch path.
f = Ham(Cat(S2 , S3 , S4 )) (19)
( )
this study uses ImageNet to pre-train the main path (MP) of the MCAG encoder.
f = Ham Cat ( S 2 , S3 , Sconsisting
ADE20K [29] is a challenging dataset with 150 semantic classes, 4) of 20,210/2000/
3352 images for training, validation and testing sets, respectively. Cityscapes [30] focuses
on urban scenes, presenting 5000 high-resolution images with 19 categories. The dataset is
divided into 2975/500/1525 images for training, validation and testing. COCO-Stuff [31] is
another challenging dataset, encompassing a total of 172 semantic classes and 164,000 images.
The experiments in this paper were conducted using PyTorch [33] and the mmsegmen-
tation library [34]. The main route of the segmentation model’s encoder was pretrained on
the ImageNet-1K dataset [32]. The mean intersection over union (mIoU) was employed as
the segmentation evaluation metric. All models were trained on nodes equipped with two
RTX 3090 GPUs.
For the pre-training on ImageNet, the data augmentation methods and training set-
tings were consistent with DeiT [35]. Common data augmentation techniques, including
random horizontal flipping, random scaling (from 0.5 to 2) and random cropping, are
applied for segmentation experiments. The batch size for the Cityscapes dataset is set
to 4, while for the other datasets, it is set to 8. The AdamW optimizer [36] is used for
training. The initial learning rate is set to 0.00006, and a multi-learning rate decay strategy
is employed. The ADE20K model is trained for 160K iterations, and the Cityscapes and
COCO-Stuff models are trained for 80K iterations.
(ADE20K [29], Cityscapes [30], and COCO-Stuff [31]) to demonstrate the superiority of the
proposed approach. The multi-scale flipping testing strategy (MS) is employed during the
comparison process.
On ADE20K, we compare MCAG with state-of-the-art semantic segmentation models.
As shown in Table 1, MCAG achieves a nearly 1.0% higher mIoU compared to the state-of-
the-art CNN-based model SegNext-B [7], and it outperforms the fully attentional network
FLANet [37] by 0.7% in mIoU. Additionally, MCAG achieves better mIoU values than the
Transformer-based models MPViT [22], FASeg [38] and TSG [39] with fewer parameters.
FASeg introduces a simple and effective query design for semantic segmentation called
dynamic focus-aware position query (DFPQ), which dynamically generates position queries
based on the cross-attention scores of the previous decoder block and the position encoding
of corresponding image features. TSG, on the other hand, utilizes internal attributes of the
attention map in Transformer for multi-scale feature selection in semantic segmentation.
TSG introduces TSGE and TSGD in the encoder and decoder of the Transformer, respectively,
to enhance the semantic segment localization performance. These results demonstrate that
MCAG achieves competitive segmentation performance while introducing a multi-path
self-attention mechanism at a lower computational cost than Transformer models. The
asterisk (*) denotes reproduced results.
Table 1. Comparison with SOTA on ADE20K. The asterisk (*) denotes reproduced results.
Table 2. Comparison with SOTA on Cityscapes. The asterisk (*) denotes reproduced results.
Table 3. Comparison with SOTA on Coco-Stuff. The asterisk (*) denotes reproduced results.
Table 4. Comparison of specific objects with Segnext on ADE20K. The asterisk (*) denotes repro-
duced results.
4.3. Visualization
In Figure 8, the visual results of MCAG on the Cityscapes dataset are presented. The
first column displays the input images, the second column represents the corresponding
ground truth and the third column shows the segmentation results of the MCAG method,
with black rectangular regions indicating detailed displays.
Appl. Sci. 2024, 14, x FOR PEER REVIEW 15 of 18
Appl. Sci. 2024, 14, 2024 14 of 17
Figure 8.
Figure Visualizationresults
8. Visualization resultsofofMCAG
MCAGon onthe
theCityscapes
Cityscapesdataset.
dataset. The
The first
first column
column displays
displays thethe
input images, the second column represents the corresponding ground truth and the third
input images, the second column represents the corresponding ground truth and the third column column
shows
shows the
the segmentation
segmentation results of the MCAG method.
It can
It can be
be observed
observed that
that MCAG
MCAG is is more
more effective
effective at
at identifying
identifying bothboth boundary
boundary details
details
and overall information. In the first set of images, MCAG successfully
and overall information. In the first set of images, MCAG successfully recognizes the rail- recognizes the
railing
ing in front
in front of central
of the the central part
part of theofbicycle
the bicycle
and theand theofseat
seat the of the bicycle,
bicycle, as wellas as well as
pedes-
pedestrians
trians next tonext to the pole.
the utility utilityInpole. In the second
the second set of images,
set of images, above the above theMCAG
cyclist, cyclist,adeptly
MCAG
adeptly identifies
identifies the less noticeable
the less noticeable lamp posts, lamp posts,
and and effectively
effectively segmentssegments the two
the two seated seated
individ-
individuals in the center of the image from the background bushes.
uals in the center of the image from the background bushes. In the third set of images, In the third set of
images, MCAG achieves satisfactory results in segmenting pedestrians
MCAG achieves satisfactory results in segmenting pedestrians at the end of the road, pay- at the end of the
road, paying particular attention to details. In the fourth set of images,
ing particular attention to details. In the fourth set of images, the model performs out- the model performs
outstanding
standing segmentation
segmentation between
between the pedestrians
the pedestrians in front
in front of theofcar
theand
carthe
andcarthe car itself,
itself, with
with clear
clear distinction
distinction between
between the of
the legs legs
the ofpedestrians
the pedestriansand and the road.
the road. TheseThese remarkable
remarkable re-
results stem from the robust long-range dependency-parsing ability
sults stem from the robust long-range dependency-parsing ability of MCAG’s main path, of MCAG’s main path,
the subsidiary
the subsidiary paths’
paths’ exceptional
exceptional focus
focus on on image
image details
details atat high
high resolutions
resolutions andand thethe final
final
fusion mechanism’s appropriate handling of features extracted
fusion mechanism’s appropriate handling of features extracted at multiple scales. at multiple scales.
Among the modules, “Multiple Paths without Attention Guidance” represents that the
guidance of the main route to the DP and the guidance to the BP are simply added together.
It can be observed that each component contributes to the model’s final performance. When
using both multiple paths and attention guidance, the mIoU is 0.84% higher than having
only one main route. If there is no attention guidance mechanism and a simple addition of
main route and branch results is performed, the result is lower by 0.4%. These two findings
indicate that the proposed multi-path and attention guidance from the main route are both
effective and necessary.
Author Contributions: Conceptualization, C.F.; Investigation, C.F.; Methodology, C.F.; Project admin-
istration, S.H. and Y.Z.; Resources, Y.Z.; Software, C.F.; Supervision, S.H. and Y.Z.; Validation, C.F.;
Visualization, C.F.; Writing—original draft, C.F.; Writing—review and editing, Y.Z. All authors have
read and agreed to the published version of the manuscript.
Funding: This research received no funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data presented in this study are available on request from the
corresponding author. The data are not publicly available due to the privacy.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440.
2. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets
and fully connected crfs. arXiv 2014, arXiv:1412.7062.
3. Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017,
arXiv:1706.05587.
4. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image
segmentation. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018;
pp. 801–818.
5. Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation
from a equence-to-sequence perspective with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890.
Appl. Sci. 2024, 14, 2024 16 of 17
6. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. Segformer: Simple and effificient design for semantic
segmentation with Transformers. Adv. Neural Inform. Process. Syst. 2021, 34, 12077–12090.
7. Guo, M.H.; Lu, C.Z.; Hou, Q.; Liu, Z.; Cheng, M.M.; Hu, S.M. Segnext: Rethinking convolutional attention design for semantic
segmentation. Adv. Neural Inform. Process. Syst. 2022, 35, 1140–1156.
8. Guo, M.H.; Lu, C.Z.; Liu, Z.N.; Cheng, M.M.; Hu, S.M. Visual attention network. arXiv 2022, arXiv:2202.09741. [CrossRef]
9. Geng, Z.; Guo, M.H.; Chen, H.; Li, X.; Wei, K.; Lin, Z. Is attention better than matrix decomposition? In Proceedings of the 2021
International Conference on Learning Representations, Virtual, 3–7 May 2021.
10. Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation.
IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [CrossRef] [PubMed]
11. Bertasius, G.; Shi, J.; Torresani, L. Semantic segmentation with boundary neural fifields. In Proceedings of the 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3602–3610.
12. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890.
13. Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2net: A new multi-scale backbone architecture. IEEE Trans.
Pattern Anal. Mach. Intell. 2021, 43, 652–662. [CrossRef] [PubMed]
14. Yuan, Y.; Fu, R.; Huang, L.; Lin, W.; Zhang, C.; Chen, X.; Wang, J. Hrformer: High-resolution vision Transformer for dense predict.
Adv. Neural Inform. Process. Syst. 2021, 34, 7281–7293.
15. Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the 2021
IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021;
pp. 7262–7272.
16. Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision Transformers for dense prediction. In Proceedings of the 2021 IEEE/CVF
International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 12179–12188.
17. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
18. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708.
19. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the
International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October
2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241.
20. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional
nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [CrossRef] [PubMed]
21. Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings
of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019;
pp. 603–612.
22. Lee, Y.; Kim, J.; Willette, J.; Huang, S.J. Mpvit: Multi-path vision Transformer for dense predtion. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7287–7296.
23. Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking bisenet for real-time semantic segmentation. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9716–9725.
24. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings
of the International Conference on Machine Learning (ICML 2015), Lille, France, 6–11 July 2015; pp. 448–456.
25. Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; Volume 5, pp. 510–519.
26. Hong, Y.; Pan, H.; Sun, W.; Jia, Y. Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes.
arXiv 2021, arXiv:2101.06085.
27. Orsic, M.; Kreso, I.; Bevandic, P.; Šegvić, S. In defense of pre-trained imagenet architectures for real-time semantic segmentation
of road-driving images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach,
CA, USA, 15–20 June 2019; pp. 12607–12616.
28. Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution
representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [CrossRef] [PubMed]
29. Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene parsing through ade20k dataset. In Proceedings of the 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 633–641.
30. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes
dataset for semantic urban scene understanding. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223.
31. Caesar, H.; Uijlings, J.; Ferrari, V. Coco-stuff: Thing and stuff classes in context. In Proceedings of the 2018 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1209–1218.
32. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the
2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255.
Appl. Sci. 2024, 14, 2024 17 of 17
33. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch:
An imperative style, high-performance deep learning library. Adv. Neural Inform. Process. Syst. 2019, 32, 8026.
34. Contributors, M. MMSegmentation: Openmmlab Semantic seg213mentation Toolbox and Benchmark. 2020. Available online:
https://github.com/open-mmlab/mmsegmentation (accessed on 1 July 2022).
35. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jegou, H. Training data efficient image transformers & distillation
through attention. In Proceedings of the International Conference on Machine Learning (ICML 2021), Online, 18–24 July 2021;
pp. 10347–10357.
36. Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101.
37. Song, Q.; Li, J.; Li, C.; Guo, H.; Huang, R. Fully attentional network for semantic segmentation. In Proceedings of the AAAI
Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; pp. 2280–2288.
38. He, H.; Cai, J.; Pan, Z.; Liu, J.; Zhang, J.; Tao, D.; Zhuang, B. Dynamic Focus-aware Positional Queries for Semantic Segmentation.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June
2023; pp. 11299–11308.
39. Shi, H.; Hayat, M.; Cai, J. Transformer scale gate for semantic segmentation. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 3051–3060.
40. Cheng, B.; Schwing, A.G.; Kirillov, A. Per-pixel classification is not all you need for semantic segmentation. In Proceedings of the
NeurIPS 2021, Online, 6–14 December 2021.
41. Dong, B.; Wang, P.; Wang, F. Head-free lightweight semantic segmentation with linear transformer. arXiv 2023, arXiv:2301.04648.
[CrossRef]
42. Lin, F.; Liang, Z.; Wu, S.; He, J.; Chen, K.; Tian, S. Structtoken: Rethinking semantic segmentation with structural prior. IEEE
Trans. Circuits Syst. Video Technol. 2023, 33, 5655–5663. [CrossRef]
43. Liu, Q.; Dong, Y.; Jiang, Z.; Pei, Y.; Zheng, B.; Zheng, L.; Fu, Z. Multi-Pooling Context Network for Image Semantic Segmentation.
Remote Sens. 2023, 15, 2800. [CrossRef]
44. Xu, J.; Xiong, Z.; Bhattacharyya, S.P. PIDNet: A Real-Time Semantic Segmentation Network Inspired by PID Controllers. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023;
pp. 19529–19539.
45. Huynh, C.; Tran, A.T.; Luu, K.; Hoai, M. Progressive semantic segmentation. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16755–16764.
46. Nirkin, Y.; Wolf, L.; Hassner, T. Hyperseg: Patch-wise hypernetwork for real-time semantic segmentation. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4061–4070.
47. Lei, X.; Liang, J.; Gong, Z.; Jiang, Z. LightSeg: Local Spatial Perception Convolution for Real-Time Semantic Segmentation. Appl.
Sci. 2023, 13, 8130. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.