DL Segmentation 2
DL Segmentation 2
Learning Techniques
Cheng, J., Li, H., Li, D., Hua, S., & Sheng, V.S.. 2023. A Survey on Image
Citation Semantic Segmentation Using Deep Learning Techniques. Computers,
Materials and Continua, 74(1).
https://doi.org/10.32604/cmc.2023.032757
https://hdl.handle.net/2346/93006
Citable Link
© 2023 Tech Science Press. All rights reserved. cc-by
Terms of Use
1
School of Computer Science and Technology, Hainan University, Haikou, 570228, China
2
School of Cyberspace Security (School of Cryptology), Hainan University, Haikou, 570228, China
3
Hainan Blockchain Technology Engineering Research Center, Hainan University, Haikou, 570228, China
4
Department of Computer Science Texas Tech University TX, 79409, USA
*Corresponding Author: Hua Li. Email: [email protected]
Received: 28 May 2022; Accepted: 12 July 2022
1 Introduction
Image semantic segmentation is a basic task in the field of computer vision. It can be regarded
as a pixel level classification task, which achieves fine-grained reasoning by intensively predicting
and inferring labels for each pixel, so that each pixel is labeled and divided into a specific category.
Image semantic segmentation not only provides category prediction, but also provides spatial location
information about these classes. In recent years, semantic segmentation has been applied more
and more widely. It plays an important role in medical image analysis [1], automatic driving [2],
virtual/augmented reality [3], video surveillance [4] and three-dimensional reconstruction [5].
This work is licensed under a Creative Commons Attribution 4.0 International License,
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
1942 CMC, 2023, vol.74, no.1
Reviewing the development of semantic segmentation methods [6], Early methods were mostly
based on mathematical methods, such as thresholding, k-means clustering, and conditional random
fields. Then, with the great success of deep learning in various fields [7], researchers tried to use deep
learning techniques for semantic segmentation task, and successfully designed the full convolution
neural network (FCN) [8]. Since then, convolution neural network has swept the field and become the
mainstream method. In the past two years, transformer has become popular in computer vision, and
the application of MLP technology in this field has inspired researchers to explore more possibilities
in the field of semantic segmentation.
With the rapid emergence of new semantic segmentation methods based on deep learning in recent
years, many past reviews have some shortcomings. Although they have [9,10] introduced common
datasets in the field of semantic segmentation and technical details of some classical methods, they
lacked generalizations and descriptions of some new technologies (e.g., transformer- and MLP-
based methods). It is well known that there is no extensive survey covering many types of semantic
segmentation methods such as CNN-based, transformer-based and MLP-based.
The goal of this paper is to summarize and classify current deep learning methods in semantic
segmentation to provide comprehensive information reference for scholars and practitioners. Inspired
by the work of Zhao et al. [11], this paper compares and analyzes the image segmentation work of
three main neural network architectures in deep learning technology, and proposes a new classification
method, which is shown in Fig. 1: it is based on network architecture. Existing semantic segmentation
methods are divided into four categories according to different network architectures: CNN-based
architectures, transformer-based architectures, MLP-based architectures, and others.
The key contributions of this paper include a systematic review of image semantic segmentation
methods which covers the latest literature in the field of image semantic segmentation. Various
deep learning algorithms used in image segmentation are described and divided into four categories
according to different network architectures. The advantages and limitations of existing segmentation
methods are compared and analyzed on popular benchmarks. The results of this study provide trends
in semantic segmentation using deep learning, and challenges, and future research directions.
The remainder of this survey is organized as follows: Section 2 reviews some of the most popular
image segmentation datasets and their characteristics. Section 3 is the main body of our survey. Section
4 summarizes some common metrics used in the performance evaluation of segmentation models,
and then evaluates and analyzes the performance of the models. Section 5 discusses the main future
CMC, 2023, vol.74, no.1 1943
research directions and challenges in the field of image segmentation. Finally, Section 6 makes a
summary.
2 Datasets
There are many datasets that can be used for semantic segmentation tasks. This paper introduces
a total of ten representative general image segmentation datasets, including PASCAL visual object
classes (VOC) [12], Cityscapes [13], Microsoft common objects in context (COCO) [14], ADE20K
[15], CamBridge-driving labeled video database (CamVid) [16], COCO-stuff [17], Indian driving
dataset (IDD) [18], Dark Zurich [19], adverse conditions dataset with correspondences (ACDC)
[20], and PartImageNet [21]. According to different purposes of these datasets, they can be divided
into generic, urban/Driving, generic-part, etc. Although there are related works [9,10] that have
described datasets in detail, they suffer from the problem of partial content invalidation and lack of
recent datasets. Therefore, several image semantic segmentation datasets are briefly summarized, and
detailed information (such as their purpose, number of classes, training/validation/testing splits, and
access hyperlink.) about the characteristics of each dataset are provided. Tab. 1 shows a summarized
view of the above datasets, where the first five are the more popular datasets, and the last five are the
most recent datasets. In addition, some segmentation models only select the classes of interest when
training the model on the dataset, instead of using all classes. Therefore, the number in brackets in the
class column is used to indicate the number of frequently used classes. The above summary is intended
to facilitate readers to have a basic understanding of commonly used semantic segmentation data sets
when reading this article. Readers can refer to the corresponding link address to query the detailed
description of the relevant data set according to their own needs.
divides segmentation methods into four categories according to different architectures: CNN-based,
transformer-based, MLP-based, and others, also introduces the typical segmentation methods based
on these architectures in detail.
down sampling operation of both branches will lose part of the spatial information and reduce the
network performance. Therefore, the HRNet started from a high-resolution subnet and gradually
down sampled to form a subnet from high to low resolution. Each subnet is connected in parallel and
continuously integrates information, all branches are aggregated to directly affect the output.
applied to visual tasks such as image classification [44], image captioning [45], object detection [46],
and segmentation [47]. The Google team also analyzed the training of vision transformer and provided
an effective guidance for future research on visual transformers [48]. This section mainly introduces
self-attention mechanisms and transformer networks (a specific form of self-attention) in the field of
image segmentation.
transformer model for training. Next, we will introduce the transformer networks specifically designed
to solve the image semantic segmentation task.
Zheng et al. [59] first performed semantic segmentation based on transformer and constructed
a segmentation transformer network (SETR) to extract global semantic information. The input and
output in transformer need to be serialized, so the SETR first divided the two-dimensional images
into (H/16) ∗ (W/16) patches and converted them into a one-dimensional sequence with a length
of H ∗ W/256, and then learned the specific coding of each patch through position coding to retain
spatial information. Then, the sequence is input into the transformer encoder composed of multi-
head self-attention (MSA) and multi-layer perceptron (MLP) modules to learn features. In addition,
to effectively evaluate the effect of the encoder, the SETR designed three different decoders: naive up-
sampling (naive), progressive up-sampling (pup) and multi-level feature aggregation (MLA). Inspired
by SETR, Strudel et al. [60] designed a pure transformer model named Segmenter to apply to semantic
segmentation tasks. Each layer of the encoder in the Segmenter models the global context information,
then the mask transformer decodes the output of the encoder and class embedding, decodes the
encoded sequence features into three-dimensional segmentation feature map, and then up-samples
the feature map to obtain the final image segmentation map. Based on the advantage that transformer
can build global dependencies in images, segmentation transformer [61] combined transformer with
object contextual representations to represent the OCR pipeline to enhance feature expression.
Considering the key role of multi-scale features in CNN based segmentation model, researchers
began to explore methods to combine the advantages of transformer and encode multi-scale infor-
mation. For example, Xie et al. [62] proposed a SegFormer model, which designed a hierarchical
transformer encoder and lightweight MLP decoder. Hierarchical transformer encoder will generate
multi-level features, and the output feature size decreased layer by layer. Large-scale feature map
provided coarse-grained information, and small-scale feature map provided fine-grained information.
All-MLP decoder aggregated different levels of features and combined global attention and local
attention to obtain more powerful information representation. The model achieved excellent segmen-
tation performance. Liu et al. [63] proposed a mobile window scheme, which limited the self-attention
calculation to non-overlapping local windows, constructed a hierarchical network architecture named
Swin Transformer, merged patch blocks layer by layer to expand the perception range, and then
encoded information of different scales to adapt to multi-scale tasks in vision. Chen et al. [64] proposed
a vision transformer adapter (ViT-Adapter), which first used the spatial prior module to model the
local spatial contexts of the input image, then injected the captured prior features into the patches
encoded by the vision transformer backbone (VIT) [65], and finally extracted hierarchical features
from the output of the block through the multi-scale feature extractor.
Besides using multi-scale information, effectively utilizing different distance attention mechanisms
to enhance features can also improve the performance of the model. For example, the cross-scale
transformer (CrossFormer) [66] proposed a long and short distance attention (LSDA), which not only
paid attention to the dependency between adjacent embedding, but also emphasized the dependency
between embedding far away from each other and retained the embedding characteristics of small-scale
and large-scale while reducing the cost. Yang et al. [67] proposed a focal self-attention and constructed
a transformer architecture, named as Focal Transformer. It constructed fine-grained attention in local
scope and coarse-grained attention in global scope. At the same time, it effectively captured the visual
dependence of short-range and long-range, reduced the amount of calculation and improved the
performance of the model.
1948 CMC, 2023, vol.74, no.1
3.4 Others
In [11], a unified framework called SPACH has been developed to fairly compare the performance
of CNN, transformer and MLP. The experimental results show that under the same pre-training
conditions, all three architectures can perform the classification task well. CNN and transformer
are complementary, CNN-based structure has the best generalization ability, and transformer-based
structure has the largest model capacity.
Obviously, each type of architecture has its own advantages, so future researches do not need to
stick to a single architecture. Researchers can integrate the advantages of multiple architectures to
achieve more efficient performance according to actual task requirements. This section will mainly
introduce the related research of hybrid architecture.
The convolution operation in CNN-based architectures has low computational cost, but it has
the limitation that it is unable to model long-term dependencies. In contrast, transformer has global
attention mechanism, but low-level details are insufficient. Therefore, part of the work combines CNN
and transformer to make use of their advantages to achieve a more efficient architecture. The following
methods all combine CNN and transformer. The nnFormer [72] built an interleaving architecture
based on self-attention and convolution to achieve a better combination of transformer and CNN.
Zhang et al. [73] proposed an architecture called TransFuse which integrated transformer and CNN.
The TransFuse consists of parallel transformer branches and CNN branches, in which transformer
CMC, 2023, vol.74, no.1 1949
branches capture global information and build remote dependencies, while CNN branches capture rich
local information. Guo et al. [74] proposed a method to segment objects with transformers (SOTR),
which extracted shallow features through feature pyramid and captured remote context dependencies
based on parallel two branch transformer.
In addition to the work of combining CNN and transformer, researchers also try to combine
CNN and MLP to build an architecture. MLP-based architectures can encode feature information
well, but most of them have fixed dimension input, which is difficult to adapt to downstream tasks
(such as Object detection, semantic segmentation), and has large amount of calculation and limited
performance; Convolutional neural network can greatly reduce the amount of network calculation
and flexibly adapt to different inputs. Therefore, the integration of CNN and MLP can build a lighter,
phased and high-performance architecture. Li et al. [75] proposed a hierarchical convolutional MLPs
(ConvMLP) for vision, which mainly included convolution stage and Conv-MLP stage. Tokenizer
includes convolution, normalization, activation function and maximum pooling operation to extract
initial features. The convolution stage is responsible for enhancing the spatial connection. In the Conv-
MLP stage, convolution is used to increase the interaction ability of adjacent information in the process
of patch merging and down sampling, and a depth wise convolution layer is embedded between the
two MLP blocks to further promote the blending of adjacent information and effectively improve the
performance of the model.
4 Evaluation of Methods
In this section, it first introduces several common metrics for model performance evaluation and
then analyzes the performance of the segmentation model based on the evaluation metrics.
where Pij refers to the number of pixels inferred from category i as category j, Pii refers to the true
positive. Pij and Pji refer to the false positive and false negative, respectively.
Memory footprint is another important factor in segmentation methods. Models with large
parameters and complex calculations may not be applicable to some edge devices (such as unmanned
aerial vehicles (UAVs), autopilot cars, and robots) with less memory than high-performance servers.
1950 CMC, 2023, vol.74, no.1
Therefore, memory constraints need to be considered in model design. It may be very useful to fully
describe the peak and average memory occupation during model operation.
Table 2: Continued
Type Method Backbone Performance
From the perspective of segmentation accuracy, it is not difficult to find that transformer-based
models perform well on multiple benchmarks (e.g., Cityscapes test, ADE20K Val) compared to CNN
and MLP-based models. If the goal is to improve network accuracy without paying attention to model
size and computational cost, choosing to use transformer to design a segmentation network is a good
choice. The MLP-based networks are comparable in accuracy to large CNN-based segmentation
networks on the ADE20K Val dataset. From the perspective of network size and inference speed,
CNN-based methods have absolute advantages.
The performance achieved by these methods stems from the unique advantages of their respective
architectural designs. The convolution operation extracts image features through a fixed-size con-
volution kernel, and only needs to learn the parameters of the fixed-size window without encoding
global information. Therefore, CNN-based networks are more lightweight and have faster inference
speed. However, convolution also comes with the disadvantage that it cannot capture long-range
dependencies such as the relationship between arbitrary pixels in an image. Furthermore, convolution
filter weights remain fixed after training, and thus cannot dynamically adapt to variation to the
input. Currently, methods to capture global information mainly include expanding the receptive field
and embedding non-local self-attention mechanism in the networks. Self-attention mechanism is an
integral part of the transformer, unlike convolution operations, its weights are dynamically calculated
and can capture long-range dependencies. Therefore, the transformer-based networks have achieved
SOTA results on multiple datasets, due to its unique advantages of encoding global information and
having general modeling capabilities. However, transformer-based models have complex structures,
modeling global information requires huge computational overhead, and training models requires
high costs (enough computing resources, large-scale datasets). MLP-based segmentation methods are
new attempts to semantic segmentation architecture. Its existence indicates that convolution operation
1952 CMC, 2023, vol.74, no.1
and self-attention mechanism are not necessary conditions for good performance and provides more
ideas for future development.
In summary, since the emergence of FCN, deep learning-based semantic segmentation methods
have made significant progress in both accuracy and speed, their MIoU scores have increased by
nearly 30% on multiple datasets. Among all segmentation models, the CNN-based models still occupy
the majority. Transformer based methods achieve SOTA in terms of accuracy. At present, as new
networks from different camps continue to improve the accuracy on image segmentation benchmarks,
no conclusion can be made as which structure among CNN, Transformer, and MLP performs the
best or is most suitable for semantic segmentation tasks. In general, each architecture has its own
advantages. With the development of deep learning technology, future semantic segmentation models
will integrate the advantages of multiple architectures to achieve better performance.
such as self-driving cars, robot dogs, and UAVs. Although large-scale transformer networks and MLP-
based networks can achieve high accuracy, they have intensive power and computational requirements,
and high resource costs affect their deployment on devices. Currently, most models either have high
segmentation accuracy but long inference time, or fast inference but low segmentation accuracy, and
do not achieve a good balance between speed and accuracy. Therefore, future work needs to pay
more attention to real-time constraints and efficient hardware designs, continuously improve model
performance, simplify the network, and find a balance between accuracy and running time to promote
the implementation and application of relevant technologies.
6 Conclusion
The research of semantic segmentation methods based on deep learning has made great progress
in recent years. This survey first presents some commonly used image segmentation datasets and
later reviews pioneering methods in the field of general image semantic segmentation. Furthermore,
these methods are divided into 4 types according to their different architectures and highlights: CNN-
based, transformer-based, MLP-based and others. To discover and utilize the power of different types
of architectures, existing methods are compared and analyzed based on evaluation metrics (such as
model size, inference speed, segmentation accuracy), and the key strengths and limitations of different
types of architectures are reported. In generally, CNN-based methods have lighter models and faster
inference speed, transformer-based methods can encode global information, MLP-based models are
simple in design and do not require convolution operations and self-attention mechanisms. At present,
there is no conclusion can be made as which structure among CNN, Transformer, and MLP performs
the best or is most suitable for semantic segmentation tasks. Finally, possible research directions and
challenges are specifically elaborated. We believe that combining the advantages of multiple deep
learning architectures to design high-precision and high-efficiency networks is a key point to be
explored to solve the current bottleneck, which is also the future scope of our present work.
Funding Statement: This work was supported by the Major science and technology project of Hainan
Province (Grant No. ZDKJ2020012), National Natural Science Foundation of China (Grant No.
62162024 and 62162022), Key Projects in Hainan Province (Grant ZDYF2021GXJS003 and Grant
ZDYF2020040), Graduate Innovation Project (Grant No. Qhys2021-187).
1954 CMC, 2023, vol.74, no.1
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the
present study.
References
[1] R. Naqvi, D. Hussain and W. Loh, “Artificial intelligence-based semantic segmentation of ocular regions
for biometrics and healthcare applications,” Computers, Materials & Continua, vol. 66, no. 1, pp. 715–732,
2021.
[2] X. Tang, W. Tu, K. Li and J. Cheng, “DFFNet: An IoT-perceptive dual feature fusion network for general
real-time semantic segmentation,” Information Sciences, vol. 565, pp. 326–343, 2021.
[3] T. Leonardo, P. Piazzolla, F. Porpiglia and E. Vezzetti, “Real-time deep learning semantic segmentation
during intra-operative surgery for 3D augmented reality assistance,” International Journal of Computer
Assisted Radiology and Surgery, vol. 16, no. 9, pp. 1435–1445, 2021.
[4] S. Nedevschi, “Weakly supervised semantic segmentation learning on UAV video sequences,” in Proc. of
29th European Signal Processing Conf. (EUSIPCO), Dublin, Ireland, pp. 731–735, 2021.
[5] T. B. Zhu, D. Wang, Y. H. Li and W. J. Dong, “Three-dimensional image reconstruction for virtual talent
training scene,” Traitement du Signal, vol. 38, no. 6, pp. 1719–1726, 2021.
[6] S. Mahajan and A. K. Pandit, “Image segmentation and optimization techniques: A short overview,”
Medicon Engineering Themes, vol. 2, no. 2, pp. 47–49, 2022.
[7] J. Cheng, Y. Yang, X. Tang, N. Xiong, Y. Zhang et al., “Generative adversarial networks: A literature
review,” KSII Transactions on Internet and Information Systems, vol. 14, no. 12, pp. 4625–4647, 2020.
[8] E. Shelhamer, J. Long and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 640–651, 2017.
[9] S. Minaee, Y. Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz et al., “Image segmentation using deep
learning: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, pp. 1–20,
2021.
[10] F. Cao and Q. Bao, “A survey on image semantic segmentation methods with convolutional neural
network,” in Proc. CISCE, Kuala Lumpur, Malaysia, pp. 458–462, 2020.
[11] Y. Zhao, G. Wang, C. Tang, C. Luo, W. Zeng et al., “A battle of network structures: An empirical study of
CNN, transformer, and MLP,” arXiv preprint, arXiv:2108.13002, 2021.
[12] M. Everingham, L. V. Gool, C. K. I. Williams, J. M. Winn and A. Zisserman, “The pascal visual object
classes (VOC) challenge,” International Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, 2010.
[13] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler et al., “The cityscapes dataset for semantic
urban scene understanding,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),
Las Vegas, NV, USA, pp. 3213–3223, 2016.
[14] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona et al., “Microsoft COCO: Common objects in context,”
in Proc. of European Conf. Computer Vision (ECCV), Zurich, Switzerland, vol. 8693, pp. 740–755, 2014.
[15] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso et al., “Scene parsing through ADE20K dataset,” in Proc.
of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 5122–5130,
2017.
[16] G. J. Brostow, J. Fauqueur and R. Cipolla, “Semantic object classes in video: A high-definition ground
truth database,” Pattern Recognition Letters, vol. 30, no. 2, pp. 88–97, 2009.
[17] H. Caesar, J. R. R. Uijlings and V. Ferrari, “Coco-stuff: Thing and stuff classes in context,” in Proc. of IEEE
Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, pp. 1209–1218, 2018.
[18] G. Varma, A. Subramanian, A. M. Namboodiri, M. Chandraker and C. V. Jawahar, “IDD: A dataset for
exploring problems of autonomous navigation in unconstrained environments,” in Proc. of IEEE Winter
Conf. on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, pp. 1743–1751, 2019.
[19] C. Sakaridis, D. X. Dai and L. V. Gool, “Guided curriculum model adaptation and uncertainty-aware
evaluation for semantic nighttime image segmentation,” in Proc. of IEEE/CVF Int. Conf. on Computer
Vision (ICCV), Seoul, Korea, pp. 374–7383, 2019.
CMC, 2023, vol.74, no.1 1955
[20] C. Sakaridis, D. X. Dai and L. V. Gool, “ACDC: The adverse conditions dataset with correspondences
for semantic driving scene understanding,” in Proc. of IEEE/CVF Int. Conf. on Computer Vision (ICCV),
Montreal, QC, Canada, pp. 10745–10755, 2021.
[21] J. He, S. Yang, S. K. Yang, A. Kortylewski, X. D. Yuan et al., “PartImageNet: A large, high-quality dataset
of parts,” arXiv preprint, arXiv: 2112.00933, 2021.
[22] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in
Proc. of Int. Conf. on Learning Representations (ICLR), San Diego, CA, USA, pp. 1–14, 2015.
[23] K. He, X. Zhang, S. Ren and J. Sun, “Deep residual learning for image recognition,” in Proc. of IEEE Conf.
on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 770–778, 2016.
[24] X. Li, W. Wang, X. Hu and J. Yang, “Selective kernel networks,” in Proc. of IEEE Conf. on Computer Vision
and Pattern Recognition (CVPR), Long Beach, CA, USA, pp. 510–519, 2019.
[25] H. Zhang, C. Wu, Z. Zhang, Y. Zhu, Z. Zhang et al., “ResNeSt: Split-attention networks,” arXiv preprint,
arXiv:2004.08955, 2004.
[26] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov and L. Chen, “MobileNetV2: Inverted residuals and
linear bottlenecks,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake
City, UT, USA, pp. 4510–4520, 2018.
[27] N. Ma, X. Zhang, H. Zheng and J. Sun, “ShuffleNet V2: Practical guidelines for efficient CNN architecture
design,” in Proc. of European Conf. Computer Vision (ECCV), Munich, Germany, pp. 122–138, 2018.
[28] M. Liu and Y. Zhang, “A hierarchical feature extraction network for fast scene segmentation,” Sensors,
vol. 21, no. 22, pp. 7730–7747, 2021.
[29] X. Zhang, B. Du, Z. Wu and T. wan, “LAANet: Lightweight attention-guided asymmetric network for
real-time semantic segmentation,” Neural Computing & Applications, vol. 34, no. 5, pp. 3573–3587, 2022.
[30] C. Yu, C. Gao, J. Wang, G. Yu, C. Shen et al., “BiSeNet V2: Bilateral network with guided aggregation for
real-time semantic segmentation,” International Journal of Computer Vision, vol. 129, no. 11, pp. 3051–3068,
2021.
[31] M. Fan, S. Lai, J. Huang, X. Wei, Z. Chai et al., “Rethinking BiSeNet for real-time semantic segmentation,”
in Proc. of Conf. on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, pp. 9716–9725,
2021.
[32] Y. Hong, H. Pan, W. Sun and Y. Jia, “Deep dual-resolution networks for real-time and accurate semantic
segmentation of road scenes,” arXiv preprint, arXiv: 2101.06085, 2021.
[33] K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao et al., “High-resolution representations for labeling pixels and
regions,” arXiv preprint, arXiv:1904.04514, 2019.
[34] B. Jiang, W. Tu, C. Yang and J. Yuan, “Context-integrated and feature-refined network for lightweight
object parsing,” IEEE Transactions on Image Processing, vol. 29, pp. 5079–5093, 2020.
[35] J. Cheng, X. Peng, X. Tang, W. Tu and W. Xu, “MIFNet: A lightweight multiscale information fusion
network,” 2021. [Online]. Available https://doi.org/10.1002/int.22804.
[36] S. Huang, Z. Lu, R. Cheng and C. He, “FAPN: Feature-aligned pyramid network for dense image
prediction,” in Proc. of IEEE/CVF Int. Conf. on Computer Vision (ICCV), Montreal, QC, Canada, pp.
844–853, 2021.
[37] H. Zhao, J. Shi, X. Qi, X. Wang and J. Jia, “Pyramid scene parsing network,” in Proc. of IEEE Conf. on
Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 6230–6623, 2017.
[38] H. Zhang, K. J. Dana, J. Shi, Z. Zhang, X. Wang et al., “Context encoding for semantic segmentation,” in
Proc. of Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, pp. 7151–
7160, 2018.
[39] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy and A. L. Yuille, “DeepLab: Semantic image segmenta-
tion with deep convolutional nets, atrous convolution, and fully connected CRFs,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2018.
[40] L. Chen, Y. Zhu, G. Papandreou, F. Schroff and H. Adam, “Encoder-decoder with atrous separable
convolution for semantic image segmentation,” in Proc. of European Conf. Computer Vision (ECCV),
Munich, Germany, pp. 833–851, 2018.
1956 CMC, 2023, vol.74, no.1
[41] M. Yang, K. Yu, C. Zhang, Z. Li and K. Yang, “DenseASPP for semantic segmentation in street scenes,”
in Proc. of Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, pp.
3684–3692, 2018.
[42] J. Devlin, M. W. Chang, K. Lee and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers
for language understanding,” in Proc. NAACL-HLT, Minneapolis, MN, USA, 1, pp. 4171–4186, 2019.
[43] B. Y. Xue, J. W. Yu, J. H. Xu, S. S. Liu, S. K. Hu et al., “Bayesian transformer language models for speech
recognition,” in Proc. ICASSP, Toronto, ON, Canada, pp. 7378–7382, 2021.
[44] Z. W. Zhang, T. Li, X. Tang, X. Hu and Y. Peng, “CAEVT: Convolutional autoencoder meets lightweight
vision transformer for hyperspectral image classification,” Sensors, vol. 22, no. 10, pp. 3902–3923, 2021.
[45] Z. Deng, B. Zhou, P. He, J. Huang, O. Alfarraj et al., “A position-aware transformer for image captioning,”
Computers, Materials & Continua, vol. 70, no. 1, pp. 2065–2081, 2022.
[46] Y. N. Dai, J. Y. Yu, D. Zhang, T. H. Hu and X. T. Zheng, “RODFormer: High-precision design for rotating
object detection with transformers,” Sensors, vol. 22, no. 7, pp. 2633–2646, 2022.
[47] Z. Y. Xu, W. C. Zhang, T. X. Zhang, Z. Yang and J. Y. Li, “Efficient transformer for remote sensing image
segmentation,” Remote Sensing, vol. 13, no. 18, pp. 3585–3609, 2021.
[48] A. Steiner, A. Kolesnikov, X. Zhai, R. Wightman, J. Uszkoreit et al., “How to train your vit? Data,
augmentation, and regularization in vision transformers,” arXiv preprint, arXiv:2106.10270, 2021.
[49] H. Ahmad, H. U. Khan, S. Ali, S. Ijaz, F. Wahid et al., “Effective video summarization approach based on
visual attention,” Computers, Materials & Continua, vol. 71, no. 1, pp. 1427–1442, 2022.
[50] J. Hu, L. Shen and G. Sun, “Squeeze-and-excitation networks,” in Proc. of IEEE Conf. on Computer Vision
and Pattern Recognition (CVPR), Salt Lake City, UT, USA, pp. 7132–7141, 2018.
[51] X. Wang, R. B. Girshick, A. Gupta and K. He, “Non-local neural networks,” in Proc. of IEEE Conf. on
Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, pp. 7794–7803, 2018.
[52] Z. L. Huang, X. G. Wang, L. C. Huang, C. Huang, Y. Wei et al., “CcNet: Criss-cross attention for semantic
segmentation,” in Proc. of IEEE/CVF Int. Conf. on Computer Vision (ICCV), Seoul, Korea, pp. 603–612,
2019.
[53] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao et al., “Dual attention network for scene segmentation,” in Proc. of
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, pp. 3146–3154,
2019.
[54] J. Fu, J. Liu, J. Jiang, Y. Li, Y. Bao et al., “Scene segmentation with dual relation-aware attention network,”
IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 6, pp. 2547–2560, 2021.
[55] A. Sagar, “DMSANnet: Dual multi scale attention network,” in Proc. of Int. Conf. on Image Analysis and
Processing (ICIAP), Lecce, Italy, pp. 633–645, 2022.
[56] Y. Huang, W. J. Jia, X. J. He, L. Liu, Y. X. Li et al., “CAA: Channelized axial attention for semantic
segmentation,” arXiv preprint, arXiv:2101.07434, 2021.
[57] H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille et al., “Axial-DeepLab: Stand-alone axial-attention for
panoptic segmentation,” in Proc. of European Conf. Computer Vision (ECCV), Glasgow, UK, pp. 108–126,
2020.
[58] Q. Hou, D. Zhou and J. Feng, “Coordinate attention for efficient mobile network design,” in Proc. of IEEE
Conf. on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, pp. 13713–13722, 2021.
[59] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo et al., “Rethinking semantic segmentation from a sequence-
to-sequence perspective with transformers,” in Proc. of IEEE Conf. on Computer Vision and Pattern
Recognition (CVPR), Nashville, TN, USA, pp. 6881–6890, 2021.
[60] R. Strudel, R. G. Pinel, I. Laptev and C. Schmid, “Segmenter: Transformer for semantic segmentation,” in
Proc. of IEEE/CVF Int. Conf. on Computer Vision (ICCV), Montreal, QC, Canada, pp. 7242–7252, 2021.
[61] Y. Yuan, X. Chen and J. Wang, “Object-contextual representations for semantic segmentation,” in Proc. of
European Conf. Computer Vision (ECCV), Glasgow, UK, pp. 173–190, 2020.
[62] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. Alvarez et al., “Segformer: Simple and efficient design for
semantic segmentation with transformers,” in Proc. of NeurIPS, Online Conference, Virtual Event, pp.
12077–12090, 2021.
CMC, 2023, vol.74, no.1 1957
[63] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei et al., “Swin transformer: Hierarchical vision transformer using
shifted windows,” in Proc. of Int. Conf. on Computer Vision (ICCV), Montreal, QC, Canada, pp. 9992–
10002, 2021.
[64] Z. Chen, Y. C. Duan, W. H. Wang, J. J. He, T. Lu et al., “Vision transformer adapter for dense predictions,”
arXiv preprint, arXiv:2205.08534, 2022.
[65] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai et al., “An image is worth 16 × 16 words:
Transformers for image recognition at scale,” in Proc. of Int. Conf. on Learning Representations (ICLR),
Virtual Event, Austria, pp. 1–21, 2021.
[66] W. Wang, L. Yao, L. Chen, D. Cai, X. He et al., “CrossFormer: A versatile vision transformer hinging on
cross-scale attention,” arXiv preprint, arXiv:2108.00154, 2021.
[67] J. Yang, C. Li, P. Zhang, X. Dai, B. Xiao et al., “Focal attention for long-range interactions in vision
transformers,” in Proc. NeurIPS, Online Conference, Virtual Event, pp. 30008–30022, 2021.
[68] J. Tae, H. Kim and Y. Lee, “MLP singer: Towards rapid parallel Korean singing voice synthesis,” in Proc.
of IEEE Int. Workshop on Machine Learning for Signal Processing (MLSP), Gold Coast, Australia, pp.
1–6, 2021.
[69] T. Yu, X. Li, Y. Cai, M. Sun and P. Li, “S2-MLP: Spatial-shift MLP architecture for vision,” in Proc.
WACV , Waikoloa, HI, USA, pp. 3615–3624, 2022.
[70] S. Chen, E. Xie, C. Ge, D. Liang and P. Luo, “CycleMLP: A MLP-like architecture for dense prediction,”
arXiv preprint, arXiv:2107.10224, 2021.
[71] D. Lian, Z. Yu, X. Sun and S. Gao, “AS-MLP: An axial shifted MLP architecture for vision,” arXiv
preprint, arXiv:2107.08391, 2021.
[72] H. Zhou, J. Guo, Y. Zhang, L. Yu, L. Wang et al., “NnFormer: Interleaved transformer for volumetric
segmentation,” arXiv preprint, arXiv:2109.03201, 2021.
[73] Y. Zhang, H. Liu and Q. Hu, “Transfuse: Fusing transformers and CNNs for medical image segmentation,”
in Proc. MICCAI, Strasbourg, France, pp. 14–24, 2021.
[74] R. Guo, D. Niu, L. Qu and Z. Li, “SOTR: Segmenting objects with transformers,” in Proc. of IEEE/CVF
Int. Conf. on Computer Vision (ICCV), Montreal, QC, Canada, pp. 7137–7146, 2021.
[75] J. Li, A. Hassani, S. Walton and H. Shi, “ConvMP: Hierarchical convolutional MLPs for vision,” arXiv
preprint, arXiv:2109.04454, 2021.
[76] M. S. Amac, A. Sencan, O. B. Baran, N. Ikizler-Cinbis and R. G. Cinbis, “MaskSplit: Self-supervised meta-
learning for few-shot semantic segmentation,” in Proc. WACV , Waikoloa, HI, USA, pp. 428–438, 2022.
[77] S. Kang and J. Choi, “Unsupervised semantic segmentation method of user interface component of games,”
Intelligent Automation & Soft Computing, vol. 31, no. 2, pp. 1089–1105, 2022.