Paper 1
Paper 1
A B S T R A C T
Image semantic segmentation is a technique that distinguishes different kinds of things in an image by assigning a label to each point in a target category based on its
"semantics". The Deeplabv3+ image semantic segmentation method currently in use has high computational complexity and large memory consumption, making it
difficult to deploy on embedded platforms with limited computational power. When extracting image feature information, Deeplabv3+ struggles to fully utilize
multiscale information. This can result in a loss of detailed information and damage to segmentation accuracy. An improved image semantic segmentation method
based on the DeepLabv3+ network is proposed, with the lightweight MobileNetv2 serving as the model’s backbone. The ECAnet channel attention mechanism is
applied to low-level features, reducing computational complexity and improving target boundary clarity. The polarized self-attention mechanism is introduced after
the ASPP module to improve the spatial feature representation of the feature map. Validated on the VOC2012 dataset, the experimental results indicate that the
improved model achieved an MloU of 69.29% and a mAP of 80.41%, which can predict finer semantic segmentation results and effectively optimize the model
complexity and segmentation accuracy.
1. Introduction methods (Boykov et al., 2001; Plath et al., 2009). To compensate for the
lack of traditional methods, the semantic segmentation methods of deep
The emergence of artificial intelligence (AI) has dramatically learning mainly have two types of classification from the model struc
changed every aspect of our lives. The concept of semantic segmentation ture: based on information fusion and based on coder-decoder(Minaee
is easy to understand. When people see a picture, it is easy to understand et al., 2021). Based on the information fusion method, the model utili
the content of the picture. Semantic segmentation allows the machine to zation is improved by increasing the number of layers of the network
understand the content of the picture. The application, in reality, is also (Starck et al., 2005; Minaee et al., 2017). The representative algorithms
increasingly extensive, for example, scene recognition of automatic include the full convolutional network (FCN) algorithm and a series of
driving technology, for surgical navigation in the field of medical image improved algorithms (Biao et al., 2018), such as FCN–32S, FCN–16S,
segmentation, and advertising recommendations. The wide application and FCN–8S. Based on the coder-decoder method (Liu et al., 2018; Fu
of image semantic segmentation has high practical value (Iftikhar et al., et al., 2022), the accuracy of the network is improved by adopting
2022, 2023). different backbone network forms and pyramid pooling modules. The
To date, many different semantic segmentation algorithms have been representative algorithms include the pyramid scene parsing network
proposed, including traditional and deep learning semantic segmenta (PSPNet)(Sun and Wang, 2018) and DeepLabv series. The current
tion. From the traditional methods, such as threshold (Otsu, 1979), method based on Deeplabv3+ has high computational complexity and
histogram-based bundling, region-grow (Nock and Nielsen, 2004), large memory consumption, and it is difficult to deploy on embedded
k-means clustering (Dhanachandra et al., 2015), and watersheds (Naj platforms with limited computational power. Deeplabv3+ cannot fully
man et al., 1994), to more advanced algorithms such as active contours utilize the multiscale information when extracting the image feature
(Dhanachandra et al., 2015), graph cut (Najman et al., 1994), condi information, and it is easy to cause the loss of detail information and
tional and Markov random fields (Kass et al., 2004), and sparsity-based lead to damage of segmentation accuracy. To further improve the ability
* Corresponding author.
** Corresponding author.
E-mail addresses: [email protected] (Y. Liu), [email protected] (X. Bai), [email protected] (J. Wang), [email protected]
(G. Li), [email protected] (J. Li), [email protected] (Z. Lv).
https://doi.org/10.1016/j.engappai.2023.107260
Received 3 May 2023; Received in revised form 15 September 2023; Accepted 3 October 2023
Available online 10 October 2023
0952-1976/© 2023 Elsevier Ltd. All rights reserved.
Y. Liu et al. Engineering Applications of Artificial Intelligence 127 (2024) 107260
of the DeepLabv3 plus network to obtain key category information, 2. DeepLabv3 plus network
improvements are mainly made based on DeepLabv3 plus. The main
contributions of this paper are summarized as follows. The DeepLabv3 plus network (Yang et al., 2020) is shown in Fig. 1.
The role of the backbone network is to extract feature semantic infor
1. The DeepLabv3+ network is improved to make it suitable to fit the mation (Zhao et al., 2017). The function of ASPP is to extract feature
needs of realistic scenarios. The original feature extraction network information from the backbone network again to obtain sufficient
parameter amount is too large, and the model adopts the lightweight feature information. DCNN is generally a deep convolutional neural
MobileNetV2 as the backbone network, based on which it is further network. The ASPP module is mainly composed of 5 parts, 1 × 1
optimized to solve the problems of spatial detail loss and insufficient Convolution and void ratio are 6, 12, and 18 times, respectively 3 × 3
feature extraction. Convolution and global average pooling. These five parts are in parallel
2. In DeepLabv3+, the polarized self-attention mechanism (PSA-P, and together constitute the ASPP part. Backbone network low-level
PSA-S) is added after the ASPP module to increase the ability of the feature postaccess 1 × 1. The convolution and ASPP are then con
feature map to extract detailed information to improve the accuracy nected to the 4 times downsampling part for feature fusion and then
performance of semantic segmentation. A channel attention mech connected to the 3 × 3 convolution and 4 times downsampling to
anism (ECA-Net) is added after the MobileNetv2 low-level features to recover the size of the image.
recover clearer segmentation boundaries.
3. Stripe pooling is utilized in the ASPP module instead of the original 3. Improved DeepLabv3 plus network
global average pooling to effectively capture long-range de
pendencies, and hybrid pooling is utilized instead of the original The DeepLabv3 plus model is taken as the main body for improve
global average pooling to effectively capture short-range and long- ment. In image semantic segmentation based on the DeepLabv3 plus
range interdependencies between different locations, thus network, this paper uses lightweight MobileNetV2 as the backbone
improving the efficiency and reliability of the system. network. Then, ASPP is used to extract multiscale information from the
2
Y. Liu et al. Engineering Applications of Artificial Intelligence 127 (2024) 107260
3
Y. Liu et al. Engineering Applications of Artificial Intelligence 127 (2024) 107260
4
Y. Liu et al. Engineering Applications of Artificial Intelligence 127 (2024) 107260
3.2. Polarized self-attention mechanism effect. This paper mainly adds polarization self-attention and channel
attention mechanisms to the DeepLabv3 plus network. The two attention
We are all familiar with the concept of attention (Zeng et al., 2020). mechanisms are added at different locations in the network, and both
People cannot pay attention to the whole picture when they watch a show good performance.
picture. It must be that the eyes tend to be more interested in the part of The polarized self-attention mechanism (Hridoy et al., 2021; Liu
the painting, and people will ignore the part that they are not interested et al., 2021) has two main forms, series and parallel. The serial form
in. Based on such characteristics, the attention mechanism in the neural refers to the serial form of the channel self-attention mechanism and
network takes advantage of this, that is, to screen out effective infor spatial self-attention mechanism. The parallel form refers to the parallel
mation from complex information (Chen et al., 2017a). For image pro form of the channel self-attention mechanism and spatial self-attention
cessing, the target will be locked in one part of the image while ignoring mechanism. The two ways together constitute the polarized
other areas, which can improve the efficiency of image processing and self-attention mechanism. After inserting the polarization self-attention
save unnecessary trouble. With the rapid development of attention mechanism into the ASPP module (Yang, 2020; Zhu et al., 2019), the
mechanisms, an increasing number of neural network models have model can increase the extraction of important information and improve
added attention mechanisms (Zhang et al., 2020; Honarbakhsh et al., the utilization of the model. PSA_p and PSA_s can maintain high reso
2023) to improve the efficiency of the model, which has shown a good lution in the channel and spatial dimensions, which is why they are
5
Y. Liu et al. Engineering Applications of Artificial Intelligence 127 (2024) 107260
6
Y. Liu et al. Engineering Applications of Artificial Intelligence 127 (2024) 107260
adding the ECA-Net module after the shallow features of MobileneV2. CRediT authorship contribution statement
Table 3 compares ① and ② and ① and ③ of table. By using stripe
pooling instead of global average pooling and introducing a polarization Yanyan Liu: Conceptualization, Methodology, Experiments. Xiao
self-attention mechanism, Mlou improved by 2.51% and 2.89%, tian Bai: Experimental results analysis, Writing – review & editing.
respectively. Compare ① and ④, ① and ⑤ of the table. In the ASPP Jiafei Wang: Conceptualization, Methodology, Experiments. Guoning
module, stripe pooling replaces global average pooling, and the polari Li: Supervision. Jin Li: Supervision, Writing – review & editing. Zen
zation self-attention mechanism and ECA-Net are introduced, resulting gming Lv: Experimental results analysis.
in increases of 2.58% and 3.61% in MloU, respectively. By analyzing the
above table, it has been verified that all modules have played a role, and
all the improvements mentioned above can greatly improve the accu Declaration of competing interest
racy of the algorithm.
The authors declare that they have no known competing financial
4.4. Comparison of segmentation results for different categories interests or personal relationships that could have appeared to influence
the work reported in this paper.
The most important evaluation indicator for accuracy in semantic
segmentation is the average intersection-to-union ratio, which can be Data availability
seen from the graph among the 21 categories. The modified model only
has 6 categories that are lower than the original algorithm, and the No data was used for the research described in the article.
accuracy of the 6 lower categories is not significantly different from the
original algorithm. The remaining 15 categories are all higher than those References
of the original algorithm. Especially for categories such as houses, dogs,
cats, trains, sheep, etc., showing better advantages. After adding the Badrinarayanan, V., Kendall, A., Cipolla, R., 2017. Segnet: a deep convolutional encoder-
decoder architecture for image segmentation[J]. IEEE Trans. Pattern Anal. Mach.
attention mechanism, the accuracy of key categories is improved, which
Intell. 39 (12), 2481–2495. https://doi.org/10.1109/TPAMI.2016.2644615.
can to some extent improve the accuracy of the original algorithm. The Biao, W., Yali, G., Qingchuan, Z., 2018. Research on Image Semantic Segmentation
category segmentation results are shown in Fig. 7. Algorithm Based on Fully Convolutional HED-CRF[C]//2018 Chinese Automation
Congress (CAC). IEEE, pp. 3055–3058. https://doi.org/10.1109/
To see the effects before and after the improvement more clearly, the
CAC.2018.8623459.
segmentation prediction maps of the DeepLabv3 plus network and the Boykov, Yuri, Veksler, Olga, Zabih, Ramin, 2001. Fast approximate energy minimization
improved DeepLabv3 plus network were compared. Where (a) repre via graph cuts. In: Proceedings of the Seventh IEEE International Conference on
sents the original image, (b) represents the image label, (c) represents Computer Vision 1, pp. 377–384, 1.
Chen, L.C., Papandreou, G., Kokkinos, I., et al., 2017a. Deeplab: semantic image
the DeepLabv3 plus segmentation image, and (d) represents the segmentation with deep convolutional nets, atrous convolution, and fully connected
improved DeepLabv3 plus segmentation image. From the results, it can crfs[J]. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), 834–848. https://doi.org/
be seen that the model segmentation that integrates stripe pooling and 10.1109/TPAMI.2017.2699184.
Chen, L.C., Papandreou, G., Schroff, F., et al., 2017b. Rethinking Atrous Convolution for
introduces the attention mechanism is relatively smoother and more Semantic Image segmentation[J]. https://doi.org/10.48550/arXiv:1706.05587
complete. The original DeepLabv3 plus network has problems with arXiv preprint arXiv:1706.05587.
misclassification and discontinuous segmentation. The optimized Chen, L.C., Zhu, Y., Papandreou, G., et al., 2018. Encoder-decoder with atrous separable
convolution for semantic image segmentation[C]. In: Proceedings of the European
network has improved the semantic segmentation effect, better resolu Conference on Computer Vision. ECCV, pp. 801–818.
tion, refined the segmentation boundary of the target and achieved Dhanachandra, Nameirakpam, Manglem, Khumanthem, Yambem Jina Chanu, 2015.
better accuracy. The selected segmentation prediction diagram is shown Image segmentation using K -means clustering algorithm and subtractive clustering
algorithm. Procedia Comput. Sci. 54, 764–771.
in Fig. 8.
Fu, J., Yi, X., Wang, G., et al., 2022. Research on ground object classification method of
high resolution remote-sensing images based on improved DeeplabV3+[J]. Sensors
5. Summary 22 (19), 7477. https://doi.org/10.3390/S22197477.
He, K., Zhang, X., Ren, S., et al., 2016. Deep residual learning for image recognition[C].
Proc. IEEE Conf. on Comput. Vision and Pattern Recogn. 770–778. https://doi.org/
This article proposes a DeepLabv3 plus network based on the 10.3390/APP12188972.
attention mechanism. Changing global pooling to stripe pooling in the Honarbakhsh, V., Siahkoohi, H.R., Rezghi, M., et al., 2023. SeisDeepNET: an extension of
ASPP module captures global contextual information, while the addition Deeplabv3+ for full waveform inversion problem[J]. Expert Syst. Appl. 213, 118848
https://doi.org/10.1016/J.ESWA.2022.118848.
of the polarization self-attention mechanism enhances the utilization of Hou, Qibin, Zhang, Li, Cheng, Ming-Ming, Feng, Jiashi, 2020. Strip pooling: rethinking
image spatial features. Finally, by adding ECA-Net after the low-level spatial pooling for scene parsing. Proceedings of the IEEE/CVF Conference on
features of MobileNetV2, the acquisition of shallow features improves. Computer Vision and Pattern Recognition 4003–4012.
Hridoy, R.H., Habib, T., Jabiullah, I., et al., 2021. Early recognition of betel leaf disease
The experimental results show that embedding the attention module using deep learning with depthwise separable convolutions[C]. In: 2021 IEEE Region
into DeepLabv3 plus as a network can improve the accuracy of key 10 Symposium (TENSYMP). IEEE, pp. 1–7. https://doi.org/10.1109/
categories and effectively improve the segmentation accuracy of objects TENSYMP52854.2021.9551009.
Iftikhar, S., Asim, M., Zhang, Z., et al., 2022. Advance generalization technique through
in images by the network. The objective indicator MIoU improved by 3D CNN to overcome the false positives pedestrian in autonomous vehicles.
approximately 2%. Our work improves the performance of image se Telecommun. Syst. 80, 545–557. https://doi.org/10.1007/s11235-022-00930-1.
mantic segmentation, which provides new ideas for autonomous Iftikhar, Sundas, Asim, Muhammad, Zhang, Zuping, Muthanna, Ammar, Chen, Junhong,
El-Affendi, Mohammed, Ahmed, Sedik, Ahmed, A., Abd El-Latif, 2023. Target
driving, medical imaging, and other fields and provides direction for the
detection and recognition for traffic congestion in smart cities using deep learning-
field of computer vision. enabled UAVs: a review and analysis. Appl. Sci. 13 (6), 3995. https://doi.org/
Although the improved algorithm has made good improvements, 10.3390/app13063995.
Kass, Michael, Witkin, Andrew P., Terzopoulos, Demetri, 2004. Snakes: active contour
there are still shortcomings. Since the introduction of the attention
models. Int. J. Comput. Vis. 1, 321–331.
mechanism increases the model complexity to some extent, further Liu, M z, 2020. Research on Image Semantic Segmentation Algorithm Based on Self-
research is needed in terms of model complexity and parameter quan Attention Mechanism [D] Dalian. Dalian Univ. Technol. 20–35. https://doi.org/
tity. In the future, we will consider using model compression methods to 10.26991/d.cnki.gdllu.2020.001777.
Liu, A., Yang, Y., Sun, Q., Xu, Q., 2018. A deep fully convolution neural network for
optimize the network so that the model can balance high accuracy and semantic segmentation based on adaptive feature fusion. In: 2018 5th International
light weight. Conference on Information Science and Control Engineering (ICISCE), pp. 16–20.
https://doi.org/10.1109/ICISCE.2018.00013. Zhengzhou, China.
Liu, H., Liu, F., Fan, X., et al., 2021. Polarized self-attention: toward high-quality
pixelwise regression[J]. arXivpreprintarXiv:2107.00782. https://doi.org/10.4855
0/arXiv.2107.00782.
7
Y. Liu et al. Engineering Applications of Artificial Intelligence 127 (2024) 107260
Minaee, Shervin, Wang, Yao, 2017. An ADMM approach to masked signal decomposition Sun, Y., Jiang, Q., Hu, J., et al., 2019. Attention mechanism based pedestrian trajectory
using subspace representation. IEEE Trans. Image Process. 28, 3192–3204. prediction generation model[J]. J. Comput. Appl. 39 (3), 668. https://doi.org/
Minaee, S., Boykov, Y., Porikli, F., et al., 2021. Image segmentation using deep learning: 10.13203/j.whugis20200159.
a survey[J]. IEEE Trans. Pattern Anal. Mach. Intell. 44 (7), 3523–3542. Yang, X., 2020. An overview of the attention mechanisms in computer vision[C]//
Najman, Laurent, Schmitt, Michel, 1994. Watershed of a continuous function. Signal Journal of Physics: conference Series. IOP Publish. 1693 (1), 012173 https://doi.
Process. 38, 99–112. org/10.1088/1742-6596/1693/1/012173.
Nock, R., Nielsen, F., 2004. Statistical region merging. IEEE Trans. Pattern Anal. Mach. Yang, Z., Peng, X., Yin, Z., 2020. Deeplab_v3_plus-net for image semantic segmentation
Intell. 26 (11), 1452–1458. https://doi.org/10.1109/TPAMI.2004.110. with channel compression[C]//2020 IEEE 20th international conference on
Otsu, N., 1979. A threshold selection method from gray-level histograms. IEEE Trans. communication technology (ICCT). IEEE 1320–1324. https://doi.org/10.1109/
Syst. Man, and Cybern. 9 (1), 62–66. https://doi.org/10.1109/TSMC.1979.4310076. ICCT50939.2020.9295748.
Plath, Nils, Toussaint, Marc, Nakajima, Shinichi, 2009. Multiclass image segmentation Zeng, H., Peng, S., Li, D., 2020. Deeplabv3+ semantic segmentation model based on
using conditional random fields and global classification. International Conference feature cross attention mechanism[C]. In: Journal of Physics: Conference Series.
on Machine Learning. IOPPublishing, 012106. https://doi.org/10.1088/1742-6596/1678/1/012106,
Sehar, U., Naseem, M.L., 2022. How deep learning is empowering semantic 1678(1).
segmentation: traditional and deep learning techniques for semantic segmentation: a Zhang, Z., Huang, J., Jiang, T., et al., 2020. Semantic segmentation of very high-
comparison[J]. Multimed. Tool. Appl. 81 (21), 30519–30544. https://doi.org/ resolution remote sensing image based on multiple band combinations and
10.1007/S11042-022-12821-3. patchwise scene analysis[J]. J. Appl. Remote Sens. 14 (1) https://doi.org/10.1117/
Starck, J.-L., Elad, M., Donoho, D.L., 2005. Image decomposition via the combination of 1.JRS.14.016502, 016502-016502.
sparse representations and a variational approach. IEEE Trans. Image Process. 14 Zhao, Hengshuang, Shi, Jianping, Qi, Xiaojuan, Wang, Xiaogang, Jia, Jiaya, 2017.
(10), 1570–1582. https://doi.org/10.1109/TIP.2005.852206. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Sun, W., Wang, R., 2018. Fully convolutional networks for semantic segmentation of very CVPR) 2881–2890.
high resolution remotely sensed images combined with DSM[J]. Geosci. Rem. Sens. Zhu, Z.L., Rao, Y., Wu, Y., et al., 2019. Research progress of attention mechanism in deep
Lett. IEEE 15 (3), 474–478. https://doi.org/10.1109/LGRS.2018.2795531. learning[J]. J. Chin. Inf. Process. 33 (6), 1–11. https://doi.org/10.13374/j.issn2095-
9389.2021.01.30.005.