0% found this document useful (0 votes)
17 views

Paper 1

Uploaded by

giribabukande
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Paper 1

Uploaded by

giribabukande
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Engineering Applications of Artificial Intelligence 127 (2024) 107260

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence


journal homepage: www.elsevier.com/locate/engappai

Image semantic segmentation approach based on DeepLabV3 plus network


with an attention mechanism
Yanyan Liu a, Xiaotian Bai b, Jiafei Wang a, Guoning Li b, **, Jin Li c, *, Zengming Lv b
a
Department of Electronics and Information Engineering, Changchun University of Science and Technology, Changchun, 130022, China
b
Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences (CIOMP), Changchun, 130033, China
c
School of Instrumentation and Optoelectronic Engineering, Beihang University, Beijing, 100191, China

A B S T R A C T

Image semantic segmentation is a technique that distinguishes different kinds of things in an image by assigning a label to each point in a target category based on its
"semantics". The Deeplabv3+ image semantic segmentation method currently in use has high computational complexity and large memory consumption, making it
difficult to deploy on embedded platforms with limited computational power. When extracting image feature information, Deeplabv3+ struggles to fully utilize
multiscale information. This can result in a loss of detailed information and damage to segmentation accuracy. An improved image semantic segmentation method
based on the DeepLabv3+ network is proposed, with the lightweight MobileNetv2 serving as the model’s backbone. The ECAnet channel attention mechanism is
applied to low-level features, reducing computational complexity and improving target boundary clarity. The polarized self-attention mechanism is introduced after
the ASPP module to improve the spatial feature representation of the feature map. Validated on the VOC2012 dataset, the experimental results indicate that the
improved model achieved an MloU of 69.29% and a mAP of 80.41%, which can predict finer semantic segmentation results and effectively optimize the model
complexity and segmentation accuracy.

1. Introduction methods (Boykov et al., 2001; Plath et al., 2009). To compensate for the
lack of traditional methods, the semantic segmentation methods of deep
The emergence of artificial intelligence (AI) has dramatically learning mainly have two types of classification from the model struc­
changed every aspect of our lives. The concept of semantic segmentation ture: based on information fusion and based on coder-decoder(Minaee
is easy to understand. When people see a picture, it is easy to understand et al., 2021). Based on the information fusion method, the model utili­
the content of the picture. Semantic segmentation allows the machine to zation is improved by increasing the number of layers of the network
understand the content of the picture. The application, in reality, is also (Starck et al., 2005; Minaee et al., 2017). The representative algorithms
increasingly extensive, for example, scene recognition of automatic include the full convolutional network (FCN) algorithm and a series of
driving technology, for surgical navigation in the field of medical image improved algorithms (Biao et al., 2018), such as FCN–32S, FCN–16S,
segmentation, and advertising recommendations. The wide application and FCN–8S. Based on the coder-decoder method (Liu et al., 2018; Fu
of image semantic segmentation has high practical value (Iftikhar et al., et al., 2022), the accuracy of the network is improved by adopting
2022, 2023). different backbone network forms and pyramid pooling modules. The
To date, many different semantic segmentation algorithms have been representative algorithms include the pyramid scene parsing network
proposed, including traditional and deep learning semantic segmenta­ (PSPNet)(Sun and Wang, 2018) and DeepLabv series. The current
tion. From the traditional methods, such as threshold (Otsu, 1979), method based on Deeplabv3+ has high computational complexity and
histogram-based bundling, region-grow (Nock and Nielsen, 2004), large memory consumption, and it is difficult to deploy on embedded
k-means clustering (Dhanachandra et al., 2015), and watersheds (Naj­ platforms with limited computational power. Deeplabv3+ cannot fully
man et al., 1994), to more advanced algorithms such as active contours utilize the multiscale information when extracting the image feature
(Dhanachandra et al., 2015), graph cut (Najman et al., 1994), condi­ information, and it is easy to cause the loss of detail information and
tional and Markov random fields (Kass et al., 2004), and sparsity-based lead to damage of segmentation accuracy. To further improve the ability

* Corresponding author.
** Corresponding author.
E-mail addresses: [email protected] (Y. Liu), [email protected] (X. Bai), [email protected] (J. Wang), [email protected]
(G. Li), [email protected] (J. Li), [email protected] (Z. Lv).

https://doi.org/10.1016/j.engappai.2023.107260
Received 3 May 2023; Received in revised form 15 September 2023; Accepted 3 October 2023
Available online 10 October 2023
0952-1976/© 2023 Elsevier Ltd. All rights reserved.
Y. Liu et al. Engineering Applications of Artificial Intelligence 127 (2024) 107260

Fig. 1. Deeplabv3 plus model.

Fig. 2. Improved DeepLabv3 plus.

of the DeepLabv3 plus network to obtain key category information, 2. DeepLabv3 plus network
improvements are mainly made based on DeepLabv3 plus. The main
contributions of this paper are summarized as follows. The DeepLabv3 plus network (Yang et al., 2020) is shown in Fig. 1.
The role of the backbone network is to extract feature semantic infor­
1. The DeepLabv3+ network is improved to make it suitable to fit the mation (Zhao et al., 2017). The function of ASPP is to extract feature
needs of realistic scenarios. The original feature extraction network information from the backbone network again to obtain sufficient
parameter amount is too large, and the model adopts the lightweight feature information. DCNN is generally a deep convolutional neural
MobileNetV2 as the backbone network, based on which it is further network. The ASPP module is mainly composed of 5 parts, 1 × 1
optimized to solve the problems of spatial detail loss and insufficient Convolution and void ratio are 6, 12, and 18 times, respectively 3 × 3
feature extraction. Convolution and global average pooling. These five parts are in parallel
2. In DeepLabv3+, the polarized self-attention mechanism (PSA-P, and together constitute the ASPP part. Backbone network low-level
PSA-S) is added after the ASPP module to increase the ability of the feature postaccess 1 × 1. The convolution and ASPP are then con­
feature map to extract detailed information to improve the accuracy nected to the 4 times downsampling part for feature fusion and then
performance of semantic segmentation. A channel attention mech­ connected to the 3 × 3 convolution and 4 times downsampling to
anism (ECA-Net) is added after the MobileNetv2 low-level features to recover the size of the image.
recover clearer segmentation boundaries.
3. Stripe pooling is utilized in the ASPP module instead of the original 3. Improved DeepLabv3 plus network
global average pooling to effectively capture long-range de­
pendencies, and hybrid pooling is utilized instead of the original The DeepLabv3 plus model is taken as the main body for improve­
global average pooling to effectively capture short-range and long- ment. In image semantic segmentation based on the DeepLabv3 plus
range interdependencies between different locations, thus network, this paper uses lightweight MobileNetV2 as the backbone
improving the efficiency and reliability of the system. network. Then, ASPP is used to extract multiscale information from the

2
Y. Liu et al. Engineering Applications of Artificial Intelligence 127 (2024) 107260

Fig. 3. Structure of strip pooling.

Fig. 4. PSA in parallel.

Fig. 5. PSA in series.

3
Y. Liu et al. Engineering Applications of Artificial Intelligence 127 (2024) 107260

improve image segmentation performance. The improved model is


shown in Fig. 2.

3.1. Strip pooling

The pooling window of global average pooling is square, which has


certain limitations, and it is difficult to obtain the correlation of graph
scales in different directions. Strip pooling has more advantages than
global average pooling. The pooling window of strip pooling is rectan­
gular, and the design of strip pooling can obtain global information from
Fig. 6. ECA-Net diagram. horizontal and vertical dimensions, expanding the scope of obtaining
feature information (Hou et al., 2020).
Different from the global average pooling calculation method, strip
Table 1
pooling is performed simultaneously according to the horizontal and
Comparison results of ASPP improvement experiments. vertical spatial dimensions. In addition, when two spatial dimensions
are pooled, the eigenvalues of a column or row are weighted averages.
Algorithm Backbone MloU mAP
The model structure is shown in Fig. 3 below.
Deeplabv3 plus MobileNetV2 66.16% 78.75% For the input image, the calculation formula of the row vector output
Deeplabv3 plus-SP 67.6% 78.6%
is as follows:
1 ∑
yhi = Xi,j (1)
W 0≤j<w
Table 2
Comparison of different attention mechanisms.
The calculation formula of the column vector output is as follows:
Backbone Attention MloU mAP
1 ∑
MobileNetV2 ECA-Net 66.95% 79.64% yvi = Xi,j (2)
H 0≤i<H
MobileNetV2 PSA_p 67.3% 80.34%
MobileNetV2 PSA_s 67.74% 81.3%
For an input X ∈ RC×H×W , where C refers to the number of channels, H
and W represent the height and width, respectively. X enters the hori­
zontal and vertical paths for pooling, and the outputs in the vertical and
Table 3
Comparison of network segmentation accuracy by integrating different modules.
horizontal directions are yh ∈ RC×H and yv ∈ RC×W , respectively. After
combining the two, the output is calculated as follows:
Group SP PSA_p PSA_S ECA-Net MloU MAP

① × × × × 66.16% 78.75% yc,i,j = yhc,j + yvc,j (3)


② ✓ ✓ × × 68.67% 80.34%
③ ✓ × ✓ × 69.05% 79.65% The convolution and sigmoid function will obtain the characteristic
④ ✓ ✓ × ✓ 68.74% 79.01% image, which will be fused with the original image to obtain the output
⑤ ✓ × ✓ ✓ 69.77% 79.29% z. The output z calculation formula is:
z = Scale(X, σ (f (y))) (4)
feature maps obtained in the backbone network while using strip
pooling instead of global pooling to retain more detailed information. In the above formula, scale () represents multiplication, σ represents the
Introduce the attention mechanism and add a polarization self-attention sigmoid function, and f represents 1 × 1 convolution.
mechanism to weigh the feature maps obtained by the ASPP module.
ECA-Net was added to fuse shallow features of MobileNetV2 and

Fig. 7. Comparison chart of category segmentation accuracy.

4
Y. Liu et al. Engineering Applications of Artificial Intelligence 127 (2024) 107260

Fig. 8. Comparison of PASCAL VOC 2012 dataset segmentation results.

3.2. Polarized self-attention mechanism effect. This paper mainly adds polarization self-attention and channel
attention mechanisms to the DeepLabv3 plus network. The two attention
We are all familiar with the concept of attention (Zeng et al., 2020). mechanisms are added at different locations in the network, and both
People cannot pay attention to the whole picture when they watch a show good performance.
picture. It must be that the eyes tend to be more interested in the part of The polarized self-attention mechanism (Hridoy et al., 2021; Liu
the painting, and people will ignore the part that they are not interested et al., 2021) has two main forms, series and parallel. The serial form
in. Based on such characteristics, the attention mechanism in the neural refers to the serial form of the channel self-attention mechanism and
network takes advantage of this, that is, to screen out effective infor­ spatial self-attention mechanism. The parallel form refers to the parallel
mation from complex information (Chen et al., 2017a). For image pro­ form of the channel self-attention mechanism and spatial self-attention
cessing, the target will be locked in one part of the image while ignoring mechanism. The two ways together constitute the polarized
other areas, which can improve the efficiency of image processing and self-attention mechanism. After inserting the polarization self-attention
save unnecessary trouble. With the rapid development of attention mechanism into the ASPP module (Yang, 2020; Zhu et al., 2019), the
mechanisms, an increasing number of neural network models have model can increase the extraction of important information and improve
added attention mechanisms (Zhang et al., 2020; Honarbakhsh et al., the utilization of the model. PSA_p and PSA_s can maintain high reso­
2023) to improve the efficiency of the model, which has shown a good lution in the channel and spatial dimensions, which is why they are

5
Y. Liu et al. Engineering Applications of Artificial Intelligence 127 (2024) 107260

increasingly widely used in deep learning networks. The model diagram


is shown in Figs. 4 and 5 below. 1 ∑ k
Pii
MIoU = (7)
The series and parallel forms of the polarization self-attention K + 1 i=0 ∑
k
Pij +
∑k
Pij − Pii
mechanism are formally divided into two branches: channel branches j=0 j=0

and space branches.


The channel weight calculation formula is as follows: 1 ∑ K
pii
mPA = ∑ (8)
[ (( ( ( ))))] K + 1 i=0 Kj=0 pij
ch
A (X) = FSG Wz|θ1 σ1 (Wv (X)) × FSM σ2 Wq (X) (5)

where and σ 1 σ 2 represent the 1 × 1 convolution. FSM represents the


softmax function part. WZ|θ1 Representing 1 × 1 convolution and LN 4.3. Experimental comparison
elevates the dimension of C/2 on the channel to C. FSG represents the
sigmoid function. The algorithm proposed in this paper is based on the original
The spatial weight calculation formula is as follows: DeepLabv3 plus model (Sun et al., 2019; Badrinarayanan et al., 2017).
[ ))) ))] The ASPP module is redesigned, and an attention mechanism is intro­
sp
A (X) = FSG σ 3 (FSM (σ 1 (FGP (Wq (X) × (X) (6) duced to make the shallow and deep features of the model pay more
attention to important semantic information(He et al., 2016; Chen et al.,
where σ 1 σ 2 and σ 3 represent the 1 × 1 convolution. FSM represents the 2017b, 2018; Sehar and Naseem, 2022)-(He et al., 2016; Chen et al.,
softmax function. FGP represents global pooling. FSG Represents the 2017b, 2018; Sehar and Naseem, 2022). The fitting effect can be ach­
sigmoid function. ieved by training the algorithm for 100 epochs using the Adam network
The above formula shows the calculation formula for two branch model optimizer. The training was divided into two phases: the freezing
weights. The polarization self-attention mechanism is fused based on the phase and the unfreezing phase. A learning rate of 0.005 is used in the
branching weight. Parallel and series are just two simple calculations for freezing phase, and the batch size is set to 8. A learning rate of 0.0005 is
shunt weights, similar to addition and multiplication. used in the unfreezing phase, and the batch size is set to 4. To prevent
overfitting, the weight decay rate is set to 0.005. Epoch refers to the
3.3. ECA attention mechanism process of all the data entering the network to complete the forward
computation and backpropagation once, and the number of epochs is set
The advantage of ECA-Net (Liu, 2020) is that it utilizes global to 100, with 50 rounds in the freezing phase and 50 rounds in the un­
pooling to transform spatial matrices into one-dimensional vectors .(see freezing phase. Phase of 50 rounds and the unfreezing phase of 50
Fig. 6) Then, the size of the one-dimensional convolutional kernel can be rounds. Before and after improvement. This article adopts the MloU and
obtained based on the number of network channels. Then, an adaptive MAP evaluation index system and conducts ASPP module optimization,
size convolution kernel is used for the convolution operation, and the attention mechanism addition, and mutual fusion experiments on
feature map of the input image is obtained through a weighted form. PASCAL VOC2012 to verify the performance of the model.
Finally, the input image is multiplied by the feature map obtained after
convolution calculation to extract the information of interest. Due to the 4.3.1. ASPP improvement experiment
pretraining method of the backbone network adopted by the network, The stripe pooling module (SP) is introduced in the ASPP module,
inserting ECA-Net into MobileNetV2 damages the network structure of where Deeplabv3 plus-sp represents using stripe pooling instead of
the backbone network. Therefore, inserting ECA-Net into the shallow global pooling in the ASPP module. To demonstrate the applicability of
features of MobileNetV2 can improve the segmentation effect without stripe pooling, MloU improved the DeepLabv3 plus network by 1.09%
damaging the network. before and after improvement. As shown in Table 1 below.

4. Experiments 4.3.2. Introduction of different attention experiments


Based on the MobileneV2 backbone network and ASPP module,
4.1. Datasets different attention mechanisms are introduced. The polarization self-
attention mechanism in series and parallel forms was introduced after
The PASCAL VOC2012 dataset is widely used and can be effectively the ASPP module. ECA-Net is introduced after the shallow layer of
utilized in the field of image processing. A dataset that can be used for MobileneV2. MloU increased by 0.79% after joining PSA and ECA-Net.
image semantic segmentation. There are four main types in this dataset: PSA_s has a better performance than PSA_p. In particular, MloU
indoor furniture, people, vehicles, and common animals. There are 21 increased by 1.68% after adding PSA_s. As shown in Table 2 below.
categories in four categories, and 3200 images are randomly selected
and divided into 9:1:1. A total of 2616 images are used as the training 4.3.3. Comparative experiments of different models
set, 292 images are used as the validation set, and 292 images are used as To demonstrate the effectiveness of the stripe pooling module, po­
the testing set. larization self-attention mechanism module, and ECA Net module and to
verify the accuracy of the improved algorithm, five control experiments
were established. Among them, ① refers to the DeepLabv3 plus network.
4.2. Experimental equipment and evaluation indicators ② It refers to changing the global average pooling to stripe pooling in
the ASPP module of DeepLabv3 plus and adding a polarization self-
The operating system is Ubuntu 20.04, using the Python 1.2.0 deep attention mechanism in parallel after the ASPP module. It refers to
learning open source framework and CUDA version 10.0. The pro­ changing global pooling to stripe pooling in the ASPP module. Deep­
gramming language is Python 3.6, and the hardware configuration is as Labv3 plus, and adding a polarization self-attention mechanism in a
follows: The CPU is i7-9600, and the GPU is NVIDIA 3060-Ti. The concatenated form after the ASPP module. ④ It refers to changing global
average intersection to union ratio (MloU) and average pixel accuracy pooling to stripe pooling in the ASPP module of DeepLabv3 plus, adding
(mAP) are used as performance evaluation coefficients for image se­ a parallel form of polarization self-attention mechanism after the ASPP
mantic segmentation. Where k represents k categories, Pij indicates that module, and adding the ECA-Net module after the shallow features of
the true value is i and the predicted value is j; Pji indicates that the true MobileneV2. ⑤ It refers to changing global pooling to stripe pooling in
value is j, the predicted value is i, and Pii indicates that the true and the ASPP module of DeepLabv3 plus, adding a concatenated form of the
predicted values are i. The calculation formulas for MloU and mPA are: polarization self-attention mechanism after the ASPP module, and

6
Y. Liu et al. Engineering Applications of Artificial Intelligence 127 (2024) 107260

adding the ECA-Net module after the shallow features of MobileneV2. CRediT authorship contribution statement
Table 3 compares ① and ② and ① and ③ of table. By using stripe
pooling instead of global average pooling and introducing a polarization Yanyan Liu: Conceptualization, Methodology, Experiments. Xiao­
self-attention mechanism, Mlou improved by 2.51% and 2.89%, tian Bai: Experimental results analysis, Writing – review & editing.
respectively. Compare ① and ④, ① and ⑤ of the table. In the ASPP Jiafei Wang: Conceptualization, Methodology, Experiments. Guoning
module, stripe pooling replaces global average pooling, and the polari­ Li: Supervision. Jin Li: Supervision, Writing – review & editing. Zen­
zation self-attention mechanism and ECA-Net are introduced, resulting gming Lv: Experimental results analysis.
in increases of 2.58% and 3.61% in MloU, respectively. By analyzing the
above table, it has been verified that all modules have played a role, and
all the improvements mentioned above can greatly improve the accu­ Declaration of competing interest
racy of the algorithm.
The authors declare that they have no known competing financial
4.4. Comparison of segmentation results for different categories interests or personal relationships that could have appeared to influence
the work reported in this paper.
The most important evaluation indicator for accuracy in semantic
segmentation is the average intersection-to-union ratio, which can be Data availability
seen from the graph among the 21 categories. The modified model only
has 6 categories that are lower than the original algorithm, and the No data was used for the research described in the article.
accuracy of the 6 lower categories is not significantly different from the
original algorithm. The remaining 15 categories are all higher than those References
of the original algorithm. Especially for categories such as houses, dogs,
cats, trains, sheep, etc., showing better advantages. After adding the Badrinarayanan, V., Kendall, A., Cipolla, R., 2017. Segnet: a deep convolutional encoder-
decoder architecture for image segmentation[J]. IEEE Trans. Pattern Anal. Mach.
attention mechanism, the accuracy of key categories is improved, which
Intell. 39 (12), 2481–2495. https://doi.org/10.1109/TPAMI.2016.2644615.
can to some extent improve the accuracy of the original algorithm. The Biao, W., Yali, G., Qingchuan, Z., 2018. Research on Image Semantic Segmentation
category segmentation results are shown in Fig. 7. Algorithm Based on Fully Convolutional HED-CRF[C]//2018 Chinese Automation
Congress (CAC). IEEE, pp. 3055–3058. https://doi.org/10.1109/
To see the effects before and after the improvement more clearly, the
CAC.2018.8623459.
segmentation prediction maps of the DeepLabv3 plus network and the Boykov, Yuri, Veksler, Olga, Zabih, Ramin, 2001. Fast approximate energy minimization
improved DeepLabv3 plus network were compared. Where (a) repre­ via graph cuts. In: Proceedings of the Seventh IEEE International Conference on
sents the original image, (b) represents the image label, (c) represents Computer Vision 1, pp. 377–384, 1.
Chen, L.C., Papandreou, G., Kokkinos, I., et al., 2017a. Deeplab: semantic image
the DeepLabv3 plus segmentation image, and (d) represents the segmentation with deep convolutional nets, atrous convolution, and fully connected
improved DeepLabv3 plus segmentation image. From the results, it can crfs[J]. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), 834–848. https://doi.org/
be seen that the model segmentation that integrates stripe pooling and 10.1109/TPAMI.2017.2699184.
Chen, L.C., Papandreou, G., Schroff, F., et al., 2017b. Rethinking Atrous Convolution for
introduces the attention mechanism is relatively smoother and more Semantic Image segmentation[J]. https://doi.org/10.48550/arXiv:1706.05587
complete. The original DeepLabv3 plus network has problems with arXiv preprint arXiv:1706.05587.
misclassification and discontinuous segmentation. The optimized Chen, L.C., Zhu, Y., Papandreou, G., et al., 2018. Encoder-decoder with atrous separable
convolution for semantic image segmentation[C]. In: Proceedings of the European
network has improved the semantic segmentation effect, better resolu­ Conference on Computer Vision. ECCV, pp. 801–818.
tion, refined the segmentation boundary of the target and achieved Dhanachandra, Nameirakpam, Manglem, Khumanthem, Yambem Jina Chanu, 2015.
better accuracy. The selected segmentation prediction diagram is shown Image segmentation using K -means clustering algorithm and subtractive clustering
algorithm. Procedia Comput. Sci. 54, 764–771.
in Fig. 8.
Fu, J., Yi, X., Wang, G., et al., 2022. Research on ground object classification method of
high resolution remote-sensing images based on improved DeeplabV3+[J]. Sensors
5. Summary 22 (19), 7477. https://doi.org/10.3390/S22197477.
He, K., Zhang, X., Ren, S., et al., 2016. Deep residual learning for image recognition[C].
Proc. IEEE Conf. on Comput. Vision and Pattern Recogn. 770–778. https://doi.org/
This article proposes a DeepLabv3 plus network based on the 10.3390/APP12188972.
attention mechanism. Changing global pooling to stripe pooling in the Honarbakhsh, V., Siahkoohi, H.R., Rezghi, M., et al., 2023. SeisDeepNET: an extension of
ASPP module captures global contextual information, while the addition Deeplabv3+ for full waveform inversion problem[J]. Expert Syst. Appl. 213, 118848
https://doi.org/10.1016/J.ESWA.2022.118848.
of the polarization self-attention mechanism enhances the utilization of Hou, Qibin, Zhang, Li, Cheng, Ming-Ming, Feng, Jiashi, 2020. Strip pooling: rethinking
image spatial features. Finally, by adding ECA-Net after the low-level spatial pooling for scene parsing. Proceedings of the IEEE/CVF Conference on
features of MobileNetV2, the acquisition of shallow features improves. Computer Vision and Pattern Recognition 4003–4012.
Hridoy, R.H., Habib, T., Jabiullah, I., et al., 2021. Early recognition of betel leaf disease
The experimental results show that embedding the attention module using deep learning with depthwise separable convolutions[C]. In: 2021 IEEE Region
into DeepLabv3 plus as a network can improve the accuracy of key 10 Symposium (TENSYMP). IEEE, pp. 1–7. https://doi.org/10.1109/
categories and effectively improve the segmentation accuracy of objects TENSYMP52854.2021.9551009.
Iftikhar, S., Asim, M., Zhang, Z., et al., 2022. Advance generalization technique through
in images by the network. The objective indicator MIoU improved by 3D CNN to overcome the false positives pedestrian in autonomous vehicles.
approximately 2%. Our work improves the performance of image se­ Telecommun. Syst. 80, 545–557. https://doi.org/10.1007/s11235-022-00930-1.
mantic segmentation, which provides new ideas for autonomous Iftikhar, Sundas, Asim, Muhammad, Zhang, Zuping, Muthanna, Ammar, Chen, Junhong,
El-Affendi, Mohammed, Ahmed, Sedik, Ahmed, A., Abd El-Latif, 2023. Target
driving, medical imaging, and other fields and provides direction for the
detection and recognition for traffic congestion in smart cities using deep learning-
field of computer vision. enabled UAVs: a review and analysis. Appl. Sci. 13 (6), 3995. https://doi.org/
Although the improved algorithm has made good improvements, 10.3390/app13063995.
Kass, Michael, Witkin, Andrew P., Terzopoulos, Demetri, 2004. Snakes: active contour
there are still shortcomings. Since the introduction of the attention
models. Int. J. Comput. Vis. 1, 321–331.
mechanism increases the model complexity to some extent, further Liu, M z, 2020. Research on Image Semantic Segmentation Algorithm Based on Self-
research is needed in terms of model complexity and parameter quan­ Attention Mechanism [D] Dalian. Dalian Univ. Technol. 20–35. https://doi.org/
tity. In the future, we will consider using model compression methods to 10.26991/d.cnki.gdllu.2020.001777.
Liu, A., Yang, Y., Sun, Q., Xu, Q., 2018. A deep fully convolution neural network for
optimize the network so that the model can balance high accuracy and semantic segmentation based on adaptive feature fusion. In: 2018 5th International
light weight. Conference on Information Science and Control Engineering (ICISCE), pp. 16–20.
https://doi.org/10.1109/ICISCE.2018.00013. Zhengzhou, China.
Liu, H., Liu, F., Fan, X., et al., 2021. Polarized self-attention: toward high-quality
pixelwise regression[J]. arXivpreprintarXiv:2107.00782. https://doi.org/10.4855
0/arXiv.2107.00782.

7
Y. Liu et al. Engineering Applications of Artificial Intelligence 127 (2024) 107260

Minaee, Shervin, Wang, Yao, 2017. An ADMM approach to masked signal decomposition Sun, Y., Jiang, Q., Hu, J., et al., 2019. Attention mechanism based pedestrian trajectory
using subspace representation. IEEE Trans. Image Process. 28, 3192–3204. prediction generation model[J]. J. Comput. Appl. 39 (3), 668. https://doi.org/
Minaee, S., Boykov, Y., Porikli, F., et al., 2021. Image segmentation using deep learning: 10.13203/j.whugis20200159.
a survey[J]. IEEE Trans. Pattern Anal. Mach. Intell. 44 (7), 3523–3542. Yang, X., 2020. An overview of the attention mechanisms in computer vision[C]//
Najman, Laurent, Schmitt, Michel, 1994. Watershed of a continuous function. Signal Journal of Physics: conference Series. IOP Publish. 1693 (1), 012173 https://doi.
Process. 38, 99–112. org/10.1088/1742-6596/1693/1/012173.
Nock, R., Nielsen, F., 2004. Statistical region merging. IEEE Trans. Pattern Anal. Mach. Yang, Z., Peng, X., Yin, Z., 2020. Deeplab_v3_plus-net for image semantic segmentation
Intell. 26 (11), 1452–1458. https://doi.org/10.1109/TPAMI.2004.110. with channel compression[C]//2020 IEEE 20th international conference on
Otsu, N., 1979. A threshold selection method from gray-level histograms. IEEE Trans. communication technology (ICCT). IEEE 1320–1324. https://doi.org/10.1109/
Syst. Man, and Cybern. 9 (1), 62–66. https://doi.org/10.1109/TSMC.1979.4310076. ICCT50939.2020.9295748.
Plath, Nils, Toussaint, Marc, Nakajima, Shinichi, 2009. Multiclass image segmentation Zeng, H., Peng, S., Li, D., 2020. Deeplabv3+ semantic segmentation model based on
using conditional random fields and global classification. International Conference feature cross attention mechanism[C]. In: Journal of Physics: Conference Series.
on Machine Learning. IOPPublishing, 012106. https://doi.org/10.1088/1742-6596/1678/1/012106,
Sehar, U., Naseem, M.L., 2022. How deep learning is empowering semantic 1678(1).
segmentation: traditional and deep learning techniques for semantic segmentation: a Zhang, Z., Huang, J., Jiang, T., et al., 2020. Semantic segmentation of very high-
comparison[J]. Multimed. Tool. Appl. 81 (21), 30519–30544. https://doi.org/ resolution remote sensing image based on multiple band combinations and
10.1007/S11042-022-12821-3. patchwise scene analysis[J]. J. Appl. Remote Sens. 14 (1) https://doi.org/10.1117/
Starck, J.-L., Elad, M., Donoho, D.L., 2005. Image decomposition via the combination of 1.JRS.14.016502, 016502-016502.
sparse representations and a variational approach. IEEE Trans. Image Process. 14 Zhao, Hengshuang, Shi, Jianping, Qi, Xiaojuan, Wang, Xiaogang, Jia, Jiaya, 2017.
(10), 1570–1582. https://doi.org/10.1109/TIP.2005.852206. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Sun, W., Wang, R., 2018. Fully convolutional networks for semantic segmentation of very CVPR) 2881–2890.
high resolution remotely sensed images combined with DSM[J]. Geosci. Rem. Sens. Zhu, Z.L., Rao, Y., Wu, Y., et al., 2019. Research progress of attention mechanism in deep
Lett. IEEE 15 (3), 474–478. https://doi.org/10.1109/LGRS.2018.2795531. learning[J]. J. Chin. Inf. Process. 33 (6), 1–11. https://doi.org/10.13374/j.issn2095-
9389.2021.01.30.005.

You might also like