MixSegNet A Novel Crack Segmentation Network Combining CNN and Transformer

The document presents MixSegNet, a novel deep learning model for crack segmentation that integrates convolutional neural networks (CNNs) and transformer architectures to enhance detection accuracy. MixSegNet addresses challenges in traditional image processing methods by effectively extracting and fusing multi-scale features, achieving a precision of 95.2% and an F1 score of 91.5% in comparative evaluations against existing models. The research highlights the importance of accurate crack identification for structural safety and proposes MixSegNet as a solution to improve segmentation performance in complex environments.

Uploaded by

Bhoomika A S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views11 pages

MixSegNet A Novel Crack Segmentation Network Combining CNN and Transformer

Uploaded by

Bhoomika A S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Received 9 July 2024, accepted 29 July 2024, date of publication 5 August 2024, date of current version 20 August 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3438112

MixSegNet: A Novel Crack Segmentation Network

Combining CNN and Transformer
YANG ZHOU 1 , RAZA ALI 2, (Senior Member, IEEE), NORRIMA MOKHTAR 1 ,
SULAIMAN WADI HARUN 1, AND MASAHIRO IWAHASHI 3 , (Senior Member, IEEE)
1 Department of Electrical Engineering, Faculty of Engineering, Universiti Malaya, Kuala Lumpur 50603, Malaysia
2 Department of Electrical Engineering, Faculty of Information and Communication Technology (FICT), Balochistan University of Information Technology,
Engineering and Management Sciences (BUITEMS), Quetta 87300, Pakistan
3 Department of Electrical, Electronics and Information Engineering, Nagaoka University of Technology, Nagaoka 940-2188, Japan

Corresponding author: Norrima Mokhtar ([email protected])

This work was supported in part by JSPS KAKENHI under Grant 24K02975.

ABSTRACT In the domain of road inspection and structural health monitoring, precise crack identification
and segmentation are essential for structural safety and disaster prediction. Traditional image processing
technologies encounter difficulties in detecting cracks due to their morphological diversity and complex
background noise. This results in low detection accuracy and poor generalization. To overcome these
challenges, this paper introduces MixSegNet, a novel deep learning model that enhances crack recognition
and segmentation by integrating multi-scale features and deep feature learning. MixSegNet integrates
convolutional neural networks (CNNs) and transformer architectures to enhance the detection of small cracks
through the extraction and fusion of fine-grained features. Comparative evaluations against mainstream
models, including LRASPP, U-Net, Deeplabv3, Swin-UNet, AttuNet, and FCN, demonstrate that MixSegNet
achieves superior performance on open-source datasets. Specifically, the model achieved a precision of
95.2%, a recall of 88.2%, an F1 score of 91.5%, and a mean intersection over union (mIoU) of 84.8%,
thereby demonstrating its effectiveness and reliability for crack segmentation tasks.

INDEX TERMS Crack segmentation network, crack images, convolutional neural network, transformer
model, image processing, deep learning, self-attention mechanism.

I. INTRODUCTION cracks are not discovered and treated in time, they may
Crack identification occupies a vital position in the field lead to a decrease in structural performance and even
of structural health monitoring because it is directly related threaten personnel safety. Therefore, developing effective
to the safety and reliability of building structures. With the crack detection and identification systems is crucial to ensure
development of technology, crack detection methods have structural safety.
gradually transformed from traditional manual inspection Traditional crack detection methods, such as threshold
to automatic identification using modern technologies such techniques [2], demonstrate limited adaptability. To address
as advanced image processing, artificial intelligence, and this, Yang et al. [3] introduced a novel approach utilizing a
machine learning [1]. These methods not only improve fully convolutional network (FCN), enhancing the detection
identification accuracy and efficiency, but also enable poten- process. This technique employs single-pixel width skeletons
tial structural problems to be discovered at an early stage, for crack segmentations, allowing for the detailed analysis
enabling preventive maintenance and extending the life of of crack features—like topology, length, and widths—
the building. The existence of cracks may be caused by a thus offering critical indicators for practical assessments.
variety of reasons, including structural aging, environmental However, the scarcity of training data for crack segmen-
erosion, excessive loads, and natural disasters. If these tation presents a challenge. In response, König et al. [4]
developed a method to streamline the annotation process
The associate editor coordinating the review of this manuscript and for semantic segmentation of surface cracks. They uti-
approving it for publication was Wei Wei . lized a U-Net architecture based on a fully convolutional
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 111535
Y. Zhou et al.: MixSegNet: A Novel Crack Segmentation Network Combining CNN and Transformer

network, optimized for small datasets through patch-based we propose the MixSegNet model as a means of enhancing
training, leading to unprecedented results on various crack the accuracy of crack segmentation.
datasets. Ren et al. [5] explored the application of deep
fully convolutional networks for concrete crack detection in II. RELATED WORK
tunnel images, proposing CrackSegNet, an advanced network A. SEMANTIC SEGMENTATION
for comprehensive crack segmentation. This innovation Semantic segmentation is derived from the further refinement
improves feature extraction, aggregation, and resolution of classification problems. It requires pixel-level classi-
reconstruction, significantly boosting segmentation perfor- fication tasks and puts forward higher requirements for
mance. Kang et al. [6] introduced an automated method architecture and algorithms. At present, semantic segmen-
combining Faster R-CNN and a modified TuFF algorithm tation technology has been widely used in different fields
for precise crack detection, localization, and quantification, of computer vision. Among them, semantic segmentation is
overcoming the limitations posed by varying environmental applied in various fields, including satellite imagery [20],
conditions. Similarly, Lau et al. [7] applied convolutional medicine [21], material science [22], and meteorology [23].
neural networks for segmenting pavement crack images, It can be seen that semantic segmentation technology is
marking a significant advancement in the field. Liu et al. [8] crucial. FCN (Fully Convolutional Networks) [24], which
proposed a two-step convolutional neural network method was first proposed by Jonathan Long, Evan Shelhamer, and
for enhanced crack detection and segmentation. Following Trevor Darrell in 2015, aims to classify each pixel in the
this, Guan et al. [9] aimed to refine the accuracy and speed image into the corresponding category. The core idea of FCN
of 3D crack segmentation models, pushing the boundaries is to use a fully convolutional layer to replace the fully con-
of current methodologies. Ali et al. [10] proposed an addi- nected layer in the traditional convolutional neural network,
tive attention gate-based network architecture called Crack so that the network can accept input images of any size
Segmentation Network-II (CSN-II). and output a spatial map of corresponding size. The spatial
map can be directly applied to pixel-level prediction tasks.
U-Net [25], a deep learning model specifically designed
A. RESEARCH GAP for medical image segmentation, was initially introduced
Recent research has led to further improvements in various by Olaf Ronneberger, Philipp Fischer, and Thomas Brox in
aspects of crack segmentation. Wang et al. [11] intro- 2015. The architectural design of U-Net is particularly well-
duced a lightweight crack segmentation network based suited for tasks that necessitate high-precision localization,
on knowledge distillation. Liu et al. [12] presented an such as the segmentation of organs and tissues within
upgraded CrackFormer network for pavement crack seg- medical imagery. This model is named for its unique
mentation. This network achieved higher accuracy with ‘‘U’’-shaped structure, which effectively combines shallow
fewer floating-point operations (FLOPs) and parameters (high-resolution) features and deep (high-level semantics)
compared to previous methods. Wu et al. [13] developed a features to improve segmentation accuracy. LRASPP (Lite
lightweight MobileNetV2-DeepLabV3 network for enhanced R-ASPP, or Lightweight Residual Atrous Space Pyramid
precision in dam crack width measurement. Yao et al. [14] Pooling) [26] is a deep learning architecture optimized for
developed a CrackResU-Net model with a pyramid region mobile devices and edge computing, especially for semantic
attention module for pixel-level pavement crack recognition. segmentation tasks. It is improved and simplified based on
Lin et al. [15] proposed DeepCrackAT, a framework for crack the original ASPP (Atrous Spatial Pyramid Pooling) and
segmentation based on learning multi-scale crack features. DeepLab series models. ASPP captures multi-scale informa-
Tang et al. [16] introduced a novel lightweight concrete crack tion by using different dilation rates in parallel convolutional
segmentation method based on DeeplabV3+. This method layers, thereby improving the model’s ability to understand
reduces the number of model parameters and enhances different areas of the image. LRASPP aims to reduce the
segmentation accuracy. Chen et al. [17] introduced a dynamic computational complexity and number of parameters to
semantic segmentation algorithm with an encoder-crossor- adapt to environments with limited computing resources.
decoder structure for pixel-level building crack segmentation. DeepLabv3 [27] is an advanced deep learning architecture
Li et al. [18] concentrated on crack segmentation in asphalt designed specifically for image semantic segmentation tasks.
pavement using an enhanced YOLOv5s model. Moreover, It is the third version of the DeepLab series model, developed
Sohaib et al. [19] proposed an ensemble approach for robust by Liang-Chieh Chen and others, aiming to further improve
automated crack detection and segmentation in concrete the segmentation accuracy of the model in complex image
structures, achieving high precision and an intersection over scenes. The core contributions of DeepLabv3 include the
union score. Collectively, these studies contribute to the improved atrous spatial pyramid pooling (ASPP) module
advancement of crack segmentation algorithms, addressing and the systematic application of atrous convolution. These
various challenges and improving the accuracy and efficiency features enable the model to effectively capture multi-scale
of crack detection and segmentation processes. information and handle different object sizes in images.
However, the aforementioned models fail to fully leverage AttuNet [28] is a recently proposed semantic segmentation
the respective strengths of CNN and Transformer. Therefore, architecture. It is an improved version of U-Net. It better