0% found this document useful (0 votes)

7 views22 pages

Drones 08 00417 v2

The article presents CrackScopeNet, a lightweight neural network designed for rapid crack detection on resource-constrained drone platforms, enhancing structural health monitoring. This model effectively captures both local and global crack features while maintaining low computational requirements, achieving a mean intersection over union (mIoU) score of 82.12% on the CrackSeg9k dataset. CrackScopeNet's architecture allows for efficient processing on edge devices, making it suitable for practical applications in outdoor environments.

Uploaded by

aungzawoo43

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views22 pages

Drones 08 00417 v2

Uploaded by

aungzawoo43

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

drones

Article
CrackScopeNet: A Lightweight Neural Network for Rapid Crack
Detection on Resource-Constrained Drone Platforms
Tao Zhang , Liwei Qin, Quan Zou, Liwen Zhang, Rongyi Wang and Heng Zhang *

College of Computer and Information Science College of Software, Southwest University,

Chongqing 400715, China; [email protected] (T.Z.)
* Correspondence: [email protected]

Abstract: Detecting cracks during structural health monitoring is crucial for ensuring infrastructure
safety and longevity. Using drones to obtain crack images and automate processing can improve the
efficiency of crack detection. To address the challenges posed by the limited computing resources
of edge devices in practical applications, we propose CrackScopeNet, a lightweight segmentation
network model that simultaneously considers local and global crack features while being suitable for
deployment on drone platforms with limited computational power and memory. This novel network
features a multi-scale branch to improve sensitivity to cracks of varying sizes without substantial
computational overhead along with a stripe-wise context attention mechanism to enhance the capture
of long-range contextual information while mitigating the interference from complex backgrounds.
Experimental results on the CrackSeg9k dataset demonstrate that our method leads to a significant
improvement in prediction performance, with the highest mean intersection over union (mIoU) scores
reaching 82.12%, and maintains a lightweight architecture with only 1.05 M parameters and 1.58 G
floating point operations (FLOPs). In addition, the proposed model excels in inference speed on edge
devices without a GPU thanks to its low FLOPs. CrackScopeNet contributes to the development of
efficient and effective crack segmentation networks suitable for practical structural health monitoring
applications using drone platforms.

Citation: Zhang, T.; Qin, L.; Zou, Q.; Keywords: computer vision; crack detection; drone platforms; semantic segmentation; lightweight
Zhang, L.; Wang, R.; Zhang, H. neural network
CrackScopeNet: A Lightweight
Neural Network for Rapid Crack
Detection on Resource-Constrained
Drone Platforms. Drones 2024, 8, 417. 1. Introduction
https://doi.org/10.3390/ Cracks serve as early indicators of structural damage in buildings, bridges, and roads,
drones8090417 making their detection vital for structural health monitoring. Analyzing the morphological
Academic Editors: Sungjin Kim and characteristics, positional information, and extent of internal damage in cracks allows for
Javier Irizarry accurate safety assessments of buildings and infrastructure [1,2]. Timely detection and
repair of cracks not only reduces maintenance costs but also prevents further structural
Received: 7 July 2024 deterioration, ensuring safety and durability [3,4].
Revised: 15 August 2024
Traditional crack detection methods such as visual inspections and manual evaluations
Accepted: 22 August 2024
are often costly and inefficient, relying heavily on the expertise of inspectors, which can
Published: 23 August 2024
lead to subjective and inconsistent results [5]. Therefore, the development of objective and
efficient automated crack detection methods have become a significant trend in this field.
Various sensor-based methods for automatic or semi-automatic crack detection have been
Copyright: © 2024 by the authors.
proposed, including crack meters, RGB-D sensors, and laser scanners [6–8]. Although these
Licensee MDPI, Basel, Switzerland. sensors are accurate, they are expensive and challenging to deploy on large scales.
This article is an open access article Advancements in computer vision technology have popularized image-based crack
distributed under the terms and detection methods due to their long-distance, non-contact, and cost-effective nature. Tra-
conditions of the Creative Commons ditional visual detection methods such as morphological image processing [9,10], filter-
Attribution (CC BY) license (https:// ing [11,12] and percolation models [13] are simple to implement and computationally light,
creativecommons.org/licenses/by/ but suffer from limited generalization performance. Environmental noise such as debris
4.0/). around cracks further complicates detection in practical engineering environments.

Drones 2024, 8, 417. https://doi.org/10.3390/drones8090417 https://www.mdpi.com/journal/drones

Drones 2024, 8, 417 2 of 22

Recently, deep learning-based semantic segmentation algorithms have significantly

improved the accuracy and stability of image recognition in noisy environments. These
algorithms excel at locating and labeling crack pixels, providing comprehensive information
on crack distribution, width, length, and shape [14]. Nevertheless, general models for
understanding scenes often fail to capture the unique features of cracks, which are typically
thin, long, and irregularly shaped [15,16]. Cracks typically span entire images while
constituting only a small fraction of the pixel area, necessitating models with the ability
to capture long-range dependencies between pixels. While self-attention mechanisms
do well in aggregating long-distance contextual information [17–19], they come with
high computational costs that limit their detection speed. Additionally, cracks exhibit
uneven distribution along with significant size variations, necessitating multiscale feature
extraction [20–22]. Although methods such as DeepLabV3+ [23] and SegNext [24] are
able to capture multiscale information, they are computationally intensive and costly for
large images.
The use of unmanned aerial vehicles (UAVs) for crack monitoring has become preva-
lent due to their flexibility, cost effectiveness, and ability to efficiently cover both large
and difficult to access areas [25]. However, the computational resources available on edge
devices are typically limited and often lack high-power GPUs, making it crucial to de-
ploy lightweight models that can perform processing and analysis while incurring low
latency [26]. Researchers have proposed lightweight networks that reduce computational
costs by minimizing deep downsampling, reducing channel numbers, and optimizing
convolutional design; however, reducing the subsampling stages can lead to models with
insufficient receptive fields for covering large objects, as seen with ENet [27]. Bilateral
backbone models partially address this issue; for instance, BiSeNet [28] adds a context path
with fewer channels, while HrSegNet [29] maintains high-resolution features while adjust-
ing parameters to reduce channels. Unfortunately, these two-branch structures increase
computational overhead, and reducing the channels can hinder the ability of the model to
learn relational features.
Furthermore, several challenges affect the design of lightweight models for surface
crack segmentation: (1) existing methods increase computational complexity by incorporat-
ing large kernel convolutions, multiple parallel branches, and feature pyramid structures
to handle various object sizes and shapes; (2) diverse crack image scenes and complex
backgrounds limit feature extraction by lightweight models, making it difficult to learn
effective information from limited datasets; and (3) the subtle differences between cracked
and normal areas introduces complications during segmentation. While adding multi-
ple skip connections and auxiliary training branches can improve accuracy, this leads to
increased memory overhead.
To address the aforementioned challenges, we propose CrackScopeNet, a lightweight
segmentation network optimized for structural surface cracks. It features an optimized mul-
tiscale branching architecture and a carefully designed stripwise context attention (SWA)
module. Figure 1 presents a comparison on the CrackSeg9k dataset between traditional and
lightweight crack-specific semantic segmentation networks and our segmentation approach
in terms of their mean intersection over union (mIoU), model floating point operations
(FLOPs), and number of parameters. The figure clearly illustrates that our method outper-
forms all shown models while having substantially fewer FLOPs and parameters. This is
due to the initial design consideration capturing the local context information around small
cracks as well as the remote context information, allowing the model to identify complete
cracks while mitigating background interference.
Drones 2024, 8, 417 3 of 22

Figure 1. Comparison between classical and lightweight semantic segmentation networks and
CrackScopeNet on CrackSeg9k dataset.

Initially, in the local feature extraction stage we divide the channel data and perform
three convolution operations with different convolution kernel sizes to obtain the local
context information of cracks. Subsequently, we utilize a combination of strip pooling and
one-dimensional convolution to capture remote context information without compressing
channel features. Finally, we construct a lightweight multiscale feature fusion module to
aggregate shallow detail and deep semantic information. In these modules, we employ
depth-separable convolution, dropout, and residual connection structures to prevent over-
fitting, gradient disappearance, and gradient explosion problems, resulting in a lightweight
neural network that is adaptable to crack detection.
In summary, our main contributions are as follows.
(1) We propose a novel crack image segmentation model called CrackScopeNet de-
signed to meet the requirements of structural health monitoring. The model effectively
extracts information at multiple levels during the downsampling stage and fuses key
features during the upsampling stage.
(2) We introduce a lightweight multiscale branch module and stripwise context atten-
tion module designed to align with the morphological characteristics of fractures. These
components effectively capture rich contextual information while minimizing computa-
tional costs. Compared to the previous HrSegNetB48 lightweight crack segmentation
model, our approach achieves savings of approximately 5.2 times in memory and improves
inference speed by 1.7 times.
(3) CrackScopeNet demonstrates state-of-the-art performance on the CrackSeg9k
dataset, and exhibits excellent transferability to small datasets in specific scenarios; ad-
ditionally, the model has a low inference delay on resource-constrained drone platforms,
making it ideal for outdoor crack detection through computer vision. This ensures that
drone platforms can perform rapid crack detection and analysis, enhancing the efficiency
and effectiveness of structural health monitoring.
To facilitate further research and application, the code for this work is available on
GitHub (at https://github.com/ttkingzz/CrackScopeNet, accessed on 5 July 2024).
Drones 2024, 8, 417 4 of 22

2. Related Work
Deep learning-based semantic segmentation has significantly advanced crack detection.
Cutting edge research in this field primarily explores three key areas: achieving higher recogni-
tion accuracy, increasing inference speed, and developing more efficient attention mechanisms.
This section discusses related work in crack segmentation across these three aspects.

2.1. High-Accuracy Models

Recent advancements in semantic segmentation have been driven by efficient feature
fusion and powerful feature extraction techniques. U-Net [30] has excelled in the medical
domain by using skip connections to integrate feature maps at different scales, while
PSPNet [31] employs atrous convolutions to capture multiscale context.
To enhance performance further, researchers have introduced transformer-based models.
ViT [17] adapts the transformer architecture to image segmentation by processing image
patches as sequences, leveraging self-attention to capture the global dependencies crucial for
precise segmentation. SegFormer [19] uses hierarchical transformers for multiscale feature
extraction as well as integrating features with the multilayer perceptron (MLP) decoder,
eliminating positional encoding, and enhancing robustness.
Researchers have also explored the benefits of using larger convolutional kernels;
ConvNeXt [32] integrates modern CNN design with 7 × 7 convolutions to enhance feature
representation and efficiently capture long-range dependencies, while RepLKNet [33]
uses 31 × 31 convolutions to capture extensive spatial information and re-parameterizes
kernels for efficient inference, thereby leveraging large receptive fields with minimal
computational cost.
However, these models often require significant parameter counts and computational
resources to capture multiscale features. As a result, most tend to have large sizes and slow
inference speeds, hindering their practical application.

2.2. Lightweight Models

Traditional semantic segmentation models often require substantial computational re-
sources, with some models demanding tens of billions of operations per second, potentially
exceeding the computational capabilities of edge platforms. Consequently, recent research
has focused on designing lightweight neural networks that achieve real-time performance.
ENet [27] employs smaller-sized feature maps with fewer channels in the early layers
of the network, and also reduces downsampling rates; however, this approach prioritizes
speed over accuracy. While dilated convolutions can be used to achieve a larger receptive
field, there is still a loss of spatial detail, especially in boundary areas. To compensate for the
loss of detail, a dual-branch approach can be adopted to capture both low-level detail and
high-level semantic information. BiSeNet [28] utilizes a spatial path to maintain the spatial
dimensions of the original image, capturing rich spatial information, while a context path
quickly performs downsampling to acquire global semantic information. HrSegNet [29]
maintains a backbone branch with an unchanged scale, aiming to preserve spatial details,
and introduces a semantic guidance branch to provide deep semantic information to the
main branch. Although dual-branch network structures address both detail and semantic
information, and can achieve real-time inference with GPU acceleration, the additional
branches result in increased parameter count and computational demand. Therefore,
deploying such models on drone platforms without high-performance GPUs is not feasible.
The TopFormer [34] model combines the CNN and transformer architectures to balance
local feature extraction and global context understanding, achieving reduced computational
complexity through a lightweight design. The SeaFormer [35] model employs simplified
transformer modules and innovative attention mechanisms to prioritize important features,
further enhancing model efficiency. Although these models significantly reduce the compu-
tational load and parameter count in the transformer, they require extensive datasets for
training, which is not applicable to the field of crack detection.
Drones 2024, 8, 417 5 of 22

2.3. Attention Mechanism

Attention mechanisms have gained widespread recognition in the field of computer
vision due to their ability to effectively enhance the performance of deep neural networks.
The squeeze-and-excitation (SE) attention mechanism was the first to successfully introduce
channel attention into neural networks; this approach using 2D global pooling to compress
spatial dimensions into channel dimensions [36], facilitating enhanced feature learning.
Subsequently, attention mechanisms have made significant progress in two directions:
(1) aggregating only the channel features, and (2) combining channel features with spatial
features. Specifically, a convolutional block attention module (CBAM) utilizes both average
pooling and max pooling to integrate features from the channel and spatial dimensions [37].
Coordinate attention (CA) is an attention mechanism designed to improve the focus on
important regions while retaining precise positional information [38]. Unlike traditional
attention mechanisms, CA splits attention into a coordinate stage and spatial stage.
Unfortunately, existing attention mechanisms either lack long-range dependencies
or reduce the number of channels during channel feature extraction in order to minimize
parameters, which may lead to the loss of channel information.

3. Methods
To achieve a high-performance lightweight crack segmentation model, we introduce
CrackScopeNet, which is characterized by two main features: (1) partitioning the feature map
channels for convolutions at different scales to extract local multiscale information without
incurring excessive computational overhead, and (2) reducing the downsampling rate to
1/16 without using additional auxiliary branches, instead incorporating a stripwise attention
mechanism tailored to crack morphology in order to capture long-range dependencies.

3.1. Model Overview

We meticulously design feature extraction and global context attention modules based
on crack morphological information, which we collectively name CrackScopeNet. As
shown at the top of Figure 2, our proposed model adopts a simple encoder–decoder
structure similar to most previous works [19,23]. It comprises a feature encoder and
a feature decoder. In the encoder, the crack image is input into the network, passing
through three feature extraction stages to capture detailed and deep semantic information.
Each stage comprises a downsampling convolution layer and a series of CrackScope (CS)
modules, each containing a multiscale branch and an SWA module. The decoder gradually
restores the crack features to the original resolution, merging different levels of feature
information through the feature fusion module to produce an output with rich spatial and
textural features. The detailed configuration of CrackScopeNet is listed in Table 1.

Table 1. Instance of CrackScopeNet.

Downsampling Upsampling Output Size

Stage
Operation Cin Cout Operation Cin Cout
S0 Input 3 Seg Head 96 2 400 × 400
S1 Stem 3 32 Concatenate 64 96 100 × 100
S2 CS Stage × 3 32 32 Up-samp. 128 64 100 × 100
Donw-samp. 32 64 Concatenate 64 128 50 × 50
S3
CS Stage × 3 64 64 Up-samp. 128 64 50 × 50
Donw-samp. 64 128 25 × 25
S4
CS Stage × 4 128 128 25 × 25
Drones 2024, 8, 417 6 of 22

Figure 2. A structural overview of CrackScopeNet. (a) CrackScopeNet consists of three downsampling

stages, and each stage contains N CrackScope modules and MLP modules (we refer to these two
combined modules as a CrackScope block). A CrackScope module implies a multiscale branch (b) in
which the input is divided into three parts along the channel dimension and an SWA module (c). The
upsampling block (d) upsamples deep features and stacks them with shallow information, using SWA
modules and convolutional layers for feature fusion.

3.2. Feature Extraction

Inspired by previous advanced works [24,32], we adopt a backbone structure similar
to ViT [17] for the block design at each feature extraction stage. As illustrated in Figure 2a,
each CS block consists of two residual-connected submodules connected via residual
connections, namely, the CS module and MLP module. To prevent overfitting during
training, we incorporate a dropout layer in each submodule. The CS module comprises a
multiscale branch module and a stripwise context attention (SWA) module. This section
focuses on the multiscale features, while the SWA is introduced in the next section.
Unlike general object segmentation, crack segmentation targets the identification and
localization of crack areas with various shapes. Due to different development times and
external influences, coarse cracks are often accompanied by fine cracks. To address the
challenge of varying crack proportions in different regions, we introduce CS modules to
capture multiscale texture features, using receptive fields with kernel sizes of 3 × 3, 5 × 5,
and 7 × 7, respectively.
When extracting multiscale information with a branching structure, the FLOPs increase
exponentially with the number of branches and size of the convolution kernels. For regular
convolution computation, given output spatial dimensions W ∗ H, input channels Cin ,
Drones 2024, 8, 417 7 of 22

output channels Cout , and a convolution kernel of size k m ∗ k n , the FLOPs calculation
formula is expressed as follows:

FLOPs = W ∗ H ∗ Cin ∗ Cout ∗ k m ∗ k n . (1)

As shown in Figure 2b, to reduce computational overhead, we adjust the input and
output channels as well as the convolution kernel sizes for the multiscale branches. First,
we divide the input features into three parts along the channels, allocating half the channels
to the branch with the smallest kernel (3 × 3) and a quarter of the channels to each of
the two branches with larger kernels. Among these, as in ConvNext [32], the largest
convolution kernel that we use is 7 × 7. After the convolution computations are completed
in the three branches, the features are then merged along the channel dimension and a
1 × 1 convolution is used to model the relationships between all channels.
However, compared to 3 × 3 convolutions, these 7 × 7 large-kernel convolutions incur
more than five times the computational cost. In order to further reduce the computational
cost, we employ strip convolutions in our branch design to achieve the same receptive
field while being more computationally lightweight [24]. Because cracks are predomi-
nantly strip-shaped, strip convolutions are particularly effective in capturing these features.
Therefore, (1 × 5, 5 × 1) and (1 × 7, 7 × 1) strip convolutions are used to replace 5 × 5
and 7 × 7 2D convolutions for capturing local contextual information. Then, we design a
remote context information attention module to assist the CrackScope module in obtaining
global contextual information, which is introduced below.

3.3. Stripwise Context Attention

The attention mechanism is an adaptive selection process that makes the network focus
on important parts. As discussed above, the multiscale branch module in the CS module is
used to extract multiscale local contextual information. Inspired by GhostNetv2 [39] and
coordinate attention (CA) [38], we utilize strip pooling to capture long-distance contextual
information in the spatial dimensions. In addition, we design a stripwise context attention
(SWA) module and integrate it into the multiscale module. As illustrated in Figure 2c, given
an input feature x with channel number c and spatial dimensions H ∗ W, we perform max
pooling operations in both the height and width dimensions to obtain global width and
height features Z h and Z w .
Z h = max x (h, i )
0 ≤ i <W
w
(2)
Z = max x ( j, w)
0≤ j < H

Further, to avoid the issue of compressed channel numbers that can arise in CBAM [37]
and CA [38], we apply one-dimensional depthwise separable convolutions to model the re-
lationships across different spatial dimensions and channels. The attention representations
in the horizontal and vertical directions are denoted as follows:

yh = δ( F2 ( F1 ( F0 ( Z h ))))
(3)
yw = δ( F2 ( F1 ( F0 ( Z w ))))

where F0 is a one-dimensional depthwise separable convolution with a kernel size of 5 × 5,

F1 is a one-dimensional convolution used to capture the inter-channel information, F2 is
batch normalization, and δ is the sigmoid activation function.
Finally, the output of the attention module can be obtained by applying equation
show below.
Y = x ∗ yh ∗ yw (4)

3.4. Feature Fusion

To reduce the loss of key feature information and ensure the accuracy of crack detection,
we integrate the SWA module into the upsampling block for precise feature restoration
Drones 2024, 8, 417 8 of 22

and fusion. Notably, the decoder focuses primarily on fine-tuning the details of the feature
map results, allowing for a relatively simple design. As such, we do not use atrous spatial
pyramid pooling [31] to extract multiscale features of high-level semantic information. On
the one hand, using many dilated convolutions adds unnecessary computational overhead
and increases network complexity. On the other hand, our subsequent ablation experiments
demonstrate that further extraction of multiscale information from feature maps does not
enhance performance.
As shown in Figure 2d, the high-level semantic information is adjusted through
pointwise convolution operations, after which bilinear interpolation is used to restore the
feature map size for concatenation with features from the lower stages. Subsequently, in
order to further fuse high-level semantic features with detailed texture features, we employ
an SWA module with a shortcut connection to model feature relationships across the global
space and channels while fully integrating the multiscale feature maps. Next, a small kernel
convolution is used to refine the crack feature information. After multiscale fusion of the
three-stage feature maps, they are fed into the segmentation head, which maps the feature
map to the required segmentation output, completing the entire network computation
process. Notably, to avoid the large computational overhead incurred by the decoder, we
do not use transposed convolutions to learn more parameters; instead, we only select and
fuse the features, resulting in a lightweight multiscale feature fusion module.

3.5. Auxiliary Branch

Inspired by the auxiliary branches in PSPNet [31], we further enhance the segmentation
performance by adding three auxiliary branches in the CrackScopeNet encoder in order to
improve the performance of the feature extractor during training. The auxiliary branches
are used only during training and ignored during inference; thus, they do not affect the
inference speed of the entire network structure. However, they provide additional gradient
signals, allowing higher gradient propagation to each feature extraction stage, which helps
to mitigate the gradient vanishing or exploding problem and improves the training stability
of the encoder. For each auxiliary branch, we use the same segmentation head as the main
branch and recover the original image size with different upsampling multiples. The total
loss is the weighted sum of the binary cross-entropy losses from each segmentation head,
as follows:
Lt = L p + aL1 + bL2 + cL3 (5)
where Lt and L p represent the total loss and primary loss, respectively, while L1 , L2 , and L3
are the losses of the three auxiliary branches. In this work, the weights a, b, and c are all set
to 0.5.

4. Experimental Datasets and Setup

4.1. Datasets
In crack segmentation research, the limited size and number of publicly available
datasets poses a challenge for comprehensive algorithm evaluation. CrackSeg9k [40]
addresses this by offering a substantial dataset designed specifically for crack segmentation
tasks. Despite the CrackSeg9k images being captured with cameras and smartphones, their
resolution and imaging angles are similar to UAV-captured images, allowing models to
generalize well to UAV-based image data.
CrackSeg9k consists of 9255 images with a resolution of 400 × 400 pixels, each labeled
for cracks and backgrounds. As shown in Table 2, this dataset is merged with ten smaller
sub-datasets to enhance its diversity and robustness. The resulting dataset consists of
various crack types, including linear, branched, webbed, and non-crack images, with
examples shown in Figure 3a, ensuring that models trained on it can generalize across
different crack patterns and conditions. Notably, the creators of CrackSeg9k have corrected
for label noise, boosting the dataset’s reliability.
Drones 2024, 8, 417 9 of 22

Figure 3. Samples from three crack datasets. The first line is the original images, while the second
includes the overlay effect of the masks and the original images. Samples from the CrackSeg9k
dataset (a), samples from the Ozgenel dataset (b), and samples from the Aerial Track dataset (c).

Table 2. Sub-datasets in CrackSeg9k.

Name Number Material

Masonry [41] 240 Masonry structures
CFD [42] 118 Paths and sidewalks
CrackTree200 [43] 175 Pavement
Volker [44] 427 Concrete structures
DeepCrack [45] 443 Concrete and asphalt surfaces
Ceramic [46] 100 Ceramic tiles
SDNET2018 [47] 1411 Building facades, bridges, sidewalks
Rissbilder [44] 2736 Building surfaces (walls, bridges)
Crack500 [21] 3126 Pavement
GAPS384 [48] 383 Pavement and concrete surfaces

In addition to CrackSeg9k, we also used two specific-scene crack datasets: the close-
range concrete crack dataset Ozgenel [49] and the low-altitude UAV-captured highway
crack dataset Aerial Track [50], allowing us to further explore the generalization abilities of
CrackScopeNet. Among these, the image scenes in the Ozgenel dataset are similar to the
rock crack scenes in CrackSeg9k, while the Aerial Track dataset includes post-earthquake
highway crack images captured by UAVs, featuring predominantly small cracks amid
significant interference. As examples, two randomly selected images from these two
datasets are displayed in Figure 3b,c.
The Ozgenel dataset originally consists of 458 high-definition images (4032 × 3024 pixels)
with annotated cracks collected from various buildings at Middle East Technical University.
For our experiments, we cropped these images into 448 × 448 pixel blocks while ensuring a
crack area of at least 1% in each block. This process yielded 2600 images, which we divided
into 70% for training, 10% for validation, and 20% for testing.
The Aerial Track dataset comprises 4118 highway crack images (448 × 448 pixels)
captured by UAVs after an earthquake. The dataset is divided into three parts: training,
validation, and testing, with 2620, 598, and 900 images, respectively. We transferred our
models trained on CrackSeg9k to these two specific tasks.

4.2. Parameter Setting

4.2.1. Training and Fine-Tuning Configurations
Our experiments used the PaddleSeg [51] framework and were performed on a desktop
with an NVidia Titan V GPU (12 GB) and Ubuntu 20.04 with CUDA 10.1. To prevent
overfitting, we adopted data augmentation methods such as random horizontal flipping,
scaling (0.5 to 2), cropping, resizing, normalization, and random distortion to vary the
brightness, contrast, and saturation with 50% probability.
For the CrackSeg9k dataset, the specific training parameters are shown in Table 3. All
training was conducted from scratch without pretraining on other datasets. To manage
the limited GPU memory, models with high memory usage (e.g., UNet, PSPNet) had their
Drones 2024, 8, 417 10 of 22

batch size and initial learning rate halved while keeping the number of epochs unchanged,
ensuring similar training effects.

Table 3. Parameter settings for training on the CrackSeg9k dataset.

Item Setting
Epoch 200
Batch Size 16
Optimizer Adamw
Weight decay 0.01
Beta1 0.9
Beta2 0.999
Initial learning rate 0.005
Learning rate decay type poly
GPU memory 12 GB
Image size 400 × 400

In addition to CrackSeg9k, we transferred the models to the Ozgenel [40] and Aerial
Track [50] datasets. During fine-tuning, we reduced the learning rate to 0.0001, limited the
epochs to 20, and adjusted the batch size to 8. The input images were cropped to 448 × 448
for Ozgenel and 512 × 512 for Aerial Track.

4.2.2. Drone Platform

This study leveraged a drone-based platform to simulate edge model deployment,
focusing on real-world applications where computational resources are limited. In the
field of crack monitoring, DJI drones such as the M300RTK and Mavic 2 Pro are frequently
employed due to their precision, flexibility, and advanced imaging capabilities [25,26].
However, DJI platforms often restrict the deployment of proprietary algorithms directly
on the edge device, necessitating the transfer of images for postprocessing, which can
delay crack detection and analysis. To overcome these limitations and allow for immediate
crack analysis, we opted for a Navio2-equipped drone integrated with a Raspberry Pi
4B. This choice enables the deployment of crack segmentation models directly on the
onboard computing device, facilitating real-time image processing. The Raspberry Pi 4B,
powered by a Broadcom BCM2711 CPU running at 1.5 GHz, offers a balance between low
power consumption and adequate computational capacity, making it a suitable platform
for environments where high-performance GPUs are impractical. The Navio2 flight control
board, with its built-in inertial measurement unit (IMU), barometer, and GPS module,
provides essential flight control functionalities while allowing for seamless integration with
the Raspberry Pi. This setup allowed us to deploy crack segmentation models directly on
the computing device, enabling real-time analysis as the drone captures images.
One of the key objectives of this study was to evaluate the inference latency of various
models in a resource-constrained environment. To ensure a fair comparison, each model
was tested by loading images from a standardized test set directly onto the Raspberry
Pi. This method allows for a consistent assessment of the performance of each model
under identical conditions, ensuring that the differences in inference speeds are accurately
measured and free from bias. In the practical application scenarios, the workflow re-
mained consistent with our testing phase. The drone captured images through its onboard
camera, then these images were processed in real time using the same crack extraction
methods developed during our experiments, with output results either stored locally or
transmitted to a ground station depending on the mission requirements. This approach
demonstrates how low-cost resource-constrained platforms such as the Raspberry Pi can
be effectively used for timely and efficient structural health monitoring when integrated
with a Navio2-equipped drone, offering a scalable solution for rapid response in diverse
and often remote environments.
Drones 2024, 8, 417 11 of 22

4.3. Evaluation Metrics

Based on previous studies [29,52], we used four metrics to comprehensively evaluate
model performance: precision (Pr), recall (Re), F1 score (F1), and mean intersection over
union (mIoU). These indicators are defined as follows:

TP
Pr = (6)
TP + FP
TP
Re = (7)
TP + FN
2 ∗ Pr ∗ Re
F1 = (8)
Pr + Re
TP
mIoU = mean( ) (9)
TP + FN + FP
where true positive (TP) represents correctly classified crack pixels, false positive (FP)
represents background pixels incorrectly classified as crack categories, and false negative
(FN) represents cracks incorrectly identified as background.
In addition, we evaluated the computational cost and complexity of the model using
the number of floating point operations (FLOPs) and Parameters. We also used the average
inference latency of a single image deployed on the Navio2-based drone to evaluate
the inference speed of the lightweight models. A lightweight model suitable for drone
platforms requires a low parameter count, low FLOPS, and low inference latency.

5. Experiment
In this section, we first conduct a comprehensive quantitative comparison between
CrackScopeNet and the most advanced segmentation models in various metrics, visualize
the results, and comprehensively analyze the detection performance. Subsequently, we
explore the transfer learning capability of our model on crack datasets specific to other
scenarios. Finally, we perform ablation studies to meticulously examine the significance
and impact of each component within CrackScopeNet.

5.1. Comparative Experiments

As our primary objective is to achieve an exceptional balance between the accuracy
of crack region extraction and inference speed, we compare CrackScopeNet with three
types of models: classical general semantic segmentation models, advanced lightweight
semantic segmentation models, and the latest models designed explicitly for crack seg-
mentation, totaling thirteen models. Specifically, U-Net [30], PSPNet [31], SegNet [53],
DeeplabV3+ [23], SegFormer [19], and SegNext [24] were selected as six classical segmenta-
tion models with high accuracy. BiSeNet [28], BiSeNetV2 [54], STDC [55], TopFormer [34],
and SeaFormer [35] were chosen due to their advantages in inference speed as lightweight
semantic segmentation models; notably, SegFormer, TopFormer, and SeaFormer are all
transformer-based methods that have demonstrated outstanding performance on large
datasets such as Cityscapes [56]. Additionally, we included two specialized crack segmen-
tation models, U2Crack [52] and HrSegNet [29], which have been optimized for the crack
detection scenario based on general semantic segmentation models.
It is important to note that in order to ensure that all models could be easily converted
to ONNX format and deployed on edge devices with limited computational resources
and memory, we selected the lightweight MobileNetV2 [57] and ResNet-18 [58] backbones
for the DeepLabV3+ and BiSeNet models, respectively, while for SegFormer and SegNext
we chose the lightweight versions SegFormer-B0 [19] and SegNext_MSCAN_Tiny [24],
which are suited for real-time semantic segmentation as proposed by the authors. For
TopFormer and SeaFormer, we discovered during training that the tiny versions had
difficulty converging; thus, we only utilized their base versions.
Drones 2024, 8, 417 12 of 22

Quantitative Results. Table 4 presents the performance of each baseline network and
the proposed CrackScopeNet on the CrackSeg9k dataset, with the best values highlighted
in bold. Analyzing the accuracy of different types of segmentation networks in the table,
the larger models generally achieve higher mIoU scores than the lightweight models;
specifically, compared to the classical high-accuracy models, our model achieves the best
performance in terms of mIoU, recall, and F1 score, with scores of 82.15%, 89.24%, and
89.29%, respectively. Although our model’s precision (89.34%) is 1.26% lower than U-Net,
U-Net has poor recall performance (−2.24%) and our model’s parameters and FLOPs are
lower by 12 and 48 times, respectively.

Table 4. Performance of different methods and our method on the CrackSeg9k dataset.

Model mIoU Pr (%) Re (%) F1 (%) Parameters FLOPs

Classical
U-Net [30] 81.36 90.60 87.00 88.76 13.40 M 75.87 G
PSPNet [31] 81.69 89.19 88.72 88.95 21.06 M 54.20 G
SegNet [53] 80.50 89.71 86.57 88.11 29.61 M 103.91 G
DeepLabV3+ [23] 80.96 88.56 88.29 88.42 2.76 M 2.64 G
SegFormer [19] 81.63 89.82 88.05 88.92 3.72 M 4.13 G
SegNext [24] 81.55 89.28 88.44 88.86 4.23 M 3.72 G
Lightweight
BiSeNet [28] 81.01 89.74 87.26 88.48 12.93 M 34.57 G
BiSeNetV2 [54] 80.66 89.36 87.11 88.22 2.33 M 4.93 G
STDC [55] 80.84 88.92 87.76 88.34 8.28 M 5.22 G
STDC2 [55] 80.94 89.54 87.33 88.42 12.32 M 7.26 G
TopFormer [34] 80.96 89.28 87.60 88.43 5.06 M 1.00 G
SeaFormer [35] 79.13 87.29 87.19 87.20 4.01 M 0.64 G
Specific
U2Crack [52] 81.45 90.13 87.52 88.80 1.19 M 31.21 G
HrSegNetB48 [29] 81.07 90.39 86.78 88.55 5.43 M 5.59 G
HrSegNetB64 [29] 81.28 90.44 87.03 88.70 9.65 M 9.91 G
CrackScopeNet 82.15 89.34 89.24 89.29 1.05 M 1.58 G

In terms of network weight, the network proposed in this paper achieves the best
balance between accuracy on the CrackSeg9k dataset and weight, as intuitively illustrated
in Figure 1. Our model achieves the highest mIoU with only 1.05 M parameters and
1.58 G FLOPs, making it incredibly lightweight. Its FLOPs are slightly higher than those
of TopFormer and SeaFormer, but lower than all other small models; notably, due to
the small size of the crack dataset, the learning capability of lightweight segmentation
networks is evidently limited, as mainstream lightweight segmentation models do not
consider the unique characteristics of cracks, resulting in poor performance. The proposed
CrackScopeNet architecture successfully achieves the design goal of a lightweight net-
work structure while maintaining superior segmentation performance, making it easily
deployable on resource-constrained edge devices.
Moreover, compared to the state-of-the-art crack image segmentation algorithms,
the proposed method achieves an mIoU of 82.15% with fewer parameters and FLOPs,
surpassing the highest-accuracy versions of the U2Crack and HrSegNet models. Notably,
the HrSegNet model employs an online hard example mining (OHEM) technique during
training to improve its accuracy. In contrast, we only use the cross-entropy loss function for
model parameter updating without deliberately employing any training tricks to enhance
performance, showcasing the significant benefits of considering crack morphology during
model design.
Qualitative Results. Figures 4–6 display the qualitative results of all compared models.
Our method achieves superior visual performance compared to the other models. From
Drones 2024, 8, 417 13 of 22

the first, second, and third rows of Figure 4 it can be observed that CrackScopeNet and the
more significant parameter segmentation algorithms achieve satisfactory results for high-
resolution images with apparent crack features. In the fourth row, where the original image
contains asphalt with color and texture similar to cracks, CrackScopeNet and SegFormer
successfully overcome the background noise interference. This is attributed to their long-
range contextual dependencies, which allow them to effectively capture the relationships
between cracks. In the fifth row, the results show that CrackScopeNet exhibits robust
performance even under uneven illumination conditions. This can be attributed to the
design of the network structure, which considers both the local and global features of cracks
while effectively suppressing noise.

Figure 4. Visualization of the segmentation results of the classical segmentation models and our
model on the CrackSeg9k test set.

Figure 5 clearly shows that the lightweight networks struggle to eliminate back-
ground noise interference, leading to fragmented segmentation results for fine cracks.
This outcome is due to the limited parameters learned by lightweight models. Finally,
Figure 6 presents the visualization results of the most advanced crack segmentation models.
U2Crack [52], based on the ViT [17] architecture, achieves a broader receptive field that
somewhat alleviates background noise, though at the cost of significant computational
overhead. HrSegNet [29] maintains a high-resolution branch to capture rich and detailed
features. As seen in the last two columns of Figure 6, the increased number of channels in
the HrSegNet network allow more detailed information to be extracted; however, this leads
to background information being misclassified as cracks. This explains the high precision
and low recall results of HrSegNet. In summary, CrackScopeNet outperforms the other
segmentation models, demonstrating excellent crack detection performance under various
noise conditions with lower parameters and FLOPs.
Inference on Navio2-based drones. In practical applications, there remains a sub-
stantial gap between real-time semantic segmentation algorithms designed and validated
for mobile and edge devices, with the latter facing challenges such as limited memory
resources and low computational efficiency. To better simulate edge devices used for out-
door structural health monitoring, we explored the inference speed of the models without
GPU acceleration. Notably, to ensure that all models could be deployed on the drone
platform without sacrificing accuracy through pruning or compression, we avoided using
storage-intensive and computationally demanding models such as UNet, SegNet, and
PSPNet. We converted the models to ONNX format and tested their inference speeds on
Navio2-based drones equipped with a representative Raspberry Pi 4B, focusing on compar-
ing our proposed model with models with tiny FLOPs and parameter counts: BiSeNetV2,
Drones 2024, 8, 417 14 of 22

DeepLabV3+, STDC, HrSegNetB48, SegFormer, TopFormer, SeaFormer. The test settings

were as follows: input image size of 3 × 400 × 400, batch size of 1, and 2000 testing epochs.
To ensure a fair comparison, we did not optimize or prune any models during deployment,
meaning that the actual inference delay in practical applications could be further reduced
from these test results.

Figure 5. Visual segmentation results of the lightweight segmentation models and our model on the
CrackSeg9k test set.

Figure 6. Visual segmentation results of the crack-specific segmentation models on the CrackSeg9k
test set.
Drones 2024, 8, 417 15 of 22

As shown in Figure 7, the test results indicate that when running on a highly resource-
constrained drone platform, the proposed CrackScopeNet architecture achieves faster
inference speed compared to other real-time or lightweight semantic segmentation net-
works based on convolutional neural networks, such as BiSeNet, BiSeNetV2, and STDC.
Additionally, TopFormer and SeaFormer, which are designed with deployment on resource-
limited edge devices in mind, both achieve extremely low inference latency; however, these
models perform poorly on the crack datasets due to inadequate data volume. Our proposed
model achieves remarkable crack segmentation accuracy while maintaining rapid inference
speed, establishing its advantages over competing models.

Figure 7. Results of inference speed test on Navio2-based drones.

These results confirm the efficacy of deploying the CrackScopeNet model on outdoor
mobile devices, where high-speed inference and lightweight architecture are crucial for
real-time processing and analysis of infrastructure surface cracks. By outperforming other
state-of-the-art models, CrackScopeNet proves to be a suitable solution for addressing the
challenges associated with outdoor edge computing.

5.2. Scaling Study

To explore the adaptability of our model, we adjusted the number of channels and
stacked different numbers of CrackScope modules to cater to a broader range of application
scenarios. Because CrackSeg9k is composed of multiple crack datasets, we also investigated
the model’s transferability to specific application scenarios.
We adjusted the base number of channels after the stem from 32 to 64. Correspondingly,
the number of channels in the remaining three feature extraction stages was increased from
(32, 64, 128) to (64, 128, 160) in order to capture more features. Meanwhile, the number of
CrackScope modules stacked in each stage was adjusted from (3, 3, 4) to (3, 3, 3); we refer
to the adjusted model as CrackScopeNet_Large. First, we trained CrackScopeNet_Large on
CrackSeg9k using the same parameter settings as the base version and evaluated the model
on the test set. Furthermore, we used the training parameters and weights obtained from
CrackSeg9k for these two models as the basis for transferring the models to downstream
tasks in two specific scenarios. Images from the Ozgenel dataset, which consist of high-
resolution concrete crack images similar to some scenarios in CrackSeg9k, were cropped
to 448 × 448. The Aerial Track Dataset consists of low-altitude drone-captured images of
post-earthquake highway cracks, a type of scene not present in CrackSeg9k; these were
cropped to 512 × 512.
Table 5 presents the mIoU scores, parameter counts, and FLOPs of the base CrackScopeNet
model and the high-accuracy version CrackScopeNet_Large on the CrackSeg9k dataset
and the two specific scenario datasets. In this table, mIoU(F) represents the mIoU score
obtained after pretraining the model on CrackSeg9k and fine-tuning it on the respective
Drones 2024, 8, 417 16 of 22

dataset. It is evident that the large version of the model achieves higher segmentation
accuracy across all datasets, though with approximately double the parameters and three
times the FLOPs. Therefore, if computational resources and memory are sufficient and
higher accuracy in crack segmentation is required, the large version or further stacking of
CrackScope modules can be employed.

Table 5. Evaluation results of the two versions of our model on three different datasets; CSNet and
CSNet_L stand for CrackScopeNet and CrackScopeNet_Large, while mIoU(F) indicates the mIoU
score for models pretrained on the CrackSeg9k dataset.

CrackSeg9k Ozgenel Aerial Track Dataset

Model
mIoU Param FLOPs mIoU mIoU(F) FLOPs mIoU mIoU(F) FLOPs
CSNet 82.15% 1.05 M 1.58 G 90.05% 92.11% 1.98 G 79.12% 82.63% 2.59 G
CSNet_L 82.48% 2.20 M 5.09 G 90.71% 92.36% 6.38 G 81.04% 83.43% 8.33 G

For specific scenario training, whether from scratch or fine-tuning, all our models were
trained for only 20 epochs. It can be seen that the models converge quickly even when
training from scratch. We attribute this phenomenon to the initial design of CrackScopeNet,
which considers the morphology of cracks and is able to successfully capture the necessary
contextual information. For training using transfer learning, both versions of the model
achieve remarkable results on the Ozgenel dataset, with mIoU scores of 90.1% and 92.31%,
respectively. Even for the Aerial Track dataset, which includes low-altitude remote sens-
ing images of highway cracks not seen in CrackSeg9k, both of our models still perform
exceptionally well, achieving respective mIoU scores of 83.26% and 84.11%. These results
demonstrate the proposed model’s rapid adaptability to small datasets, aligning well with
real-world tasks.

5.3. Diagnostic Experiments

To gain more insights into CrackScopeNet, a set of ablative studies were conducted
on our proposed model. All the methods mentioned in this section were trained with the
same parameters for efficiency in 200 epochs.
Stripwise Context Attention. First, we examined the role of the critical SWA module
in CrackScopeNet by replacing it with two advanced attention mechanisms, CBAM [37] and
CA [38]. The results shown in Table 6 demonstrate that without any attention mechanism,
merely stacking convolutional neural networks for feature extraction yields poor perfor-
mance due to the limited receptive field. Next, the SWA attention mechanism based on strip
pooling and one-dimensional convolution was adopted, allowing the network structure to
capture long-range contextual information. Under this configuration, the model exhibited
the best performance. Figure 8 shows the class activation maps (CAM) [59] before the seg-
mentation head of CrackScopeNet. It can be observed that the model without SWA is easily
disturbed by shadows, whereas with the SWA module the model can focus on the global
crack areas. Next, we sequentially replaced the SWA module with the channel–spatial
feature-based CBAM attention mechanism and the coordinate attention (CA) mechanism,
which also uses strip pooling. While the model parameters did not change significantly, the
performance declined by 0.2% and 0.17%, respectively.
Furthermore, we explored the benefits of different attention mechanisms for other mod-
els by optimizing the advanced HrSegNetB48 lightweight crack segmentation network [29].
HrSegNetB48 consists of high-resolution and auxiliary branches, merging shallow detail
information with deep semantic information at each stage. Therefore, we added the SWA,
CBAM, and CA attention mechanisms after feature fusion to capture richer features. Table 7
shows the performance of HrSegNetB48 with the different attention mechanisms, clearly
indicating that introducing the SWA attention mechanism to capture long-range contextual
information provides the most significant benefit.
Drones 2024, 8, 417 17 of 22

Table 6. Ablation study on the effectiveness of each component in CrackScopeNet.

Attention Decoder
Mutil-Branch mIoU (%) FLOPs (G)
SWA CA CBAM Ours ASPP
✓ 81.34 1.57
✓ ✓ ✓ 81.98 1.58
✓ ✓ ✓ 81.95 1.58
✓ ✓ 81.91 1.61
✓ ✓ ✓ 82.14 2.89
✓ ✓ ✓ 82.15 1.58

Figure 8. Visual explanations of the different components of CrackScopeNet.

Table 7. Results when adding different attention mechanisms to HrSegNet.

model mIoU (%) Pr (%) Re (%) F1 (%) Params FLOPs

HrSegNetB48 81.07 90.39 86.78 88.55 5.43 M 5.59 G
HrSegNetB48+CBAM 81.16 90.40 86.90 8.61 5.44 M 5.60 G
HrSegNetB48+CA 81.20 90.24 87.08 88.63 5.44 M 5.60 G
HrSegNetB48+SWA 81.72 89.65 88.33 88.98 5.48 M 5.60 G

Multiscale Branch. Next, we examined the effect of the multiscale branch in the
CrackScope module. To ensure fairness, we replaced the multiscale branch with a con-
volution of a larger kernel size (5 × 5 instead of 3 × 3). The results with and without
the multiscale branch are shown in Table 6. It is evident that using a 5 × 5 kernel size
convolution instead of the multiscale branch, decreases the mIoU score (−0.16%) despite
having more floating-point computations. This demonstrates that blindly adopting large
kernel convolutions increases computational overhead without significant performance
improvements. The benefits conferred by the multiscale branch were further analyzed
through the CAM. As shown in the third column of Figure 8, when the multiscale branch
is not used, it is obvious that the network misses the feature information of small cracks,
while the model with this branch can perfectly capture the features of cracks with various
shapes and sizes.
Decoder. CrackScopeNet uses a simple decoder to fuse feature information of different
scales, then complete the compression of channel features and the fusion of features at
different stages. At present, the most popular decoders use an atrous spatial pyramid
pooling (ASPP) [23] module to introduce multi-scale information. In order to explore
whether the introduction of an ASPP module could benefit to our model and investigate
Drones 2024, 8, 417 18 of 22

the effectiveness of our proposed lightweight decoder, we replaced decoder with the ASPP
method adopted by DeepLabV3+ [23]. The results are shown in the last two rows of
Table 6. It can be seen that the computational overhead is large because of the need to
perform parallel dilated convolution operations on deep semantic information; however,
the performance of the model does not improve. This shows that using multiple sets
of dilated convolutions to capture multiscale feature incurs additional computational
overhead while not contributing to the performance improvement of our model

5.4. Experiment Conclusions

Based on the comparative experiments conducted in previous sections, CrackScopeNet
demonstrates significant advantages over both classical and lightweight semantic segmen-
tation models in terms of performance, parameter count, and FLOPs. On the composite
CrackSeg9k dataset, CrackScopeNet achieves high segmentation accuracy and shows excel-
lent transferability to specific scenarios. Notably, it maintains a low parameter count and
minimal FLOPs, which translates to low-latency inference speeds on resource-constrained
drone platforms without the need for GPU acceleration. This efficiency is achieved by con-
sidering crack morphology characteristics, allowing CrackScopeNet to remain lightweight
and computationally efficient. This makes it particularly suitable for deployment on mobile
devices in outdoor environments. In summary, CrackScopeNet achieves a better balance
between segmentation accuracy and inference speed compared to the other networks exam-
ined in this study, making it a promising solution for timely crack detection and analysis
using drones in infrastructure surfaces.
However, there remain some drawbacks in this study. The inference speed and
performance were not tested on other hardware platforms, such as the Snapdragon 865,
which may offer different computational capabilities. Additionally, our study did not
explore the potential acceleration that NPUs (Neural Processing Units) or GPUs could
provide. Further investigation into how these processing units can be fully utilized could
offer significant improvements in the efficiency and performance of the model.

6. Discussion
In this paper, we present CrackScopeNet, a lightweight infrastructure surface crack
segmentation network specifically designed to address the challenges posed by varying
crack sizes, irregular contours, and subtle differences between cracks and normal regions
in real-world applications. The proposed network structure captures the local context
information and long-distance dependencies of cracks through a lightweight multiscale
branch and an SWA attention mechanism, respectively, and effectively extracts the low-level
details and high-level semantic information required for accurate crack segmentation.
In this work, we find that using channel-wise partitioning to apply different kernel
sizes effectively captures multiscale features without introducing significant computational
overhead. Additionally, by incorporating an attention mechanism that accounts for long-
range dependencies, it is possible to compensate for the limitations of downsampling
without resorting to additional detail branches, which would otherwise increase compu-
tational demands. Our experimental results demonstrate that CrackScopeNet delivers
robust performance and high accuracy. It outperforms larger models like SegFormer in
terms of efficiency, significantly reducing the number of parameters and computational cost.
Furthermore, our method achieves faster inference speeds than other lightweight models
such as BiSeNet and STDC even in the absence of GPU acceleration. This performance
makes it highly suitable for deployment on resource-constrained drone platforms, enabling
efficient and low-latency crack detection in structural health monitoring. By making the
model and code publicly available, we aim to advance the application of UAV remote
sensing technology in infrastructure maintenance, providing an efficient and practical tool
for the timely detection and analysis of cracks.
Furthermore, utilizing UAVs to monitor crack development in geological disaster
scenarios can greatly aid in warning efforts. CrackScopeNet, having proven effective in
Drones 2024, 8, 417 19 of 22

infrastructure crack detection, has the potential to be adapted for these contexts through
domain adaptation. We have undertaken preliminary investigations by capturing images
of hazardous rock formations with UAVs and using our model to extract crack regions, as
illustrated in Figure 9. These environments present more intricate crack patterns, including
various types and complex curved damage. Our approach currently exhibits limitations in
detecting fine cracks, particularly those that blend with the background. Our next work
will focus on enhancing the model sensitivity and capacity in order to accurately identify
smaller and more complex crack patterns in challenging conditions, especially in geological
disaster monitoring.

Figure 9. Application of CrackScopeNet for crack detection in dangerous rock masses.

Lastly, in this era of large models, our model has only been trained and evaluated on
datasets containing a few thousand images; the need for a large amount of data collection
and manual labeling represents a bottleneck. Recent advances in generative AI and self-
supervised learning can bypass the limitations imposed by the need for data acquisition
and manual annotation. Researchers can use the inherent structure or attributes of existing
data to generate richer “synthetic images” and “synthetic labels”, which is a very interesting
research avenue that could be applied to crack detection.

Author Contributions: T.Z. designed the architecture and comparative experiments, and wrote the
manuscript; L.Q. revised the manuscript and assisted T.Z. in conducting the experiments; Q.Z. and
H.Z. made suggestion for the experiments and assisted in revising the manuscript; L.Z. and R.W.
conducted investigation and code testing. All authors have read and agreed to the published version
of the manuscript.
Funding: This research was funded by research on identification and variation analysis methods
for rock fractures, development of a real-time monitoring model for falling rocks based on machine
vision, research project on hazard warning algorithm, and terminal equipment for rock collapse based
on vibration data of Chongqing Institute of Geology and Mineral Resources, grant numbers F2023304,
F2023045, and cstc2022jxjl00006. This work was supported by the 2024 Key Technology Project of
Chongqing Municipal Education Commission, grant numbers KJZD-K202400204.
Data Availability Statement: The code and data that support the findings of this study are available
on GitHub at https://github.com/ttkingzz/CrackScopeNet, accessed on 5 July 2024.
Acknowledgments: The authors would like to thank the editors and reviewers for their valuable suggestions.
Conflicts of Interest: The authors declare no conflicts of interest.

References
1. Minh Dang, L.; Wang, H.; Li, Y.; Nguyen, L.Q.; Nguyen, T.N.; Song, H.K.; Moon, H. Deep learning-based masonry crack
segmentation and real-life crack length measurement. Constr. Build. Mater. 2022, 359, 129438. [CrossRef]
2. Zheng, M.; Lei, Z.; Zhang, K. Intelligent detection of building cracks based on deep learning. Image Vis. Comput. 2020, 103, 103987.
[CrossRef]
3. Ha, J.; Kim, D.; Kim, M. Assessing severity of road cracks using deep learning-based segmentation and detection. J. Supercomput.
2022, 78, 17721–17735. [CrossRef]
4. Zhang, J.; Qian, S.; Tan, C. Automated bridge surface crack detection and segmentation using computer vision-based deep
learning model. Eng. Appl. Artif. Intell. 2022, 115, 105225. [CrossRef]
Drones 2024, 8, 417 20 of 22

5. Deng, J.; Singh, A.; Zhou, Y.; Lu, Y.; Lee, V.C.S. Review on computer vision-based crack detection and quantification methodologies
for civil structures. Constr. Build. Mater. 2022, 356, 129238. [CrossRef]
6. Gavilán, M.; Balcones, D.; Marcos, O.; Llorca, D.F.; Sotelo, M.A.; Parra, I.; Ocaña, M.; Aliseda, P.; Yarza, P.; Amírola, A. Adaptive
Road Crack Detection System by Pavement Classification. Sensors 2011, 11, 9628–9657. [CrossRef]
7. Jahanshahi, M.R.; Jazizadeh, F.; Masri, S.F.; Becerik-Gerber, B. Unsupervised Approach for Autonomous Pavement-Defect Detection and
Quantification Using an Inexpensive Depth Sensor; American Society of Civil Engineers: Reston, VA, USA, 2012.
8. Zhang, D.; Zou, Q.; Lin, H.; Xu, X.; He, L.; Gui, R.; Li, Q. Automatic pavement defect detection using 3D laser profiling technology.
Autom. Constr. 2018, 96, 350–365. [CrossRef]
9. Iyer, S.; Sinha, S.K. Segmentation of Pipe Images for Crack Detection in Buried Sewers. Comput.-Aided Civ. Infrastruct. Eng. 2006,
21, 395–410. [CrossRef]
10. Sun, B.C.; Qiu, Y.J. Automatic Identification of Pavement Cracks Using Mathematic Morphology. In Proceedings of the First
International Conference on Transportation Engineering, Chengdu, China, 22–24 July 2007.
11. Kamaliardakani, M.; Sun, L.; Ardakani, M.K. Sealed-Crack Detection Algorithm Using Heuristic Thresholding Approach.
J. Comput. Civ. Eng. 2016, 30, 04014110. [CrossRef]
12. Mohan, A.; Poobal, S. Crack detection using image processing: A critical review and analysis. Alex. Eng. J. 2018, 57, 787–798.
[CrossRef]
13. Qu, Z.; Lin, L.D.; Guo, Y.; Wang, N. An improved algorithm for image crack detection based on percolation model. Comput.-Aided
Civ. Infrastruct. Eng. 2015, 10, 214–221. [CrossRef]
14. Cha, Y.J.; Ali, R.; Lewis, J.; Büyüköztürk, O. Deep learning-based structural health monitoring. Autom. Constr. 2014, 161, 105328.
[CrossRef]
15. Liu, Z.; Cao, Y.; Wang, Y.; Wang, W. Computer vision-based concrete crack detection using U-net fully convolutional networks.
Autom. Constr. 2019, 104, 129–139. [CrossRef]
16. Yang, J.; Wang, W.; Lin, G.; Li, Q.; Sun, Y.; Sun, Y. Infrared Thermal Imaging-Based Crack Detection Using Deep Learning. IEEE
Access 2019, 7, 182060–182077. [CrossRef]
17. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.;
Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929.
18. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted
windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October
2021; pp. 10012–10022.
19. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic
Segmentation with Transformers. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.:
New York, NY, USA,2021; Volume 34, pp. 12077–12090.
20. Lin, Q.; Li, W.; Zheng, X.; Fan, H.; Li, Z. DeepCrackAT: An effective crack segmentation framework based on learning multi-scale
crack features. Eng. Appl. Artif. Intell. 2023, 126, 106876. [CrossRef]
21. Yang, F.; Zhang, L.; Yu, S.; Prokhorov, D.; Mei, X.; Ling, H. Feature Pyramid and Hierarchical Boosting Network for Pavement
Crack Detection. IEEE Trans. Intell. Transp. Syst. 2020, 21, 1525–1535. [CrossRef]
22. Chu, H.; Wang, W.; Deng, L. Tiny-Crack-Net: A multiscale feature fusion network with attention mechanisms for segmentation
of tiny cracks. Comput.-Aided Civ. Infrastruct. Eng. 2022, 37, 1914–1931. [CrossRef]
23. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image
segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018;
pp. 801–818.
24. Guo, M.H.; Lu, C.Z.; Hou, Q.; Liu, Z.; Cheng, M.M.; Hu, S.m. SegNeXt: Rethinking Convolutional Attention Design for Semantic
Segmentation. arXiv 2022, arXiv:2209.08575.
25. Duan, Z.; Liu, J.; Ling, X.; Zhang, J.; Liu, Z. ERNet: A Rapid Road Crack Detection Method Using Low-Altitude UAV Remote
Sensing Images. Remote Sens. 2024, 16, 1741. [CrossRef]
26. Forcael, E.; Román, O.; Stuardo, H.; Herrera, R.F.; Soto-Muñoz, J. Evaluation of Fissures and Cracks in Bridges by Applying
Digital Image Capture Techniques Using an Unmanned Aerial Vehicle. Drones 2024, 8, 8. [CrossRef]
27. Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. ENet: A Deep Neural Network Architecture for Real-Time Semantic
Segmentation. arxiv 2016, arxiv:1606.02147.
28. Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation.
In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341.
29. Li, Y.; Ma, R.; Liu, H.; Gaoli, C. Real-time high-resolution neural network with semantic guidance for crack segmentation. Autom.
Constr. 2023, 156, 105112. [CrossRef]
30. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the
Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F.,
Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [CrossRef]
31. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890.
Drones 2024, 8, 417 21 of 22

32. Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986.
33. Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up your kernels to 31 × 31: Revisiting large kernel design in cnns. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11963–11975.
34. Zhang, W.; Huang, Z.; Luo, G.; Chen, T.; Wang, X.; Liu, W.; Yu, G.; Shen, C. TopFormer: Token Pyramid Transformer for Mobile
Semantic Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
New Orleans, LA, USA, 18–24 June 2022; pp. 12083–12093.
35. Wan, Q.; Huang, Z.; Lu, J.; Yu, G.; Zhang, L. SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation.
In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023.
36. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018.
37. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference
on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19.
38. Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722.
39. Tang, Y.; Han, K.; Guo, J.; Xu, C.; Xu, C.; Wang, Y. GhostNetv2: Enhance cheap operation with long-range attention. arXiv 2022,
arXiv:2211.12905.
40. Kulkarni, S.; Singh, S.; Balakrishnan, D.; Sharma, S.; Devunuri, S.; Korlapati, S.C.R. CrackSeg9k: A Collection and Benchmark
for Crack Segmentation Datasets and Frameworks. In Proceedings of the Computer Vision—ECCV 2022 Workshops; Karlinsky, L.,
Michaeli, T., Nishino, K., Eds.; Springer: Cham, Switzerland, 2023; pp. 179–195.
41. Dais, D.; Bal, E.; Smyrou, E.; Sarhosis, V. Automatic crack classification and segmentation on masonry surfaces using convolutional
neural networks and transfer learning. Autom. Constr. 2021, 125, 103606. [CrossRef]
42. Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic Road Crack Detection Using Random Structured Forests. IEEE Trans. Intell.
Transp. Syst. 2016, 17, 3434–3445. [CrossRef]
43. Zou, Q.; Cao, Y.; Li, Q.; Mao, Q.; Wang, S. CrackTree: Automatic crack detection from pavement images. Pattern Recognit. Lett.
2012, 33, 227–238. [CrossRef]
44. Pak, M.; Kim, S. Crack Detection Using Fully Convolutional Network in Wall-Climbing Robot. In Advances in Computer Science
and Ubiquitous Computing; Park, J.J.; Fong, S.J.; Pan, Y.; Sung, Y., Eds.; Springer: Singapore, 2021; pp. 267–272.
45. Liu, Y.; Yao, J.; Lu, X.; Xie, R.; Li, L. DeepCrack: A deep hierarchical feature learning architecture for crack segmentation.
Neurocomputing 2019, 338, 139–153. [CrossRef]
46. Junior, G.S.; Ferreira, J.; Millán-Arias, C.; Daniel, R.; Junior, A.C.; Fernandes, B.J.T. Ceramic Cracks Segmentation with Deep
Learning. Appl. Sci. 2021, 11, 6017. [CrossRef]
47. Dorafshan, S.; Thomas, R.J.; Maguire, M. SDNET2018: An annotated image dataset for non-contact concrete crack detection using
deep convolutional neural networks. Data Brief 2018, 21, 1664–1668. [CrossRef]
48. Eisenbach, M.; Stricker, R.; Seichter, D.; Amende, K.; Debes, K.; Sesselmann, M.; Ebersbach, D.; Stoeckert, U.; Gross, H.M. How to
get pavement distress detection ready for deep learning? A systematic approach. In Proceedings of the 2017 International Joint
Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 2039–2047. [CrossRef]
49. Özgenel, F. Concrete Crack Segmentation Dataset. Mendeley Data 2019. [CrossRef]
50. Hong, Z.; Yang, F.; Pan, H.; Zhou, R.; Zhang, Y.; Han, Y.; Wang, J.; Yang, S.; Chen, P.; Tong, X.; et al. Highway Crack Segmentation
From Unmanned Aerial Vehicle Images Using Deep Learning. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [CrossRef]
51. Liu, Y.; Chu, L.; Chen, G.; Wu, Z.; Chen, Z.; Lai, B.; Hao, Y. PaddleSeg: A High-Efficient Development Toolkit for Image
Segmentation. arxiv 2021, arxiv:2101.06175.
52. Shi, P.; Zhu, F.; Xin, Y.; Shao, S. U2CrackNet: A deeper architecture with two-level nested U-structure for pavement crack
detection. Struct. Health Monit. 2023, 22, 2910–2921. [CrossRef]
53. Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation.
IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [CrossRef] [PubMed]
54. Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. BiSeNet V2: Bilateral Network with Guided Aggregation for Real-Time
Semantic Segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [CrossRef]
55. Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking BiSeNet for Real-Time Semantic Segmentation. In Proceedings
of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021;
pp. 9716–9725.
56. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset
for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223.
57. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520.
Drones 2024, 8, 417 22 of 22

58. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
59. Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. arXiv 2015,
arXiv:1512.04150.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.