Drones 08 00417 v2
Drones 08 00417 v2
Article
CrackScopeNet: A Lightweight Neural Network for Rapid Crack
Detection on Resource-Constrained Drone Platforms
Tao Zhang , Liwei Qin, Quan Zou, Liwen Zhang, Rongyi Wang and Heng Zhang *
Abstract: Detecting cracks during structural health monitoring is crucial for ensuring infrastructure
safety and longevity. Using drones to obtain crack images and automate processing can improve the
efficiency of crack detection. To address the challenges posed by the limited computing resources
of edge devices in practical applications, we propose CrackScopeNet, a lightweight segmentation
network model that simultaneously considers local and global crack features while being suitable for
deployment on drone platforms with limited computational power and memory. This novel network
features a multi-scale branch to improve sensitivity to cracks of varying sizes without substantial
computational overhead along with a stripe-wise context attention mechanism to enhance the capture
of long-range contextual information while mitigating the interference from complex backgrounds.
Experimental results on the CrackSeg9k dataset demonstrate that our method leads to a significant
improvement in prediction performance, with the highest mean intersection over union (mIoU) scores
reaching 82.12%, and maintains a lightweight architecture with only 1.05 M parameters and 1.58 G
floating point operations (FLOPs). In addition, the proposed model excels in inference speed on edge
devices without a GPU thanks to its low FLOPs. CrackScopeNet contributes to the development of
efficient and effective crack segmentation networks suitable for practical structural health monitoring
applications using drone platforms.
Citation: Zhang, T.; Qin, L.; Zou, Q.; Keywords: computer vision; crack detection; drone platforms; semantic segmentation; lightweight
Zhang, L.; Wang, R.; Zhang, H. neural network
CrackScopeNet: A Lightweight
Neural Network for Rapid Crack
Detection on Resource-Constrained
Drone Platforms. Drones 2024, 8, 417. 1. Introduction
https://doi.org/10.3390/ Cracks serve as early indicators of structural damage in buildings, bridges, and roads,
drones8090417 making their detection vital for structural health monitoring. Analyzing the morphological
Academic Editors: Sungjin Kim and characteristics, positional information, and extent of internal damage in cracks allows for
Javier Irizarry accurate safety assessments of buildings and infrastructure [1,2]. Timely detection and
repair of cracks not only reduces maintenance costs but also prevents further structural
Received: 7 July 2024 deterioration, ensuring safety and durability [3,4].
Revised: 15 August 2024
Traditional crack detection methods such as visual inspections and manual evaluations
Accepted: 22 August 2024
are often costly and inefficient, relying heavily on the expertise of inspectors, which can
Published: 23 August 2024
lead to subjective and inconsistent results [5]. Therefore, the development of objective and
efficient automated crack detection methods have become a significant trend in this field.
Various sensor-based methods for automatic or semi-automatic crack detection have been
Copyright: © 2024 by the authors.
proposed, including crack meters, RGB-D sensors, and laser scanners [6–8]. Although these
Licensee MDPI, Basel, Switzerland. sensors are accurate, they are expensive and challenging to deploy on large scales.
This article is an open access article Advancements in computer vision technology have popularized image-based crack
distributed under the terms and detection methods due to their long-distance, non-contact, and cost-effective nature. Tra-
conditions of the Creative Commons ditional visual detection methods such as morphological image processing [9,10], filter-
Attribution (CC BY) license (https:// ing [11,12] and percolation models [13] are simple to implement and computationally light,
creativecommons.org/licenses/by/ but suffer from limited generalization performance. Environmental noise such as debris
4.0/). around cracks further complicates detection in practical engineering environments.
Figure 1. Comparison between classical and lightweight semantic segmentation networks and
CrackScopeNet on CrackSeg9k dataset.
Initially, in the local feature extraction stage we divide the channel data and perform
three convolution operations with different convolution kernel sizes to obtain the local
context information of cracks. Subsequently, we utilize a combination of strip pooling and
one-dimensional convolution to capture remote context information without compressing
channel features. Finally, we construct a lightweight multiscale feature fusion module to
aggregate shallow detail and deep semantic information. In these modules, we employ
depth-separable convolution, dropout, and residual connection structures to prevent over-
fitting, gradient disappearance, and gradient explosion problems, resulting in a lightweight
neural network that is adaptable to crack detection.
In summary, our main contributions are as follows.
(1) We propose a novel crack image segmentation model called CrackScopeNet de-
signed to meet the requirements of structural health monitoring. The model effectively
extracts information at multiple levels during the downsampling stage and fuses key
features during the upsampling stage.
(2) We introduce a lightweight multiscale branch module and stripwise context atten-
tion module designed to align with the morphological characteristics of fractures. These
components effectively capture rich contextual information while minimizing computa-
tional costs. Compared to the previous HrSegNetB48 lightweight crack segmentation
model, our approach achieves savings of approximately 5.2 times in memory and improves
inference speed by 1.7 times.
(3) CrackScopeNet demonstrates state-of-the-art performance on the CrackSeg9k
dataset, and exhibits excellent transferability to small datasets in specific scenarios; ad-
ditionally, the model has a low inference delay on resource-constrained drone platforms,
making it ideal for outdoor crack detection through computer vision. This ensures that
drone platforms can perform rapid crack detection and analysis, enhancing the efficiency
and effectiveness of structural health monitoring.
To facilitate further research and application, the code for this work is available on
GitHub (at https://github.com/ttkingzz/CrackScopeNet, accessed on 5 July 2024).
Drones 2024, 8, 417 4 of 22
2. Related Work
Deep learning-based semantic segmentation has significantly advanced crack detection.
Cutting edge research in this field primarily explores three key areas: achieving higher recogni-
tion accuracy, increasing inference speed, and developing more efficient attention mechanisms.
This section discusses related work in crack segmentation across these three aspects.
3. Methods
To achieve a high-performance lightweight crack segmentation model, we introduce
CrackScopeNet, which is characterized by two main features: (1) partitioning the feature map
channels for convolutions at different scales to extract local multiscale information without
incurring excessive computational overhead, and (2) reducing the downsampling rate to
1/16 without using additional auxiliary branches, instead incorporating a stripwise attention
mechanism tailored to crack morphology in order to capture long-range dependencies.
output channels Cout , and a convolution kernel of size k m ∗ k n , the FLOPs calculation
formula is expressed as follows:
As shown in Figure 2b, to reduce computational overhead, we adjust the input and
output channels as well as the convolution kernel sizes for the multiscale branches. First,
we divide the input features into three parts along the channels, allocating half the channels
to the branch with the smallest kernel (3 × 3) and a quarter of the channels to each of
the two branches with larger kernels. Among these, as in ConvNext [32], the largest
convolution kernel that we use is 7 × 7. After the convolution computations are completed
in the three branches, the features are then merged along the channel dimension and a
1 × 1 convolution is used to model the relationships between all channels.
However, compared to 3 × 3 convolutions, these 7 × 7 large-kernel convolutions incur
more than five times the computational cost. In order to further reduce the computational
cost, we employ strip convolutions in our branch design to achieve the same receptive
field while being more computationally lightweight [24]. Because cracks are predomi-
nantly strip-shaped, strip convolutions are particularly effective in capturing these features.
Therefore, (1 × 5, 5 × 1) and (1 × 7, 7 × 1) strip convolutions are used to replace 5 × 5
and 7 × 7 2D convolutions for capturing local contextual information. Then, we design a
remote context information attention module to assist the CrackScope module in obtaining
global contextual information, which is introduced below.
Further, to avoid the issue of compressed channel numbers that can arise in CBAM [37]
and CA [38], we apply one-dimensional depthwise separable convolutions to model the re-
lationships across different spatial dimensions and channels. The attention representations
in the horizontal and vertical directions are denoted as follows:
yh = δ( F2 ( F1 ( F0 ( Z h ))))
(3)
yw = δ( F2 ( F1 ( F0 ( Z w ))))
and fusion. Notably, the decoder focuses primarily on fine-tuning the details of the feature
map results, allowing for a relatively simple design. As such, we do not use atrous spatial
pyramid pooling [31] to extract multiscale features of high-level semantic information. On
the one hand, using many dilated convolutions adds unnecessary computational overhead
and increases network complexity. On the other hand, our subsequent ablation experiments
demonstrate that further extraction of multiscale information from feature maps does not
enhance performance.
As shown in Figure 2d, the high-level semantic information is adjusted through
pointwise convolution operations, after which bilinear interpolation is used to restore the
feature map size for concatenation with features from the lower stages. Subsequently, in
order to further fuse high-level semantic features with detailed texture features, we employ
an SWA module with a shortcut connection to model feature relationships across the global
space and channels while fully integrating the multiscale feature maps. Next, a small kernel
convolution is used to refine the crack feature information. After multiscale fusion of the
three-stage feature maps, they are fed into the segmentation head, which maps the feature
map to the required segmentation output, completing the entire network computation
process. Notably, to avoid the large computational overhead incurred by the decoder, we
do not use transposed convolutions to learn more parameters; instead, we only select and
fuse the features, resulting in a lightweight multiscale feature fusion module.
Figure 3. Samples from three crack datasets. The first line is the original images, while the second
includes the overlay effect of the masks and the original images. Samples from the CrackSeg9k
dataset (a), samples from the Ozgenel dataset (b), and samples from the Aerial Track dataset (c).
In addition to CrackSeg9k, we also used two specific-scene crack datasets: the close-
range concrete crack dataset Ozgenel [49] and the low-altitude UAV-captured highway
crack dataset Aerial Track [50], allowing us to further explore the generalization abilities of
CrackScopeNet. Among these, the image scenes in the Ozgenel dataset are similar to the
rock crack scenes in CrackSeg9k, while the Aerial Track dataset includes post-earthquake
highway crack images captured by UAVs, featuring predominantly small cracks amid
significant interference. As examples, two randomly selected images from these two
datasets are displayed in Figure 3b,c.
The Ozgenel dataset originally consists of 458 high-definition images (4032 × 3024 pixels)
with annotated cracks collected from various buildings at Middle East Technical University.
For our experiments, we cropped these images into 448 × 448 pixel blocks while ensuring a
crack area of at least 1% in each block. This process yielded 2600 images, which we divided
into 70% for training, 10% for validation, and 20% for testing.
The Aerial Track dataset comprises 4118 highway crack images (448 × 448 pixels)
captured by UAVs after an earthquake. The dataset is divided into three parts: training,
validation, and testing, with 2620, 598, and 900 images, respectively. We transferred our
models trained on CrackSeg9k to these two specific tasks.
batch size and initial learning rate halved while keeping the number of epochs unchanged,
ensuring similar training effects.
Item Setting
Epoch 200
Batch Size 16
Optimizer Adamw
Weight decay 0.01
Beta1 0.9
Beta2 0.999
Initial learning rate 0.005
Learning rate decay type poly
GPU memory 12 GB
Image size 400 × 400
In addition to CrackSeg9k, we transferred the models to the Ozgenel [40] and Aerial
Track [50] datasets. During fine-tuning, we reduced the learning rate to 0.0001, limited the
epochs to 20, and adjusted the batch size to 8. The input images were cropped to 448 × 448
for Ozgenel and 512 × 512 for Aerial Track.
TP
Pr = (6)
TP + FP
TP
Re = (7)
TP + FN
2 ∗ Pr ∗ Re
F1 = (8)
Pr + Re
TP
mIoU = mean( ) (9)
TP + FN + FP
where true positive (TP) represents correctly classified crack pixels, false positive (FP)
represents background pixels incorrectly classified as crack categories, and false negative
(FN) represents cracks incorrectly identified as background.
In addition, we evaluated the computational cost and complexity of the model using
the number of floating point operations (FLOPs) and Parameters. We also used the average
inference latency of a single image deployed on the Navio2-based drone to evaluate
the inference speed of the lightweight models. A lightweight model suitable for drone
platforms requires a low parameter count, low FLOPS, and low inference latency.
5. Experiment
In this section, we first conduct a comprehensive quantitative comparison between
CrackScopeNet and the most advanced segmentation models in various metrics, visualize
the results, and comprehensively analyze the detection performance. Subsequently, we
explore the transfer learning capability of our model on crack datasets specific to other
scenarios. Finally, we perform ablation studies to meticulously examine the significance
and impact of each component within CrackScopeNet.
Quantitative Results. Table 4 presents the performance of each baseline network and
the proposed CrackScopeNet on the CrackSeg9k dataset, with the best values highlighted
in bold. Analyzing the accuracy of different types of segmentation networks in the table,
the larger models generally achieve higher mIoU scores than the lightweight models;
specifically, compared to the classical high-accuracy models, our model achieves the best
performance in terms of mIoU, recall, and F1 score, with scores of 82.15%, 89.24%, and
89.29%, respectively. Although our model’s precision (89.34%) is 1.26% lower than U-Net,
U-Net has poor recall performance (−2.24%) and our model’s parameters and FLOPs are
lower by 12 and 48 times, respectively.
Table 4. Performance of different methods and our method on the CrackSeg9k dataset.
In terms of network weight, the network proposed in this paper achieves the best
balance between accuracy on the CrackSeg9k dataset and weight, as intuitively illustrated
in Figure 1. Our model achieves the highest mIoU with only 1.05 M parameters and
1.58 G FLOPs, making it incredibly lightweight. Its FLOPs are slightly higher than those
of TopFormer and SeaFormer, but lower than all other small models; notably, due to
the small size of the crack dataset, the learning capability of lightweight segmentation
networks is evidently limited, as mainstream lightweight segmentation models do not
consider the unique characteristics of cracks, resulting in poor performance. The proposed
CrackScopeNet architecture successfully achieves the design goal of a lightweight net-
work structure while maintaining superior segmentation performance, making it easily
deployable on resource-constrained edge devices.
Moreover, compared to the state-of-the-art crack image segmentation algorithms,
the proposed method achieves an mIoU of 82.15% with fewer parameters and FLOPs,
surpassing the highest-accuracy versions of the U2Crack and HrSegNet models. Notably,
the HrSegNet model employs an online hard example mining (OHEM) technique during
training to improve its accuracy. In contrast, we only use the cross-entropy loss function for
model parameter updating without deliberately employing any training tricks to enhance
performance, showcasing the significant benefits of considering crack morphology during
model design.
Qualitative Results. Figures 4–6 display the qualitative results of all compared models.
Our method achieves superior visual performance compared to the other models. From
Drones 2024, 8, 417 13 of 22
the first, second, and third rows of Figure 4 it can be observed that CrackScopeNet and the
more significant parameter segmentation algorithms achieve satisfactory results for high-
resolution images with apparent crack features. In the fourth row, where the original image
contains asphalt with color and texture similar to cracks, CrackScopeNet and SegFormer
successfully overcome the background noise interference. This is attributed to their long-
range contextual dependencies, which allow them to effectively capture the relationships
between cracks. In the fifth row, the results show that CrackScopeNet exhibits robust
performance even under uneven illumination conditions. This can be attributed to the
design of the network structure, which considers both the local and global features of cracks
while effectively suppressing noise.
Figure 4. Visualization of the segmentation results of the classical segmentation models and our
model on the CrackSeg9k test set.
Figure 5 clearly shows that the lightweight networks struggle to eliminate back-
ground noise interference, leading to fragmented segmentation results for fine cracks.
This outcome is due to the limited parameters learned by lightweight models. Finally,
Figure 6 presents the visualization results of the most advanced crack segmentation models.
U2Crack [52], based on the ViT [17] architecture, achieves a broader receptive field that
somewhat alleviates background noise, though at the cost of significant computational
overhead. HrSegNet [29] maintains a high-resolution branch to capture rich and detailed
features. As seen in the last two columns of Figure 6, the increased number of channels in
the HrSegNet network allow more detailed information to be extracted; however, this leads
to background information being misclassified as cracks. This explains the high precision
and low recall results of HrSegNet. In summary, CrackScopeNet outperforms the other
segmentation models, demonstrating excellent crack detection performance under various
noise conditions with lower parameters and FLOPs.
Inference on Navio2-based drones. In practical applications, there remains a sub-
stantial gap between real-time semantic segmentation algorithms designed and validated
for mobile and edge devices, with the latter facing challenges such as limited memory
resources and low computational efficiency. To better simulate edge devices used for out-
door structural health monitoring, we explored the inference speed of the models without
GPU acceleration. Notably, to ensure that all models could be deployed on the drone
platform without sacrificing accuracy through pruning or compression, we avoided using
storage-intensive and computationally demanding models such as UNet, SegNet, and
PSPNet. We converted the models to ONNX format and tested their inference speeds on
Navio2-based drones equipped with a representative Raspberry Pi 4B, focusing on compar-
ing our proposed model with models with tiny FLOPs and parameter counts: BiSeNetV2,
Drones 2024, 8, 417 14 of 22
Figure 5. Visual segmentation results of the lightweight segmentation models and our model on the
CrackSeg9k test set.
Figure 6. Visual segmentation results of the crack-specific segmentation models on the CrackSeg9k
test set.
Drones 2024, 8, 417 15 of 22
As shown in Figure 7, the test results indicate that when running on a highly resource-
constrained drone platform, the proposed CrackScopeNet architecture achieves faster
inference speed compared to other real-time or lightweight semantic segmentation net-
works based on convolutional neural networks, such as BiSeNet, BiSeNetV2, and STDC.
Additionally, TopFormer and SeaFormer, which are designed with deployment on resource-
limited edge devices in mind, both achieve extremely low inference latency; however, these
models perform poorly on the crack datasets due to inadequate data volume. Our proposed
model achieves remarkable crack segmentation accuracy while maintaining rapid inference
speed, establishing its advantages over competing models.
These results confirm the efficacy of deploying the CrackScopeNet model on outdoor
mobile devices, where high-speed inference and lightweight architecture are crucial for
real-time processing and analysis of infrastructure surface cracks. By outperforming other
state-of-the-art models, CrackScopeNet proves to be a suitable solution for addressing the
challenges associated with outdoor edge computing.
dataset. It is evident that the large version of the model achieves higher segmentation
accuracy across all datasets, though with approximately double the parameters and three
times the FLOPs. Therefore, if computational resources and memory are sufficient and
higher accuracy in crack segmentation is required, the large version or further stacking of
CrackScope modules can be employed.
Table 5. Evaluation results of the two versions of our model on three different datasets; CSNet and
CSNet_L stand for CrackScopeNet and CrackScopeNet_Large, while mIoU(F) indicates the mIoU
score for models pretrained on the CrackSeg9k dataset.
For specific scenario training, whether from scratch or fine-tuning, all our models were
trained for only 20 epochs. It can be seen that the models converge quickly even when
training from scratch. We attribute this phenomenon to the initial design of CrackScopeNet,
which considers the morphology of cracks and is able to successfully capture the necessary
contextual information. For training using transfer learning, both versions of the model
achieve remarkable results on the Ozgenel dataset, with mIoU scores of 90.1% and 92.31%,
respectively. Even for the Aerial Track dataset, which includes low-altitude remote sens-
ing images of highway cracks not seen in CrackSeg9k, both of our models still perform
exceptionally well, achieving respective mIoU scores of 83.26% and 84.11%. These results
demonstrate the proposed model’s rapid adaptability to small datasets, aligning well with
real-world tasks.
Attention Decoder
Mutil-Branch mIoU (%) FLOPs (G)
SWA CA CBAM Ours ASPP
✓ 81.34 1.57
✓ ✓ ✓ 81.98 1.58
✓ ✓ ✓ 81.95 1.58
✓ ✓ 81.91 1.61
✓ ✓ ✓ 82.14 2.89
✓ ✓ ✓ 82.15 1.58
Multiscale Branch. Next, we examined the effect of the multiscale branch in the
CrackScope module. To ensure fairness, we replaced the multiscale branch with a con-
volution of a larger kernel size (5 × 5 instead of 3 × 3). The results with and without
the multiscale branch are shown in Table 6. It is evident that using a 5 × 5 kernel size
convolution instead of the multiscale branch, decreases the mIoU score (−0.16%) despite
having more floating-point computations. This demonstrates that blindly adopting large
kernel convolutions increases computational overhead without significant performance
improvements. The benefits conferred by the multiscale branch were further analyzed
through the CAM. As shown in the third column of Figure 8, when the multiscale branch
is not used, it is obvious that the network misses the feature information of small cracks,
while the model with this branch can perfectly capture the features of cracks with various
shapes and sizes.
Decoder. CrackScopeNet uses a simple decoder to fuse feature information of different
scales, then complete the compression of channel features and the fusion of features at
different stages. At present, the most popular decoders use an atrous spatial pyramid
pooling (ASPP) [23] module to introduce multi-scale information. In order to explore
whether the introduction of an ASPP module could benefit to our model and investigate
Drones 2024, 8, 417 18 of 22
the effectiveness of our proposed lightweight decoder, we replaced decoder with the ASPP
method adopted by DeepLabV3+ [23]. The results are shown in the last two rows of
Table 6. It can be seen that the computational overhead is large because of the need to
perform parallel dilated convolution operations on deep semantic information; however,
the performance of the model does not improve. This shows that using multiple sets
of dilated convolutions to capture multiscale feature incurs additional computational
overhead while not contributing to the performance improvement of our model
6. Discussion
In this paper, we present CrackScopeNet, a lightweight infrastructure surface crack
segmentation network specifically designed to address the challenges posed by varying
crack sizes, irregular contours, and subtle differences between cracks and normal regions
in real-world applications. The proposed network structure captures the local context
information and long-distance dependencies of cracks through a lightweight multiscale
branch and an SWA attention mechanism, respectively, and effectively extracts the low-level
details and high-level semantic information required for accurate crack segmentation.
In this work, we find that using channel-wise partitioning to apply different kernel
sizes effectively captures multiscale features without introducing significant computational
overhead. Additionally, by incorporating an attention mechanism that accounts for long-
range dependencies, it is possible to compensate for the limitations of downsampling
without resorting to additional detail branches, which would otherwise increase compu-
tational demands. Our experimental results demonstrate that CrackScopeNet delivers
robust performance and high accuracy. It outperforms larger models like SegFormer in
terms of efficiency, significantly reducing the number of parameters and computational cost.
Furthermore, our method achieves faster inference speeds than other lightweight models
such as BiSeNet and STDC even in the absence of GPU acceleration. This performance
makes it highly suitable for deployment on resource-constrained drone platforms, enabling
efficient and low-latency crack detection in structural health monitoring. By making the
model and code publicly available, we aim to advance the application of UAV remote
sensing technology in infrastructure maintenance, providing an efficient and practical tool
for the timely detection and analysis of cracks.
Furthermore, utilizing UAVs to monitor crack development in geological disaster
scenarios can greatly aid in warning efforts. CrackScopeNet, having proven effective in
Drones 2024, 8, 417 19 of 22
infrastructure crack detection, has the potential to be adapted for these contexts through
domain adaptation. We have undertaken preliminary investigations by capturing images
of hazardous rock formations with UAVs and using our model to extract crack regions, as
illustrated in Figure 9. These environments present more intricate crack patterns, including
various types and complex curved damage. Our approach currently exhibits limitations in
detecting fine cracks, particularly those that blend with the background. Our next work
will focus on enhancing the model sensitivity and capacity in order to accurately identify
smaller and more complex crack patterns in challenging conditions, especially in geological
disaster monitoring.
Lastly, in this era of large models, our model has only been trained and evaluated on
datasets containing a few thousand images; the need for a large amount of data collection
and manual labeling represents a bottleneck. Recent advances in generative AI and self-
supervised learning can bypass the limitations imposed by the need for data acquisition
and manual annotation. Researchers can use the inherent structure or attributes of existing
data to generate richer “synthetic images” and “synthetic labels”, which is a very interesting
research avenue that could be applied to crack detection.
Author Contributions: T.Z. designed the architecture and comparative experiments, and wrote the
manuscript; L.Q. revised the manuscript and assisted T.Z. in conducting the experiments; Q.Z. and
H.Z. made suggestion for the experiments and assisted in revising the manuscript; L.Z. and R.W.
conducted investigation and code testing. All authors have read and agreed to the published version
of the manuscript.
Funding: This research was funded by research on identification and variation analysis methods
for rock fractures, development of a real-time monitoring model for falling rocks based on machine
vision, research project on hazard warning algorithm, and terminal equipment for rock collapse based
on vibration data of Chongqing Institute of Geology and Mineral Resources, grant numbers F2023304,
F2023045, and cstc2022jxjl00006. This work was supported by the 2024 Key Technology Project of
Chongqing Municipal Education Commission, grant numbers KJZD-K202400204.
Data Availability Statement: The code and data that support the findings of this study are available
on GitHub at https://github.com/ttkingzz/CrackScopeNet, accessed on 5 July 2024.
Acknowledgments: The authors would like to thank the editors and reviewers for their valuable suggestions.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Minh Dang, L.; Wang, H.; Li, Y.; Nguyen, L.Q.; Nguyen, T.N.; Song, H.K.; Moon, H. Deep learning-based masonry crack
segmentation and real-life crack length measurement. Constr. Build. Mater. 2022, 359, 129438. [CrossRef]
2. Zheng, M.; Lei, Z.; Zhang, K. Intelligent detection of building cracks based on deep learning. Image Vis. Comput. 2020, 103, 103987.
[CrossRef]
3. Ha, J.; Kim, D.; Kim, M. Assessing severity of road cracks using deep learning-based segmentation and detection. J. Supercomput.
2022, 78, 17721–17735. [CrossRef]
4. Zhang, J.; Qian, S.; Tan, C. Automated bridge surface crack detection and segmentation using computer vision-based deep
learning model. Eng. Appl. Artif. Intell. 2022, 115, 105225. [CrossRef]
Drones 2024, 8, 417 20 of 22
5. Deng, J.; Singh, A.; Zhou, Y.; Lu, Y.; Lee, V.C.S. Review on computer vision-based crack detection and quantification methodologies
for civil structures. Constr. Build. Mater. 2022, 356, 129238. [CrossRef]
6. Gavilán, M.; Balcones, D.; Marcos, O.; Llorca, D.F.; Sotelo, M.A.; Parra, I.; Ocaña, M.; Aliseda, P.; Yarza, P.; Amírola, A. Adaptive
Road Crack Detection System by Pavement Classification. Sensors 2011, 11, 9628–9657. [CrossRef]
7. Jahanshahi, M.R.; Jazizadeh, F.; Masri, S.F.; Becerik-Gerber, B. Unsupervised Approach for Autonomous Pavement-Defect Detection and
Quantification Using an Inexpensive Depth Sensor; American Society of Civil Engineers: Reston, VA, USA, 2012.
8. Zhang, D.; Zou, Q.; Lin, H.; Xu, X.; He, L.; Gui, R.; Li, Q. Automatic pavement defect detection using 3D laser profiling technology.
Autom. Constr. 2018, 96, 350–365. [CrossRef]
9. Iyer, S.; Sinha, S.K. Segmentation of Pipe Images for Crack Detection in Buried Sewers. Comput.-Aided Civ. Infrastruct. Eng. 2006,
21, 395–410. [CrossRef]
10. Sun, B.C.; Qiu, Y.J. Automatic Identification of Pavement Cracks Using Mathematic Morphology. In Proceedings of the First
International Conference on Transportation Engineering, Chengdu, China, 22–24 July 2007.
11. Kamaliardakani, M.; Sun, L.; Ardakani, M.K. Sealed-Crack Detection Algorithm Using Heuristic Thresholding Approach.
J. Comput. Civ. Eng. 2016, 30, 04014110. [CrossRef]
12. Mohan, A.; Poobal, S. Crack detection using image processing: A critical review and analysis. Alex. Eng. J. 2018, 57, 787–798.
[CrossRef]
13. Qu, Z.; Lin, L.D.; Guo, Y.; Wang, N. An improved algorithm for image crack detection based on percolation model. Comput.-Aided
Civ. Infrastruct. Eng. 2015, 10, 214–221. [CrossRef]
14. Cha, Y.J.; Ali, R.; Lewis, J.; Büyüköztürk, O. Deep learning-based structural health monitoring. Autom. Constr. 2014, 161, 105328.
[CrossRef]
15. Liu, Z.; Cao, Y.; Wang, Y.; Wang, W. Computer vision-based concrete crack detection using U-net fully convolutional networks.
Autom. Constr. 2019, 104, 129–139. [CrossRef]
16. Yang, J.; Wang, W.; Lin, G.; Li, Q.; Sun, Y.; Sun, Y. Infrared Thermal Imaging-Based Crack Detection Using Deep Learning. IEEE
Access 2019, 7, 182060–182077. [CrossRef]
17. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.;
Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929.
18. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted
windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October
2021; pp. 10012–10022.
19. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic
Segmentation with Transformers. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.:
New York, NY, USA,2021; Volume 34, pp. 12077–12090.
20. Lin, Q.; Li, W.; Zheng, X.; Fan, H.; Li, Z. DeepCrackAT: An effective crack segmentation framework based on learning multi-scale
crack features. Eng. Appl. Artif. Intell. 2023, 126, 106876. [CrossRef]
21. Yang, F.; Zhang, L.; Yu, S.; Prokhorov, D.; Mei, X.; Ling, H. Feature Pyramid and Hierarchical Boosting Network for Pavement
Crack Detection. IEEE Trans. Intell. Transp. Syst. 2020, 21, 1525–1535. [CrossRef]
22. Chu, H.; Wang, W.; Deng, L. Tiny-Crack-Net: A multiscale feature fusion network with attention mechanisms for segmentation
of tiny cracks. Comput.-Aided Civ. Infrastruct. Eng. 2022, 37, 1914–1931. [CrossRef]
23. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image
segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018;
pp. 801–818.
24. Guo, M.H.; Lu, C.Z.; Hou, Q.; Liu, Z.; Cheng, M.M.; Hu, S.m. SegNeXt: Rethinking Convolutional Attention Design for Semantic
Segmentation. arXiv 2022, arXiv:2209.08575.
25. Duan, Z.; Liu, J.; Ling, X.; Zhang, J.; Liu, Z. ERNet: A Rapid Road Crack Detection Method Using Low-Altitude UAV Remote
Sensing Images. Remote Sens. 2024, 16, 1741. [CrossRef]
26. Forcael, E.; Román, O.; Stuardo, H.; Herrera, R.F.; Soto-Muñoz, J. Evaluation of Fissures and Cracks in Bridges by Applying
Digital Image Capture Techniques Using an Unmanned Aerial Vehicle. Drones 2024, 8, 8. [CrossRef]
27. Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. ENet: A Deep Neural Network Architecture for Real-Time Semantic
Segmentation. arxiv 2016, arxiv:1606.02147.
28. Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation.
In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341.
29. Li, Y.; Ma, R.; Liu, H.; Gaoli, C. Real-time high-resolution neural network with semantic guidance for crack segmentation. Autom.
Constr. 2023, 156, 105112. [CrossRef]
30. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the
Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F.,
Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [CrossRef]
31. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890.
Drones 2024, 8, 417 21 of 22
32. Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986.
33. Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up your kernels to 31 × 31: Revisiting large kernel design in cnns. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11963–11975.
34. Zhang, W.; Huang, Z.; Luo, G.; Chen, T.; Wang, X.; Liu, W.; Yu, G.; Shen, C. TopFormer: Token Pyramid Transformer for Mobile
Semantic Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
New Orleans, LA, USA, 18–24 June 2022; pp. 12083–12093.
35. Wan, Q.; Huang, Z.; Lu, J.; Yu, G.; Zhang, L. SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation.
In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023.
36. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018.
37. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference
on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19.
38. Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722.
39. Tang, Y.; Han, K.; Guo, J.; Xu, C.; Xu, C.; Wang, Y. GhostNetv2: Enhance cheap operation with long-range attention. arXiv 2022,
arXiv:2211.12905.
40. Kulkarni, S.; Singh, S.; Balakrishnan, D.; Sharma, S.; Devunuri, S.; Korlapati, S.C.R. CrackSeg9k: A Collection and Benchmark
for Crack Segmentation Datasets and Frameworks. In Proceedings of the Computer Vision—ECCV 2022 Workshops; Karlinsky, L.,
Michaeli, T., Nishino, K., Eds.; Springer: Cham, Switzerland, 2023; pp. 179–195.
41. Dais, D.; Bal, E.; Smyrou, E.; Sarhosis, V. Automatic crack classification and segmentation on masonry surfaces using convolutional
neural networks and transfer learning. Autom. Constr. 2021, 125, 103606. [CrossRef]
42. Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic Road Crack Detection Using Random Structured Forests. IEEE Trans. Intell.
Transp. Syst. 2016, 17, 3434–3445. [CrossRef]
43. Zou, Q.; Cao, Y.; Li, Q.; Mao, Q.; Wang, S. CrackTree: Automatic crack detection from pavement images. Pattern Recognit. Lett.
2012, 33, 227–238. [CrossRef]
44. Pak, M.; Kim, S. Crack Detection Using Fully Convolutional Network in Wall-Climbing Robot. In Advances in Computer Science
and Ubiquitous Computing; Park, J.J.; Fong, S.J.; Pan, Y.; Sung, Y., Eds.; Springer: Singapore, 2021; pp. 267–272.
45. Liu, Y.; Yao, J.; Lu, X.; Xie, R.; Li, L. DeepCrack: A deep hierarchical feature learning architecture for crack segmentation.
Neurocomputing 2019, 338, 139–153. [CrossRef]
46. Junior, G.S.; Ferreira, J.; Millán-Arias, C.; Daniel, R.; Junior, A.C.; Fernandes, B.J.T. Ceramic Cracks Segmentation with Deep
Learning. Appl. Sci. 2021, 11, 6017. [CrossRef]
47. Dorafshan, S.; Thomas, R.J.; Maguire, M. SDNET2018: An annotated image dataset for non-contact concrete crack detection using
deep convolutional neural networks. Data Brief 2018, 21, 1664–1668. [CrossRef]
48. Eisenbach, M.; Stricker, R.; Seichter, D.; Amende, K.; Debes, K.; Sesselmann, M.; Ebersbach, D.; Stoeckert, U.; Gross, H.M. How to
get pavement distress detection ready for deep learning? A systematic approach. In Proceedings of the 2017 International Joint
Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 2039–2047. [CrossRef]
49. Özgenel, F. Concrete Crack Segmentation Dataset. Mendeley Data 2019. [CrossRef]
50. Hong, Z.; Yang, F.; Pan, H.; Zhou, R.; Zhang, Y.; Han, Y.; Wang, J.; Yang, S.; Chen, P.; Tong, X.; et al. Highway Crack Segmentation
From Unmanned Aerial Vehicle Images Using Deep Learning. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [CrossRef]
51. Liu, Y.; Chu, L.; Chen, G.; Wu, Z.; Chen, Z.; Lai, B.; Hao, Y. PaddleSeg: A High-Efficient Development Toolkit for Image
Segmentation. arxiv 2021, arxiv:2101.06175.
52. Shi, P.; Zhu, F.; Xin, Y.; Shao, S. U2CrackNet: A deeper architecture with two-level nested U-structure for pavement crack
detection. Struct. Health Monit. 2023, 22, 2910–2921. [CrossRef]
53. Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation.
IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [CrossRef] [PubMed]
54. Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. BiSeNet V2: Bilateral Network with Guided Aggregation for Real-Time
Semantic Segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [CrossRef]
55. Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking BiSeNet for Real-Time Semantic Segmentation. In Proceedings
of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021;
pp. 9716–9725.
56. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset
for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223.
57. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520.
Drones 2024, 8, 417 22 of 22
58. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
59. Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. arXiv 2015,
arXiv:1512.04150.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.