2024, DCNAM - Automatic Detection of Pixel Level Fine Crack Using A Densely Connected - 'Beyene Et Al' (Structures)
2024, DCNAM - Automatic Detection of Pixel Level Fine Crack Using A Densely Connected - 'Beyene Et Al' (Structures)
Structures
journal homepage: www.elsevier.com/locate/structures
Keywords: Deep-learning-based crack identification has emerged as a prominent research area in structural health
Crack detection monitoring. Although the detection of common cracks has been the predominant focus in previous studies,
Deep learning the identification of tiny cracks has often been neglected. Efficiently managing thin cracks is vital, because
FC-denseNet
they can threaten the overall structural integrity over time if left unaddressed. We address this gap by
Attention Block
targeting thin cracks within a broad category of crack types. We introduce a fine-crack-detection algorithm
Structural health monitoring
Tiny cracks
that efficiently detects both common and tiny cracks. Owing to the limited availability of publicly accessible
datasets specifically focused on thin cracks, we collect images of fine cracks to train and evaluate our algorithm.
To validate the efficiency of our method, we conduct experiments on three publicly available crack datasets
and our private dataset. Compared with the baseline neural network, our proposed approach demonstrates
superior performance across all evaluation metrics. Furthermore, our model exhibits impressive generalization
ability across the datasets, with the F1 score and mean intersection over union improving by 22.42% and
28.07%, respectively. Notably, our observations indicate that the advantages of the proposed method become
more pronounced as the dataset size increases.
Abbreviations: CNN, Convolutional Neural Networks; DCNAM, Densely Connected Network with Attention Mode
∗ Corresponding author.
∗∗ Corresponding author at: Department of Global Smart City, Sungkyunkwan University, Suwon 16419, South Korea.
E-mail addresses: [email protected] (M. Park), [email protected] (S. Park).
1
These authors contributed equally to this work.
https://doi.org/10.1016/j.istruc.2024.107073
Received 2 April 2024; Received in revised form 4 July 2024; Accepted 10 August 2024
Available online 22 August 2024
2352-0124/© 2024 Institution of Structural Engineers. Published by Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and
similar technologies.
D.A. Beyene et al. Structures 68 (2024) 107073
unsatisfactory performance in the case of complex backgrounds; thus, and uneven illumination, which can lead to fragmented detection out-
their performance and robustness are affected. comes. To address these limitations, Minimal path selection techniques
Recently, deep learning has demonstrated remarkable performance have been introduced [10,23,24]. These techniques aim to suppress
in computer vision, and its use for detecting concrete cracks has been noise and enhance the continuity of crack detection. However, they are
explored in numerous studies. For example, Zhang et al. [15] pro- often used in scenarios with complex crack topologies, which can limit
posed a convolutional neural network (CNN)-based method employing their applicability.
a sliding-window technique to classify each subregion of input im- Regarding edge detection techniques, Canny [7], Sobel [25], and
ages as cracked or non-cracked. Similarly, Li et al. [14] proposed a morphological filters [26] are employed due to the similarity between
deep bridge crack classification model based on CNNs, incorporating cracks and edges. These methods identify edges in images that may
an optimized sliding-window algorithm for detecting bridge cracks. correspond to crack lines. However, their performance is highly sus-
However, these methods typically only identify the approximate loca- ceptible to noise, and the selection of appropriate hyperparameters is
tion of cracks and lack the capability to extract more detailed crack critical for varying environmental conditions.
information.
An alternative approach is pixel-level crack detection. Schmugge 2.2. Machine learning techniques-based crack detection
et al. [16] proposed a method for remote video crack detection us-
ing semantic segmentation networks. To connect hierarchical features, Machine learning has significantly enhanced crack detection by
Yang et al. [17] introduced feature fusion operations. These methods shifting from basic feature extraction to sophisticated classification
can improve network performance to a certain extent, as low-level fea- algorithms. This evolution marked a pivotal shift in the identification
tures provide intricate crack details and high-level features encompass and analysis of cracks in various images.
more abstract semantic information. However, as the neural network Early machine learning endeavors, as exemplified by
architecture becomes more complex and the number of layers increases, Moon et al. [27] and Moselhi et al. [28], focused on extracting geomet-
acquiring crack features becomes more challenging, with an increased ric features from connected regions, such as the lengths of the major
risk of gradient disappearance. These factors complicate the detection and minor axes and their ratios. These features were then classified
of tiny cracks.. Gao et al. [18] proposed the DenseNet classification using a BP neural network. The effectiveness of these early methods
network to enhance feature propagation and mitigate the vanishing lies in their simplicity and direct approach to feature extraction, and
gradient problem. The DenseNet allows each layer direct access to subsequent research has delved into texture-based features, recognizing
the gradients from the loss function and original input and provides their potential in capturing intricate details of crack surfaces. Michael
implicit deep supervision. The DenseNet enhances feature extraction et al. [29] and Kapela et al. [30] demonstrated the effectiveness of
with densely connected feature maps. However, integrating feature utilizing the pixel intensity values and histograms of oriented gradient
maps of various scales and re-weighting them are crucial owing to features, respectively. Local binary patterns have also been widely
the semantic feature distribution of cracks and the minimal variability adopted for the efficient representation of textural features [31,32].
between fine cracks and the background. To address the challenges in Classifiers are essential in these methods, with various algorithms
crack detection arising from the limited variability between fine cracks such as artificial neural networks [33], support vector machines [29,
and the background as well as the semantic feature distribution of 32], and random structured forests [34] being employed. Each classifier
cracks, we introduced a fully CNN for pixel-level segmentation based has its own strengths and weaknesses, which influence the overall ef-
on FC-DenseNet [19] and squeeze-and-excitation (SE) block [20].The fectiveness of the crack-detection process. However, machine-learning
main contributions of this work are as follows: methods in this domain require a large number of structured labels,
and their performance and robustness are still lacking, particularly in
• The FC-DenseNet is employed to extract feature maps from raw
complex backgrounds.
images at various scales. Dense blocks enhance crack extraction ef-
fectiveness, and skip connections fuse feature maps across different
2.3. Deep learning techniques based crack detection
scales.
• The SE block is utilized to reweight the channel dimensions of
The remarkable feature extraction capabilities of CNNs have driven
feature maps at various scales, thereby enabling dynamic adjustments
numerous studies into the application of deep learning for crack detec-
to channel values.
tion.
• A dataset consisting of 51 images captured using a GoPro camera
Crack detection, an extension of crack classification, involves de-
is collected. These images have a resolution of 6000 × 3384 pixels and
termining the precise location of cracks in an image. Deep-learning
specifically focus on fine cracks.
approaches have shown remarkable robustness and adaptability, often
outperforming traditional and machine-learning-based methods in vari-
2. Related work ous scenarios. For example, Cha et al. [35] developed a CNN model that
uses a 256 × 256 window to scan images. Zhang et al. [15] and Pauly
In this review, we categorize the existing methodologies into three et al. [36] further emphasized the the advantages of deep learning, par-
main approaches: image processing, machine learning, and deep ticularly in handling complex crack patterns. To address the challenge
learning-based methods. of detecting cracks of varying sizes, methodologies like R-CNN and
YOLO have been introduced [37,38]. For road damage identification,
2.1. Image processing techniques-based crack detection Tsuchiya et al. [39] employed YOLOv3, illustrating the benefits of
data augmentation in improving detection accuracy. However, these
In traditional crack identification, several image-processing tech- methods generally provide only approximate crack locations, lacking
niques are utilized to detect cracks. Among these techniques, threshold- the detail needed for thorough analysis. This limitation underscores
ing, minimal path selection, and edge detection methods are commonly the need for continued research to develop techniques that offer more
employed for crack detection [2]. precise crack detection and characterization.
Threshold-based methods rely on the characteristics that cracks are Another alternative solution is a semantic-segmentation-based tech-
darker than the surrounding pixels [5,6,21,22], employ a threshold nique that segments cracks at the pixel level. For example, the holis-
value to distinguish cracks from the background. By employing a tically nested edge detection (HED) framework introduced by Xie
threshold value, these methods distinguish cracks from the background. et al. [40] marked a pivotal shift toward more accurate edge detection,
However, their effectiveness is often compromised by shadows, noise, influencing subsequent models in crack detection. Liu et al. [41]
2
D.A. Beyene et al. Structures 68 (2024) 107073
developed the Deepcrack model based on the HED framework which 3.1.2. FC-Densenet
integrating Conditional Random Fields and Guided Filtering techniques Simon Jégou et al. [19] proposed the FC-DenseNet, which is a
to refine segmentation accuracy. Despite these advancements such variant of the DenseNet architecture specifically designed for semantic
models can still produce false positive pixels due to background in- segmentation tasks. The FC-DenseNet preserves the dense connectivity
terference in the encoder’s features. The Fully Convolutional Network pattern of the DenseNet while adapting it to enable dense predictions
(FCN) [42] enhances semantic segmentation by introducing fully con-
at each spatial location. The key components include dense blocks,
volutional layers, significantly enhancing crack segmentation tasks.
transition-down blocks, and skip connections for multi-scale feature
This approach allows the model to capture detailed features without the
fusion. The FC-DenseNet demonstrates excellent performance in pixel-
limitation of fully connected layers.The UNet architecture [43] further
wise segmentation by effectively capturing and propagating features,
improves segmentation accuracy through its innovative use of skip
connections and upsampling techniques, resulting in a more detailed thereby allowing for accurate and detailed segmentation results. It has
and accurate representation of crack features. Extensions to UNet, as been widely adopted in the field of semantic segmentation and serves
seen in models like Deepcrack [44] and UHDN [45], introduce novel as a foundation for advancements in this field. Owing to the superior
methods for feature fusion at various levels. advantages of the dense connectivity modules and encoder-decoder
Accurate boundary localization remains a challenge in crack detec- structure, which enable better attention to fine crack features, we chose
tion. To address this, Guo et al. [46] proposed a model that combines the FC-DenseNet as the foundational framework for our model.
the original image edge with the output of a base predictor module,
which refined the results with a separate refinement module. Although
this model effectively enhances boundary precision, it comes at the 3.1.3. SE block
cost of increased complexity and processing requirements. In response SE block [20], is a widely used architectural component in deep-
to the need for speed and efficiency, Choi et al. [47] introduced neural networks, particularly for computer vision tasks. The SE block
a semantic damage-detection network that employed separable and enhances the representational power of intermediate feature maps
dilated convolutions to expedite the segmentation process without com- within a CNN by learning channel-wise dependencies and adaptively
promising accuracy. Ensemble methods, such as the one proposed by recalibrating feature responses. This allows the network to emphasize
Fan et al. [48], utilize multiple network outputs to improve detection informative features and suppress less relevant ones. The SE block
accuracy. Although equal weighting in feature fusion can overlook achieves this through two key operations: the squeezing and exciting
minor crack details, indicating areas for potential improvement, most stages. The structure of the SE block is shown in Fig. 5.
of the aforementioned methods employ concatenation operations for
In the squeeze stage, a fully connected hidden layer is employed
feature fusion or merging, where features are equally weighted. In these
to compress channel-wise feature descriptors into a lower-dimensional
operations, the information from tiny cracks may be neglected because
space. This mapping operation utilizes a nonlinear activation function
of their weak representation. This study addresses the challenge arising
from the limited variability between tiny cracks and the background, as such as the rectified linear unit to learn the correlation and importance
well as the semantic distribution of cracks, for the precise segmentation of each channel. In the excitation stage, another fully connected hidden
of both coarse and fine cracks. layer is utilized to map the compressed feature descriptors back to
the original channel dimensions. This mapping operation employs a
3. Research methodology sigmoid function to generate a weight vector ranging from 0 to 1, which
indicates the importance of each channel.
3.1. Proposed method The two core stages of the SE attention module, squeezing and
excitation, work together to enhance the representational power of
3.1.1. Overview of the method the CNN by adaptively adjusting channel weights. This process em-
We consider crack identification through crack segmentation, where phasizes important feature channels, leading to significant performance
each pixel is classified as either a crack or background. The data- improvements in various computer vision tasks. By enhancing the rep-
processing workflow involves obtaining, labeling, and preprocessing resentation of fine cracks, the SE block increases contrast between fine
data. After preprocessing, we created a target dataset that met the ex- cracks and the background. This reweighting capability improves the
perimental requirements. The main structure of the proposed network, network’s ability to detect and represent fine cracks more effectively.
named a Densely Connected Network with Attention Mode(DCNAM) is
presented in Fig. 1. Our network architecture utilizes FC-DenseNet [19]
with the addition of SE blocks [20]. SE blocks are integrated af- 3.1.4. Loss function
ter each convolutional module and dense block, except for the final In our experiments, we employed three types of loss functions:binary
convolutional layer.
cross-entropy (BCE) loss, Dice loss, and class-balanced loss. These
A dense block was used to capture fracture characteristics and
functions were utilized to train our model and evaluate its performance
ensure effective gradient propagation. Fig. 2 shows how the modules
in crack detection.
are tightly connected. The dense connections differ slightly from those
in DenseNet [18], as shown in Fig. 3. (a) BCE loss function. The cross-entropy loss function is commonly
During the training phase, crack images were initially fed into the used in segmentation tasks to measure the discrepancy between the
encoder-decoder structure to generate multi-level feature maps. In the predicted distribution (p) and true distribution (q) of the cracks. For
encoder, the cracked image passed through a convolution module, three
each sample, the BCE loss is computed as the negative logarithm of the
SE blocks, three dense blocks, and two transition-down modules. In the
predicted probability of the true label.
decoder, the feature map was processed through three SE blocks, two
transition-up modules, two dense blocks, and one convolution module. ∑
𝐻𝑊
( ( ) ( ) ( ))
Each SE block was connected after every dense block and the initial BCE(P, Q) = − 𝑄𝑖 log 𝑃𝑖 + 1 − 𝑄𝑖 log 1 − 𝑃𝑖 (1)
i=1
convolution module. Finally, a sigmoid classifier and class-balance loss
were employed to train the network. Fig. 4 provides a comprehensive where Q is the true binary label (zero or one), P is the predicted
overview of our proposed method, offering a visual representation that probability of the positive class, H is the height of the feature map,
clarifies the overall approach. and W is its width of the feature map.
3
D.A. Beyene et al. Structures 68 (2024) 107073
Fig. 1. DCNAM main structure. SE: Squeeze-and-Excitation block, TU: transition up modules, TD: transition down modules.
following equation:
∑
𝐻𝑊
( ( ) ( ) ( ))
L𝑝 = − 𝛽𝑄𝑖 log 𝑃𝑖 + (1 − 𝛽) 1 − 𝑄𝑖 log 1 − 𝑃𝑖 (3)
𝑖=1
N𝑛
𝛽= (4)
𝑁𝑛 + 𝑁𝑐
where N𝑐 represents the pixel number of cracks, N𝑛 represents for non-
crack pixels, 𝛽 means the class-balancing weight, 𝑄 means ground truth
Fig. 2. Dense connection in our method. label, 𝑃 means prediction probability map, 𝐻 means the height of the
feature map, and W means the width of the feature map.
4
D.A. Beyene et al. Structures 68 (2024) 107073
Table 1 (a) Pixel Accuracy (PA). where is the ratio of the correctly predicted
Information on the datasets used in this study. Train Num: number of training images;
pixels to the total number of pixels:
Val Num: number of validation images; Test Num: number of testing images.
Dataset 𝑇𝑃 + 𝐹𝑁
𝑃𝐴 = (6)
𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁
Crack500 Deepcrack537 CFD Ours
(b) Mean Intersection over Union (miou). is the average ratio of the
Train Num 1896 300 72 141
Val Num 348 0 0 63 intersection to union between the predicted and ground truths for each
Test Num 1124 257 46 54 class.
Resolution 360 × 640 554 × 384 320 × 480 600 × 600 ( )
1 𝑇𝑃
𝑚𝐼𝑜𝑈 = (7)
2 𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁
(c) Precision (P). is the proportion of correctly predicted instances to
from bridges around the natural campus of Sungkyunkwan University, all predicted instances:
South Korea. 𝑇𝑃
𝑃 = (8)
The images were captured at different ISO settings: 25 images at ISO 𝑇𝑃 + 𝐹𝑃
200, 13 images at ISO 400, and 13 images at ISO 800. The original size (d) Recall (R). is the proportion of correctly predicted instances to all
of these images is 6000 × 3368 pixels. Some representative images are positive instances:
shown in Fig. 6, display varying backgrounds, lighting conditions, and 𝑇𝑃
𝑅= (9)
surface textures. Due to the predominance of tiny cracks, the ratio of 𝑇𝑃 + 𝐹𝑁
crack pixels to non-crack pixels is lower compared to other datasets. In (e) F-measure (F1). represents the balance between the precision and
addition, the contrast between cracked and non-cracked pixels varies recall:
across images. Annotation files were prepared by experts using Darwin 𝑃 ×𝑅
V7 software to label cracks at the pixel level. After labeling, the final an- 𝐹1 = 2 (10)
𝑃 +𝑅
notations were converted into binary format. To enlarge the dataset and where FP, FN,TP,and TN represent false positives, false negatives, true
mitigate computational resources, we segmented the images 600 × 600 positives, and true negatives, respectively.
pixel tiles, following the approach used for the Crack500 dataset. A
total of 258 non-overlapping segments were selected and randomly 4. Experimental results
divided into training, validation, and testing sets, consisting of 141, 63,
and 54 images, respectively. We conducted four different types of experiments: a comparison
study, a cross-data generation experiment, an ablation study, and an
experiment on special cases in which we analyzed the segmentation
3.4. Evaluation metrics
results of the proposed model under complex scenarios.
In this study, we utilized pixel accuracy (PA), F1 score, precision, 4.1. Comparison result
recall, and mean intersection over union (mIoU) as metrics to evaluate
and compare the alignment between ground-truth labeling and model 4.1.1. Results of Crack500 dataset
predictions. Among these, the mIoU is a commonly used metric in crack We compare the detection results of the models using the Crack500
detection. Precision and recall are standard performance measures for dataset. The experimental results for precision, recall, F1-score, accu-
assessing the capabilities of neural networks. The formulas for these racy, and mIoU are listed in Table 2. Our proposed method, DCNAM,
five evaluation metrics are as follows: demonstrates superior performance across all five evaluation metrics,
5
D.A. Beyene et al. Structures 68 (2024) 107073
Fig. 7. Segmentation results on the Crack500 dataset. From left to right: original images, ground truth, segmentation results from FCN, U-Net, FPHBN, and our model.
Table 2 Table 3
Evaluation results of models on the Crack500 dataset. Comparative evaluation of models on CFD dataset.
Class Balance loss Metrics Class Balance loss Metrics
P R F1 Acc mIoU P R F1 Acc mIoU
U-Net [43] 69.82 75.68 72.63 96.48 56.28 U-Net [43] 64.48 69.88 67.07 98.58 50.58
FCN [42] 56.36 62.74 59.40 95.96 43.21 FCN [42] 67.03 70.58 68.76 84.06 53.68
FPHBN [44] 66.75 73.65 70.03 96.37 54.06 FPHBN [44] 53.91 70.38 61.05 98.54 43.86
DCNAM 87.23 79.60 83.24 96.83 71.11 DCNAM 67.69 86.49 75.94 98.71 54.61
with scores of 87.23% for precision, 79.60% for recall, 83.24% for F1- consistently achieved the highest values for precision, recall, F1-score,
score, 96.83% for accuracy, and 71.11% for mIoU. Specifically, in terms accuracy, and mean mIoU, with scores of 67.69, 86.49, 75.94, 98.71,
of F1-score, DCNAM achieved improvements of 10.61 %, 23.84 %, and and 54.61 %, respectively. The CFD dataset’s limited size, with only
13.21 % over U-Net, FCN8, and FPHBN respectively. For mIoU, DCNAM approximately a hundred images across all training, validation, and
outperformed U-Net, FCN, and FPHBN by 14.83, 27.90, and 17.05 %, testing sets, likely explains the lack of significant performance dif-
respectively. ference among the four neural networks. Specifically, the proposed
Fig. 7 shows the segmentation results of randomly selected images method outperformed the U-Net, FCN, and FPHBN in terms of the
from the Crack500 dataset. Our approach achieves segmentation re- F1-score by 8.87, 6.88, and 14.59%, respectively. Similarly, DCNAM
sults that closely match the ground truth in both crack shape and achieved mIoU values 4.03, 0.93, and 10.75% higher than those of the
propagation orientation, which is beneficial for crack quantification. In U-Net, FCN, and FPHBN, respectively.
contrast, the U-Net and FPHBN tended to misidentify some non-crack Segmentation results of randomly selected images from the CFD
pixels as cracks, whereas the FCN lost some crack details. dataset are depicted in Fig. 8. Our method’s segmentation output
closely aligns with the ground truth in terms of crack width and overall
4.1.2. Results of CFD dataset shape. In contrast, FPHBN tends to misclassify non-crack pixels as
Table 3 presents the inference results for precision, recall, F1- cracks and segments cracks wider than the actual ground truth. U-Net’s
score, accuracy, and mIoU on the CFD dataset. The proposed method results suffered from a loss of detail and distortion in the overall shape
6
D.A. Beyene et al. Structures 68 (2024) 107073
Fig. 8. Segmentation results of the CFD dataset. From left to right: original images, ground truth, and the segmentation results of FCN, U-Net, FPHBN, and our model.
Table 4 Table 5
Comparative evaluation of models on the Deepcrack537 dataset. Comparison results on our dataset.
Class Balance loss Metrics Class Balance loss Metrics
P R F1 Acc mIoU P R F1 Acc mIoU
U-Net [43] 82.48 82.68 83.58 98.09 71.25 U-Net [43] 83.19 79.16 81.12 98.35 73.85
FCN [42] 82.38 81.23 81.80 96.24 69.26 FCN [42] 77.38 79.22 78.30 98.67 71.56
FPHBN [44] 83.22 76.13 79.52 98.02 68.20 FPHBN [44] 80.16 75.46 77.74 96.84 69.59
DCNAM 84.67 93.45 88.84 98.22 74.69 DCNAM 85.21 79.33 82.16 98.69 74.35
Table 6
of the cracks. Similarly, FCN’s segmentation results also exhibited a loss Comparison result of models on cross data(Deepcrack to CFD).
Class Balance loss Metrics
of crack details.
P R F1 Acc mIoU
U-Net [43] 58.97 57.43 58.19 98.67 41.04
4.1.3. Results of Deepcrack537 dataset FCN [42] 46.47 55.50 50.59 98.25 33.85
Table 4 presents the experimental results for precision, recall, F1- FPHBN [44] 54.22 59.08 56.55 98.53 39.41
DCNAM 72.35 90.99 80.61 99.14 69.11
score, accuracy, and mIoU on the Deepcrack537 dataset. Our proposed
outperforms all metrics, achieving precision, recall, F1-score, accuracy,
and mIoU values of 85.21, 93.45, 88.84, 98.22, and 74.69%, respec-
tively. Specifically, the proposed method’s F1-scores are 5.26, 7.04, and 4.2. Generation on cross data
9.32% higher than those of the U-Net, FCN, and FPHBN, respectively.
Additionally, the proposed method shows substantial improvements in To assess the generalizability of the proposed model, we conducted
mIoU, surpassing the U-Net, FCN, and FPHBN by 3.44, 5.43, and 6.49%, cross-dataset validation experiments. Initially, we trained the models
respectively. on the Deepcrack537 dataset and then tested them on the CFD dataset.
The segmentation results for three randomly selected images from Table 6 lists the experimental results of various models for this cross-
dataset validation. Our proposed method surpassed the baseline models
the Deepcrack537 dataset are shown in Fig. 9. Our proposed method
U-Net, FCN, and FPHBN in all evaluation metrics. Specifically, our
exhibits the least deviation from the ground truth segmentation. In
model achieved precision, recall, F1-score, accuracy, and mIoU values
contrast, the segmentation maps produced by the other methods exhibit
of 73.04, 89.42, 80.40, 99.09, and 68.80%, respectively. In terms of
noticeable detail loss and distortions.
F1-score, our model outperformed U-Net, FCN, and FPHBN by 22.21,
34.98, and 30.34%, respectively. Furthermore, our method demon-
4.1.4. Result of our dataset strated a significant improvement in mIoU, surpassing U-Net, FCN, and
Table 5 presents the inference results of the models in terms of FPHBN by 27.76, 34.95, and 29.93%, respectively.
the precision, recall, F1-score, accuracy, and mIoU on our dataset. Our Fig. 11 shows the segmentation results of different types of struc-
tural surfaces, including asphalt, concrete, and masonry, under vari-
proposed model achieved the highest precision, F1-score, accuracy,
ous backgrounds. The first, second, and seventh rows show pavement
and mIoU which are 85.21, 82.16, 98.69, and 74.35%, respectively.
cracks: the first features a shallow bifurcation crack, the second row
In terms of the F1-score, DCNAM outperforms the U-Net, FCN, and
shows an alligator crack, and the seventh row depicts a complex
FPHBN by 1.04, 0.69, and 4.42%, respectively. In terms of the mIoU,
background pavement crack. Concrete cracks are shown in the third
our proposed method achieved 0.49, 1.88, and 4.76% higher values
to fifth rows, with shallow and deep bifurcation cracks shown under
than those of the U-Net, FCN, and FPHBN, respectively. various backgrounds. The fifth row shows a masonry crack.
The segmentation results of the dataset are presented in Fig. 10. Our In the first row, despite the disconnected prediction on the left
model achieved more accurate segmentation of fine cracks compared side, our model achieved segmentation results that were closest to the
to other models, especially in terms of crack length and propagation ground-truth labeling. The segmentation result generated by FPHBN
orientations. In contrast, other neural networks exhibit distortions in was better than those of U-Net and FCN; however, it displayed dis-
the segmentation of fine crack lengths and their integrity. connected and missing predictions on the right side of the cracks. FCN
7
D.A. Beyene et al. Structures 68 (2024) 107073
Fig. 9. Segmentation results of the Deepcrack537 dataset. From left to right: original images, ground truth, segmentation results of FCN, U-Net, FPHBN, and Our Model.
Fig. 10. Segmentation results of the our dataset. From left to right: original images, ground truth, segmentation results from FCN, U-Net, FPHBN, and our model.
failed to detect cracks on the left side, whereas U-Net failed to produce In the first row, our model’s predicted segmentation results align
crack lines on both the left and right sides. closely with the ground truth and the segmentation results shown in the
In the third row, our model showed the best segmentation, which same row of Fig. 11. In contrast, segmentations produced by FCN, U-
was closest to the ground truth labeling. The segmentation results net, and FPHBN have disconnected cracks and struggle to detect entire
produced by FPHBN and FCN have similar performances, with FPHBN cracks. In the third row, all models except ours fail to detect the crack
having disconnected predictions on the right side and FCN having on the right side of the non-target image. Additionally, in the fourth
them on both the left and right sides. U-Net is the least accurate and fifth rows, our model successfully detects cracks despite changes
compared to the other models. In the fifth row, despite disconnected in the background resembling different material surfaces. However,
crack predictions, our model demonstrated the best performance for FCN, U-net, and FPHBN struggle to detect complete cracks, resulting in
background surface-like cracks on a concrete surface. The predictions misleading results. The predictions from FCN and FPHBN show shorter
of FCN, U-Net, and FPHBN have disconnected and missing predictions and disconnected segments, respectively. Overall, the segmentation
towards the edges, and in the sixth row, it is easy to see that our model predictions of U-net were the least accurate compared to the other
predicts segmentation results similar to the ground truth labeling of models. Our model demonstrates strong generalization to small domain
the masonry wall. FCN misses the predictions at the joints, whereas shifts or gaps, which can be further enhanced by utilizing a dataset
both U-Net and FPHBN have missing predictions at the bottom edge. with varied lighting conditions, backgrounds, and crack widths into the
In addition, the U-Net exhibited disconnected cracks. training and validation process.
To evaluate the stability of the models against small changes or
domain shifts in the input data, we generated images from raw images 4.3. Ablation study
using different mechanisms. Fig. 12 presents the segmentation predic-
tions for these generated images. In the first row, a wet surface was We conducted an ablation experiment in two parts: 1) comparing
created on a shadowed pavement surface. The third row features a different loss functions, including BCE, Dice loss, and class-balanced
non-target image inserted into the background images, with an image loss and 2) evaluating model performance with and without an atten-
containing cracks used as the background. In the fourth and fifth rows, tion block to examine its impact on segmentation performance.
the background color was changed to resemble red marble, stone, The results of the entire ablation study are presented in Table 7.
sandstone walls, and wood surfaces. Incorporating attention with class-balanced loss outperformed models
8
D.A. Beyene et al. Structures 68 (2024) 107073
Fig. 11. Segmentation results of models on structural surfaces under different material types and environmental conditions.
Table 7 loss function enhances results by 4.62 and 0.04 % compared to the BCE
Ablation study result of both loss functions and attention blocks.
and Dice loss functions, respectively. Fig. 14 provides a comparative
DCNAM Metrics on Crack500 analysis between results obtained with and without the SE attention
P R F1 Acc mIoU module to explore the impact of attention mechanisms. Utilizing the SE
BCE Loss 72.16 88.75 79.60 96.40 66.49 attention module enhances performance compared to models without
Dice Loss 84.24 80.08 82.11 97.47 69.01 it. In particular, integrating the SE attention module into the model
Class-Balance Loss 87.23 79.60 83.24 96.83 71.11
resulted in enhancements of 0.84 and 2.06 % in F1 score and mIoU,
Without SE 81.61 83.20 82.40 95.17 69.05
respectively.
9
D.A. Beyene et al. Structures 68 (2024) 107073
10
D.A. Beyene et al. Structures 68 (2024) 107073
11
D.A. Beyene et al. Structures 68 (2024) 107073
12