0% found this document useful (0 votes)
10 views23 pages

Stair Net

The document discusses the development of an efficient convolutional neural network (CNN) called StairNet, inspired by EfficientNetV2, for concrete crack detection. StairNet incorporates specific structural characteristics in its early, middle, and late blocks to enhance efficiency and accuracy, outperforming existing models like EfficientNetV2 and GoogLeNet. A software platform, Faster R-Stair, was created for practical application, demonstrating high-speed and effective crack detection using unmanned aerial vehicles.

Uploaded by

yccckid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views23 pages

Stair Net

The document discusses the development of an efficient convolutional neural network (CNN) called StairNet, inspired by EfficientNetV2, for concrete crack detection. StairNet incorporates specific structural characteristics in its early, middle, and late blocks to enhance efficiency and accuracy, outperforming existing models like EfficientNetV2 and GoogLeNet. A software platform, Faster R-Stair, was created for practical application, demonstrating high-speed and effective crack detection using unmanned aerial vehicles.

Uploaded by

yccckid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Automation in Construction 156 (2023) 105098

Contents lists available at ScienceDirect

Automation in Construction
journal homepage: www.elsevier.com/locate/autcon

Extended efficient convolutional neural network for concrete crack


detection with illustrated merits
Ronghua Fu a, Maosen Cao b, c, *, Drahomír Novák d, Xiangdong Qian a, Nizar Faisal Alkayem e
a
College of Mechanics and Materials, Hohai University, Nanjing, China
b
Institute of Structural Dynamics and Control, Hohai University, Nanjing, China
c
Jiangxi Province Key Laboratory of Environmental Geotechnical Engineering and Hazards Control, Jiangxi University of Science and Technology, Ganzhou, China
d
Institute of Structural Mechanics, Faculty of Civil Engineering, Brno University of Technology, Brno, Czech Republic
e
College of Civil and Transportation Engineering, Hohai University, Nanjing 210098, China

A R T I C L E I N F O A B S T R A C T

Keywords: An efficient convolutional neural network (CNN), called EfficientNetV2, was recently developed. The early
EfficientNet blocks of EfficientNetV2 have structural characteristics that lead to higher training speeds than state-of-the-art
Three tactics CNNs. Inspired by EfficientNetV2, extended research was conducted in this study to determine whether the
CNN performance improvement
early, middle, and late blocks of CNNs should have respective structural characteristics to achieve higher effi­
StairNet
ciency. Based on comprehensive studies, three tactics were proposed, which underpinned a swift CNN called
Concrete crack detection
Unmanned aerial vehicle StairNet. StairNet was subsequently equipped into faster region-based CNN framework, producing Faster R-Stair.
Software The presented StairNet and Faster R-Stair were validated on two datasets, respectively: Dataset1 comprising a
pair of open-source datasets and a dataset of images captured in real-world conditions; Dataset2 derived from
Dataset1, consisting of more complicated object modes, with the purpose of mimicking the coexistence of
multiple cracks under real conditions. Experimental results showed that StairNet outperforms EfficientNetV2,
GoogLeNet, VGG16_BN, ResNet34, and MobileNetV3 in efficiency of crack classification and detection. A Faster
R-Stair concrete crack-detection software platform was also developed. The software platform and an unmanned
aerial vehicle were used to detect concrete road cracks at a university in Nanjing, China. The developed system
has a swift detection process, with high speed and excellent results.

1. Introduction adopted as classifiers [6] instead of other machine learning techniques


because they can automatically extract features of images without
Concrete is one of the most commonly used construction materials manual interpretation and realise automatic classification [7,8]. LeNet,
owing to its low price, high compressive strength, and good durability. which was proposed in 1998 [9], marked the pioneering work on CNNs.
However, concrete will inevitably crack because of loading, human LeNet was a multi-layer network that uses convolutional operations to
factors, and harsh environmental conditions such as fires, earthquakes, extract features from images. Nevertheless, it was not until nonlinearity
storms, and snow [1,2]. Cracks reduce the bearing capacity of the entire was introduced by AlexNet [10,11] that CNNs entered the stage of high-
concrete structure [3]. Therefore, regular and long-term monitoring is speed development. In the last decade, several remarkable networks
essential [4,5] to extend the service life and reduce the maintenance cost based on LeNet and AlexNet have been proposed. These include the VGG
of concrete structures. series [12], ResNet series [13], GoogLeNet [14], and MobileNet
Recently, deep learning has become popular and has been used in [15–17]. CNNs are also commonly used as feature extractors in object
vision-based monitoring owing to its excellent autonomous learning detection and semantic segmentation tasks because of their high accu­
capacity. Generally, there are three main tasks in deep learning that racy and efficiency in extracting features from input images.
assist in vision-based monitoring: classification, object detection, and Object detection algorithms have been the primary focus of research
semantic segmentation. The classification task involves classifying in vision-based monitoring because it enables automatic classification
concrete cracks. In this task, convolutional neural networks (CNNs) are and localisation of concrete cracks simultaneously. These algorithms can

* Corresponding author.
E-mail address: [email protected] (M. Cao).

https://doi.org/10.1016/j.autcon.2023.105098
Received 10 May 2023; Received in revised form 9 September 2023; Accepted 15 September 2023
Available online 27 September 2023
0926-5805/© 2023 Published by Elsevier B.V.
R. Fu et al. Automation in Construction 156 (2023) 105098

improve the crack detection efficiency and address the challenge of of EfficientNet. It incorporates new structural characteristics, such as
processing large amounts of data, which cannot be handled quickly conventional convolutions (Convs) in early blocks and depthwise con­
using conventional methods, such as manual inspection [18] and digital volutions (Dconvs) in other blocks, resulting in higher training speeds.
image processing techniques [19–22]. Owing to their automated and Bai et al. [46] proposed an image and frequency information-based
efficient characteristics, object detection algorithms have been widely procedure using EfficientNetV2 with a discrete wavelet transform
adopted for crack detection in various structures, such as tunnels [23], module for reinforced concrete damage recognition. The results indicate
pavements [24], high-rise buildings [25] and bridges [26]. These algo­ that the procedure can rapidly assess the overall structural safety.
rithms are gradually replacing the conventional methods for concrete (2) Expanding functions beyond detection by interdisciplinarity.
crack detection. The frameworks of these algorithms are constructed Chaiyasarn et al. [47] proposed an image-based crack detection system
using CNNs with the aid of a sliding window technique in the early stage by integrating an image-based three-dimensional (3D) photogrammetry
of object detection research and they have been extensively employed technique into deep learning algorithms, which realised concrete crack
for concrete crack detection. Cha et al. [27] proposed a vision-based detection and 3D mapping. Jang et al. [48] proposed a deep learning-
method using a CNN and a sliding window technique to detect con­ based autonomous concrete crack detection technique with the aid of
crete cracks without calculating the defect features. With the rapid infrared thermography, achieving automatic identification and visual­
technological development in recent years, novel object detection isation of macro- and microcracks. In addition, image processing tech­
frameworks primarily consisting of single shot multibox detector (SSD) niques can be implemented in object detection of concrete cracks to
[28], faster region-based CNN (Faster R-CNN) [29,30] and you only look eliminate background noise [26,49].
once network (YOLO) series [31–35] have been proposed. These Although tremendous efforts have been made for object detection of
frameworks classify and localise concrete cracks in images using flexible concrete cracks, the following deficiencies still exist:
bounding boxes. Cha et al. [36] developed a structural damage detection The CNN serves as a feature extractor within the object detection
method based on a Faster R-CNN framework to detect five defects on framework and acts as its backbone. A swift CNN with an excellent
concrete and metal surfaces, which achieved a mean average precision feature extraction capability is a prerequisite for rapid crack detection
of 89.7%. Their results proved that object detection for defects based on using an object detection framework. Thus far, the backbones of object
a Faster R-CNN framework is more efficient, and the size of the bounding detection frameworks have always been general-purpose CNNs, such as
box is more flexible than that of the early-stage techniques of CNNs with VGG, ResNet, MobileNet, and EfficientNet. These were originally
a sliding window. Mandal et al. [37] proposed an automatic road-crack designed to classify hundreds or thousands of object types. However,
detection framework based on YOLOv2 with excellent detection results. only tens of typical concrete crack types exist. General-purpose CNNs are
Semantic segmentation [38–40] detects concrete cracks at the pixel significantly large, resulting in a lower efficiency and increased hard­
level, which has greater application potential but requires higher com­ ware resource requirements when used as backbones to detect concrete
puter specifications. cracks. This can limit their use in concrete crack detection applications,
Scholars have performed various studies on object detection for particularly when dealing with large datasets and running on small
concrete cracks, which can be divided into two categories based on the devices with limited computing capabilities.
optimisation strategy: To address these deficiencies, an extended EfficientNetV2, called
(1) Optimising the object detection framework for better accuracy or StairNet, with illustrated merits for concrete crack detection, is proposed
efficiency. For example, the latest deep learning algorithms can be used, in this study. The main contributions are as follows:
some internal parameters can be fine-tuned well, and part of the ● The early blocks of EfficientNetV2 have structural characteristics
network structures can be optimised. Zhang et al. [41] developed an that lead to higher training speeds. Inspired by EfficientNetV2, extended
improved YOLOv3 framework by incorporating convolutional block studies on the middle and late blocks of CNNs were conducted for more
attention modules (CBAM) [42] into YOLOv3 to detect cracks on bridge efficient classification and object detection of concrete cracks, resulting
surfaces. The framework of YOLOv3 was also simplified, and the in three tactics for CNN construction: 1) applying Convs in the early
structures of MobileNet were used to improve the detection efficiency. blocks and Dconvs in the other blocks of CNNs to save training time; 2)
Tang et al. [43] proposed an ME-Faster R-CNN framework consisting of deepening the middle blocks of CNNs to achieve the best trade-off be­
a Faster R-CNN framework and a feature pyramid network to detect tween accuracy and efficiency; and 3) enhancing the late blocks of CNNs
multiple and small defects in a dam. In 2019, Google developed Effi­ by increasing the kernel size and expansion factor to improve the ac­
cientNet [44] using a neural architecture search technology to deter­ curacy with less training time.
mine the optimal combination of three parameters: image input ● The three tactics gave rise to Stair1, Stair2, and Stair3 in the early,
resolution, network depth, and channel width. EfficientNet was then middle, and late blocks of CNNs, forming StairNet. StairNet was applied
used as the backbone in object detection frameworks for concrete crack as the backbone of the Faster R-CNN framework for concrete crack
detection. Zhou et al. [23] introduced a new framework for detecting detection, producing the Faster R-Stair framework. Various comparisons
defects in tunnel linings using YOLOv4 with EfficientNet. This approach were performed, which show that the proposed StairNet has a smaller
was demonstrated to outperform SSD, YOLOv3, YOLOv4, and Faster R- size, better efficiency, and higher accuracy in both tasks of classification
CNN frameworks in terms of accuracy, efficiency, and anti-interference and object detection of concrete cracks.
capabilities. In 2021, EfficientNetV2 [45] was proposed as an extension ● A software platform named Faster R-Stair for Concrete Crack-
Detection Platform (FS_CCDP) was built based on the Faster R-Stair
framework. The FS_CCDP, together with an unmanned aerial vehicle
(UAV), was used for detecting cracks on roads at a university in Nanjing,
Table 1
Computer system and environment configuration.
China. The detection process was convenient, with high speed and good
detection results.
Platform Parameters
The remainder of this paper is organised as follows. In Section 2, the
System Windows 10 computing environment and dataset with data augmentation are intro­
CPU Intel(R) Xeon(R) Gold 5222 CPU @ 3.80 GHz 3.79 GHz duced. Section 3 discusses strategies for developing StairNet. In Sections
GPU NVIDIA Quadro P2200
Memory 64.0 GB
4 and 5, StairNet and the Faster R-Stair framework are described,
Environment Anaconda 3 respectively. In Section 6, the experimental details and results of Stair­
CUDA 10.2 Net and the Faster R-Stair framework for the classification and object
Python 3.6 detection of concrete cracks are discussed. Section 7 presents the results
PyTorch
of using the FS_CCDP software assisted by a UAV for detecting road

2
R. Fu et al. Automation in Construction 156 (2023) 105098

Fig. 1. Images in Dataset1.

cracks at a university in Nanjing, China. Finally, the conclusions are


provided in Section 8.

2. Computational environment and dataset

2.1. Computational environment

Most of the methods used in this study were implemented using


PyTorch [50] on Windows 10. Table 1 lists the details of the system and
environment.

2.2. Datasets for concrete crack classification and crack object detection

CNNs for classification and object detection rely on the quantity and
quality of images. The datasets of crack images used in this study contain
Dataset1 and Dataset2.

2.2.1. Dataset1
Dataset1 was used to train and validate the CNNs for crack classifi­
cation. The concrete crack images in Dataset1 came from the following
two sources:

(1) Open sources: The dataset of Guo et al. [51], which contains
20,000 crack and 20,000 non-crack images collected from
bridges, roads, and pavements, and the dataset of Xu et al. [52], Fig. 2. Crack classification rules for VerticalCrack, ObliqueCrack, and
which includes approximately 6000 images of bridge cracks. TransverseCrack.
(2) Real-world conditions: Images captured using a UAV at a uni­
versity in Nanjing, China. 224 × 224 for faster training speed [15].
Dataset1 is divided into three parts: Train, Val, and Test sets.
Different sources of data, including open sources and captured im­ Train and Val sets: There are approximately 10,000 images, of
ages in real-world conditions, were combined to train the CNNs. This which 70% are used as the Train set, while the remaining 30% are used
enriches the dataset with diverse features, augmenting the variety of as the Val set. The number of images of each crack type in the Train set is
concrete cracks and backgrounds, thereby enhancing the CNN general­ similar. The Val set is used to judge whether a model is well trained by
isation capability. However, the use of mixed data sources can also result the Train set at the end of each training epoch.
in greater demands on the performance of CNNs, possibly leading to Test set: There are 700 images (100 of each type) that are entirely
difficulties in achieving convergence, or increasing the likelihood of independent of the CNN training epoch. The Test set is utilised to
crack misclassification. evaluate the generalisation capability of a well-trained CNN.
Dataset1 is a combination of 9000 randomly selected images from In this study, the rules for crack classification were established ac­
open-source datasets and 1000 images captured under real-world con­ cording to the common fracture morphology, as shown in Fig. 1, and the
ditions. The images, originally sized at 227 × 227 pixels, are resized to

3
R. Fu et al. Automation in Construction 156 (2023) 105098

Fig. 3. Images in Dataset2.

(5) ObliqueCrack: A crack that is oblique, with the angle between the
Table 2
crack and vertical central axis exceeding 15◦ .
Examples of data augmentation.
(6) TransverseCrack: A crack that is transverse, with the angle be­
Original image Operator Result tween the crack and horizontal central axis not exceeding 15◦ .
Additive random pixels (7) VerticalCrack: A crack that is vertical, with the angle between the
+ crack and vertical axis not exceeding 15◦ .
Blend

2.2.2. Dataset2
Colour temperature Dataset2 was used to train and validate the object detection frame­
transformation work for concrete crack detection. The images in Dataset2 originated
+
Perspective
from Dataset1 and were randomly selected and spliced into a nine-
transformation square grid with 681 × 681 pixels to simulate the coexistence of multi­
Random pixel zero ple cracks under real conditions, as illustrated in Fig. 3. There are 674
+ images in Dataset2, of which 70% are used as the Train set, and the
Horizontal flip
remaining 30% are used as the Val set. Cracks in the images of Dataset2
were labelled in the PASCAL VOC format for training and validation.
Motion blur The labels contain rectangular box coordinates and type labels for the
+ concrete cracks.
Vertical flip

2.3. Data augmentation


Additive Gaussian noise
+ Image processing techniques were utilised for data augmentation in
Unequal scaling the Train set of Dataset1 and Dataset2 with two purposes:

(1) Dealing with the data imbalance caused by an insufficient num­


ber of cracks of a certain type. As listed in Table 2, one MeshCrack
classification details are depicted in Fig. 2. There are six crack types with image is expanded to five images using randomly combined data
a background in Dataset1: augmentation algorithms. The amount of data for each crack type
is almost the same when using data augmentation algorithms.
(1) Background: No cracks are shown in the image. (2) Increasing the anti-interference of CNNs in classification and
(2) Hole: A crack in the image that has a large gap or pore-like shape. object detection of concrete cracks by adding noise. CNNs have
(3) IrregularCrack: A crack in the image that expands irregularly poor anti-interference if there is insufficient noise in the datasets
similar to a snake. for the models to classify the different features between cracks
(4) MeshCrack: Consists of more than one crack that are crossing and noise. Noise is always caused by the weather or light.
each other. Furthermore, the UAV-captured images are vulnerable to blur­
ring caused by winds or abrupt rolls.

Fig. 4. Structure of Conv and Dconv.

4
R. Fu et al. Automation in Construction 156 (2023) 105098

Fig. 5. Structure of IR blocks with Conv or Dconv.

Fig. 6. Basic model of a CNN composed of IR blocks.

3. Strategies for developing StairNet To develop a smaller and more efficient CNN for concrete crack
classification, extended studies were conducted to determine the
3.1. Basic components of StairNet respective structural characteristics of the early, middle, and late blocks
of CNNs. Extensive research resulted in three tactics. Three tests,
Convs and Dconvs [16] are commonly used in CNNs to extract fea­ namely, Test1, Test2, and Test3, were performed to verify these three
tures from input images. Fig. 4 displays the contrasting structures of tactics.
Conv and Dconv. The number of filter channels of Conv is the same as
the number of input channels, whereas it decreases to one in Dconv. The 3.2.1. Evaluation indicators of the models in the extended studies
use of a single channel in Dconv significantly reduces the model size of a Train accuracy/loss and val accuracy/loss were selected as evaluation
CNN, with a small reduction in accuracy. To compensate for this indicators of the classification capability, whereas model size and
decrease in accuracy, an inverted residual (IR) block was proposed in training time were chosen as evaluation indicators of efficiency.
MobileNetV2 [17]. As shown in Fig. 5, the input feature maps of the IR Train accuracy/loss represents the CNN classification capability of
block are dimensionally expanded through a Conv layer based on an concrete cracks in the Train set. During validation, the CNN inherits the
expansion factor and reduced via another Conv layer before outputting. weight after training, and the Val set is not related to the Train set.
The middle of the IR block contains either a Dconv or Conv layer. The IR Therefore, val accuracy/loss of the Val set serves as a more reliable in­
blocks with Dconv and Conv layers are referred to as IR-D and IR-C dicator of the CNN capacity for generalisation. The formula for calcu­
blocks, respectively. The expanded input feature maps are extracted lating the accuracy is as follows:
by the Dconv layer in the IR-D blocks or by the Conv layer in the IR-C ∑
eq(y, p)
blocks. BN and AF denote batch normalisation [53] and activation
Accuracy = N (1)
function, respectively. N
Fig. 6 depicts the structure of a basic CNN model composed of five IR
blocks. The CNN feature extraction process involves downsampling the where y is the true value of the images in the dataset (Train or Val set), p
feature maps extracted from the input images. As illustrated in Fig. 6, the is the predicted value of the CNN, eq is used to verify whether the true

feature map of 1122 is downsampled to 72 using IR blocks. The CNN value y is equal to the predicted value p, () is used to count the equal
N
blocks can be classified into three types according to the depth: early quantity of true value y and predicted value p in the dataset, and N is the
blocks, which are near the input of the CNN (approximately from 2-fold number of all images in the dataset.
to 4-fold downsampling, feature maps from 1122 to 562); middle blocks The loss can be determined as follows:
(approximately from 4-fold to 16-fold downsampling, feature maps from ∑
562 to 142); and late blocks, which are near the output of the CNN Loss(y, p)
(2)
steps
(approximately from 16-fold to 32-fold downsampling, feature maps Loss =
Nsteps
from 142 to 72).

M

3.2. Three tactics for developing StairNet Loss(y, p) = − y • log(p) (3)


c=1

With EfficientNetV2, using Convs in early blocks and Dconvs in other N


blocks results in faster training. It has been demonstrated that Effi­ Nsteps = (4)
Nbatch
cientNetV2, with its structural characteristics, significantly reduces the
training time compared with state-of-the-art CNNs [45].

5
R. Fu et al. Automation in Construction 156 (2023) 105098

Fig. 7. Structures of models in Test1: (a) model1, (b) model2, and (c) model3.

Table 3
Structure and main parameters of models 1–3 (k represents the number of types in Dataset1, and stride is the kernel stride of the Conv layer).
Input Operation Expansion factor Output channel AF Stride

224 × 224 × 3 Conv2d \ 16 HS 2


Early blocks 112 × 112 × 16 IR, 3 × 3 4 24 RE6 2
56 × 56 × 24 IR, 5 × 5 3 40 RE6 2
Middle blocks 28 × 28 × 40 IR, 3 × 3 6 80 HS 2
14 × 14 × 80 IR, 5 × 5 6 112 HS 1
Late blocks 14 × 14 × 112 IR, 5 × 5 6 160 HS 2
7 × 7 × 160 IR, 5 × 5 6 160 HS 1
7 × 7 × 160 Conv2d, 1 × 1 \ 960 HS 1
7 × 7 × 960 Avgpool [54], 7 × 7 \ \ 1
Classifier 1 × 1 × 960 Conv2d, 1 × 1 \ 1280 HS 1
1 × 1 × 1280 Conv2d, 1 × 1 \ k \ 1

where Loss(y, p) is the loss between the predicted and true values displays a comparison of the accuracy, loss, training time, and model size
calculated by cross-entropy; M is the number of types, which is seven of models 1–3 in Dataset1. As shown in Fig. 8 (b), the training time and
(six types of cracks and the background); N is the number of images in model size of model1 are 2410.2 s and 47.2 MB, which are 96% longer
the dataset; and Nbatch is the number of images in a batch size, which is and 3800% larger than those of model2 and 145% longer and 3320%
taken as 16 in this study. larger than those of model3. However, as presented in Fig. 8 (a), the
three models exhibit almost equal values of accuracy and loss. Fig. 8 (b)
3.2.2. Tactic1 for early blocks of CNNs also shows that the training time of model3 is 25% shorter than that of
Test1 was implemented to verify whether the tactic in EfficientNetV2 model2, with a mere 0.17 MB increase in model size. The results of Test1
was appropriate for concrete crack classification. Fig. 7 indicates that indicate that the use of IR-C in all blocks significantly reduces the crack
model1 consists entirely of IR-C blocks, model2 is composed solely of IR- classification efficiency of the CNN, with little improvement in accuracy.
D blocks, and model3 utilises IR-C in the early blocks and IR-D in the In contrast, the CNN consisting of IR-C in early blocks and IR-D in other
middle and late blocks. Except for the composition of the IR blocks in blocks exhibit the best performance in terms of accuracy and efficiency.
models 1–3, the other parameters and settings were the same, as listed in The results of Test1 demonstrate that the tactic employed in Effi­
Table 3. Two activation functions were used in these models: ReLU6 cientNetV2 is effective in classifying concrete cracks. Thus, Tactic1 in­
(RE6) and Hardswish (HS) proposed in MobileNetV3 [15]. These two volves applying Convs in early blocks and Dconvs in other blocks of
activation functions have simpler derivative computations and are more CNNs to save training time.
quantization friendly than the ReLU activation function [11]. Fig. 8

Fig. 8. Comparison of models 1–3 in Dataset1 in Test1: (a) accuracy and loss; (b) training time and model size.

6
R. Fu et al. Automation in Construction 156 (2023) 105098

Table 4 determine the most effective depth for improving both the accuracy and
Structure and main parameters of the basic model. efficiency in concrete crack classification. The structure of the basic
Input Operation Expansion Output AF model is illustrated in Fig. 6. Table 4 lists the detailed parameters of the
factor channel basic model composed of IR-D blocks (IR-D basic model). Fig. 9 depicts
224 × 224 × 3 Conv2d \ 24 RE6 the block-deepened schemes based on the IR-D basic model, resulting in
112 × 112 × 24 IR-D_1, 3 × 3 1 24 RE6 models 1–5: model1, with four IR-D_1 blocks deepened after IR-D_1 of
56 × 56 × 24 IR-D_2, 3 × 3 1 24 RE6 the IR-D basic model; model2, which is deepened after IR-D_2; model3,
28 × 28 × 24 IR-D_3, 3 × 3 1 24 RE6 which is deepened after IR-D_3; model4, which is deepened after IR-D_4;
14 × 14 × 24 IR-D_4, 3 × 3 1 24 RE6
7 × 7 × 24 IR-D_5, 3 × 3 1 24 RE6
and model5, which is deepened after IR-D_5. All settings and parameters
7 × 7 × 24 Conv2d, 1 × 1 \ 144 HS in each block of models 1–5 are the same, except for the block-deepened
7 × 7 × 144 Avgpool, 7 × 7 \ \ \ scheme.
1 × 1 × 144 conv2d, 1 × 1 \ 1280 HS Fig. 10 displays a comparison of the five models with the IR-D basic
1 × 1 × 1280 conv2d, 1 × 1 \ k \
model in terms of accuracy, loss, training time, and model size for
Dataset1. As shown in Figs. 10 (a) and (b), the accuracy of the IR-D basic
3.2.3. Tactic2 for middle blocks of CNNs model for both the Train and Val sets is lower than that of all the block-
Increasing the depth of a CNN can enhance its capability to extract deepened models. That is, deepening the blocks of the CNN can improve
crack features but may significantly reduce its efficiency. Test2 aims to the accuracy of crack classification but increases the training time and

Fig. 9. Structures of models 1–5 in Test2: (a) model1, (b) model2, (c) model3, (d) model4, and (e) model5.

7
R. Fu et al. Automation in Construction 156 (2023) 105098

Fig. 10. Comparison of models 1–5 with the IR-D basic model in Dataset1 in Test2: (a) accuracy and loss; (b) training time and model size.

Fig. 11. Structures of models 1–5 in Test3.

model size. As depicted in Fig. 10 (a), the train accuracy of all models is those of the IR-D basic model. The five block-deepened models have the
lower than their val accuracy. This may be because there are more im­ same 0.844 MB size because they share the same settings and parameters
ages in the Train set than in the Val set, leading to difficulty of classi­ in each block, except for the block-deepened scheme. This is slightly
fication in the Train set. The train accuracy increases from 68.3% in larger than the 0.821 MB size of the IR-D basic model. The training time
model1 to 71.4% in model4 but decreases to 70.9% in model5. The val decreases from 1014.62 s for model1 to 625.15 s for model5, which are
accuracy increases from 84.6% in model1 to 87.3% in model4 but de­ all higher than that of the IR-D basic model. The training time sharply
creases to 86.7% in model5. Additionally, both the train loss and val loss decreases for models 1–2, indicating that deepening the early blocks is
decrease for models 1–4 but increase for model5. Therefore, the inefficient. The decreasing training time from model1 to model5 is
improvement in classification capability is greater when the middle possibly due to the smaller feature map size in the late blocks, resulting
blocks are deepened as compared to deepening the early and late blocks. in fewer parameter calculations. Considering the best trade-off between
Fig. 10 (b) compares the training time and model size of models 1–5 with accuracy and efficiency, it is more effective for CNNs to deepen the

Fig. 12. Comparison of models 1–5 with the IR-D basic model in Dataset1 in Test3: (a) accuracy and loss; (b) training time and model size.

8
R. Fu et al. Automation in Construction 156 (2023) 105098

3 98*398*1

398*3
98*1

Fig. 13. Basic structure of StairNet, which mainly comprises Stairs 1–3, a CBAM attention mechanism, and a classifier with fully connected layers.

middle blocks for the classification of concrete cracks. Therefore, Tac­ depicted in Fig. 12 (b), the sizes of models 1–5 are the same, and the
tic2 is deepening the middle blocks of CNNs to achieve the best trade-off training time decreases from 1098.52 s for model1 to 702.38 s for
between accuracy and efficiency. model5. The results of Test3 indicate that the closer to the late blocks,
the stronger the enhancing effect on the capability and efficiency in the
3.2.4. Tactic3 for late blocks of CNNs classification of concrete cracks. Therefore, Tactic3 is enhancing the late
It is time-consuming to increase the kernel size and expansion factor blocks of CNNs by adjusting the kernel size and expansion factor to
of the IR-D blocks to achieve higher accuracy. To verify which blocks improve the accuracy with less training time.
enhanced by improving the kernel size and expansion factor of IR-D are Tests 1–3 in Section 3.2 have proved that the respective structural
more effective in concrete crack classification, Test3 was implemented. characteristics at different depths of CNNs are beneficial for improving
As illustrated in Fig. 11, five models, namely, models 1–5, were designed the accuracy and efficiency of concrete crack classification. In summary,
based on the IR-D basic model. Table 4 presents the parameters of the IR- there are three tactics of CNN optimisation in concrete crack classifi­
D basic model. Each model has one block enhanced in the IR-D basic cation: 1) applying Convs in early blocks and Dconvs in other blocks of
model. In particular, IR-D_1 was enhanced in model1, IR-D_2 was CNNs to save training time; 2) deepening the middle blocks of CNNs to
enhanced in model2, IR-D_3 was enhanced in model3, IR-D_4 was achieve the best trade-off between accuracy and efficiency; and 3)
enhanced in model4, and IR-D_5 was enhanced in model5. enhancing the late blocks of CNNs by increasing the kernel size and
Fig. 12 displays a comparison of the five block-enhanced models with expansion factor to improve the accuracy with less training time. These
the IR-D basic model in terms of accuracy, loss, training time, and model three tactics gave rise to a swift CNN called StairNet.
size in Dataset1. As shown in Figs. 12 (a) and (b), the accuracy and loss in
the Train and Val sets of model1 are nearly the same as those of the IR-D 4. StairNet: An extended EfficientNetV2
basic model with increasing training time. Therefore, enhancing the
early blocks of CNNs are ineffective in concrete crack classification. The The use of Convs instead of Dconvs in the early blocks of Effi­
other block-enhanced models have a better classification capability at cientNetV2 results in higher training speeds. StairNet is an extended
the expense of efficiency. The train accuracy increases from 65.7% in EfficientNetV2 with respective structural characteristics in the early,
model1 to 69.4% in model5 (Fig. 12 (a)), and the val accuracy increases middle, and late blocks. The early, middle, and late blocks in StairNet
from 75.6% in model1 to 85.4% in model5. In contrast, the loss of the are Stair1, Stair2, and Stair3, respectively. The structures of the three
Train and Val sets decreases from model1 to model5. The former de­ stairs are built by considering the three tactics. Fig. 13 illustrates the
creases from 1 to 0.91, whereas the latter decreases from 0.73 to 0.49. As modular structure of StairNet, which comprises eight components: an

9
R. Fu et al. Automation in Construction 156 (2023) 105098

× ×

Fig. 14. Visualisations of a part of feature maps extracted by Stairs 1–3.

Table 5
Structure and main parameters in each layer of StairNet.
Feature extraction block Input (height, width, channel) Operator Expansion Output channel AF Stride
factor

224 × 224 × 3 Conv2d \ 16 HS 2


Stair1 Early blocks 112 × 112 × 16 Basic block_1 1 24 RE6 2
(Tactic1) 56 × 56 × 24 Basic block_1 2 24 RE6 1
Channel attention
Spatial attention
Stair2 Middle blocks 56 × 56 × 24 Basic block_2 1 48 RE6 2
(Tactic2) 28 × 28 × 48 Basic block_2 1 48 HS 1
28 × 28 × 48 Basic block_2 1 96 HS 2
14 × 14 × 96 Basic block_2 1 96 HS 1
14 × 14 × 96 Basic block_2 1 96 HS 1
Channel attention
Spatial attention
Stair3 Late blocks 14 × 14 × 96 Basic block_3 6 96 HS 2
(Tactic3) 7 × 7 × 96 Basic block_3 6 96 HS 1
7 × 7 × 96 Avgpool, 7 × 7 \ \ \ 1
Classifier 1 × 1 × 96 Conv2d,1 × 1 \ 512 HS 1
1 × 1 × 512 Conv2d,1 × 1 \ k \ 1

input layer consisting of Conv + BN + AF, Stair1, a CBAM that includes stride:
channel and spatial attention, Stair2, another CBAM, Stair3, an avgpool ● When the kernel stride is 1, the input feature maps are split into
layer, and a classifier with fully connected layers. two parts of equal dimensions. One part passes through a branch of IR-D
Visualisations of a portion of the feature maps extracted from the and is then concatenated with the other part, which remains unaltered.
input images by Stairs 1–3 were calculated, as displayed in Fig. 14. The ● When the kernel stride is 2, the input feature maps are duplicated
parameters and settings in each block of StairNet, for example, the three times. One copy passes through a branch of IR-D, where the feature
operator, expansion factor, AF, and stride (stride of the kernel), are maps are extracted as half the dimensions of the Stair2 output. The other
summarised in Table 5. two copies pass through branches of the 5 × 5 and 3 × 3 maxpool layers,
respectively, and are both extracted to be one-fourth of the dimension of
4.1. Stair1 the Stair2 output via a 1 × 1 Conv layer. Finally, the three outputs are
concatenated, as displayed in Fig. 15 (b). In addition, the channel shuffle
Stair1 is proposed as the early block of StairNet based on Tactic1. [55] technique is employed to enhance the channel connectivity in two
Fig. 15 (a) shows the basic block_1 structure of Stair1, which uses IR-C, variations of the basic block_2 structure within Stair2 before outputting.
whereas Stair2 and Stair3 predominantly employ IR-D. Furthermore, the As shown in Fig. 15 (b), the basic block_2 structure in Stair2 has three
basic block_1 structure contains two variants based on two conditions, as branches when the stride is 2. This is because the feature maps in the
depicted in Fig. 15 (a). The input feature maps are expanded dimen­ Conv or Dconv layer begin to downsample when the stride is set to 2. To
sionally via a 3 × 3 Conv layer of the IR-C block if the expansion factor is prevent missing a diverse range of crack features extracted during the
not 1; otherwise, they pass through the 3 × 3 Conv layer. process, Stair2 implements a three-branch structure when using a stride
of 2. The three-branch structure comprises three feature extraction
4.2. Stair2 methods that operate at different scales, resulting in the extraction of
diverse feature maps. Furthermore, Stair2 uses branch concatenation
Stair2 is proposed as the middle block of StairNet. According to instead of branch addition. Concatenation preserves the feature maps
Tactic2, Stair1 has two basic block_1 structures and Stair3 has two basic from different branches while addition combines these feature maps at
block_3 structures, whereas Stair2 has five-deepened basic block_2 the pixel level. Although combining feature maps from branches may
structures, as presented in Table 5. require fewer parameters and save training time, it increases the risk of
Fig. 15 (b) illustrates the basic block_2 structure of Stair2. This data distortion during processing. To balance the efficiency and accu­
structure has two variations, each corresponding to a different kernel racy, branch concatenation is implemented only in Stair2. Fig. 15 (b)

10
R. Fu et al. Automation in Construction 156 (2023) 105098

× × ×

× ×
×

× × ×

× ×

× ×

× × ×

× × × ×

Fig. 15. Basic blocks of Stairs 1–3 in StairNet: (a) basic block_1 structure in Stair1, (b) basic block_2 structure in Stair2, and (c) basic block_3 structure in Stair3.

illustrates the output result of Stair2 when the stride is 2. The output is expansion factor to 6, whereas they are often 3 × 3 and 1, respectively,
composed of three colours: blue, green, and orange. These colours in the other Stairs, based on Tactic3. In addition, the efficient channel
represent the concatenation of the three output feature maps extracted attention (ECA) [56] is implemented in basic block_3 to improve the
from three different branches. feature extraction capability for concrete cracks.

4.3. Stair3 4.4. Other settings

Stair3 is proposed as the late block of StairNet. Fig. 15 (c) depicts the The Adam algorithm [57] is applied to optimise the internal pa­
basic block_3 structure in Stair3. Table 5 shows that the basic block_3 rameters of StairNet, and batch normalisation is used to normalise the
structures are enhanced by increasing the kernel size to 5 × 5 and the data to avoid gradient disappearance. The ReLU6 and HS activation

11
R. Fu et al. Automation in Construction 156 (2023) 105098

Fig. 16. Structure of Faster R-Stair.

functions are used to improve the nonlinear processing of data in space on the right or lower side is filled with zeros until all images match
StairNet. Additionally, a CBAM attention mechanism is implemented the template size.
between Stairs 1–3 to improve the capability of StairNet. Furthermore,
variations in the hyperparameter values for the expansion factor and 5.2. Backbone
stride within the basic block_1 and basic block_2 structures will form
new structures for Stair1 and Stair2. Therefore, other fine-tunings of the The backbone extracts the features after the generalised data are
expansion factor and stride before training can generate new StairNet transformed in the Faster R-CNN framework. The structure from the
structures. The structure of StairNet used in this study corresponds to the input to Stair3 of StairNet is utilised as the backbone of the Faster R-Stair
expansion factors and strides listed in Table 5. framework. The details of StairNet are presented in Section 4. Feature
maps are acquired from the backbone. It is worth noting that the
5. StairNet in object detection framework for concrete crack interface between Stair3 and the RPN is modified by 16-fold down­
detection sampling instead of 8-fold.

A Faster R-Stair framework is proposed by incorporating StairNet


5.3. RPN
into the Faster R-CNN framework to enable swift crack detection. The
Faster R-Stair framework consists of four main components: a general­
The RPN consists of two components: anchor generator and RPN
ised data transform, backbone, region proposal network (RPN), and
head.
region of interest (ROI) head. These components are illustrated in Fig. 16
and described in more detail in the following sections.
5.3.1. Anchor generator
First, the anchor generator produces multiple sets of anchor boxes
5.1. Generalised data transform with varying scales based on the feature map output from the StairNet
backbone. Next, each pixel in the feature map is mapped onto the
The generalised data transform is a normalisation operation applied original image and assigned a set of anchor boxes. Consequently, in
to the original images in the input dataset. To standardise the size of the addition to the ground truth (GT) boxes (manually labelled true value
images in the Train set, the largest height and width are determined by bounding boxes), anchor boxes are also included on the original image.
traversing all images. Using these dimensions, a template is constructed. The anchor box is judged as positive according to the following rules:
Then, each image is aligned with the template by placing its upper-left (1) The anchor box has an intersection over union (IoU) overlap
corner at the same position as that of the template. Any insufficient higher than 0.7 with any GT box. The IoU between the anchor and GT

12
R. Fu et al. Automation in Construction 156 (2023) 105098

boxes is calculated as follows: / /


tx* = (x* − xa ) wa , ty* = (y* − ya ) ha
area(A) ∩ area(G) (14)
IoU(A, G) = (5) tw* = ln(w* /wa ), th* = ln(h* /ha )
area(A) ∪ area (G)

where (A) and (G) are the areas of the anchor and GT boxes, respectively. where x* , y* , w* and h* denote the coordinate of the central position
(2) When the IoU of all anchor boxes is lower than 0.7, the anchor (x* , y* ) and the width and height of the GT box, respectively.
box that has the highest IoU with the GT box is directly judged as Filter Proposal is a filter for selecting the proposal boxes. First, 2000
positive. proposal boxes with the highest probability of cracks are selected, and
The negative discrimination rule is as follows: an anchor box with an the others are detected according to the target score of each proposal
IoU that is lower than 0.3 for all GT boxes. box in the image. Second, the proposal boxes with smaller areas are
We randomly selected 256 positive and negative samples to calculate detected. Finally, non-maximum suppression (NMS) [58] is used to filter
the loss in the RPN head. the proposal boxes, and the selected proposal boxes are mapped onto the
feature map output by the StairNet backbone to obtain the corre­
5.3.2. RPN head sponding feature matrix.
The feature map outputted by the StairNet backbone is processed
using a 3 × 3 Conv layer, followed by two parallel 1 × 1 Conv layers with 5.4. ROI head
the ReLU activation function. This predicts the target score and regres­
sion parameters of the anchor boxes on the original image, which Each feature matrix is first downsampled to a 7 × 7 feature map using
correspond to the mapped position of each pixel in the feature map. the ROI-align layer. Then, the fully connected layer (FC1–FC4) predicts
The target score and regression parameters are as follows: both the crack scores and regression parameters for each proposal box.
The crack score represents the predicted type of concrete crack, whereas
cls = [Crack probability] (6) the regression parameters are utilised to adjust the proposal boxes to
[ ] their final predicted boundary boxes.
ti = tx , ty , tw , th (7) The coordinates and size of the final predicted boundary box can be
calculated as follows:
where cls is the target score representing the crack probability predicted
by the RPN head and ti is the regression parameter of the ith anchor box xp = wtxu + x
predicted by the RPN head. yp = htyu + y
The anchor box is transferred to the proposal box through the ( ) (15)
wp = w • exp twu
following calculation formulas: ( u)
hp = h • exp th
x = wa tx + xa
y = ha ty + ya where xp , yp , wp and hp denote the coordinate of the central position
(8) ( )
w = wa exp(tw ) xp , yp and the width and height of the predicted boundary box,
h = ha exp(th )
respectively, whereas txu , tyu , twu and thu represent the regression parameters
where x,y,w and h denote the coordinate of the central position (x, y) predicted by FC4.
and the width and height of the proposal box, respectively, whereas xa , The loss of a fully connected layer in the ROI head can be calculated
( )
ya , wa and ha denote the coordinate of the central position xa , ya and as follows:
the width and height of the anchor box, respectively.
Loss(p, u, tu , v) = Lcls (p, u) + λ[u ≥ 1]Lloc (tu , v) (16)
The loss of the RPN head can be calculated as follows:
1 ∑ ( ) 1 ∑ * ( ) Lcls (p, u) = − logpu (17)
Loss({pi }, {ti }) = Lcls pi , p*i + λ p Lreg ti , ti* (9)
Ncls i Nreg i i ∑ ( )
Lloc (tu , v) = smoothL1 tiu − vi (18)
( ) [ ( ) ]
pi , p*i = − p*i log(pi ) + 1 − p*i log(1 − pi ) (10)
i∈{x,y,w,h}
Lcls
( ) ∑ ( ) where Loss(p, u, tu , v) denotes the loss of a fully connected layer in the
Lreg ti , ti* = smoothL1 ti − ti* (11) ROI head, which comprises the loss of crack classification and loss of
regression parameters; p is the softmax probability predicted by the
i

{ classifier; u is the label of a real class of target; tu is the regression


0.5x2 if ∣x∣ < 1
SmoothL1 (x) = (12) ( )
∣x∣ − 0.5 otherwise parameter predicted by FC4, which is txu , tyu , twu , thu ; v is the regression
parameter of the GT box corresponding to the real target, which is
where Loss({pi }, {ti }) denotes the loss of the RPN head, which is ( )
( ) vx , vy , vw , vh ; [u ≥ 1] denotes the Iverson bracket; and the smoothL1
composed of the loss of class classification Lcls pi , p*i and loss of
( ) function is defined in Eq. (12). The Adam algorithm is used to optimise
regression parameters Lreg ti , ti* ; pi is the probability that the ith anchor the internal parameters of the model.
is predicted as the target; p*i is 1 when the anchor is positive while it is Post-process detection is used for the post-processing of the ROI
0 when negative; ti is the regression parameter of the ith anchor box head, including (1) calculating the final boundary box coordinates ac­
whereas ti* is the regression parameter of the GT box corresponding to cording to the proposals and regression parameters predicted by FC4, (2)
the ith anchor; Ncls is the number of all samples in a mini-batch, which is using softmax to obtain the prediction results of the crack, (3) removing
96 in this study; Nreg is the number of anchor positions; and λ is used to all background information, (4) removing low-probability targets and
balance the loss of class classification and loss of boundary box regres­ small-size targets, and (5) using NMS to filter the prediction results in
sion, which is 10 in this study. the ROI head.
Here, ti* is calculated as follows: Finally, the predicted results are mapped back to the scale of the
[ ] original image to complete the object detection of concrete cracks.
ti* = tx* , ty* , tw* , th* (13)

13
R. Fu et al. Automation in Construction 156 (2023) 105098

Table 6 compared and presented in graphs to verify the capability and efficiency
Settings of CNNs during training by Dataset1. of StairNet for the classification and object detection of concrete cracks.
Optimiser Adam
The initial value is 0.005 and then multiplied by 33% every 3 6.1. Training and validation results for classification of concrete cracks
Learning rate
epochs
Batch size 16
6.1.1. Hyperparameter settings
Loss function Cross entropy
Training As listed in Table 6, the training epochs and batch sizes of StairNet
20 and general-purpose CNNs for the classification of concrete cracks were
epoch
set to 20 and 16, respectively. The optimiser for all CNNs was Adam,
with the same learning rate.

Table 7 6.1.2. Evaluation indicators


Meaning of TP, TN, FP, and FN.
Training accuracy/loss and val accuracy/loss were chosen as evalua­
Predicted tion indicators of the classification capability, whereas model size and
Positive Negative training time were selected as evaluation indicators of efficiency. They
True Positive TP FN
are defined in Section 3.2.1.
Negative FP TN The precision and recall of the Val set were also used to evaluate the
classification capability of the CNNs. Precision is the proportion of pos­
itive samples that are correctly predicted for all predicted positive
6. Experimental verification of StairNet in classification and samples. The higher the precision, the lower the probability of false
object detection of concrete cracks alarms. The formula for precision is as follows:

The experiment primarily consists of two stages: classification and TP


Precision = (19)
object detection of concrete cracks. In the first stage, StairNet and TP + FP
several well-known general-purpose CNNs, including EfficientNetV2, Recall is the proportion of positive samples that are correctly pre­
GoogLeNet, ResNet34, MobileNetV3, and VGG16_BN, are trained and dicted for all true positive samples. The higher the recall, the lower the
validated using Dataset1. In the second stage, the Faster R-Stair and probability of underreporting. The recall calculation formula is as
Faster R-CNN frameworks with backbones of the general-purpose CNNs follows:
in the first stage are trained and validated using Dataset2. All results are

Fig. 17. (a) Train accuracy, (b) train loss, (c) val accuracy, and (d) val loss curves of StairNet during training by Dataset1.

14
R. Fu et al. Automation in Construction 156 (2023) 105098

Table 8 listed in Table 8, the model size and training time of StairNet are 1.41
Classification accuracy, loss, model size, and training time of StairNet and other MB and 972.11 s, which are 91.30% smaller and 33.35% shorter than
CNNs after 20 epochs. those of MobileNetV3. Furthermore, StairNet significantly outperforms
CNN Accuracy Loss Model Training time all other CNNs in terms of efficiency, with the smallest model size and
Train Val Train Val
size (s) shortest training time. The train accuracy, val accuracy, train loss, and val
(MB) loss of StairNet are 82.2%, 95.9%, 0.5, and 0.15, respectively, which are
(%) (%)
almost the same as those of MobileNetV3 and better than those of the
StairNet 82.2 95.9 0.53 0.15 1.41 972.11
EfficientNetV2 80.8 94.8 0.61 0.21 77.8 5062.31 other CNNs.
VGG16_BN 76.9 86.4 0.75 0.61 527 14,534.98 The precision and recall of all compared CNNs using the Val set in
GoogLeNet 81.3 93 0.95 0.27 39.4 1689.68 Dataset1 were calculated. As presented in Table 9, most of the precision
ResNet34 80.9 89.2 0.61 0.32 81.3 4521.46 and recall values of StairNet are higher, for example, the precision and
MobileNetV3 83.2 95.8 0.52 0.16 16.2 1458.53
recall of MeshCrack are 0.90 and 0.94 by StairNet while they are only
0.70 and 0.88 by VGG16_BN.
A confusion matrix was adopted in the Test set to evaluate the
Table 9 generalisation capability of all the compared CNNs, as depicted in
Comparison of precision and recall of StairNet and other CNNs using Val set. Fig. 18. In the confusion matrix, each column represents the true label
StairNet Precision Recall EfficientNetV2 Precision Recall and each row represents the predicted label for the Background, Hole,
Background 1 1 Background 1.0 0.98 IrregularCrack, MeshCrack, ObliqueCrack, TransverseCrack, and Verti­
Hole 0.95 0.88 Hole 0.94 0.81 calCrack in turn. As illustrated in Fig. 18 (a), 88 in the second row and
IrregularCrack 0.95 0.59 IrregularCrack 0.93 0.46 second column means that the number of Hole correctly predicted is 88;
MeshCrack 0.90 0.94 MeshCrack 0.86 0.9
5 in the second row and fourth column indicates that the sample number
ObliqueCrack 0.81 1 ObliqueCrack 0.75 0.99
TransverseCrack 0.84 0.97 TransverseCrack 0.80 0.97 of MeshCrack incorrectly predicted as Hole is 5. Fig. 18 shows that
VerticalCrack 0.90 0.92 VerticalCrack 0.84 0.92 StairNet has almost the best comprehensive performance; for example,
the number of ObliqueCrack predicted correctly is 100 by StairNet,
whereas it is 87, 99, 87, 98, and 98 by VGG16_BN, EfficientNetV2,
MobileNetV3 Precision Recall GoogLeNet Precision Recall
GoogLeNet, ResNet34, and MobileNetV3, respectively.
Background 1 1 Background 1 0.92 The results show that the proposed StairNet has not only an excellent
Hole 0.95 0.9 Hole 0.42 0.82 classification capability but also a high efficiency in concrete crack
IrregularCrack 0.91 0.65 IrregularCrack 0.89 0.49
MeshCrack 0.91 0.92 MeshCrack 0.91 0.88
classification.
ObliqueCrack 0.82 0.98 ObliqueCrack 0.72 0.87
TransverseCrack 0.88 0.97 TransverseCrack 0.82 0.60
VerticalCrack 0.88 0.92 VerticalCrack 0.87 0.58 6.2. Training and validation results for concrete crack detection

6.2.1. Hyperparameter settings


ResNet34 Precision Recall VGG16_BN Precision Recall As presented in Table 10, the training epochs and batch sizes of the
Background 0.99 1 Background 1 0.25 Faster R-Stair and Faster R-CNN frameworks with backbones of general-
Hole 0.92 0.81 Hole 0.39 0.91 purpose CNNs for concrete crack detection were set to 20 and 4,
IrregularCrack 0.96 0.43 IrregularCrack 0.91 0.38
respectively. The optimiser for all models was Adam, with the same
MeshCrack 0.88 0.92 MeshCrack 0.70 0.88
ObliqueCrack 0.74 0.98 ObliqueCrack 0.76 0.87
learning rate. Moreover, a warm-up [59] was implemented in the iter­
TransverseCrack 0.78 0.97 TransverseCrack 0.89 0.83 ations during the first training epoch. All backbones of the object
VerticalCrack 0.87 0.90 VerticalCrack 0.85 0.56 detection frameworks selected the layer with 16-fold downsampling as
the interface to connect the RPN head. The anchor sizes of the frame­
works were 32, 64, 128, 256, and 512, with 0.5, 1.0, and 2.0 as the
TP
Recall = (20) aspect ratios corresponding to each size. In other words, there were 15
TP + FN
anchor sizes during training and detection.
where TP, TN, FP, and FN are as listed in Table 7. The second letter
includes P (positive) and N (negative), which are used to indicate the 6.2.2. Evaluation indicators
predicted results. The first letter includes T (true) and F (false), which Average precision (AP), mean average precision (mAP), and some COCO
are used to judge the predicted results. The explanation is as follows: evaluators [33] were proposed as evaluation indicators of the detection
TP: The CNN predicts that the sample is positive and the judgement is capability, whereas the framework size, training time, and detection
correct (in fact, the sample is positive). frames per second (FPS) were selected as evaluation indicators of effi­
TN: The CNN predicts that the sample is negative, and the judgement ciency. The calculation formula for AP is as follows:
is correct (in fact, the sample is negative). ∫1
FP: The CNN predicts that the sample is positive and the judgement is AP = Precision(Recall)dRecall (21)
incorrect (in fact, the sample is negative).
0

FN: The CNN predicts that the sample is negative and the judgement where AP is the area included in the precision–recall (PR) curve. The PR
is incorrect (in fact, the sample is positive). curve is drawn with recall as the abscissa and precision as the ordinate.
Precision and recall can be calculated using Eqs. (19) and (20). The
6.1.3. Comparisons of training and validation results definitions for TP, FP, and FN are different from those of the classifica­
The concrete crack images in Dataset1 were classified using the tion task and are as follows:
CNNs containing StairNet, EfficientNetV2, GoogLeNet, MobileNetV3, TP: Number of prediction boxes that has IoU higher than 0.5 with the
ResNet34, and VGG16. In addition, batch normalisation was applied in corresponding GT box.
VGG16, named VGG16_BN, to solve the serious gradient disappearance FP: Number of prediction boxes that has IoU lower than 0.5 with the
when training VGG16 by Dataset1. The train accuracy/loss and val ac­ corresponding GT box.
curacy/loss of the six CNNs during the training process were calculated. FN: Number of GT boxes not detected.
As shown in Fig. 17, StairNet exhibits the highest convergence speed. As The mAP is determined as follows:

15
R. Fu et al. Automation in Construction 156 (2023) 105098

Fig. 18. Confusion matrix of Test set in Dataset1 classified by (a) StairNet and general-purpose CNNs containing (b) VGG16_BN, (c) EfficientNetV2, (d) GoogLeNet,
(e) ResNet34, and (f) MobileNetV3.

16
R. Fu et al. Automation in Construction 156 (2023) 105098

Table 10 speed.
Settings of object detection frameworks during training by Dataset2.
Optimiser Adam 6.2.3. Comparisons of training and validation results
The initial value is 0.005 and then multiplied by 33% every 3 The concrete crack images in Dataset2 were detected using the object
Learning rate
epochs detection frameworks containing Faster R-Stair and Faster R-CNN with
Batch size 4
the backbones of EfficientNetV2, GoogLeNet, VGG16_BN, ResNet34, and
Loss function Cross entropy
Training MobileNetV3.
20 Fig. 19 displays the PR curves of the six crack types detected using
epoch
Anchor sizes 32, 64, 128, 256, and 512 the Faster R-Stair framework. The AP values of all crack types detected
Aspect ratios 0.5, 1.0, and 2.0 by the Faster R-Stair framework and other Faster R-CNN frameworks
were calculated using the corresponding PR curves. As indicated in

AP Table 11, the AP values for crack detection by the frameworks utilising
mAP = (22) the StairNet, MobileNetV3, and EfficientNetV2 backbones are similar
n
and superior to those utilising the GoogLeNet, VGG16_BN, and ResNet34
where n is the number of detection types. backbones. The AP values of IrregularCrack detected by all detection
FPS represents the detection speed and is calculated as follows: frameworks are the lowest among the six AP values of crack types. The
main reason may be that this crack contains too many properties of other
Nimage
FPS = (23) cracks, which is a huge challenge for the frameworks to distinguish.
Time
The mAP of the Faster R-Stair framework during training was
where Nimage is the total number of detected images, and Time is the total calculated using AP, as shown in Fig. 20 (a). In the early stage of the
time of image detection. A higher FPS indicates a higher the detection training process, the mAP of the framework rises sharply, and then the

Fig. 19. PR curves of the six crack types detected by the Faster R-Stair framework.

Table 11
AP of each crack detected by the Faster R-CNN frameworks with the StairNet, EfficientNetV2, GoogLeNet, VGG16_BN, ResNet34, and MobileNetV3 backbones.
Crack type AP (%)

StairNet EfficientNetV2 GoogLeNet VGG16_BN ResNet34 MobileNetV3

ObliqueCrack 96.74 97.0 95.4 96.63 96.59 97.31


MeshCrack 85.12 85.9 86 85.44 87.86 87.62
Hole 97.22 97.3 95.9 96.26 95.98 96.34
TransverseCrack 96.9 96.1 95.9 95.67 96.68 95.97
VerticalCrack 94.79 95.8 94.7 94.9 94.33 94.19
IrregularCrack 79.28 79.5 69.3 79.22 79.52 78.96

17
R. Fu et al. Automation in Construction 156 (2023) 105098

Fig. 20. (a) mAP and (b) loss and learning rate of the Faster R-Stair framework during training.

Table 12
Detection capability and efficiency indicators of Faster R-CNN frameworks with backbones of StairNet, EfficientNetV2, GoogLeNet, VGG16_BN, ResNet34, and
MobileNetV3.
Faster R-CNN Faster Training time (s) FPS IOU = 0.50 IoU = 0.50:0.05:0.95 IoU = 0.75
backbone R-CNN framework size (MB) (f/s)
mAP (%) mAR (%) mAP (%) mAP (%)

StairNet 72 626.9 58 91.6 97.8 65.4 79.4


EfficientNetV2 88.1 1455.2 34 91.9 97.6 65.7 79.8
GoogLeNet 599 2470.4 26 89.3 97.5 62.1 73.3
VGG16_BN 334 5750.0 11 91.1 97.8 63.4 79.5
ResNet34 260 1747.0 31 91.3 97.4 64.7 78.5
MobileNetV3 88 1950.0 15 91.0 97.9 65 79.7

rising speed slows down after the tenth epoch. Finally, the curve con­ calculated, as summarised in Table 12. The first three columns of
verges. The loss and learning rates of the Faster R-Stair framework are Table 12 present the comparative results of the efficiency of the
depicted in Fig. 20 (b). The loss of the framework decreases significantly frameworks. The framework size, training time, and FPS of the Faster R-
in the first ten epochs, and then the curve converges gradually. The Stair framework are 72 MB, 626.94 s, and 58 f/s, respectively.
learning rate of this framework decreases with an increase in training Compared with the Faster R-CNN frameworks with backbones of
epoch. EfficientNetV2, GoogLeNet, VGG16_BN, ResNet34, and MobileNetV3,
The indicators such as the mAP, FPS, and training time of the Faster the Faster R-Stair framework is 18.27, 87.98, 78.44, 72.31, and 18.18%
R-Stair framework and the other five Faster R-CNN frameworks were smaller in framework size, respectively; 56.92, 74.62, 89.10, 64.11, and

Fig. 21. Bubble diagrams based on mAP, mAR, FPS, framework size, and training time of Faster R-CNN frameworks with backbones of StairNet, EfficientNetV2,
VGG16_BN, ResNet34, GoogLeNet and MobileNetV3.

18
R. Fu et al. Automation in Construction 156 (2023) 105098

Fig. 22. Detection results of concrete cracks using the Faster R-Stair framework: (a) original image; (b) detection results.

67.85% shorter in training time; and 41.38, 55.17, 81.03, 46.55, and capability and efficiency of the detection frameworks. The diagrams are
74.14% faster in detection speed. The results indicate that the Faster R- based on the mAP (IoU = 0.50), mAR (IoU = 0.50), FPS, framework size,
Stair framework is significantly superior to the other frameworks in and training time. As illustrated in Fig. 21 (a), the ordinate is mAP (IoU
terms of efficiency. = 0.50), the abscissa is FPS, and the bubble size represents the frame­
The mAP and mean average recall (mAR) when IoU = 0.50 are most work size. Because of the large size of some frameworks, some bubbles
commonly used to verify the detection capability. The mAR is an eval­ cover each other; therefore, the size of the bubbles corresponding to all
uation indicator used for multi-label classification models to evaluate frameworks is reduced by 10 times. Fig. 21 (a) shows that the black
the average recall of each label, and its calculation method is similar to bubble, which represents the Faster R-Stair framework, is at the far right
that of mAP [60]. Recall is defined in Eq. (20). Furthermore, other in­ of all bubbles and has the smallest size. In other words, the Faster R-Stair
dicators from the COCO evaluator were selected, as summarised in framework has the highest detection speed and smallest framework size.
Table 12. They are mAP of IoU = 0.75 and IoU = 0.50:0.05:0.95, as As depicted in Fig. 21 (b), the black bubble is on the far left of all
shown in the last two columns of Table 12. The mAP of IoU = 0.75 is the bubbles, representing the shortest training time among the compared
mAP when the IoU ratio is higher than 0.75. This requires a more frameworks. The mAP (IoU = 0.50) and mAR (IoU = 0.50) of the Faster
stringent detection capability of the frameworks compared with the mAP R-Stair framework are only 0.3% and 0.1% lower than those of the
when IoU = 0.50. To further verify the robustness of the frameworks, the frameworks with backbones of EfficientNetV2 and MobileNetV3,
mAP values when the IoU is higher than 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, respectively, and are better than the others.
0.8, 0.85, 0.9, and 0.95 were calculated and then averaged to obtain the All the above comparisons indicate that the proposed Faster R-Stair
mAP of IoU = 0.50:0.05:0.95. As shown in Table 12, the mAP (IoU = 0.5) framework not only has an impressive crack-detection efficiency but
of the Faster R-Stair framework reaches 91.6%. This is nearly the same also maintains a high detection capability. Fig. 22 displays the detection
as that of the Faster R-CNN framework with the EfficientNetV2 back­ results of the concrete crack images obtained through the Faster R-Stair
bone, which is better than those of the other frameworks. Meanwhile, framework. The results exhibit its effectiveness in accurately detecting
the mAR (IoU = 0.50) and mAP (IoU = 0.50:0.05:0.95 and IoU = 0.75) of concrete cracks.
the Faster R-Stair framework reach comparably high values compared
with those of the other frameworks. The results indicate that the Faster 7. Software with illustrative applications
R-Stair framework performs similarly to the Faster R-CNN framework
with the EfficientNetV2 backbone, and slightly outperforms the other The FS_CCDP based on Python 3.6 and the PyQt5 visualisation
Faster R-CNN frameworks in terms of detection capability. This may be module was designed to apply the proposed Faster R-Stair framework to
because (1) all frameworks are well fine-tuned and (2) the features of the real-world conditions.
concrete cracks and backgrounds in Dataset2 are adequate, causing the
convergence of all comparison frameworks. Therefore, all frameworks 7.1. UAV-based image collection
achieve a high detection capability after training.
Fig. 21 displays bubble diagrams that compare the detection UAVs have become a popular instrument for object detection in

19
R. Fu et al. Automation in Construction 156 (2023) 105098

hyperparameters are fine-tuned. In the upper section of the


interface, we specify where to save the trained weight of the
Faster R-Stair framework. The lower part of the interface presents
the values for mAP, learning rate, and loss during training. Once
ready to train, the Train button is clicked on and the trained
weight file is saved in the chosen path.
(3) Val: Fig. 25 (c) displays the interface of the Val/Detection mod­
ule. To verify whether the weight is well trained, the file path of
the trained weight is selected to load, the validation dataset is
chosen, and the necessary hyperparameters are input in the upper
part of the interface. Once completed, the Validate button is
clicked on to view the results by the COCO evaluator presented in
a text box located at the centre of the interface.
(4) Detection: The detection and validation modules share universal
Fig. 23. DJI UAV for image collection. input information displayed on the upper part of the interface. To
initiate the detection process, the path for images to be detected is
many studies owing to their high mobility, convenience, and high effi­ selected and a location to save them after detection is chosen. The
ciency. The DJI UAV (Avata), as depicted in Fig. 23, was remotely image resolution can be resized as required before clicking on the
controlled and used to collect images of concrete roads at a university in Detect button.
Nanjing, China to further examine the performance of the FS_CCDP (5) View Results: To view the detection results, the View Results
under real-world conditions. button on the left side of the interface is clicked on. This will open
The UAV-captured images were extremely large for fast feature a new window, as shown in Fig. 25 (d). To select which images to
extraction by computers, and even the CUDA acceleration by Nvidia view, the Choose button is clicked on and the left or right arrow is
Quadro P2200 could not process them effectively. Therefore, these im­ used to browse through the detected results.
ages were reshaped into smaller resolutions using a function in the
FS_CCDP.
7.3. UAV-based crack detection

7.2. FS_CCDP software First, the FS_CCDP was installed on both a laptop and desktop PC.
Next, the Training and Validation modules of the FS_CCDP were trained
Fig. 24 displays the structure of the FS_CCDP, which comprises three and validated using Dataset2 on the desktop PC to obtain a well-trained
modules: Training, Validation (Val), and Detection. Fig. 25 illustrates weight for the Faster R-Stair framework. This well-trained weight was
the operation interfaces of the FS_CCDP. Its main functions are as then transferred from the desktop PC to the laptop. Consequently,
follows: without any further fine-tuning or training operations, the detection
module of the FS_CCDP on the laptop could inherit this well-trained
(1) Login in: To access the main interface, users need to log in by weight and be used for concrete crack detection. Finally, with this
entering their username and password, as shown in Fig. 25 (a). setup, swift crack detection could be easily achieved under real-world
Once logged in, they can navigate to different modules by click­ conditions using a laptop.
ing on the Training, Val/Detection, or View Results buttons Fig. 26 depicts an example of the detection results of UAV-captured
located on the left side of the main interface. road images at a university using the FS_CCDP. The detection results
(2) Training: Fig. 25 (b) depicts the Training module interface. To show that the FS_CCDP can realise high-performance detection of con­
begin the training, data are first imported and essential crete cracks in real-world conditions.

Fig. 24. Diagram of FS_CCDP software.

20
R. Fu et al. Automation in Construction 156 (2023) 105098

Fig. 25. Interface and function display of FS_CCDP software: (a) Log in, (b) Training module, (c) Validation and Detection module, and (d) View Results
of detection.

Fig. 26. Crack detection of concrete road in a university using the FS_CCDP software.

8. Conclusions and future work EfficientNetV2, an extended research was conducted, resulting in three
tactics for building a fast CNN for concrete crack classification. The
In this paper, an extended EfficientNetV2 with merits illustrated for tactics are as follows: 1) apply Convs in early blocks and Dconvs in other
concrete crack detection, called StairNet, is proposed. StairNet exhibits blocks of CNNs to save training time, 2) deepen the middle blocks of
high efficiency and accuracy in both tasks of classification and object CNNs to achieve the best trade-off between accuracy and efficiency, and
detection of concrete cracks. The main conclusions of this study are as 3) enhance the late blocks of CNNs by increasing the kernel size and
follows: expansion factor to improve the accuracy with less training time.
1. Inspired by the structural characteristics of the early blocks of 2. StairNet uses the three tactics that correspond to Stair1, Stair2,

21
R. Fu et al. Automation in Construction 156 (2023) 105098

and Stair3 in the network. Furthermore, a Faster R-Stair framework was [2] G.B. Song, H.C. Gu, Y.L. Mo, Smart aggregates: multi-functional sensors for
concrete structures - a tutorial and a review, Smart Mater. Struct. 17 (3) (2008),
proposed by applying StairNet as the backbone of a Faster R-CNN
https://doi.org/10.1088/0964-1726/17/3/033001.
framework for concrete crack detection. [3] M.B. Otieno, M.G. Alexander, H.D. Beushausen, Corrosion in cracked and
3. Comparative experiments were conducted to confirm the uncracked concrete - influence of crack width, concrete quality and crack
improvement achieved by StairNet in the classification and object reopening, Mag. Concr. Res. 62 (6) (2010) 393–404, https://doi.org/10.1680/
macr.2010.62.6.393.
detection of concrete cracks. The results show that the proposed StairNet [4] M. Bayat, I. Pakar, G. Domairry, Recent developments of some asymptotic methods
is significantly smaller in size and more efficient, with high accuracy in and their applications for nonlinear vibration equations in engineering problems: a
both tasks of classification and object detection of concrete cracks review, Latin Am. J. Solids Struct. 9 (2) (2012) 145–234, https://doi.org/10.1590/
S1679-78252012000200003.
compared with general-purpose CNNs such as EfficientNetV2, GoogLe­ [5] W. Xu, W.D. Zhu, S.A. Smith, M.S. Cao, Structural damage detection using slopes of
Net, VGG16_BN, ResNet34, and MobileNetV3. longitudinal vibration shapes, J. Vibrat. Acoust. Transact. Asme 138 (3) (2016),
4. A software platform called FS_CCDP was built based on Faster R- https://doi.org/10.1115/1.4031996.
[6] M. Knezevic, M. Cvetkovska, T. Hanak, L. Braganca, A. Soltesz, Artificial neural
Stair with Python and PyQt5. A swift automatic crack detection of networks and fuzzy neural networks for solving civil engineering problems,
concrete roads at a university in Nanjing, China, was conducted using Complexity (2018), https://doi.org/10.1155/2018/8149650.
the FS_CCDP and a UAV. The detection process was convenient, with [7] Z.Q. Zhao, P. Zheng, S.T. Xu, X.D. Wu, Object detection with deep learning: a
review, Ieee Transact. Neural Netw. Learn. Syst. 30 (11) (2019) 3212–3232,
high speed and good detection results. https://doi.org/10.1109/Tnnls.2018.2876865.
The structures in the early, middle, and late blocks of a general- [8] M. Landauskas, M.S. Cao, M. Ragulskis, Permutation entropy-based 2D feature
purpose CNN are similar. However, the comparison results and soft­ extraction for bearing fault diagnosis, Nonlin. Dynam. 102 (3) (2020) 1717–1731,
https://doi.org/10.1007/s11071-020-06014-6.
ware application in this study suggest that the early, middle, and late
[9] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to
blocks of CNNs with their respective structural characteristics achieve document recognition, Proc. IEEE 86 (11) (1998) 2278–2324, https://doi.org/
higher efficiency. Compared to the current state-of-the-art general- 10.1109/5.726791.
purpose CNNs, StairNet has the same ability to classify concrete cracks, [10] M.Z. Alom, T.M. Taha, C. Yakopcic, S. Westberg, P. Sidike, M.S. Nasrin, B.C. Van
Esesn, A.A.S. Awwal, V.K. Asari, The history began from alexnet: A comprehensive
while having a smaller model size and shorter training time. This in­ survey on deep learning approaches, in: Proceedings of the IEEE Conference on
dicates that StairNet has lower hardware requirements, and software Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, https://
based on StairNet responds faster. Therefore, StairNet has a broader doi.org/10.48550/arXiv.1803.01164.
[11] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep
application value and commercial potential for the classification, object convolutional neural networks, Commun. ACM 60 (6) (2017) 84–90, https://doi.
detection, and semantic segmentation of concrete cracks. org/10.1145/3065386.
The proposed StairNet may not achieve high accuracy in other [12] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale
image recognition, in: International Conference on Learning Representations
classification or object detection tasks because the depth-stage depen­ (ICLR), 2014, pp. 1–14, https://doi.org/10.48550/arXiv.1409.1556.
dent structural characteristics of StairNet are developed for classifying [13] Z. Wu, C. Shen, A. Van Den Hengel, Wider or deeper: revisiting the resnet model for
concrete cracks. However, the strategies for building StairNet may be visual recognition, Pattern Recogn. 90 (2019) 119–133, https://doi.org/10.1016/j.
patcog.2019.01.006.
applicable to other tasks. [14] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V.
Further research is required to investigate the effectiveness of the Vanhoucke, A. Rabinovich, Going deeper with convolutions, Proceedings of the
proposed modular network with different structures in different blocks IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA
2015, pp. 1–9. doi:https://doi.org/10.1109/CVPR.2015.7298594.
for semantic segmentation tasks, where the goal is to accurately label
[15] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu,
each pixel in an image belonging to a specific class (e.g. crack or non- R. Pang, V. Vasudevan, Searching for mobilenetv3, in: Proceedings of the IEEE/
crack). This may involve testing the model on more complex back­ CVF international conference on computer vision, Seoul, Korea (South), 2019,
grounds or images with smaller cracks, as well as exploring its effec­ pp. 1314–1324, https://doi.org/10.1109/ICCV.2019.00140.
[16] A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
tiveness in detecting different types of cracks. M. Andreetto, H. Adam, Mobilenets: Efficient convolutional neural networks for
mobile vision applications, in: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, Honolulu, HI, USA, 2017, https://doi.org/
Declaration of Competing Interest 10.48550/arXiv.1704.04861.
[17] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, Mobilenetv2: Inverted
residuals and linear bottlenecks, in: Proceedings of the IEEE conference on
We declare that we have no financial and personal relationships with computer vision and pattern recognition, Salt Lake City, UT, USA, 2018,
other people or organizations that can inappropriately influence our pp. 4510–4520, https://doi.org/10.48550/arXiv.1801.04381.
work, there is no professional or other personal interest of any nature or [18] D. Dais, I.E. Bal, E. Smyrou, V. Sarhosis, Automatic Crack Classification and
Segmentation on Masonry Surfaces Using Convolutional Neural Networks and
kind in any product, service and/or company that could be construed as Transfer Learning, Automation in Construction 125, 2021, https://doi.org/
influencing the position presented in, or the review of, the manuscript 10.1016/j.autcon.2021.103606.
entitled. [19] A. Bali, S.N. Singh, A Review on the Strategies and Techniques of Image
Segmentation, 2015 5th International Conference on Advanced Computing &
Communication Technologies Acct 2015, IEEE Computer Society, Haryana, India,
Data availability 2015, pp. 113–120, https://doi.org/10.1109/Acct.2015.63.
[20] H. Oliveira, P.L. Correia, Automatic Road Crack Segmentation Using Entropy and
Image Dynamic Thresholding, 2009 17th European Signal Processing Conference,
Data will be made available on request.
IEEE, Glasgow, UK, 2009, pp. 622–626.
[21] A. Ayenu-Prah, N. Attoh-Okine, Evaluating pavement cracks with bidimensional
Acknowledgements empirical mode decomposition, Eurasip J. Adv. Signal Proces. (2008), https://doi.
org/10.1155/2008/861701.
[22] J. Zhou, P.S. Huang, F.P. Chiang, Wavelet-based pavement distress detection and
This study was supported by the Jiangsu International Joint Research evaluation, Opt. Eng. 45 (2) (2006), https://doi.org/10.1117/1.2172917.
and Development Program under Grant No. BZ2022010, Nanjing In­ [23] Z. Zhou, J. Zhang, C. Gong, Automatic detection method of tunnel lining multi-
defects via an enhanced you only look once network, Comput. Aided Civ. Inf. Eng.
ternational Joint Research and Development Program under Grant No.
37 (6) (2022) 762–780, https://doi.org/10.1111/mice.12836.
202112003, Fundamental Research Funds for the Central Universities [24] R. Fan, M.J. Bocus, Y.L. Zhu, J.H. Jiao, L. Wang, F.L. Ma, S.S. Cheng, M. Liu, Road
under Grant No. B220204002, and National Natural Science Foundation crack detection using deep convolutional neural network and adaptive
thresholding, in: 2019 30th Ieee Intelligent Vehicles Symposium (Iv19), 2019,
of China for Young International Scientists (No. 52250410359).
pp. 474–479, https://doi.org/10.48550/arXiv.1904.08582.
[25] H. Kim, E. Ahn, M. Shin, S.H. Sim, Crack and noncrack classification from concrete
References surface images using machine learning, Struct. Health Monitor. Int. J. 18 (3)
(2019) 725–738, https://doi.org/10.1177/1475921718768747.
[26] R.H. Fu, H. Xu, Z.J. Wang, L. Shen, M.S. Cao, T.W. Liu, D. Novak, Enhanced
[1] G. Song, N. Ma, H.N. Li, Applications of shape memory alloys in civil structures,
intelligent identification of concrete cracks using multi-layered image
Eng. Struct. 28 (9) (2006) 1266–1274, https://doi.org/10.1016/j.
engstruct.2005.12.010.

22
R. Fu et al. Automation in Construction 156 (2023) 105098

preprocessing-aided convolutional neural networks, Sensors 20 (7) (2020), https:// [45] M. Tan, Q. Le, Efficientnetv2: Smaller models and faster training, in: International
doi.org/10.3390/s20072021. Conference on Machine Learning, PMLR, 2021, pp. 10096–10106, https://doi.org/
[27] Y.J. Cha, W. Choi, O. Büyüköztürk, Deep learning-based crack damage detection 10.48550/arXiv.2104.00298.
using convolutional neural networks, Comput. Aided Civ. Inf. Eng. 32 (5) (2017) [46] Z. Bai, T. Liu, D. Zou, M. Zhang, A. Zhou, Y. Li, Image-based reinforced concrete
361–378, https://doi.org/10.1111/mice.12263. component mechanical damage recognition and structural safety rapid assessment
[28] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A.C. Berg, SSD: Single using deep learning with frequency information, Autom. Constr. 150 (2023)
shot multibox detector, European Conference on Computer Vision, Springer, 104839, https://doi.org/10.1016/j.autcon.2023.104839.
Amsterdam, The Netherlands, 2016, pp. 21–37. doi:10.48550/arXiv.1512.02325. [47] K. Chaiyasarn, A. Buatik, S. Likitlersuang, Concrete crack detection and 3D
[29] R. Girshick, Fast r-cnn, in: Proceedings of the IEEE international conference on mapping by integrated convolutional neural networks architecture, Adv. Struct.
computer vision, Santiago, Chile, 2015, pp. 1440–1448, https://doi.org/ Eng. 24 (7) (2021) 1480–1494, https://doi.org/10.1177/1369433220975574.
10.48550/arXiv.1504.08083. [48] K. Jang, N. Kim, Y.K. An, Deep learning-based autonomous concrete crack
[30] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection evaluation through hybrid image scanning, Struct. Health Monitor. Int. J. 18 (5–6)
with region proposal networks, in: Advances in Neural Information Processing (2019) 1722–1737, https://doi.org/10.1177/1475921718821719.
Systems 28, 2015, https://doi.org/10.48550/arXiv.1506.01497. [49] Z. Yu, YOLO V5s-based Deep Learning Approach for Concrete Cracks Detection,
[31] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real- SHS Web of Conferences 144, EDP Sciences, 2022, p. 03015, https://doi.org/
time object detection, in: Proceedings of the IEEE conference on computer vision 10.1051/shsconf/202214403015.
and pattern recognition, Las Vegas, NV, USA, 2016, pp. 779–788, https://doi.org/ [50] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z.M. Lin,
10.48550/arXiv.1506.02640. N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison,
[32] J. Redmon, A. Farhadi, YOLO9000: better, faster, stronger, in: Proceedings of the A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J.J. Bai, S. Chintala, PyTorch: an
IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, imperative style, high-performance deep learning library, in: Advances in Neural
2017, pp. 7263–7271, https://doi.org/10.1109/CVPR.2017.690. Information Processing Systems 32 (Nips 2019), Vol. 32, Vancouver, Canada,
[33] J. Redmon, A. Farhadi, Yolov3: An incremental improvement, in: arXiv preprint, 2019, https://doi.org/10.48550/arXiv.1912.01703.
2018, https://doi.org/10.48550/arXiv.1804.02767 arXiv:1804.02767. [51] L. Guo, R. Li, B. Jiang, X. Shen, Automatic crack distress classification from
[34] A. Bochkovskiy, C.-Y. Wang, H.-Y.M. Liao, Yolov4: Optimal speed and accuracy of concrete surface images using a novel deep-width network architecture,
object detection, in: arXiv preprint, 2020, https://doi.org/10.48550/ Neurocomputing 397 (2020) 383–392, https://doi.org/10.1016/j.
arXiv.2004.10934 arXiv:2004.10934. neucom.2019.08.107.
[35] Z. Ge, S. Liu, F. Wang, Z. Li, J. Sun, Yolox: Exceeding yolo series in 2021, in: arXiv [52] H. Xu, X. Su, Y. Wang, H. Cai, K. Cui, X. Chen, Automatic bridge crack detection
preprint, 2021, https://doi.org/10.48550/arXiv.2107.08430 arXiv:2107.08430. using a convolutional neural network, Appl. Sci. 9 (14) (2019) 2867, https://doi.
[36] Y.J. Cha, W. Choi, G. Suh, S. Mahmoudkhani, O. Büyüköztürk, Autonomous org/10.3390/app9142867.
structural visual inspection using region-based deep learning for detecting multiple [53] S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by
damage types, Comput. Aided Civ. Inf. Eng. 33 (9) (2018) 731–747, https://doi. reducing internal covariate shift, in: International Conference on Machine
org/10.1111/mice.12334. Learning, PMLR, Lille, France, 2015, pp. 448–456, https://doi.org/10.48550/
[37] V. Mandal, L. Uong, Y. Adu-Gyamfi, Automated Road Crack Detection Using Deep arXiv.1502.03167.
Convolutional Neural Networks, 2018 IEEE International Conference on Big Data [54] H. Gholamalinezhad, H. Khosravi, Pooling Methods in Deep Neural Networks, a
(Big Data), IEEE, Seattle, WA, USA, 2018, pp. 5212–5215, https://doi.org/ Review, Proceedings of the IEEE Conference on Computer Vision and Pattern
10.1109/BigData.2018.8622327. Recognition, Seattle, WA, USA, 2020, https://doi.org/10.48550/
[38] Y. Nirkin, L. Wolf, T. Hassner, Hyperseg: Patch-wise hypernetwork for real-time arXiv.2009.07485.
semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer [55] X. Zhang, X. Zhou, M. Lin, J. Sun, Shufflenet: An extremely efficient convolutional
Vision and Pattern Recognition, 2021, pp. 4061–4070, https://doi.org/10.48550/ neural network for mobile devices, in: Proceedings of the IEEE conference on
arXiv.2012.11582. computer vision and pattern recognition, Salt Lake City, UT, USA, 2018,
[39] L. Zhu, D. Ji, S. Zhu, W. Gan, W. Wu, J. Yan, Learning statistical texture for pp. 6848–6856, https://doi.org/10.48550/arXiv.1707.01083.
semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer [56] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, Q. Hu, Supplementary Material for ‘ECA-
Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 12537–12546, Net: Efficient Channel Attention for Deep Convolutional Neural Networks,
https://doi.org/10.1109/CVPR46437.2021.01235. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern
[40] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic Recognition Seattle, WA, USA, 2020, https://doi.org/10.1109/
segmentation, in: Proceedings of the IEEE conference on computer vision and CVPR42600.2020.01155, pp. 13-19.
pattern recognition, Boston, MA, USA, 2015, pp. 3431–3440, https://doi.org/ [57] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: the 3rd
10.1109/TPAMI.2016.2572683. International Conference for Learning Representations, San Diego, 2014, https://
[41] Y. Zhang, J. Huang, F. Cai, On bridge surface crack detection based on an improved doi.org/10.48550/arXiv.1412.6980.
YOLO v3 algorithm, IFAC-PapersOnLine 53 (2) (2020) 8205–8210, https://doi. [58] J. Hosang, R. Benenson, B. Schiele, Learning non-maximum suppression, in:
org/10.1016/j.ifacol.2020.12.1994. Proceedings of the IEEE conference on computer vision and pattern recognition,
[42] S. Woo, J. Park, J.-Y. Lee, I.S. Kweon, Cbam: Convolutional block attention Honolulu, HI, USA, 2017, pp. 4507–4515, https://doi.org/10.48550/
module, in: Proceedings of the European conference on computer vision (ECCV), arXiv.1705.02950.
Munich, Germany, 2018, pp. 3–19, https://doi.org/10.48550/arXiv.1807.06521. [59] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in:
[43] J. Tang, Y. Mao, J. Wang, L. Wang, Multi-Task Enhanced Dam Crack Image Proceedings of the IEEE conference on computer vision and pattern recognition,
Detection Based on Faster R-CNN, 2019 IEEE 4th International Conference on New Orleans, LA, USA, 2016, pp. 770–778, https://doi.org/10.48550/
Image, Vision and Computing (ICIVC), IEEE, Xiamen, China, 2019, pp. 336–340, arXiv.1512.03385.
https://doi.org/10.1109/ICIVC47709.2019.8981093. [60] T.-Y. Lin, P. Dollár, R.B. Girshick, K. He, B. Hariharan, S.J. Belongie, Feature
[44] M. Tan, Q. Le, Efficientnet: Rethinking model scaling for convolutional neural Pyramid Networks for Object Detection, in: Proceedings of the IEEE conference on
networks, in: International conference on machine learning, PMLR, Long Beach, computer vision and pattern recognition, Vol. undefined, Honolulu, HI, USA, 2017,
CA, USA, 2019, pp. 6105–6114, https://doi.org/10.48550/arXiv.1905.11946. https://doi.org/10.1109/CVPR.2017.106 undefined.

23

You might also like