0% found this document useful (0 votes)

13 views19 pages

Automatic Pixel-Level Multiple Damage Detection of Concretestructure Using Fully Convolutional Network

This document presents a Fully Convolutional Network (FCN)-based method for automatic pixel-level detection of multiple damages in concrete structures, including cracks, spalling, efflorescence, and holes. A dataset of 2,750 labeled images was created to train the model, achieving high accuracy metrics such as 98.61% pixel accuracy and 91.59% mean pixel accuracy. The proposed method outperforms traditional image-level and grid-cell level detection techniques, demonstrating improved robustness and adaptability in real-world conditions.

Uploaded by

garbohsing

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views19 pages

Automatic Pixel-Level Multiple Damage Detection of Concretestructure Using Fully Convolutional Network

Uploaded by

garbohsing

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

DOI: 10.1111/mice.

12433

I N D U ST R I A L A P P L I CAT I O N

Automatic pixel-level multiple damage detection of concrete

structure using fully convolutional network

Shengyuan Li1 Xuefeng Zhao1 Guangyi Zhou2

1 Schoolof Civil Engineering, State Key

Abstract
Laboratory of Coastal and Offshore
Engineering, Dalian University of Deep learning-based structural damage detection methods overcome the limitation of
Technology, Dalian, China inferior adaptability caused by extensively varying real-world situations (e.g., lighting
2 NortheastBranch of China Construction and shadow changes). However, most deep learning-based methods detect structural
Eighth Engineering Bureau Division Corp.,
LTD., Dalian, China damage at the image level and grid-cell level. To provide pixel-level detection of mul-
Correspondence tiple damages, a Fully Convolutional Network (FCN)-based multiple damages detec-
Xuefeng Zhao, School of Civil Engineering, tion method for concrete structure is proposed. To realize this method, a database of
State Key Laboratory of Coastal and Offshore
2,750 images (with 504 × 376 pixels) including crack, spalling, efflorescence, and hole
Engineering, Dalian University of Technology,
Dalian, China. images in concrete structure is built, and the four damages included in those images are
Email: [email protected] labeled manually. Then, the architecture of the FCN is modified, trained, validated,
Funding information and tested using this database. A strategy of model-based transfer learning is used
The National Key Research and Development
Programs of China, Grants 2016YFC0802002- to initialize the parameters of the FCN during the training process. The results show
03 and 2016YFE0202400 98.61% pixel accuracy (PA), 91.59% mean pixel accuracy (MPA), 84.53% mean inter-
section over union (MIoU), and 97.34% frequency weighted intersection over union
(FWIoU). Subsequently, the robustness and adaptability of the trained FCN model
is tested and the damage is extracted, where damage areas are provided according to
a calibrated relation between the ratio (the pixel area and true area of the detected
object) and the distance from the smartphone to the concrete surface using a laser
range finder. A comparative study is conducted to examine the performance of the
proposed FCN-based approach using a SegNet-based method. The results show that
the proposed method substantiates quite better performance and can indeed detect
multiple concrete damages at the pixel level in realistic situations.

1 I N T RO D U C T I O N conditions according to the locations and sizes of damage

observed by inspectors (Graybeal, Phares, Rolander, Moore,
Damages on the surfaces of concrete structures, including & Washer, 2002). However, the damage detection of human-
cracks, spalling, efflorescence, and holes, visually reflect the based visual inspection is time-consuming and subjective.
safety, durability, and serviceability of concrete structures. The skill level and experience of the inspectors significantly
Thus, visual inspection plays a crucial part in the maintenance influence the accuracy of damage diagnosis. Therefore, auto-
and operation of buildings and infrastructure. Despite the matic damage detection is highly desirable for efficiency and
important roles of some concrete structures (bridges, dams, objectivity of damage assessment.
nuclear power plants, etc.) in public safety and the econ- Taking the drawbacks of the conventional human-based
omy, human-based visual inspection is a conventional method visual inspection approach into consideration, computer
that can quantitatively evaluate the concrete structural health vision-based methods have been widely studied. Most

© 2019 Computer-Aided Civil and Infrastructure Engineering

616 wileyonlinelibrary.com/journal/mice Comput Aided Civ Inf. 2019;34:616–634.

14678667, 2019, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12433 by Shenzhen University, Wiley Online Library on [04/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 617

computer vision-based methods for damage detection primar- Machine (SVM) has been used for detecting cracks (Gavilán
ily adopt image process techniques (IPTs). IPTs can detect et al., 2011; O'Byrne, Schoefs, Ghosh, & Pakrashi, 2013;
some specific types of structural damage, such as concrete Zou, Cao, Li, Mao, & Wang, 2012), loose bolts (Cha, You,
cracks (Abdel-Qader, Abudayyeh, & Kelly, 2003; Fujita, & Choi, 2016), rusting (Chen, Shen, Lei, & Chang, 2012),
Mitani, & Hamamoto, 2006; Nishikawa, Yoshida, Sugiyama, and spalling (German et al., 2012). Besides, the restricted
& Fujino, 2012), concrete spalling (German, Brilakis, & Boltzmann machine has been applied to structural health
DesRoches, 2012), and potholes and cracks in asphalt pave- monitoring (Rafiei & Adeli, 2017, 2018; Rafiei, Khushefati,
ment (Koch & Brilakis, 2011; Wang, Zhang, Wang, Braham, Demirboga, & Adeli, 2017). These approaches first extract
& Qiu, 2018; Zhang et al., 2016a). To identify or represent features from images using IPTs, and then evaluate whether
the damages from images, IPTs are based on remarkable the extracted features indicate damages. However, the results
assumptions. For instance, cracks can be found accordingly of the aforementioned approaches have inevitably been
because crack pixel shapes are thinner than those of other affected by false hand-picked feature extraction of IPTs.
textural patterns and darker than that of the background Although many types of ML-based methods have been
(Cheng, Chen, Glazier, & Hu, 1999; Yamaguchi, Nakamura, developed and adapted to research and industrial fields, the
Saegusa, & Hashimoto, 2008). With respect to morphology, convolutional neural networks (CNNs) have been highlighted
the simplest method for detecting damages is to use the his- in image classification and object recognition (Krizhevsky,
togram and threshold (Kirschke & Velinsky, 1992; Oliveira Sutskever, & Hinton, 2012; Ren, He, Girshick, & Sun, 2017).
& Correia, 2009). To increase the robustness of damage A deep learning-based method was developed to detect
detection, general global transforms and local edge detection railway defects (Soukup & Huber-Mörk, 2014), road cracks
are steered into damage contour detection, such as that of the (Zhang et al., 2017; Zhang, Yang, Zhang, & Zhu, 2016b),
fast Haar transform (FHT), the fast Fourier transform (FFT), concrete cracks (Cha, Choi, & Büyüköztürk, 2017a; Zhao
Sobel and Canny edge detectors. According to Abdel-Qader & Li, 2017), and other damages (Lin, Nie, & Ma, 2017).
et al. (2003), the FHT method is relatively more reliable for Undeniably, these methods achieve excellent performance in
crack detection and its accuracy is potentially improvable. realistic situations when only one damage type is detected.
This study was followed by several proposed improvements However, the CNN method has to adopt a sliding window
(Huang & Xu, 2006; Li, Zhao, Du, Ru, & Zhang, 2017; Sinha technique to localize the detected damage (Cha et al., 2017a;
& Fieguth, 2006; Subirats, Dumoulin, Legeay, & Barba, Xue & Li, 2018). Subsequently, Cha et al. also proposed a
2006; Ying & Salari, 2010; Zalama, Gómez-García-Bermejo, Faster Region-based Convolutional Neural Network (Faster
Medina, & Llamas, 2014). Although the IPTs-based damage R-CNN) for multiple damages (Cha, Choi, Suh, Mahmoud-
detection method is effective and fast, it can detect only one khani, & Büyüköztürk, 2017b; Ren et al., 2017). However,
damage type and its robustness left much to be desired when the Faster R-CNN method still detects multiple damages
noises, primarily from lighting and distortion, substantially at the grid-cell level, which means that objective images
affected the results (Koziarski & Cyganek, 2017; Ziou & have to be cut into little patches and the included damage
Tabbone, 1998). One possible route for removing the noises characteristics cannot be detected directly.
is denoising techniques, such as total variation denoising To achieve higher detection performance for multiple dam-
(Rudin, Osher, & Fatemi, 1992). However, these techniques ages, Fully Convolutional Network (FCN)-based method is
receive little effect because the images are taken under used to detect damages at the pixel level. The FCN is an end-
extensively varying real-world situations. to-end, pixel-to-pixel convolutional network for semantic seg-
To improve the adaptability of IPTs-based methods in real- mentation (Chen, Papandreou, Kokkinos, Murphy, & Yuille,
world situations, machine learning (ML)-based approaches 2015; Long, Shelhamer, & Darrell, 2015). It predicts the class
are used to conduct feasible solutions to damage detection of each pixel by adopting a deconvolution layer to upsample
(Amezquita-Sanchez & Adeli, 2015; Butcher et al., 2014; the last convolutional layer. Therefore, the FCN can be treated
Jiang & Adeli, 2007). To detect specific damage, these as an extended CNN in which the prediction has been con-
approaches first extract damage features from images and verted from a class number to a semantic segmentation image.
then classify the extracted features using ML algorithms. The Compared with nonpixel methods, pixel-level method can
artificial neural network (ANN), a typical supervised ML, provide the class of each pixel of a damage image, and there-
was developed to detect concrete cracks and other structural fore the damage feature can be completely separated from the
damages (Lee & Lee, 2004; Moon & Kim, 2011; Moselhi background. According to the separated results, a series of
& Shehab-Eldeen, 2000). However, due to the hardware postprocessing techniques can be followed to extract specific
limitation of computational capability, the architectures damage features (e.g., a crack width). At present, the FCN
of ANNs were simple, which led to insufficient damage method has been used to detect concrete cracks (Ni, Zhang, &
detection. Therefore, researchers resorted to other ML-based Chen, 2018; Yang, Li, Yu, Luo, & Huang, 2018). In this study,
methods for better detection results. The Support Vector we proposed a novel FCN to detect four concrete damages.
14678667, 2019, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12433 by Shenzhen University, Wiley Online Library on [04/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
618 LI ET AL.

FIGURE 1 The methodology of FCN-based damage detection method

1 Input Convolution Pooling Dense Block

32 33
2 Deconvolution Eltwise Softmax 29
3 26
4 56 23
7 8 9 20
10 11 12 17
13 14 16
Input Crop Output
(RGB image) (Segmentation)
15
Crop
Crop
18 19
Crop
21 22 Crop
Crop
24 25

27 28

30 31

FIGURE 2 Overall architecture of FCN

The four types of concrete surface damage that we choose 2 METHODOLOGY

as examples are cracks, spalling, efflorescence, and holes, as
they cover a wide range of concrete damages (German et al., To identify and localize multiple damages of concrete
2012). Based on this study, other damage types can be easily structure, the FCN method is used for processing images,
added using the FCN-based method. Compared to the previ- and concrete damages are extracted from image backgrounds
ous CNN and Faster R-CNN approaches, the main advantages according to the prediction results of the FCN. The overall
of the proposed FCN-based detection of multiple damages schematic architecture of the FCN-based detection method for
are that the proposed FCN architecture which combines multiple damages is presented in Figure 1, where the FCN is
the functions of identification with localization for multiple used to predict the pixel location of damage in an image, and
damages is deeper and more comprehensive, and it requires the damage is extracted according to the prediction result of
no sliding window technique and cutting entire images into the FCN. In our method, a FCN architecture is built through
little patches (Xue & Li, 2018). In addition, the damage fine-tuning the DenseNet-121 (Huang, Liu, van der Maaten, &
area is computed according to a calibrated relation between Weinberger, 2017). The DenseNet is a remarkable CNN that is
the ratio of the pixel area and the true area of the detected widely used in object classification and recognition for large-
object and the distance from the smartphone to the concrete scale images. An application in transfer learning task has
surface using a laser range finder rather than a pasted scale proved that the DenseNet performs outstandingly at extracting
reference. features in images (Fu, Liu, Wang, & Lu, 2017). In addition,
The content of this research is described as follows. the pretrained parameters of the DenseNet-121 can signifi-
Section 2 introduces the overall architecture of the proposed cantly decrease the entire training time of FCN. Therefore, our
FCN and explains the detailed FCN methodologies. Section 3 research uses the DenseNet-121 to perform damage detection.
presents the procedure for generating databases. Section 4 Notably, the DenseNet-121 can also be replaced with other
exposes the FCN training and validation processes. In state-of-the-art CNNs, such as AlexNet (Krizhevsky et al.,
Section 5, the relation between the ratio (pixel area and true 2012), GoogLeNet (Szegedy et al., 2015), VGG16 (Simonyan
area of detected object) and the distance from the smart- & Zisserman, 2014), ResNet (He, Zhang, Ren, & Sun, 2016),
phone to the detected object is calibrated. Then, a testing etc. The details of the FCN based on DenseNet-121 and their
of the trained and validated FCN is performed regarding modification are explained in this section.
the method's performance, and the detected damages are
extracted according to the prediction of the FCN. Section 6
demonstrates the comparative studies with a state-of-the- 2.1 DenseNet-121–based FCN
art semantic separation method. Section 7 concludes this Figure 2 shows the overall FCN architecture, and its detailed
paper. specifications are presented in Table 1, where the layers of the
14678667, 2019, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12433 by Shenzhen University, Wiley Online Library on [04/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 619

TABLE 1 The detailed speciﬁcations of FCN

Output
size
Kernel size (Width × Output
Layer Type Pad (Width × Height) Stride Height) channels Note
1 Input - - - 504 × 376 3 DenseNet-121
2 Conv1 3 7×7 2 252 × 188 64
3 Maxpool1 1 3×3 2 127 × 95 64
[ ] [ ] [ ]
0 conv 1 × 1 conv 1 conv
4 Dense Block (1) ×6 ×6 ×6 127 × 95 256
1 conv 3 × 3 conv 1 conv
5 Conv2 0 1×1 1 127 × 95 128
6 Maxpool2 0 2×2 2 64 × 48 128
[ ] [ ] [ ]
0 conv 1 × 1 conv 1 conv
7 Dense Block (2) × 12 × 12 × 12 64 × 48 512
1 conv 3 × 3 conv 1 conv
8 Conv3 0 1×1 1 64 × 48 256
9 Maxpool3 0 2×2 2 32 × 24 256
[ ] [ ] [ ]
0 conv 1 × 1 conv 1 conv
10 Dense Block (3) × 24 × 24 × 24 32 × 24 1,024
1 conv 3 × 3 conv 1 conv
11 Conv4 0 1×1 1 32 × 24 512
12 Maxpool4 0 2×2 2 16 × 12 512
[ ] [ ] [ ]
0 conv 1 × 1 conv 1 conv
13 Dense Block (4) × 16 × 16 × 16 16 × 12 1,024
1 conv 3 × 3 conv 1 conv
14 Maxpool5 0 2×2 2 8×6 1,024
15 Conv 1 3×3 1 8×6 1,000
16 Conv 0 1×1 1 8×6 5
17 Deconv 0 4×4 2 18 × 14 5 Cropped into
16 × 12
18 Conv 0 1×1 1 16 × 12 5
19 Eltwise - - - 16 × 12 5
20 Deconv 0 4×4 2 34 × 26 5 Cropped into
32 × 24
21 Conv 0 1×1 1 32 × 24 5
22 Eltwise - - - 32 × 24 5
23 Deconv 0 4×4 2 66 × 50 5 Cropped into
64 × 48
24 Conv 0 1×1 1 64 × 48 5
25 Eltwise - - - 64 × 48 5
26 Deconv 0 4×4 2 130 × 98 5 Cropped into
127 × 95
27 Conv 0 1×1 1 127 × 95 5
28 Eltwise - - - 127 × 95 5
29 Deconv 0 4×4 2 256 × 192 5 Cropped into
252 × 188
30 Conv 0 1×1 1 252 × 188 5
31 Eltwise - - - 252 × 188 5
32 Deconv 0 4×4 2 506 × 378 5 Cropped into
504 × 376
33 Softmax - - - 504 × 376 5
14678667, 2019, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12433 by Shenzhen University, Wiley Online Library on [04/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
620 LI ET AL.

Maxpool1 Maxpool2

Conv2

Convolution Concat Dense Block (1)

FIGURE 3 Dense Block (1)

DenseNet-121 are ﬁne-tuned and used for feature extraction. Stride = 1

Padding
Notably, rectiﬁed linear activation, batch normalization, scale 0 0 0 0 0

and dropout layers, which cannot be visualized, are also used 0 1 1 0 0

0×0 0×1 0×0 Bias
0 0 1 1 0 ∑
in the DenseNet-121 part. The dense block in the DenseNet- 0×1 1×0 1×0 1 + 2
0 1 0 1 0 3 4 4
0×1 0×0 1×1
121 combines features through concatenation (Concat) before 0 0 0 0 0 3 5 3
they are passed into a layer. As an example, in Figure 3, Input 2 4 3
0×0 0×1 0×0 Bias
the part in dotted box presents the Dense Block (1) located 0 1 0 1×1 1×0 0×0
∑
2 + 2
Output

between Maxpool1 and Conv2, where the concatenation lay- 1 0 0 0×1 1×0 1×1
ers are used to concatenate features. When concatenating fea- 1 0 1
Convolution
tures, the feature maps from previous layers are stacked by kernel
Output size = (Input size + 2 × Pad – Kernel size) / Stride + 1; 3 = (3 + 2 × 1 – 3 ) / 1 + 1
channel. That is to say, the number of output channel in
concatenation layer is sum of its input channels. Based on FIGURE 4 Convolutional layer example
the concatenation, the DenseNet-121 alleviates the vanishing-
gradient problem, strengthens feature propagation, encour-
ages feature reuse, and substantially reduces the number of between the inputs and kernels where each kernel slides on the
parameters. input with a specific step size. The sliding step size is defined
To achieve good performance of deconvolution, all the as a stride that is usually identical in the height and width
average pooling layers are changed into max pooling layers, direction. The computed convolution values of each kernel
and the global average pooling in layer 14 also is replaced by are added together with a bias to generate the results of each
a max pooling layer with kernel size of 2 × 2 and stride of kernel. These results are amalgamated to produce the spatial
2. The final classifier layer of the DenseNet-121 is discarded, output of the convolutional layer. To maintain the output size,
and fully connected layer is converted to convolution layer fol- zero-padding (Pad) is always adopted for the input. The out-
lowed by a dropout layer with dropout rate of 0.5. Convolu- put size of a convolutional layer depends on the input size,
tions with a 1 × 1 kernel and five output channels (layer 18, Pad, kernel size, and stride, which can be calculated using the
21, 24, 27, and 30) are appended to predict the scores for each equation shown in Figure 4.
of the classes (cracks, spalling, efflorescence, holes, and the After the convolutional layer, a nonlinear operation is
background) at each of the previous output locations, followed implemented by applying a nonlinear activation function to
by a deconvolution layer to upsample the previous outputs to the convolution results. In ANN, typical nonlinear activation
pixel-dense outputs. The FCN fuses predictions from the final functions including sigmoid, tanh and arctan are adopted, but
layer of DenseNet-121, all the pooling layers and first convo- their saturating nonlinearities slow computations. The recti-
lution layer. In each deconvolution layer, the previous output fied linear unit (ReLU) was introduced (Nair & Hinton, 2010)
is doubled in size through implementing an upsample with a as a nonlinear activation function. It performs well because
stride of 2. Finally, the output size of the FCN is same as the the gradients of the ReLU are always zero and one. To facil-
input size. itate much faster computation during the training process,
ReLU is selected in our study because of its simple deriva-
tive function.
2.2 Convolutional layer (Conv)
To reduce the parameters of subsequent layers and the prob-
In CNNs, a convolutional layer performs the convolution ability of overfitting, a downsampling operation in the max
operation using a set of kernels (filters) with learnable weights pooling (Maxpool) layer is applied to output of the ReLU non-
as shown in Figure 4. The depth of the convolutional ker- linear activation function. The output size of the max pool-
nels is always equal to the depth of the convolutional layer ing layer can be calculated according to the input size, ker-
inputs, and the height and width that are identical are gener- nel size, and stride (Output size = (Input size − kernel size)
ally smaller than the inputs. The convolution is implemented /stride + 1).
14678667, 2019, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12433 by Shenzhen University, Wiley Online Library on [04/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 621

Stride = 2 Biases Output

Deconvolution 0 1 2 0 3 6
kernel 0 1 2 3 6 0 1 1 –1 1 0 2 3 2 7
Input 2 0 2 6 0 6
0 1 2 2 0 8 0 6 1 2 –1 1 0 3 2 7 1 6
1 3 1 3 0 3 1 0 9 3
× 2 0 2 0 5 5 10 5 + 2 1 1 –1 2 = 2 6 6 9 7
2 1 2 1 0 2 4 0 1 2
0 3 1 4 0 6 0 2 1 0 2 1 1 5 0 8 1 3
4 0 4 2 0 2
0 6 2 3 1 2 3 1 2 1 2 9 3 5 2
0 6 2 0 3 1
Overlap
(Sum)
Output size = (Input size – 1) × Stride + Kernel size – 2 × Pad; 5 = (2 – 1) × 2+ 3 – 2 × 0

FIGURE 5 Deconvolutional layer example

2.3 Deconvolutional layer (Deconv) 1 3 3 1 2

To output the end-to-end, pixel-to-pixel prediction results, an 3 2 4 0 3 Cropping with 2 4 0
upsampling operation is introduced to enlarge the data to the offset = 1 1 2 3
2 1 2 3 5
height and width of the original input image. The upsampling 3 6 0 2 4 6 0 2
operation is called a deconvolutional layer but should be 4 1 3 0 1 Output
called a transposed convolutional layer, as it converts a coarse
Input
input into a dense output, which is not the inverse of convolu-
Output size = Input size – 2 × Offset; 3 = 5 – 2 × 1
tion in actuality. The deconvolution layer performs the follow-
ing three operations throughout an input as presented in Fig- FIGURE 6 Crop layer example
ure 5. First, it performs multiplications between each element
of an input and a deconvolutional kernel (i.e., the element is 1×3+1×2=5
the weight of the deconvolutional kernel). Second, the multi- 3 2 0 2 0 1 5 2 1
plied subarrays are assembled with a set of the stride in height 0 1 3 + 1 3 0 = 1 4 3
and width directions, and the assembled values in overlapping 2 0 2 0 1 2 2 1 4
regions are summed. Last, biases are added to the assembled Input Input Output
values to produce an output. As Figure 5 shows, the magnifi- (Coefficient = 1) (Coefficient = 1)
cation for input primarily depends on the stride. A larger stride
FIGURE 7 Eltwise layer example
can increase the magnification but leads to an unsatisfactorily
coarse output. To achieve good outputs in our study, the
FCN fuses the features of the lower layers as the input of the 2.5 Eltwise layer
deconvolutional layers. The output size of a deconvolutional
layer is always larger than the input size, as shown in Figure 5. In our FCN, to fuse the feature maps of the previous layer and
the deconvolutional layer, an eltwise layer is used to imple-
ment an elementwise operation on input feature maps. The
eltwise layer can compute three results of elements including
2.4 Crop layer the product, sum, and maximum. In the FCN, the eltwise lay-
When fusing the features of the lower layers and deconvo- ers perform the sum operation for the corresponding elements
lutional layers, the concatenation layers stack the outputs in the feature maps and all of their weight coefficients are set
of previous layers. This process requires same input sizes to one. Notably, the input feature maps for the eltwise layer
in concatenation layer. However, the output sizes of the should be equal in size, as presented in Figure 7.
previous layer and deconvolutional layer are often different.
To eliminate the difference, a crop operation is used to crop
redundant edge pixels of feature maps. The offset, an addi- 3 DATA BA SE AND CHOSEN
tional hyperparameter for the crop layer, is the set of cropped CONF IG URATIO NS
rows and columns for the feature maps as shown in Figure 6.
In this study, the offsets in height and width are identical and This section describes the process of building a database and
the output size of the crop layer can be computed using the some chosen configurations for the FCN training, validation,
equation in Figure 6. In FCN, the fused feature maps will be and testing in detail. The chosen softmax loss function, opti-
fed into subsequent eltwise layer to fuse the feature maps. mizer, and the model initialization are explained at large. The
14678667, 2019, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12433 by Shenzhen University, Wiley Online Library on [04/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
622 LI ET AL.

evaluation metrics for the accuracy of our FCN are illustrated

to choose appropriate hyperparameters for the FCN training.

3.1 Building database

To build a database including cracks, spalling, efflorescence,
and hole images in concrete structures, 1,375 images with a
4,032 × 3,016 pixel resolution are taken using a smartphone
in different lighting conditions, and the distances between the
smartphone and concrete surface are measured using a laser
range finder. To collect the images with tiny damages, such
as thin cracks and small holes, all images are taken at a dis-
tance of 0.2 m between the smartphone and concrete surface.
Adopting the unified distance is to simplify later image pro-
cessing. The purpose of recording the distances is to com-
pute the ratio of the number of pixels and the area of the
detected damages, which will be introduced in Section 5 in
detail. To clarify, the laser spot of the laser range finder is not
in the collected images because there is some distance
between the laser range finder and the camera of the smart-
phone and the distance of 0.2 m between the smartphone and
concrete surface is too close. FIGURE 8 Examples of collected original images and their
The ground truths of the collected images are manu- ground truths
ally labeled at the pixel level using the Photoshop software.
The crack, efflorescence, and hole damages are conveniently TABLE 2 The proportion of the training, validation, and testing
sets
extracted via a matting operation, and the spalling damage is
labeled by distinguishing with the human eye. Those labeled Damage type Training Validation Testing
images are converted from JPG format to PNG format with Crack 472 50 114
a single channel using a MATLAB code, where the pixels Spalling 520 50 200
of the background, cracks, spalling, efflorescence, and holes Efflorescence 524 50 106
are labeled as 0, 1, 2, 3, and 4, respectively. For the label- Hole 484 50 100
ing of images with multiple damages, we individually extract Crack + Spalling - - 10
and label each of damage and save them as JPG images with Crack + Efflorescence - - 8
a single damage, and then those saved images are reassem- Crack + Hole - - 8
bled. In the labeling process, the holes of less than 1,000 pix-
Spalling + Hole - - 4
els (approximately 4 mm2 in reality) are neglected because
the too small holes are not harmful to concrete structure and
difficult to label manually. Figure 8 shows the examples of in the database is doubled to 2,750. Notably, the training of the
collected raw images and their ground truths with colors. To FCN is performed at the pixel level where each pair of pixels
decrease the computational cost and training time, all raw from an input image and a labeled image can be treated as a
images in the database and the labeled images are resized to a training sample. Therefore, one image with 507 × 376 pixels
507 × 376 pixel resolution. Note that the resizing operation is contains more than 190,000 samples, and all 2,750 images in
implemented after the labeling process to ensure the precision our database contain more than 500 million samples.
of manual labeling as much as possible. To assess the generalization ability of the FCN, the 2,750
For deep neural networks, the performance is further images are divided into five parts according to fivefold cross-
improved by an increase in the amount of training data. There- validation principle that 80% are used to train and validate
fore, to make use of the collected images and decrease the the model and the last 20% for testing. Specifically, 2,200
probability of overfitting, in addition to dropout (Srivastava, images are randomly selected from the 2,750 images, of which
Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014), data 2,000 images are used to generate a training set and another
augmentation is implemented on our small database. Inspired 200 images are used to create a validation set. The remaining
by the results of He, Zhang, Ren, and Sun (2014), the horizon- 550 images not selected for training and validation are used
tal flipping on the database is performed for data augmenta- to build a testing set. Table 2 shows the detailed proportion
tion. After the data augmentation, the total number of images of the training, validation, and testing sets. The images with
14678667, 2019, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12433 by Shenzhen University, Wiley Online Library on [04/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 623

0.8 0.8
Training loss 0.7
Training loss
0.7
Validation loss
0.6 Validation loss 0.6

0.5 0.5

Loss
Loss

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 20,000 40,000 60,000 80,000 100,000 0 20,000 40,000 60,000 80,000 100,000
Iteration Iteration

(a) (b)
0.8
Training loss
0.7
Validation loss
0.6
0.5
Loss

0.4
0.3
0.2
0.1
0
0 20,000 40,000 60,000 80,000 100,000
Iteration
(c)

FIGURE 9 Training and validation losses over iterations: (a) learning rate = 5 × 10−11 , (b) learning rate = 1 × 10−10 , and (c) learning
−10
rate = 2 × 10

multiple damages are scarce because it is not easy to find a 3.3 Evaluation metrics of accuracy
concrete surface including multiple damages and the multiple
Many evaluation criteria have been proposed and are fre-
damages can be taken in one image, so all of them are included
quently used to assess the accuracy of any type of tech-
in testing set to evaluate the detection ability of the FCN.
nique for semantic segmentation. The most popular metrics
for semantic segmentation that are currently used to measure
how per-pixel labeling methods perform on a task are the pixel
3.2 Model initialization accuracy (PA), mean pixel accuracy (MPA), mean intersec-
When training the FCN, a strategy of model-based transfer tion over union (MIoU), and frequency weighted intersection
learning (Gao & Mosalam, 2018; Pan & Yang, 2010), rather over union (FWIoU) (Garcia-Garcia, Orts-Escolano, Oprea,
than training the model from scratch, is adopted to acceler- Villena-Martinez, & Garcia-Rodriguez, 2017). For explana-
ate and optimize the learning efficiency of the model. Follow- tory purposes, the following notation details are marked: for
ing this strategy, the weights and biases of the DenseNet-121 the 𝑘 + 1 classes, 𝑝𝑢𝑣 is the amount of pixels of class 𝑢 pre-
part in the FCN are initialized by pretrained DenseNet-121. dicted to belong to class 𝑣. That is, 𝑝𝑢𝑢 represents the number
Besides, all the weights of deconvolutional layers in FCN are of true positives, whereas 𝑝𝑢𝑣 and 𝑝𝑣𝑢 are false positives and
initialized using the “Bilinear” method. The learning rate of false negatives, respectively. According to the marked nota-
the weights and biases are twice and equal to the base learning tion, the PA, MPA, MIoU, and FWIoU, respectively, can be
rate, respectively, and the learning rate and the weight decay formulated as follows:
of the deconvolutional layers are set to zero to keep the coeffi-
cient values of bilinear interpolation unchanged during train- ∑𝑘
ing. Moreover, all the bias terms of the deconvolutional layers 𝑢=0 𝑝𝑢𝑢
PA = ∑𝑘 ∑𝑘 (1)
𝑣=0 𝑝𝑢𝑣
are fixed at zero and not trained. 𝑢=0
14678667, 2019, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12433 by Shenzhen University, Wiley Online Library on [04/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
624 LI ET AL.

100% 95%

98% 85%

96% 75%

MPA
PA

94% 65%

92% Learning rate = 5e–11 55%

Learning rate = 5e–11
Learning rate = 1e–10 Learning rate = 1e–10
Learning rate = 2e–10 Learning rate = 2e–10
90% 45%
0 20,000 40,000 60,000 80,000 100,000 0 20,000 40,000 60,000 80,000 100,000
Iteration Iteration
(a) (b)
90% 100%

80%
95%
70%

FWIoU
60%
MIoU

90%
50%

40% 85% Learning rate = 5e–11

Learning rate = 5e–11
30% Learning rate = 1e–10 Learning rate = 1e–10
Learning rate = 2e–10 Learning rate = 2e–10
20% 80%
0 20,000 40,000 60,000 80,000 100,000 0 20,000 40,000 60,000 80,000 100,000
Iteration Iteration
(c) (d)

FIGURE 10 (a) PAs, (b) MPAs, (c) MIoUs, and (d) FWIoUs over iterations

𝑘 workstation that is conﬁgured with a graphic processing unit

1 ∑ 𝑝𝑢𝑢
MPA = (2) (GPU) (CPU: Intel® Xeon® CPU E5-2630 v4 @ 2.2 GHz,
𝑘 + 1 𝑢=0 ∑𝑘 𝑝
𝑣=0 𝑢𝑣 RAM: 32 GB, GPU: ASUS GeForce GTX 1080 Ti).
The FCN is trained with a batch size of one image, a
𝑘 momentum of 0.99 and a weight decay of 0.0005 for 100,000
1 ∑ 𝑝𝑢𝑢
MIoU = (3) iterations (50 epochs), respectively. To choose suitable ini-
𝑘 + 1 𝑢=0 ∑𝑘 ∑𝑘
𝑣=0 𝑝𝑢𝑣 + 𝑣=0 𝑝𝑣𝑢 − 𝑝𝑢𝑢 tial learning rates, three small and fixed learning rates are set
under the strategy of model-based transfer learning (Wilson
𝑘 ∑𝑘 & Martinez, 2001). Our training processes use the learning
1 ∑ 𝑝𝑢𝑣 𝑝𝑢𝑢
𝑣=0
FWIoU = ∑𝑘 ∑𝑘 ∑𝑘 ∑𝑘 (4) rates of 5 × 10−11 , 1 × 10−10 , and 2 × 10−10 . A trick is used
𝑢=0 𝑣=0 𝑝𝑢𝑣 𝑢=0 𝑣=0 𝑝𝑢𝑣 + 𝑣=0 𝑝𝑣𝑢 − 𝑝𝑢𝑢 for taking the mean values of the training and validation data
sets for the sake of efficient computation. When training the
FCN, average loss is recorded every 100 iterations to smooth
4 T R A I N I NG A N D VA L I DAT I NG the output of losses, and the trained model is validated and
T HE FCN saved every 1,000 iterations. With a GPU boosting the training
processes, the recorded training time for each FCN is approx-
A training process is conducted to optimize the initialized imately 4.2 hr.
weight and bias parameters of the FCN, and the accuracy Figure 9 depicts the recorded training and validation losses
is validated for timely tracking of the trained performance of the FCN under different learning rates. It is observed that
during the training. All the described work is based on all the training losses decrease quickly at the beginning and
Caffe (Jia et al., 2014) in Linux system and performed on a then converge near 0.05.
14678667, 2019, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12433 by Shenzhen University, Wiley Online Library on [04/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 625

TABLE 3 Summary of the FCN training and validation results

Learning rate (× 10−10 ) Highest PA (%) Highest MPA (%) Highest MIoU (%) Highest FWIoU (%)
0.5 98.23 89.42 80.15 96.68
1 98.61 91.59 84.53 97.34
2 98.59 90.37 83.27 97.32
Note: Boldface values highlight the best training and validation result of FCN. Related description is provided in text.

Figure 10 shows the validation PAs, MPAs, MIoUs, and

FWIoUs of FCN validation process under different learning
rates, where the validation PAs, MPAs, MIoUs, and FWIoUs
rapidly increase at first 20,000 iterations. This result proves
that the FCN skip architecture achieves the capacity to learn
the features of damages by fusing predictions from the previ-
ous pooling layers and the first convolution layer. Table 3 sum-
marizes the training and validation results, where the lines in
(a)
bold are the best training result of FCN.
Figure 9 presents that within a certain range, the smaller the
learning rate, the faster the convergence speeds, but Table 3 0.16

shows that the FCN will ﬁnally yield lower accuracy when

R (w pixels/mm2)
0.12 R = 2,144.7D-2.123
adopting the smaller and larger learning rates. As a result, Coefficient of determination = 0.9998
the right learning rate 1 × 10−10 between the smaller and the 0.08
larger learning rates is chosen as the best learning rate. In our
training, we attempted to use larger learning rates, but this led 0.04
to losses surging unavoidably and the trainings were difficult
0
to converge. 0 100 200 300 400 500 600
After the training process, the trained FCN model with D (mm)
1 × 10−10 learning rate achieves the highest PA of 98.61%, (b)
MPA of 91.59%, MIoU of 84.53%, and FWIoU of 97.34%.
Therefore, the trained model with 1 × 10−10 learning rate FIGURE 11 Calibration experiment for the relation between the
is used in testing process and for extracting damages from ratio and the distance: (a) apparatus, (b) fitted curve
images, which is detailed in Section 5.
(D) from the smartphone to the surface of the detected tar-
get, experiments are performed in a quasi-static sense. During
5 CA L I B R AT I NG E X P E R I M E N T the experiments, a target (a black solid circle with a diameter
A N D T E ST I NG T H E T R A I N E D A N D of 50 mm) is set, and the smartphone fixed on a linear guide
VA L I DAT E D F C N is moved from its initial (100 mm) to its maximum position
(550 mm) in steps of 10 mm and then back to complete one
To compute the areas of detected damages, a ratio between cycle, where the distance between the target and the smart-
the number of damage pixels (the pixel area) and the true phone is measured using a laser range finder. The experiment
area of the damage is necessary. Considering that the ratio is repeated five times, and the experimental results are curve-
will change over the distance of the smartphone camera from fitted in Figure 11b where the unit of R is 1 w pixels/mm2
the surface of the detected objects, the relation between the (10,000 pixels/mm2 ).
ratio and distance is calibrated under laboratory conditions. In Figure 11b, the fitted equation serves as the calibration
To examine the performance of the trained and validated FCN equation for relating the ratio R to the actual distance D from
from the previous section, extensive tests are conducted, and the target to the smartphone camera. Based on this equation,
the damages in the tested images are extracted according to the true area rather than the number of pixels can be computed
the predicted testing results and the calibrated equation. accordingly.

5.1 Calibrating the relation between the ratio 5.2 Testing the FCN and extracting damages
and the distance from images
As shown in Figure 11a, to calibrate the relation between the The remaining 550 images that are not used for training
ratio (R) of the pixel area and the true area and the distance and validation processes are used to test the trained FCN.
14678667, 2019, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12433 by Shenzhen University, Wiley Online Library on [04/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
626 LI ET AL.

3,000 10,000
y=x y=x

Prediction area (mm2)

Prediction area (mm2)
2,500 8,000
2,000
6,000
1,500
4,000
1,000

500 2,000

0 0
0 500 1,000 1,500 2,000 2,500 3,000 0 2,000 4,000 6,000 8,000 10,000
True area (mm2) True area (mm2)
(a) (b)
16,000
y=x 800
Prediction area (mm2)

Prediction area (mm2)

y=x
12,000
600

8,000
400

4,000 200

0 0
0 4,000 8,000 12,000 16,000 0 200 400 600 800
True area (mm2) True area (mm2)

FIGURE 12 True areas and prediction areas of testing images using trained FCN: (a) cracks, (b) spalling, (c) eﬄorescence, and (d) holes

FIGURE 13 Examples of predicted results and extracted cracks: (a) normal surface, (b) rough surface, (c) two cracks, and (d) shadowed image
14678667, 2019, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12433 by Shenzhen University, Wiley Online Library on [04/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 627

FIGURE 14 Examples of predicted results and extracted spalling: (a) normal surface, (b) rough surface, (c) surface with crack-like
disturbance, and (d) surface with two spalling damages

FIGURE 15 Examples of predicted results and extracted efflorescence: (a) normal surface, (b) rough surface, (c) decentralized efflorescence,
and (d) unclear efflorescence
14678667, 2019, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12433 by Shenzhen University, Wiley Online Library on [04/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Examples of predicted results and extracted holes: (a) normal surface, (b) rough surface, (c) surface with stain, and (d) shadowed
LI ET AL.

Examples of prediction results with multiple damages

FIGURE 17
FIGURE 16
image
628
14678667, 2019, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12433 by Shenzhen University, Wiley Online Library on [04/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 629

FIGURE 18 Typical examples of incorrectly predicted results and extracted damages

The testing duration is recorded as 0.05 s for each image. results on the images from normal, rough surfaces with one
Figure 12 presents the true areas (damage areas in ground crack, and a surface with two cracks. However, in Figure 13d,
truths) and prediction areas (damage areas in prediction the black stain is treated as crack because it is linked by the
results) for the 550 testing images, where the areas are crack and its color is similar with the crack, although the crack
computed according to the recorded distance from the smart- is also rightly detected.
phone to the concrete surface and the calibrated equation in Images with concrete spalling are shown in Figure 14.
Figure 11b (where the ratio of the resizing images is also Because of large area of the concrete spalling damage,
considered). It is observed that the prediction areas PAFCN the spalling is successfully detected on normal and rough
essentially agree with the true areas TA in Figures 12a, surfaces. Images from surfaces with crack-like disturbance
12b, and 12d, but, in Figure 12c, the accuracy of detecting and two spalling damages also are used to test in Fig-
efflorescence damages is not as high as expected. One of ures 14c and 14d, and all the spalling damages are recognized
the possible reasons causing this phenomenon is that the correctly.
efflorescence damages are very similar with undamaged Figure 15 presents images with concrete efflorescence,
concrete surfaces. Especially when light intensity is very where the efflorescence damage is satisfactorily detected
strong, it is difficult to distinguish efflorescence damage from using the trained FCN. In Figures 15a and 15b, the edges of
the concrete surface with reflected light. Some of the tested efflorescence damage are wrongly identified as background,
images, which include the four types of concrete damages and in addition some efflorescence with very small areas are not
are taken under various conditions, are chosen and presented detected in Figures 15c and 15d.
in Figures 13–18, where the prediction results are the testing Images consisting of concrete holes under real-word con-
outputs of the trained and validated FCN. ditions are shown in Figure 16, where the concrete holes are
Figure 13 shows images of concrete cracks under various satisfactorily extracted. As mentioned, the holes smaller than
conditions. Figures 13a, 13b, and 13c show suitable prediction 1,000 pixels are neglected. As a result, some tiny holes are
14678667, 2019, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12433 by Shenzhen University, Wiley Online Library on [04/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
630 LI ET AL.

3,500 12,000

3,000 10,000

Prediction area (mm2)

Prediction area (mm2)
y=x y=x
2,500
8,000
2,000
6,000
1,500
4,000
1,000

500 2,000

0 0
0 500 1,000 1,500 2,000 2,500 3,000 3,500 0 2,000 4,000 6,000 8,000 10,000 12,000
True area (mm2) True area (mm2)

(a) (b)
20,000 1,200
Prediction area (mm2)

Prediction area (mm2)

16,000 y=x y=x
900

12,000
600
8,000
300
4,000

0 0
0 4,000 8,000 12,000 16,000 20,000 0 200 400 600 800 1,000 1,200
True area (mm2) True area (mm2)

FIGURE 19 True areas and prediction areas of testing images using trained SegNet: (a) cracks, (b) spalling, (c) eﬄorescence, and (d) holes

not detected because the FCN did not learn the features of by developing a bigger training database. Besides, compared
tiny holes. with the ground truth, the FCN provides a smoother boundary
Figure 17 shows predicted results of images with multiple of efflorescence damage, which leads to the incorrect detec-
damages and, similar to Figure 16, all the damages are satis- tion of damage edges.
factorily detected in Figures 17b and 17c. However, the FCN Despite minor errors, the results demonstrate the robust
fails to detect the hazy part of the efflorescence in Figure 17a, performance of our FCN-based method for the detection
and the thin part of a crack is not detected in Figure 17d, where of multiple damages of concrete structure. These minor
the width of the crack at the thin part is approximately 2 pixels errors may be caused by the small training database. There-
(about 0.2 mm). fore, a larger database with more images of concrete dam-
Typical examples of incorrectly predicted results and ages under various conditions will be generated to improve
extracted damages are shown in Figure 18. In Figure 18a, a the method's capacity and generalization in our future
small number of gray-white background pixels are classified studies.
as efflorescence damages because the color of the background To apply this trained model in practice, the shooting dis-
pixels is very similar to efflorescence. Image with thin crack tance D between the smart phone and concrete surface must
is presented in Figure 18b, where FCN fails to detect the thin be measured when taking damage images, and the measur-
crack pixels. The width in broken parts of the thin crack is ing process will be easily completed using a tape measure or
approximately 2 pixels. In Figure 18c, some spalling pixels are a laser range finder. Then the ratio R of the pixel area and
wrongly classified as hole. In Figure 18d, the background pix- the true area can be calculated according to the calibrated
els in top edge are incorrectly detected as crack pixels. A really equation in Figure 11b. After that, the obtained damage
bad prediction result is shown in Figure 18d, where the FCN images can be input to the trained FCN model to predict the
incorrectly predicts small holes as background and a contin- class and location of damages in the input images. Finally,
uous crack is disconnected. These errors can be minimized the damages in images can be extracted according to the
14678667, 2019, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12433 by Shenzhen University, Wiley Online Library on [04/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 631

FIGURE 20 Comparison of prediction results for the proposed FCN and the SegNet

prediction results, and the damage areas are computed using TABLE 4 Comparison results of the FCN and SegNet
the prediction and the ratio R. Method PA(%) MPA(%) MIoU(%) FWIoU(%)
FCN 98.61 91.59 84.53 97.34
SegNet 98.62 98.16 83.82 96.84
6 COM PA R AT I V E ST U DY Note: Boldface values present the higher PA, MPA, MIoU, and FWIoU of two
methods.
To compare the performance of the proposed FCN-based
approach and a state-of-the-art semantic separation method,
the built database including concrete cracks, spalling, eﬄo-
rescence, and holes are used for training the SegNet model To choose the best initial learning rate for the SegNet,
(Badrinarayanan, Kendall, & Cipolla, 2017). The SegNet is three ﬁxed learning rates (0.01, 0.001, and 0.0001) are set.
a deep convolutional encoder–decoder architecture for image The trained SegNet models are validated and saved every
segmentation that performs well on road scenes and indoor 1,000 iterations. With a GPU boosting the training processes,
scene segmentation tasks. To adapt the input size of the Seg- the recorded training time for the SegNet is approximately
Net, all the images and their labels in the training, validation, 11 hr. In the training process, the highest validation PA, MPA,
and testing sets are resized to 480 × 360 pixel resolution to MIoU, and FWIoU are 98.62%, 98.16%, 83.82%, and 96.84%
generate a new database for SegNet training. recorded with 0.001 learning rate. Comparison of the best
The SegNet is trained with a momentum of 0.9, and a results in the validation processes of the FCN and the SegNet
weight decay of 0.0005 for 100,000 iterations. Because of are listed in Table 4. It shows that although the PA and MPA of
the smaller input size, the batch size is set to four images to FCN are lower than SegNet, the FCN achieves higher MIoU
accelerate the convergence speed and add the robustness of and FWIoU than the SegNet. Therefore, our FCN presents
the trained model. good performance in concrete damage detection problem.
14678667, 2019, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12433 by Shenzhen University, Wiley Online Library on [04/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
632 LI ET AL.

Similar to the FCN testing, the trained SegNet model is 2,000, 200, and 550, respectively, among the labeled images.
tested using the testing set in the new database. The recorded The database as open-source can be downloaded from website
testing duration is 0.05 s for each image, which is shorter than (https://drive.google.com/open?id = 1Odq4jzZyj-urfxC25bv
FCN because of its smaller input size. Figure 19 shows the gtLLO1yqbC-89). A strategy of model-based transfer learn-
true areas and prediction areas for the 550 testing images using ing was used to initialize the weight and bias parameters of
the trained SegNet, where the areas are computed according the FCN. To find the best training model of the FCN, the best
to the recorded distance from the smartphone to the concrete learning rates were selected via trial-and-error. Based on the
surface and the calibrated equation in Figure 11b (where the best training, the FCN achieved the highest PA of 98.61%,
ratio of the resizing images is also considered). It can be MPA of 91.59%, MIoU of 84.53%, and FWIoU of 97.34%.
found that the SegNet provides good prediction for spalling The robustness of the trained FCN was tested using the 550
damages, but the prediction areas for crack, efflorescence, testing images not used for training and validation. In addi-
and hole damages are generally larger than their true areas. tion, the performance of the trained FCN was compared with
Figure 20 presents a comparison of prediction results for the SegNet-based method. The comparative study showed that
the proposed FCN and SegNet, where the PASegNet is the pre- the proposed FCN-based method can provide good detection
diction area of the SegNet. Figure 20a shows images with results of damages and has the significant advantage that the
crack damage, where the crack is separated into two fragments number of parameters of the FCN are less than the SegNet,
by FCN because the thin part of the crack is not detected. The which leads to a smaller size of trained model of the FCN.
SegNet recognizes the whole crack, but the prediction area The proposed FCN was strong at detecting concrete dam-
is larger than its true area. In Figure 20b, the FCN provides ages: cracks, spalling, efflorescence, and holes, and showed
satisfactory prediction for a crack, but the SegNet incorrectly low levels of noise. It is a significant advantage that the FCN
detects background as spalling and hole. For the detection of can learn damage features from much training data. How-
spalling and efflorescence damages in Figures 20c and 20d, ever, it also means that a FCN-based method requires a large
all the FCN and the SegNet provide good prediction for them. amount of images to train a robust FCN model. One common
In Figure 20e, a stain on background is wrongly recognized shortcoming of almost all vision-based approaches, includ-
as spalling damage by the SegNet, but the FCN successfully ing the methods of IPTs, CNNs, and FCNs, is the inability
detects the holes in image. to detect the depth of damages due to the nature of flattened
From the comparison of the FCN and SegNet, it can be photographic images.
concluded that the FCN has better performance detecting con- In future studies, more images with more types of con-
crete damages because the SegNet predicts larger damage crete damages under various conditions will be provided and
areas than true areas for crack, efflorescence, and hole dam- added to the existing database to increase the robustness of
ages. In addition, a crucial advantage of the FCN is it applies the proposed method, and comparative studies will also be
DenseNet-121 to extract damage features, which significantly performed.
reduces the number of parameters and size of trained model.
As a result, the small FCN model can be integrated into a ACK NOW L E D G M E N T
smart phone, and the smart phone equipped with our FCN
This work was supported by the National Key Research and
model can be used to detect concrete damages conveniently.
Development Programs of China during the Thirteenth Five-
This advantage will significantly add to the application of the
Year Plan Period (Grant 2016YFC0802002-03 and Grant
FCN-based method.
2016YFE0202400).

REFERENCES
7 CONC LU SI ON S
Abdel-Qader, I., Abudayyeh, O., & Kelly, M. (2003). Analysis of edge-
detection techniques for crack identification in bridges. Journal of
A damage detection method based on the FCN is proposed
Computing in Civil Engineering, 17(4), 255–263.
to detect four concrete damages: cracks, spalling, efflores-
Amezquita-Sanchez, J., & Adeli, H. (2015). Synchrosqueezed wavelet
cence, and holes. A smartphone was used to collect 1,375
transform-fractality model for locating, detecting, and quantifying
raw images with a 4,032 × 3,016 pixel resolution. For the damage in smart highrise building structures. Smart Materials and
collected images, the pixel locations of all four damage types Structures, 24(6), 065034.
and their corresponding labels were specified. The raw images Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). SegNet: A deep
and labeled images were resized to a 504 × 376 pixel res- convolutional encoder-decoder architecture for scene segmentation.
olution to reduce the computation of the training processes. IEEE Transactions on Pattern Analysis & Machine Intelligence,
Then, the horizontal flipping on the database was performed 39(12), 2481–2495.
for data augmentation. After data augmentation, the num- Butcher, J., Day, C., Austin, J., Haycock, P., Verstraeten, D., &
ber of images used for training, validation, and testing were Schrauwen, B. (2014). Defect detection in reinforced concrete using
14678667, 2019, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12433 by Shenzhen University, Wiley Online Library on [04/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 633

random neural architectures. Computer-Aided Civil and Infrastruc- Huang, Y., & Xu, B. (2006). Automatic inspection of pavement cracking
ture Engineering, 29(3), 191–207. distress. Journal of Electronic Imaging, 15(1), 013017.
Cha, Y.-J., Choi, W., & Büyüköztürk, O. (2017a). Deep learning- Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., …
based crack damage detection using convolutional neural networks. Darrell, T. (2014). Caffe: Convolutional architecture for fast feature
Computer-Aided Civil and Infrastructure Engineering, 32(5), 361– embedding. arXiv preprint arXiv: 1408.5093.
378. Jiang, X., & Adeli, H. (2007). Pseudospectra, MUSIC, and dynamic
Cha, Y.-J., Choi, W., Suh, G., Mahmoudkhani, S., & Büyüköztürk, O. wavelet neural network for damage detection of highrise buildings.
(2017b). Autonomous structural visual inspection using region-based International Journal for Numerical Methods in Engineering, 71(5),
deep learning for detecting multiple damage types. Computer-Aided 606–629.
Civil and Infrastructure Engineering, 33(9), 731–747. Kirschke, K., & Velinsky, S. (1992). Histogram-based approach for auto-
Cha, Y.-J., You, K., & Choi, W. (2016). Vision-based detection of loos- mated pavement-crack sensing. Journal of Transportation Engineer-
ened bolts using the Hough transform and support vector machines. ing, 118(5), 700–710.
Automation in Construction, 71, 181–188. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classi-
Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. (2015). fication with deep convolutional neural networks. Proceedings of the
Semantic image segmentation with deep convolutional nets and fully Neural Information Processing Systems Conference, Stateline, NV,
connected CRFS. arXiv preprint arXiv: 1412.7062. December 3–8.
Chen, P., Shen, H., Lei, C., & Chang, L. (2012). Support-vector- Koch, C., & Brilakis, I. (2011). Pothole detection in asphalt pavement
machine-based method for automated steel bridge rust assessment. images. Advanced Engineering Informatics, 25(3), 507–515.
Automation in Construction, 23(5), 9–19. Koziarski, M., & Cyganek, B. (2017). Image recognition with deep neu-
Cheng, H., Chen, J., Glazier, C., & Hu, Y. (1999). Novel approach to ral networks in presence of noise—Dealing with and taking advan-
pavement cracking detection based on fuzzy set theory. Journal of tage of distortions. Integrated Computer Aided Engineering, 24(4),
Computing in Civil Engineering, 13(4), 270–280. 337–350.
Fu, J., Liu, J., Wang, Y., & Lu, H. (2017). Stacked deconvolutional Lee, B., & Lee, H. (2004). Position-invariant neural network for digital
network for semantic segmentation. arXiv preprint arXiv: 1708. pavement crack analysis. Computer-Aided Civil and Infrastructure
04943. Engineering, 19(2), 105–118.
Fujita, Y., Mitani, Y., & Hamamoto, Y. (2006). A method for crack detec- Li, G., Zhao, X., Du, K., Ru, F., & Zhang, Y. (2017). Recognition and
tion on a concrete structure. Proceedings of the International Confer- evaluation of bridge cracks with modified active contour model and
ence on Pattern Recognition, pp. 901–904. greedy search-based support vector machine. Automation in Con-
Gao, Y., & Mosalam, K. (2018). Deep transfer learning for image-based struction, 78, Supplement C, 51–61.
structural damage recognition. Computer-Aided Civil and Infrastruc- Lin, Y., Nie, Z., & Ma, H. (2017). Structural damage detection with
ture Engineering, 33(9), 748–768. automatic feature-extraction through deep learning. Computer-Aided
Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., Civil and Infrastructure Engineering, 32(12), 1025–1046.
& Garcia-Rodriguez, J. (2017). A review on deep learning tech- Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional net-
niques applied to semantic segmentation. arXiv preprint arXiv: 1704. works for semantic segmentation. Proceedings of the IEEE Confer-
06857. ence on Computer Vision and Pattern Recognition, pp. 3431–3440.
Gavilán, M., Balcones, D., Marcos, O., Llorca, D. F., Sotelo, M. A.,
Moon, H., & Kim, J. (2011). Intelligent crack detecting algorithm on the
Parra, I., … Amírola, A. (2011). Adaptive road crack detection sys-
concrete crack image using neural network. Proceedings of the 28th
tem by pavement classification. Sensors, 11(10), 9628–9657.
ISARC, pp. 1461–1467.
German, S., Brilakis, I., & DesRoches, R. (2012). Rapid entropy-based
Moselhi, O., & Shehab-Eldeen, T. (2000). Classification of defects in
detection and properties measurement of concrete spalling with
sewer pipes using neural networks. Journal of Infrastructure Systems,
machine vision for post-earthquake safety assessments. Advanced
6(3), 97–104.
Engineering Informatics, 26(4), 846–858.
Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted
Graybeal, B. A., Phares, B. M., Rolander, D. D., Moore, M., & Washer,
Boltzmann machines. Proceedings of the 27th International Confer-
G. (2002). Visual inspection of highway bridges. Journal of Nonde-
ence on Machine Learning (ICML-10), Haifa, Israel, June 21–24, pp.
structive Evaluation, 21(3), 67–83.
807–814.
He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in
Ni, F., Zhang, J., & Chen, Z. (2018). Pixel-level crack delineation in
deep convolutional networks for visual recognition. Proceedings of
images with convolutional feature fusion. Structure Control and
the 13th European Conference on Computer Vision, Zurich, Switzer-
Health Monitoring, e2286, 1–18.
land, September 6–12, pp. 346–361.
Nishikawa, T., Yoshida, J., Sugiyama, T., & Fujino, Y. (2012). Concrete
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for
crack detection by multiple sequential image filtering. Computer-
image recognition. Proceedings of the IEEE Conference on Computer
Aided Civil and Infrastructure Engineering, 27(1), 29–47.
Vision and Pattern Recognition, pp. 770–778.
O'Byrne, M., Schoefs, F., Ghosh, B., & Pakrashi, V. (2013). Tex-
Huang, G., Liu, Z., van der Maaten, L., & Weinberger, K. Q. (2017).
ture analysis based damage detection of ageing infrastructural ele-
Densely connected convolutional networks. Proceedings of the IEEE
ments. Computer-Aided Civil and Infrastructure Engineering, 28(3),
Conference on Computer Vision and Pattern Recognition.
162–177.
14678667, 2019, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12433 by Shenzhen University, Wiley Online Library on [04/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
634 LI ET AL.

Oliveira, H., & Correia, P. (2009). Automatic road crack segmentation Wilson, D., & Martinez, T. (2001). The need for small learning rates on
using entropy and image dynamic thresholding. Proceedings of the large problems. Proceedings of the International Joint Conference on
IEEE Signal Processing Conference, pp. 622–626. Neural Networks, pp. 115–119.
Pan, S., & Yang, Q. (2010). A survey on transfer learning. IEEE Xue, Y., & Li, Y. (2018). A fast detection method via region-based
Transactions on Knowledge and Data Engineering, 22(10), 1345– fully convolutional neural networks for shield tunnel lining defects.
1359. Computer-Aided Civil and Infrastructure Engineering, 33(8),
Rafiei, M., & Adeli, H. (2017). A novel machine learning-based algo- 638–654.
rithm to detect damage in high-rise building structures. Structural Yamaguchi, T., Nakamura, S., Saegusa, R., & Hashimoto, S. (2008).
Design of Tall and Special Buildings, 26(18), e1400. Image-based crack detection for real concrete surfaces. IEEJ Trans-
Rafiei, M., & Adeli, H. (2018). A novel unsupervised deep learning actions on Electrical and Electronic Engineering, 3(1), 128–135.
model for global and local health condition assessment of structures. Yang, X., Li, H., Yu, Y., Luo, X., & Huang, T. (2018). Automatic pixel-
Engineering Structures, 156, 598–607. level crack detection and measurement using fully convolutional net-
Rafiei, M., Khushefati, W., Demirboga, R., & Adeli, H. (2017). Super- work. Computer-Aided Civil and Infrastructure Engineering, 33(12),
vised deep restricted Boltzmann machine for estimation of con- 1090–1109.
crete compressive strength. ACI Materials Journal, 114(2), 237– Ying, L., & Salari, E. (2010). Beamlet transform-based technique for
244. pavement crack detection and classification. Computer-Aided Civil
Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards and Infrastructure Engineering, 25(8), 572–580.
real-time object detection with region proposal networks. IEEE Zalama, E., Gómez-García-Bermejo, J., Medina, R., & Llamas, J. (2014).
Transactions on Pattern Analysis & Machine Intelligence, 39(6), Road crack detection using visual features extracted by Gabor fil-
1137–1149. ters. Computer-Aided Civil and Infrastructure Engineering, 29(5),
Rudin, L., Osher, S., & Fatemi, E. (1992). Nonlinear total variation based 342–358.
noise removal algorithms. Physica D: Nonlinear Phenomena, 60 Zhang, A., Wang, K., Li, B., Yang, E., Dai, X., Peng, Y., … Chen,
(1-4), 259–268. C. (2017). Automated pixel-level pavement crack detection on 3D
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional net- asphalt surfaces using a deep-learning network. Computer-Aided
works for large-scale image recognition. Proceedings of the Interna- Civil and Infrastructure Engineering, 32(10), 805–819.
tional Conference on Learning Representations (ICLR), San Diego, Zhang, D., Li, Q., Chen, Y., Cao, M., He, L., & Zhang, B. (2016a). An
CA, May 7–9. efficient and reliable coarse-to-fine approach for asphalt pavement
Sinha, S., & Fieguth, P. (2006). Automated detection of cracks in buried crack detection. Image and Vision Computing, 57, 130–146.
concrete pipe images. Automation in Construction, 15(1), 58–72. Zhang, L., Yang, F., Zhang, Y., & Zhu, Y. (2016b). Road crack detection
Soukup, D., & Huber-Mörk, R. (2014). Convolutional neural networks using deep convolutional neural network. Proceedings of the IEEE
for steel surface defect detection from photometric stereo images. International Conference on Image Processing, pp. 3708–3712.
Proceedings of International Symposium on Visual Computing, Zhao, X., & Li, S. (2017). A method of crack detection based on convolu-
pp. 668–677. tional neural networks. Proceedings of the 11th International Work-
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdi- shop on Structural Health Monitoring, pp. 978–984.
nov, R. (2014). Dropout: A simple way to prevent neural networks Ziou, D., & Tabbone, S. (1998). Edge detection techniques—An
from overfitting. Journal of Machine Learning Research, 15(1), overview. International Journal of Pattern Recognition and Image
1929–1958. Analysis, 8(4), 537–559.
Subirats, P., Dumoulin, J., Legeay, V., & Barba, D. (2006). Automation Zou, Q., Cao, Y., Li, Q., Mao, Q., & Wang, S. (2012). Cracktree: Auto-
of pavement surface crack detection using the continuous wavelet matic crack detection from pavement images. Pattern Recognition
transform. Proceedings of the IEEE Image Processing Conference, Letters, 33(3), 227–238.
pp. 3037–3040.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., …
Rabinovich, A. (2015). Going deeper with convolutions. Proceedings How to cite this article: Li S, Zhao X, Zhou
of the IEEE Conference on Computer Vision and Pattern Recogni-
G. Automatic pixel-level multiple damage detec-
tion, pp. 1–9.
tion of concrete structure using fully convolutional
Wang, W., Zhang, A., Wang, K., Braham, A., & Qiu, S. (2018). Pave-
network. Comput Aided Civ Inf. 2019;34:616–634.
ment crack width measurement based on Laplace's equation for
continuity and unambiguity. Computer-Aided Civil and Infrastruc-
https://doi.org/10.1111/mice.12433
ture Engineering, 33(12), 110–123.