609) Deep Learning For Crack-Like Object Detection (Kaige Zhang, Heng-Da Cheng) 2023
609) Deep Learning For Crack-Like Object Detection (Kaige Zhang, Heng-Da Cheng) 2023
Object Detection
Kaige Zhang
Postdoctoral Research Associate
Department of Computer Science
University of Minnesota – Twin Cities, St Paul, MN, USA
Heng-Da Cheng
Full Professor, Department of Computer Science
Adjunct Full Professor, Department of Electrical Engineering
Utah State University, Logan, UT, USA
p,
p,
A SCIENCE PUBLISHERS BOOK
A SCIENCE PUBLISHERS BOOK
First edition published 2023
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
and by CRC Press
4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
Reasonable efforts have been made to publish reliable data and information,
but the author and publisher cannot assume responsibility for the validity
of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in
this publication and apologize to copyright holders if permission to publish
in this form has not been obtained. If any copyright material has not been
acknowledged please write and let us know so we may rectify in any future
reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be
reprinted, reproduced, transmitted, or utilized in any form by any electronic,
mechanical, or other means, now known or hereafter invented, including
photocopying, microfilming, and recording, or in any information storage or
retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work,
access www.copyright.com or contact the Copyright Clearance Center,
Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For
works that are not available on CCC please contact
[email protected]
Trademark notice: Product or corporate names may be trademarks or registered
trademarks and are used only for identification and explanation without
intent to infringe.
Typeset in Palatino
by Radiant Productions
Preface
Preface iii
1. Introduction 1
2. Crack Detection with Deep Classification Network 8
3. Crack Detection with Fully Convolutional Network 24
4. Crack Detection with Generative Adversarial 38
Learning
5. Self-Supervised Structure Learning for Crack 55
Detection
6. Deep Edge Computing 73
7. Conclusion and Discussion 86
References 89
Index 99
1
Introduction
2.1 Background
Before deep learning, crack detection was performed using
image segmentation method; however, it is difficult to find a
reliable threshold. The segmentation result is either with a lot of
noise, or many cracks are missed; and when performing noise
removal, it often removes true cracks, and when performing crack
defragments, it may link the noises together to create undesirable
false positives (fake cracks), as illustrated in Fig. 2.1.
Researchers have tried to solve the problem by designing
various hand-crafted feature extractors and applying machine
learning techniques to train a crack/non-crack classification
model. In this track, Shi et al. (2016) introduced random structured
forest (Dollár and Zitnick, 2013) as the classification model,
and designed an integral channel feature with three colors, two
magnitudes and eight orientations as feature extractors. However,
the model can only collect information from a limited scope and
some important global information was missed; even the statistics
at different locations were combined to build an integrated feature
vector, such as Hog (Dalal and Triggs, 2005). Thus, the methods
cannot represent the structural information which is important to
discriminate cracks from the noisy textures.
Different from traditional pattern classification method,
DCNN (Deep Convolutional Neural Network) performs feature
Crack Detection with Deep Classification Network 9
Fig. 2.4: Transfer of generic knowledge based on ImageNet data (C1, C2, C3, C4,
and C5 = five convolution layers; F6, F7, and F8 = three fully connected layers;
C = crack, BG = background).
(Deng et al., 2009). As shown in Fig. 2.4, the lock between two C1
layers indicates the transfer of generic features from convolution
kernels, which are fixed during training. The locks between C2,
C3, C4, and C5 indicate fine-tuning, and the × between F6, F7,
and F8 indicate that no knowledge is transferred. The settings are
based on the following considerations:
• The crack patterns are relatively simple which can be
represented by the generic knowledge,
• The crack patterns have little similarity with the natural
objects (e.g. dog, cat); therefore, middle-level and high-level
knowledge is less useful for crack detection; and
• Transferring the generic features from low-level layer making
the network training easier.
In Fig. 2.4, the source network is pre-trained with Caffe (Jia
et al., 2014). The model takes an image block as input and produces
a probability distribution over 1,000 object classes. The network
has five convolutional layers (C1, C2, C3, C4 and C5) followed
by three fully connected layers (F6, F7 and F8). The convolutional
layers serve as feature extractor. The fully connected layers serve
as a classifier to produce class labels by computing Y6 = σ(W6 Y5
+ B6), Y7 = σ(W7 Y6 + B7), and Y8 = φ(W8 Y7 + B8), where Yk denotes
the output of the k-th layer, Wk and Bk are trainable parameters of
the k-th layer; and σ(X[i]) = max(0,X[i]) and φ(X) = eX[i]/∑j eX[j] are
14 Deep Learning for Crack-Like Object Detection
Fig. 2.5: Feature maps of a sample image. Left: Image block of 227×227 pixels.
Right: Feature maps after the process of first convolutional layer.
Fig. 2.6: Classification results for different image blocks; squares indicate mask
blocks: (a – d) background; (e – h) crack.
wc = w/2 (Fig. 2.6), and it is used to locate the crack region in the
mask image. The mask image is defined as a binary image with
the same size as the original image which uses 1s to represent
crack and 0s to represent background. Considering efficiency
and accuracy, the two-step pre-selection is executed. In the first
step, 400×400-pixel image blocks are sampled from the original
image sequentially, and they are input into the DCNN. The step-
size ds is 100 pixels with overlapping between adjacent samples.
If the image block is classified as crack, the related mask block
area in the mask image is set to 1. In the second step, sampling
of 200×200-pixel blocks with a step-size of 50 pixel is conducted,
focusing only on the crack regions to obtain more accurate areas
with mask sizes of 100×100-pixel. In such a way, the mask images
are generated and most of the noisy background regions are
detected and discarded. The cracks are separated and located in
100×100-pixle mask block area to assist crack segmentation and
extraction. Figure 2.7 shows the results for some example images.
Fig. 2.7: Step-by-step results for the crack detection method: (a) original images;
(b) results after first step of pre-selection; (c) mask images after second step of pre
selection; (d) results for block-wise segmentation; (e) results for curve detection; (f)
final results for detected crack curves.
Fig. 2.8: Illustration of the difference between ridge regression and LASSO.
Fig. 2.9: Best lambda selected by cross-validated Mean Square Error (MSE) based
on minimum-plus-one standard error formula (solid line to left = lowest value
achieved by MSE; solid line to right represents minimum-plus-one standard error,
which is also the lambda chosen by LASSO).
20 Deep Learning for Crack-Like Object Detection
Fig. 2.11: Comparison of results using actual industry images: (a) original
pavement images; (b) manually marked GTs; (c) Canny; (d) CrackIT;
(e) CrackForest; (f) proposed method.
2.5 Summary
In this chapter, a deep classification network is introduced to
perform pre-selection which removes most background areas
before crack segmentation, based on a window-sliding strategy.
We have presented that the generic knowledge learned from
ImageNet contains rich structural information and it can be used to
extract crack patterns for crack/non-crack classification. The pre-
classification using deep convolutional network outperformed
the hand-crafted feature extraction method, which also indicated
that the global information is important to discriminate cracks
and complicated background textures. The approach is our work
in 2016, and the computational efficiency remained an issue at
that time.
3
Crack Detection with Fully
Convolutional Network
3.1 Background
In Chapter 2, we introduced a deep classification network to
perform a pre-selection that removed most noisy background
regions to assist crack detection and achieved very good results.
It is a region-based object-detection method where window-
sliding is needed when processing large input images. However,
the window-sliding-based processing is very inefficient when
dealing with a large-size image. Such a problem is a serious issue
of deep learning-based object detection, but is rarely mentioned
in the early times; for example, the time cost of processing an
industrial pavement image, 2048×4096-pixel, is 20 seconds with
window sliding. In addition, the method employed complex post
processing to extract a crack curve for the detection which is time
consuming as well. In this chapter, we discuss the reason behind
the method and propose a one-stage crack detection approach
based on Fully Convolutional Network (FCN).
Fig. 3.1: Dilated convolution with kernels of 3×3: (a) without dilation; (b) 2-dilated
convolution; (c) 3-dilated convolution (Yu and Koltun, 2016).
Fig. 3.2: Overview of the proposed approach: network at the top is the source net
trained with ImageNet; network in the middle is the crack block classification net
trained with transfer learning and network at bottom is the proposed dilation net
for crack detection (C=crack; BG=background).
Fig. 3.3: Fully connected layer is a special case of the convolutional layer.
Stride &4 &1 &1 &1 &1 &1 &1 &1 &1
Network
Pool Size & Stride 3&2 3&1 - - - - - - -
Dilation - - 2 2 2 2 4 - -
Crack Detection with Fully Convolutional Network 31
Fig. 3.4: Sample images and the outputs produced by Naive Detection Net. Left
column: original image; right column: outputs of Naive Detection Net.
32 Deep Learning for Crack-Like Object Detection
the image blocks, it is found that the network did a very good
job when dealing with the blocks with a crack passing through
the center of the blocks without cracks. However, they become
ambiguous for the blocks with cracks near the border (see Fig. 3.5).
It is called ‘localization uncertainty’, and if the localization
uncertainty can be reduced, the detection performance improves.
In the early days of deep learning, researchers devoted
substantial efforts to make the network converge, such as max
pooling (Krizhevsky et al., 2012), dropout (Wan et al., 2013),
increasing convolution/pooling stride (Krizhevsky et al., 2012),
etc. Among them, the large strides significantly reduce the memory
cost by cutting down on the output dimensions. However, it also
discards a significant amount of information. In this part, the dilated
convolution is implanted into the classification net to reduce the
resolution loss without damage to the forward computing logic of
the original classification network. As in Fig. 3.6, the network has
two convolutional layers—the first layer conducts convolution on
the input image with a kernel of 3×3 at stride 2; ‘A’, ‘B’, ‘C’ and
‘D’ denote that the outputs obtained by the convolutions operated
at ‘a’, ‘b’, ‘c’ and ‘d’, respectively. Intuitively, with stride 2, the
convolution will be conducted on locations centered with ‘a’ or
‘b’ or ‘c’ or ‘d’, depending on the starting point, and the output
is shown in Fig. 3.6. While the direct way to eliminate coverage
loss is by reducing the stride to 1; however, it involves additional
outputs which break the output order (see Fig. 3.6). Obviously, we
cannot continue the forward propagation with original kernels
for a desired output. However, the messed outputs still have the
information except for interpreting the additional values equal to
the convolution results operated on the other starting points. In
the next layer, if it interpolates 0s between each pair of elements
of the original kernel and conducts the convolution operation
Crack Detection with Fully Convolutional Network 33
C1, C2, and MP1 because they have relatively small receptive
fields which will make little difference). Finally, it will produce
the output of 250×500-dimension. The dilated network will not
improve the performance of network radically since it is only an
equivalent implementation of the classification net. However, the
experiments demonstrate that the increased output resolution is
important for network convergence. Thus we conduct an end-to
end refining under larger field of view to reduce the localization
uncertainty and improve the detection results. It is a fine-tuning
strategy that will not retrain the parameters of C1-C5 which have
acquired rich knowledge in the classification mode.
First, images of 400×400-pixel are used as the training data
which is determined experimentally. As mentioned, training an
end-to-end detection net with 1-pixel-width crack curve as GT is
problematic due to the high risk of mismatch. We use the k-dilated
GT image (‘disk’ structuring element was used for the dilation)
(see Fig. 3.7). As long as the detected crack pixel falls in the dilated
range of the 1-pixel GT, it will be treated as true positive, and the
dilation size also indicates the uncertainty level.
Based on such settings, we implement FC2 with 3×3×1024
convolution layer and add another layer FC3 with kernels of
size 1×1×3. Then it performs the fine-tuning with images of
400×400-pixel and GTs (note that the GTs are resized to 50×50
because there is an 8-time resolution loss from the first two
layers). Besides, only crack and sealed crack images are involved
because the background samples are taken into consideration
automatically under FCN (Long et al., 2015). Figure 3.8 shows
the detection results in Fig. 3.5. We see that refining reduces
Fig. 3.7: Dilated GT. (a) An image block of 400×400 pixels; (b) dilated GT.
Crack Detection with Fully Convolutional Network 35
Fig. 3.8: Detection results of the images in Fig. 3.5. First row: outputs of the refined
network; second row: final results after removing small noisy points.
3.5 Experiments
The data are obtained with a line-scan camera mounted on the top
of a vehicle running at 100 km/h to scan 4.096-meter width road
and produce an image of 2048×4096-pixel for every 2048 line-
scans in 0.2048 seconds. Four hundred images from the industry
are used for the experiments. The training-test ratio is 2:1.
Different from most object detection tasks (Everingham et al.,
2017), Intersection Over Union (IOU) is not suitable for evaluating
crack detection algorithms. Crack only occupies a very small area
and the image consists mainly of background pixels (Tsai et al.,
2010; Tsai and Chatterjee, 2017), and the precise pixel-level GTs
are hard to obtain. It is infeasible to get accurate intersection
area. As in Fig. 3.9, it is obvious that the detection result is very
good; however, the IOU values for (b) and (c) are very low, at
0.13 and 0.2, respectively. Referred to Tsai and Chatterjee, 2017,
we use Hausdorff Distance for overall evaluation of the crack
localization accuracy. The metric is insensitive to foreground-
background imbalance existing in most long-narrow object
detection tasks. We also introduce region-based precision rate
36 Deep Learning for Crack-Like Object Detection
3/9
(a) (b) (c)
Fig. 3.9: Pixel-level mismatching: (a) a crack image; (b) detection result overlapped
with the 3-width GT image; and (c) detection result overlapped with 12-width GT
image. The transparent areas represent the n-dilated GT with different n.
(p-rate) and recall rate (r-rate) for the evaluation. As in Fig. 3.10, a
crack image of 400×400-pixel is divided into small image patches
(50×50-pixel). If a crack is detected in a patch,
Output: it is marked as ‘1s’.
Output:
In the same
1×1 way, for GT images, if there 1×5
neurons is aneurons
marked crack curve
in a patch, itCrack
is a crack patch. Then
Crack the TPregion
Crack BG and BG
, FPregion
Crack FNregion
neuron by counting neuron neuron neuron neuron neuron
can be obtained the square regions, and further
TPregion
used to calculate p-rate and r-rate: Pregion = ,
CNN C N N + FP
TPregion region
TPregion
Rregion= . Then region-based F1 score (F1region) can
TPregion + FNregion
2*Pregion* Rregion
be computed as: F1region = .
Pregion + Rregion
4/
Input: m × m Input: m × 3m
Fig. 3.10: Region-based evaluation. (a) Original crack image; (b) an illustration of
counting crack and non-crack regions. The squares with label ‘1’ are crack regions,
and with label ‘0’ are background regions.
Crack Detection with Fully Convolutional Network 37
3.6 Summary
In this chapter, we discuss crack detection with FCN and dilated
convolution. The method is much more efficient than window
sliding-based detection. The analysis also performed a brief study
of an open problem: what is the difference between a DCNN-based
classification network and detection/segmentation network?
It indicated that the detection and segmentation network is a
classification network working on a larger field of view with the
receptive field as the input size (not considering the up-sampling/
decoding layers).
4
Crack Detection with
Generative Adversarial
Learning
4.1 Background
In Chapter 3, we discussed an end-to-end crack detection
network and developed an equivalent dense dilation layer for
transfer learning which improved the crack localization accuracy.
Actually, we found the training unstable, and the image size and
dilation scale of the GTs used for the training were determined
by trial and error. It was found that the network often ‘converges’
to the status that treats all pixels as background (BG), which can
still achieve a very good loss. We named such a situation ‘All
Black’ phenomenon. In this chapter, we propose a solution by
introducing an asymmetric U-shape network and Crack-Patch-
Only (CPO) supervised generative adversarial learning.
Fig. 4.1: The generative networks are trained to minimize the distribution
difference between the real data and that from the generator (Fig. from reference,
Goodfellow, 2014).
40 Deep Learning for Crack-Like Object Detection
Fig. 4.2: Illustration of deep domain adaptation (Fig. from reference, Ganin, 2014).
(2) weak supervision with dilated GTs; and (3) asymmetric U-Net
design. The CPO supervision and weak supervision with dilated
GT provide the final objective function: Lfinal = Ladv + Lpixel. Ladv
is the loss generated by the COP adversarial learning and Lpixel
is the pixel-level loss obtained using the dilated GT images. The
Ladv makes the network to generate crack images, and the Lpixel
directs the network generate cracks at the expected locations.
CPO supervision relies on the asymmetric U-Net.
As discussed in Chapter 3, directly training an FCN (Long,
2015), including U-Net (Ronneberger, 2015), for end-to-end per-
pixel crack detection, cannot achieve the expected results. The
reason is the data imbalance and unavailability of precise GTs.
When conducting end-to-end training with FCN, the network
will simply classify all the pixels as background and achieve quite
‘good’ accuracy (background pixels dominate the whole images).
We named this phenomenon ‘All Black’ problem. Figure 4.4 shows
that during training, the loss decreased rapidly and reached a
very low value; however, the detection result of a crack image was
all black as in Fig. 4.5. In order to solve such a problem, we added
generative adversarial loss to the objective for training. The new
loss forces the network to always generate crack-like detection
result which overcomes the ‘All Black’ problem. DC-GAN
(Radford, 2016) is employed. It is well known that the DC-GAN
can generate real-like images from random noise by conducting
the training with a max-min two-player game. Instead of directly
42 Deep Learning for Crack-Like Object Detection
Fig. 4.4: The loss and accuracy curves under the regular U-Net supervised by
labor-light (n-width) GTs.
Fig. 4.5: ‘All Black’ problem encountered when using FCN based method for
pixel-level crack detection on low-resolution images: (a) low-resolution image
captured under industry setting; (b) n-width ground truth image and (c) detection
result with ‘well-trained’ U-Net (refer Fig. 4.4).
Fig. 4.6: Pre-train a rich pattern DC-GAN with augmented GT images based on
CPO-supervision: the real GT dataset is augmented with manually marked ‘crack’
curve images to ensure the diversity of the crack patterns.
Fig. 4.7: Asymmetric U-Net under larger receptive field with CPO-supervision.
Crack Detection with Generative Adversarial Learning 45
3/9
including both crack and non-crack samples. Indeed, with multi-
layer fully convolutional network, each neuron actually has a
receptive field with some specific size. Since the convolutional
layer is input-size insensitive, operating the network under larger
receptive field actually realized a multi-spot image sampling with
the(a)image size equal to the(b)receptive field of the neuron.
(c) Thus,
when performing an image translation, using a deep convolutional
neural network with a larger input image, the process is equal
to translating multiple smaller image samples at the same time.
Based on the analysis, a crack image, larger than the input-size
of the discriminator, was input into the asymmetric U-Net. It was
then passed through the network to produce a down-sampled
Output: Output:
1×1 neurons 1×5 neurons
CNN C N N
Input: m × m Input: m × 3m
Fig. 4.8: With a larger input image, the CNN realizes multi-spot sampling with
the same receptive field. In the right, the first three neurons represent three crack-
samples while the last two are background neurons.
46 Deep Learning for Crack-Like Object Detection
Fig. 4.9: Detection results of the final asymmetric U-Net. First column: low-
resolution pavement image blocks; second column: the outputs of the network;
third column: the final results after removing the isolated noise areas.
Crack Detection with Generative Adversarial Learning 47
adversarial loss and the pixel loss with dilated GTs, the network
can generate cracks at the expected locations. Specifically, we
used 12-pixel-width crack GT to provide the weak supervision.
The dilated GTs were obtained by dilating 1-pixel-width crack-
GTs (‘disk’ structuring element was used for the dilation) and
the weakly supervised pixel-level loss is L = –Ex∈I,y [||y – G(x)||1].
pixel
Here, x is the input crack image patch, y is the dilated GT image,
I is the dataset of larger size crack patches, G is the generator,
and D is the discriminator. Overall, the final objective is:
L = Ladv + λLpixel. The pixel loss is normalized during training, and
λ can be determined via grid search. Once the training is finished,
the discriminator is no longer needed and the generator itself can
be used as the detection network.
Notice that the network is trained with small image blocks.
However, under industry settings, the image size is much larger,
such as pavement images with 2048×4096 pixels. A common
solution to processing large image is to sample smaller image
blocks from the full-size image and process them patch-by-patch
(Cha, 2017) which is inefficient (Zhang, 2018). Since the asymmetric
U-Net is designed as a fully convolutional network and the crack
patterns are scale insensitive, it can work on arbitrarily larger
images.
Figure 4.7 illustrates the architecture of the asymmetric U-Net.
The convolutional kernel of the first layer is 7x7 with stride 2
followed by a rectified linear unit (ReLU) (Nair, 2010); then there
is the 3x3 convolutional layer with stride 2 followed by a ReLU
layer. They are the asymmetric part of U-Net which realizes a four-
time down-sampling and reduces the input image to the same
size as the output of the asymmetric U-Net. The remaining layers
of the encoding and decoding parts are similar to regular U-Net
(Ronneberger, 2015). After the last de-convolutional layer, another
regular convolutional layer followed by a Tanh activation layer
(LeCun, 2015) is used to translate the 64-channel feature maps to
1-channel image and it is compared with the dilated GT image
for L1 loss computing. In summary, the network architecture is as
follows, especially the encoding part:
C_64_7_2 - ReLU - C_64_3_2 - ReLU - C_128_3_1 - ReLU -
C_128_3_2 - ReLU - C_256_3_1 - ReLU - C_256_3_2 - ReLU -
C_512_3_1 - ReLU - C_512_3_2 - ReLU - C_512_3_1 - ReLU -
C_512_3_2 - ReLU
48 Deep Learning for Crack-Like Object Detection
Fig. 4.10: Weakly supervised learning is able to learn crack pattern information.
Left side: image blocks sent to the classification network; middle: feature maps
after the first convolutional layer; right side: feature maps with the similar crack
patterns to the original image blocks.
4.4 Experiments
We make comparisons with six crack detection methods on two
crack datasets: CFD (Shi, 2016) captured with a cellphone while
the data collected using a line-scan camera from industry. CFD
dataset is a public dataset with hand-labeled GTs. The dataset has
118 pavement crack images of 320×480-pixel. The images were
captured by people standing on the road and holding an iPhone
near the road, and the pixel-level GTs are manually marked. As
mentioned in the last chapter, the industrial crack images are
often with thin cracks which are difficult to precisely mark the
GTs at pixel-level. The GTs are marked using 1-pixel curves. Such
GTs may not match the true crack locations at pixel-level, and
processing of such images is more challenging. In the same way,
region-based p-rate, r-rate, and HD-score are used as the metrics
for the evaluation. CrackIT-v1 (Oliveira, 2014), MFCD (Li, 2018),
50 Deep Learning for Crack-Like Object Detection
Fig. 4.13: Detection results on concrete wall and concrete pavement images.
First row: original concrete wall images. Second row: corresponding results of
CrackGAN. Third row: original concrete pavement images. Fourth row: the
corresponding results of CrackGAN.
4.5 Summary
In this chapter, we discussed deep generative adversarial network
for crack detection. The method solves an important issue, the
‘All Black’ problem, existing in deep learning-based pixel-level
crack detection. An asymmetric U-Net architecture is proposed
and a CPO supervised generative adversarial learning strategy is
used to generate expected crack patterns and implicitly enables
both crack and background image translation abilities. The work
solved an important problem—data imbalance in object detection.
5
Self-Supervised Structure
Learning for Crack
Detection
5.1 Background
We discussed crack detection with supervised deep learning from
Chapter 1 to Chapter 4. These methods, as discussed, rely on a
reasonable amount of paired training data obtained by making
non-trivial GTs manually, which is expensive, especially for pixel-
level annotations. Since the model cannot handle crack detection
with different backgrounds, the actual workload is quite heavy
because we need to re-mark the GTs and re-train the model for the
new detection task. Thus, it is worthwhile to rethink the automatic
crack detection by taking into account the labor cost. As a pre-
exploration of the future Artificial Intelligence object detection
system, in this chapter, we discuss a self-supervised structure
learning network that can be trained without using paired data.
The target is achieved by training an additional reverse network
to translate the output back to the input simultaneously. First,
a labor-free structure library is prepared and set as the target
domain for structure learning. Then a dual network is built with
two GANs: one is trained to translate a crack image patch X to a
structural patch Y, and the other is trained to translate Y back to
X, simultaneously. The experiments demonstrate that with such
settings, the network can be trained to translate a crack image to
56 Deep Learning for Crack-Like Object Detection
the GT-like image with similar structure patterns for use in crack
detection. The proposed approach is validated on four crack
datasets and achieves comparable performance with that of the
state-of-the-art supervised approaches.
Fig. 5.1: Architectures of different GANs: (a) Original GAN; (b) DC-GAN;
(c) image-to-image translation GAN.
Self-Supervised Structure Learning for Crack Detection 57
Fig. 5.6: Knowledge learned from a classification task has strong transferring
ability and can be used for parameter initialization to ease the training on different
tasks.
belief that the generative loss will tend to make the generated
images have the same distribution as the structure library. That
is not the truth and will give the generator false information for
updating the parameters, and increase the failure possibility. That
could be the reason for the failure.
After initialization, the parameters of the generators and
the discriminator in the reverse GAN are updated alternatively.
When the Adam (Kingma, 2014) optimizer is used, the learning
rate is 0.0002 and the parameter for updating the momentum is
0.9. Since the two cycles are trained simultaneously, and the cycle-
consistency restriction is applied to both forward and reverse
GANs, the training samples include both real crack patches and
generated crack patches, and the structure patches from the
library and the generated structure patches. Thus, an additional
buffer is needed to store the generated images during the training,
and so 50 previously generated images are stored. The batch size
is 6 due to memory limitation. A total of 100 epochs were run and
the model parameters are saved every five epochs. In this way,
the networks can be trained to generate similar structure patterns
according to the input cracks. The forward GAN is trained to
translate the original crack patches to the structured patches, and
the reverse GAN is trained to translate the structured patches to
crack images. Refer to Fig. 5.7 for the results of such translation
on the test set. In the experiment, the validation results are saved
every twenty-iterations to monitor the training. Finally, all the
saved models are tested to select the best one as the detection
network. After the training, the reverse GAN and the discriminator
66 Deep Learning for Crack-Like Object Detection
Fig. 5.7: Results from the image-to-image translation network. Left column: the
test images from references (Yang, 2018; Zou, 2012).
of the forward GAN are no longer used, and the generator of the
forward GAN alone serves as the detection network to translate a
pavement image into the GT-like image for crack detection.
5.4 Experiments
The experiments are conducted with an HP workstation with Intel
i7 CPU and 32G memory. An Nvidia 1080Ti GPU was configured
and PyTorch was used to perform the training and testing. The
proposed method is compared with seven crack detection
methods on two representative datasets: the public dataset CFD
(Shi, 2016) captured with a cellphone and the data from industry.
Self-Supervised Structure Learning for Crack Detection 67
Fig. 5.8: Comparison of detection results on CFD data. From top to bottom:
original image, GTs, and the detection results of CrackIT, MFCD, CrackForest,
CrackGAN and the proposed method, respectively.
Fig. 5.9: Comparison of detection results on industrial data. From top to bottom:
original image, GTs, and the detection results of CrackIT, MFCD, CrackForest,
CrackGAN and the proposed method, respectively.
the pixel-level GTs are given. As shown in Fig. 5.9, CrackIT shows
similar results using CFD data, and most cracks are missed.
MFCD detects most cracks but also introduces many noises
due to the complicated pavement texture. CrackGAN achieved
best results because of the supervised end-to-end training. The
proposed method trained with the assistance of the structure
library achieved comparable results with CrackGAN without
labor-expensive GTs.
Fig. 5.10: Testing results on training set with and without cycle consistent loss.
Top row: generated images with the proposed setting, and bottom row: generated
images without cycle consistent loss.
Self-Supervised Structure Learning for Crack Detection 71
Fig. 5.11: Experiments with and without deeper discriminator. Images at the top:
the proposed setting and images on the bottom: results with the original cycle-
GAN.
72 Deep Learning for Crack-Like Object Detection
5.6 Summary
In this chapter, we discussed a self-supervised structure learning
approach for crack detection without relying on manually
marked GTs. The method has the potential to realize the real fully
automatic crack detection that does not need to manually mark
GTs for training. We formulate crack detection as an unsupervised
structure learning problem by introducing a labor-free structure
library to assist the training of cycle-GAN. A discriminator with a
larger field of view was used as it only treated the crack patches as
real, and the data imbalance problem is overcome. Moreover, we
combine domain adaptation and generative adversarial networks
to handle object detection, which is a general means that can be
applied to other computer vision problems.
6
Deep Edge Computing
6.1 Background
We have discussed the algorithms which solved the crack
detection problem with deep neural networks. For the training
and deployment of deep neural networks, GPUs and TPUs were
the primary hardware because of their strong computational
ability. However, most applications require to deploy the system
on the edge with limited computational resources, e.g., mobile
phones and embedded systems. In this chapter, we discuss soft
technologies of deep edge computing to facilitate the deployment
of deep learning models, including parameter pruning, knowledge
distillation, and model quantization for constructing lightweight
networks.
Fig. 6.2: Schematic diagram of pruning channel 1. B represents feature map that
needs to be pruned. A represents the last feature map of B; W is the filter that needs
to be pruned.
LHT (WG, Wr) = 1– ||uh (x; WH) – r(vg (x; WG);_Wr)||2 (6.1)
2
where uh refers to the activation value when the input of the teacher
network is x and parameters are WH. vg refers to the activation
value of the hidden layer when the input of the student network
is x and the parameter is WG. Since the size of the guided layer
and the clue layer are not necessarily the same, it is necessary to
temporarily add a fully connected layer or convolution layer after
the guided layer to make their size consistent. r is the output of the
added convolution layer or fully connected layer with parameter
Wr as shown in Fig. 6.3.
78 Deep Learning for Crack-Like Object Detection
model is Gsi (i = 1, 2, ...., p), then the loss function based on the
relationship is:
1 s
LFSP(Ws,Wt) =— ∑x∑in=1λi × ||Gi (x;Ws) – Git (x;Wt)||22 (6.3)
N
where λi is a self-defined weight. The training process is to train
the teacher model and calculate the FSP matrix of the student
model and the teacher model according to formula (6.3), optimize
the student network according to formula (6.4), and finally fine-
tune the parameters of the student network as a whole.
In addition to using FSP to represent the relationship between
feature layers, Liu et al. (2017) used case relationship diagram to
represent case features, the relationship between case features,
and the transformation of feature space between layers. Chen
et al. (2020) proposed a knowledge distillation method based on
manifold learning. Passalis and Tefas et al. (2018) used probability
distribution to represent the relationship between different layers.
6.5 Experiments
This section analyzes the model size, calculation efficiency, and
performance of deep neural network, and then tests and analyzes
the methods to improve the computational efficiency from model
design and optimization. Firstly, the performances of different
pruning methods are tested on object detection data sets MNIST,
CIFAR and ImageNet. VGG-16 is used as the initial network
for pruning, and the average values of precision, speed and
parameters reduction of different methods are reported.
As shown in Table 6.1, each model pruning method has a
significant parameter reduction, and the parameter amount of
Deep Edge Computing 83
Precision, # of Precision, # of
Teacher parameters Student parameters
Method
network (unit: %, network (unit: %,
million) million)
FT (Kim
ResNet56 93.61(0.85) ResNet20 92.22(0.27)
et al., 2018)
Rocket-KD
(Zhou WRN-40-1 93.42(0.6) WRN-16-1 91.23(0.18)
et al., 2018)
DML
(Zhang WRN-28-10 95.01(36.5) ResNet32 92.47(0.46)
et al., 2019)
SP (Tung
MobileNetV2 64.65(4.4) MobileNetV2 63.38(2.2)
et al., 2019)
6.6 Summary
Intelligent edge computing designs the lightweight neural
network model through model compression. Knowledge
distillation is an effective method to realize effective training of
the lightweight model. Parameter quantization can further speed
up the computational speed and the inference time. The model
compression technologies discussed in this chapter are based
Deep Edge Computing 85
Cheng, H.D., and M. Miyojim. (1998). Novel system for automatic pavement
distress detection. J. Comput. Civil Eng., 12(3): 145–152.
Cheng, H.D., J. Chen, C. Glazier, and Y.G. Hu. (1999a). Novel approach to pavement
crack detection based on fuzzy set theory. J. Comput. Civil Eng., 13(4): 270–280.
Cheng, H.D., J. Wang, Y. Hu et al. (1999b). Novel approach to pavement cracking
detection based on neural network. Transp. Res. Rec., 1764: 119–127.
Cheng, H.D., J. Wang, Y. Hu, C. Glazier, X. Shi and X. Chen. (2001). Novel approach
to pavement cracking detection based on neural network. Transp. Res. Rec.,
1764: 119–127.
Chmiel, B., R. Banner, and G. Shomron et al. (2020). Robust quantization: one
model to rule them all. The Proceedings of Advances in Neural Information
Processing Systems, 33: 5308–5317.
Cho, J.H., and B. Hariharan. (2019). On the efficacy of knowledge distillation. The
Proceedings of the IEEE/CVF International Conference on Computer Vision.
Dai, J., Y. Li, and K. He et al. (2016). R-FCN: Object detection via region-based fully
convolutional networks. arXiv preprint arXiv:1605.06409.
Dalal, N., and B. Triggs. (2005). Histograms of oriented gradients for human
detection. The Proceedings of IEEE CVPR.
Deng, J., W. Dong, R. Socher et al. (2009). Imagenet: A large-scale hierarchical
image database. The Proceedings of IEEE Conference on Computer Vision and
Pattern Recognition.
Doersch, C., A. Gupta, and A.A. Efros. (2015). Unsupervised visual representation
learning by context prediction. The Proceedings of IEEE ICCV.
Doersch, C., and A. Zisserman. (2017). Multi-task self-supervised visual learning.
arXiv preprint arXiv:1708.07860.
Dollár, P., and C.L. Zitnick. (2013). Structured forests for fast edge detection. The
Proceedings of IEEE ICCV.
Everingham, M., L.V. Gool, C.K.I. William et al. (2011). The PASCAL Visual Object
Classes Challenge 2011 Results. http://www.pascalnetwork.org/challenges/VOC/
voc2011/workshop/index.html.
F.H.A. (2006). Pavement distress identification manual, NPS Road Inventory
Program.
Fan, A., P. Stock, B. Graham et al. (2020). Training with quantization noise for
extreme model compression. arXiv preprint arXiv:2004.07320.
Friedman, J., T. Hastie, and R. Tibshirani. (2010). Regularization paths for
generalized linear models via coordinate descent. J. Stat. Software, 33(1): 1–22.
Fukui, H., T. Hirakawa, T. Yamashita et al. (2019). Attention branch network:
Learning of attention mechanism for visual explanation. The Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition.
Ganin, Y., and V. Lempitsky. (2014). Unsupervised domain adaptation by
backpropagation. arXiv preprint arXiv:1409.7495.
Gavilan, M., D. Balcones, D.F. Llorca et al. (2011). Adaptive road crack detection
system by pavement classification. Sensors, 11(10): 9628–9657.
Girshick, R. (2015). Fast R-CNN. The proceedings of IEEE ICCV.
Girshick, R., J. Donahue, T. Darrell et al. (2014). Rich feature hierarchies for accurate
object detection and semantic segmentation. The Proceedings of IEEE CVPR.
Gonzalez, R.C., R.E. Woods, and S.L. Steven. (2009). Digital Image Processing Using
Matlab, Addison-Wesley.
References 91
Huang, Y., and B. Xu. (2006). Automatic inspection of pavement cracking distress.
Journal of Electronic Imaging, 15(1): 17–27.
Isola, P., J.Y. Zhu, T. Zhou et al. (2017). Image-to-image translation with conditional
adversarial networks. The Proceedings of IEEE CVPR.
Jacob, B., S. Kligys, B. Chen et al. (2018). Quantization and training of neural
networks for efficient integer-arithmetic-only inference. The Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2018: 2704–2713.
Jarvis, R.A. (1983). A perspective on range finding techniques for computer vision.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(2): 122–139.
Jia, Y., E. Shelhamer, J. Donahue, S. Karayev et al. (2014). Caffe: Convolutional
architecture for fast feature embedding. ArXiv preprint arXiv: 1408. 5093.
Jin, X, B. Peng, Y. Wu et al. (2019). Knowledge distillation via route constrained
optimization. The Proceedings of the IEEE CVPR, 2019: 1345–1354.
Joachims, T. (1998). Text categorization with support vector machines: Learning
with many relevant features. The Proceedings of European Conference on Machine
Learning.
Kaelbling, L.P., M.L. Littman, and A.W. Moore. (1996). Reinforcement learning: A
survey. Journal of Artificial Intelligence Research, 4: 237–285.
Kaul, V., A. Yezzi, and Y.C. Tai. (2010). Quantitative performance evaluation
algorithms for pavement distress segmentation. Transp. Res. Rec., 2153: 106–
113.
Kim, J., S.U. Park and N. Kwak. (2018). Paraphrasing complex network: Network
compression via factor transfer. The Proceedings of NIPS.
Kingma, D.P., and J. Ba. (2014). Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980.
Kirschke, K.R., and S.A. Velinsky. (1992). Histogram-based approach for automated
pavement crack sensing. Journal of Transportation Engineering, 118(5): 700–710.
Kontschieder, P., S.R. Bulo, and M. Pelillo. (2011). Structured class-labels in random
forest for semantic image labeling. The Proceedings of IEEE ICCV, 2190–2197.
Koutsopoulos, H.N., I.E. Sanhouri, and A.B. Downey. (1993). Analysis of
segmentation algorithms for pavement distress images. Journal of
Transportation Engineering, 119(6): 868–888.
Krizhevsky, A., I. Sutskever, and G.E. Hinton. (2012). ImageNet classification with
deep convolutional neural network. The Proceedings of NIPS.
Lad, P., and M. Pawar. (2016). Evaluation of railway track crack detection system.
The Proceedings of the IEEE ROMA.
Laurent, C., C. Ballas, T. George et al. (2020). Revisiting loss modelling for
unstructured pruning. arXiv preprint arXiv:2006.12279.
Lecun, Y., B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard and L.D.
Jackel. (1989). Backpropagation applied to handwritten zip code recognition.
Neural Computation, 1(4): 541–551.
LeCun, Y., Y. Bengio, and G. Hinton (2015). Deep learning. Nature, 521: 436–444.
Li, H., D. Song, Y. Liu, and B. Li. (2018). Automatic pavement crack detection
by multi-scale image fusion, IEEE Trans. Intell. Transp. Syst. Doi: 10.1109/
TITS.2018.2856928.
Li, H., A. Kadav, I. Durdanovic et al. (2016). Pruning filters for efficient convnets.
arXiv preprint arXiv:1608.08710.
References 93
Oliveira, H., and P.L. Correia. (2009). Automatic road crack segmentation using
entropy and dynamic thresholding. The Proceedings of 17th European Signal
Processing Conf., 2009: 622–626.
Oliveira, H., and P.L. Correia. (2014). CrackIT—An image processing toolbox for
crack detection and characterization. The Proceedings of IEEE ICIP.
Oquab, M., L. Bottou, I. Laptev et al. (2014). Learning and transferring mid-level
image representations. The Proceedings of IEEE CVPR.
Otsu, N. (1975). A threshold selection method from gray-level histograms. IEEE
Trans. Syst. Man Cybern., 9(1): 62–66.
Otsu, N. (1979). A threshold selection method from gray-level histograms., IEEE
Trans. on Syst. Man Cybern., 9(1): 62–66.
Pan, S.J., and Q. Yang. (2010). A survey on transfer learning. IEEE Trans. on Knowl.
Data Eng., 22(10): 1345–1359.
Passalis, N., and A. Tefas. (2018). Learning deep representations with probabilistic
knowledge transfer. The Proceedings of the European Conference on Computer
Vision, 2018: 268–284.
Petrou, M., and J. Kittler. (1996). Automatic surface crack detection on textured
materials. J. Mater. Process. Technol., 56(4): 158–167.
Polino, A., Pascanu, R., and Alistarh, D. (2018). Model compression via distillation
and quantization. arXiv preprint arXiv:1802.05668.
Power, D. (2011). Evaluation: from precision, recall and F-measure to ROC,
Informedness, markedness and correlation. Journal of Machine Learning
Technologies, 2(1): 37–63.
Radford, A., L. Metz, and S. Chintala. (2016). Unsupervised representation learning
with deep convolutional generative adversarial networks. The Proceedings of
ICLR.
Rastegari, M., Ordonez, V., Redmon, J. et al. (2016). Xnor-net: Imagenet
classification using binary convolutional neural networks. The Proceedings of
ECCV, Springer, Cham, 2016: 525–542.
Redmon, J., S. Divvala, R. Girshick et al. (2015). You only look once: Unified, real-
time object detection. arXiv preprint arXiv:1506.02640.
Redmon, J., and A. Farhadi. (2017). YOLO9000: Better, faster, stronger. The
Proceedings of IEEE CVPR.
Ren, S., K. He, R. Girshick et al. (2015). Faster R-CNN: Towards real-time object
detection with region proposal networks. arXiv preprint arXiv:1506.01497.
Romero, A., N. Ballas, S.E. Kahou et al. (2015). Fitnets: Hints for thin deep nets. The
Proceedings of ICLR.
Ronneberger, O., P. Fischer, and T. Brox. (2015). U-Net: Convolutional networks
for biomedical image segmentation. The Proceedings of in Proc. Med. Image
Comput. Comput.-assist. Intervention.
Rosenfield, A., and R.C. Smith. (1979). Thresholding using relaxation. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 3(5): 598–606.
Rumelhart, D.E., G.E. Hinton, and R.J. Williams. (1986). Learning representations
by back-propagating errors, Nature, 323(6088): 533–536.
Schraudolph, N.N. (2002). Fast curvature matrix-vector products for second-order
gradient descent. Neural Computation, 14(7): 1723–1738.
References 95
Wang, K., Li Q., and W. Gong. (2000). Wavelet-based pavement distress image
edge detection with a trous algorithm. Transp. Res. Rec., 6(2024): 24–32.
Wang, W. (2015). Protocol based pavement cracking measurement with 1 mm 3D
pavement surface model, Ph.D. thesis, Oklahoma State Univ., Stillwater.
Wu, Z., C. Shen, and A. Van Den Hengel. (2019). Wider or deeper: Revisiting the
resnet model for visual recognition. Pattern Recognition, 90: 119–133.
Xie, S., and Z. Tu. (2015). Holistically-nested edge detection. The Proceedings of
ICCV.
Yang, X., H. Li, Y. Yu et al. (2018). Automatic pixel-level crack detection and
measurement using fully convolutional network. Comput.-Aided Civ.
Infrastructure. Eng., 33(12): 1090–1109.
Yim, J., D. Joo, J. Bae et al. (2017). A gift from knowledge distillation: Fast
optimization, network minimization and transfer learning. The Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 4133–
4141.
Yosinski, J., J. Clune, Y. Bengio et al. (2014). How transferable are features in deep
neural networks? The Proceedings of NIPS.
Yu, F., and V. Koltun. (2016). Multi-scale context aggregation by dilated
convolutions. The Proceedings of ICLR.
Yu, R., A. Li, C.F. Chen et al. (2018). NIPS: Pruning networks using neuron
importance score propagation. The Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2018: 9194–9203.
Zagoruyko, S., and N. Komodakis. (2017). Paying more attention to attention:
Improving the performance of convolutional neural networks via attention
transfer. The Proceedings of ICLR.
Zalama, E., J. Bermejo, R. Medina, and J. Llamas (2014). Road crack detection
using visual features extracted by Gabor filters., Computer-aided Civil and
Infrastructure Engineering, 29(5): 342–358.
Zeng, W., and Urtasun, R. (2018). MLPrune: Multi-layer pruning for automated
neural network compression. The Proceedings of International Conference on
Learning Representation.
Zhang, A., C.P. Kelvin, B. Wang et al. (2017). Automated pixel-level pavement
crack detection on 3D asphalt surfaces using a deep-learning network,
Comput.-aided Civ. Infrastruct. Eng., 32(10): 805–819.
Zhang, K.G., and H.D. Cheng. (2017). A novel pavement crack detection approach
using pre-selection based on transfer learning. The Proceedings of the 9th
International Conference on Image and Graphics, Shanghai, China.
Zhang, K.G., H.D. Cheng, and B. Zhang. (2018). Unified approach to pavement
crack and sealed crack detection using pre-classification based on transfer
learning. J. Comput. Civil Eng., 32(2): 04018001.
Zhang, K.G., H.D. Cheng, and S. Gai. (2018). Efficient dense-dilation network for
pavement crack detection with large input image size. The Proceedings of the
IEEE ITSC, Hawaii, USA.
Zhang, K.G., Y.T. Zhang, and H.D. Cheng. (2020). Self-supervised structure
learning for crack detection based on cycle-consistent generative adversarial
networks. J. Comput. Civil Eng., 34(3): 04020004.
Zhang, K.G., Y.T. Zhang, and H.D. Cheng. (2021). CrackGAN: Pavement crack
detection using partially accurate ground truths based on generative
References 97
A G
activation 5, 9, 14, 47, 48, 76–79 generative adversarial learning 2, 7,
38, 40, 41, 46, 54, 57, 87, 88
B generator 39, 47, 48, 56, 60–66, 71, 87
gradient descent 14, 77, 78
backpropagation 64
I
C
image classification 9, 64
Caffe 13 image processing 2, 9, 28, 50
civil engineering 1 image segmentation 2, 3, 8, 9, 11, 86
computer vision 2, 4, 7, 57, 72, 86 image translation 2, 45, 54, 56, 58, 61,
convolutional neural networks 4, 62, 66, 87
8–11, 24, 45
crack detection 1–4, 6–9, 11, 13, 16, 24, K
26, 27, 35, 37, 38, 40–43, 46, 49, 51,
54–58, 66, 67, 70, 72, 73, 86–88 knowledge distillation 73, 76–81, 84,
cycle-GAN 58–60, 64, 71, 72 88
D M
deep learning 1, 2, 4–8, 10–12, 24, 32, machine learning 8, 17, 50, 67
38, 40, 50, 54, 55, 73, 85–88 model compression 75, 81, 84, 85
dilated convolution 25, 26, 32, 33, 37
discriminator 39, 40, 42–48, 58, 60–65, N
71, 72
Nvidia 21, 50, 66
domain adaptation 40, 57, 72
E P
parameter quantization 80–84, 88
edge computing 2, 73, 84, 85
pattern classification 3, 5, 8
pavement crack detection 7, 51, 67,
F
86, 88
fully convolutional neural networks 9, Pytorch 50, 66
11
100 Deep Learning for Crack-Like Object Detection
R T
random forest 21, 57 transfer learning 2, 10, 11, 14, 26, 27,
road management system 2 38, 48, 86
S U
self-supervised learning 2, 57 U-Net 41–48, 50–54, 61
semantic segmentation 25
structure learning 7, 55, 57, 72, 87