0% found this document useful (0 votes)

43 views14 pages

Pavement Crack Detection Using Partially Accurate Ground Truth Based On GANs

This article proposes a new method called CrackGAN to address the 'All Black' phenomenon in pavement crack detection using fully convolutional networks with partially accurate ground truths. CrackGAN uses generative adversarial learning with crack-patch-only supervision to force the network to always produce crack images while retaining the ability to translate both crack and background images. It is validated on four crack datasets and achieves state-of-the-art performance compared to recent works in efficiency and accuracy.

Uploaded by

pepoahmed12

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views14 pages

Pavement Crack Detection Using Partially Accurate Ground Truth Based On GANs

Uploaded by

pepoahmed12

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1

CrackGAN: Pavement Crack Detection Using

Partially Accurate Ground Truths Based on
Generative Adversarial Learning
Kaige Zhang , Member, IEEE, Yingtao Zhang, Member, IEEE, and Heng-Da Cheng, Life Senior Member, IEEE

Abstract— Fully convolutional network is a powerful tool for It is known that deep learning is a data driven approach
per-pixel semantic segmentation/detection. However, it is prob- which heavily relies on the training data with accurate GTs.
lematic when coping with crack detection using partially accurate Due to the domain sensitivity (i.e., the performance of a “well-
ground truths (GTs): the network may easily converge to the sta-
tus that treats all the pixels as background (BG) and still achieves trained” network may decrease when utilizing the datasets
a very good loss, named “All Black” phenomenon, due to the obtained from different road sections and/or during differ-
unavailability of accurate GTs and the data imbalance. To tackle ent periods), it is necessary to manually mark the GTs to
this problem, we propose crack-patch-only (CPO) supervised re-train the models for new pavement crack detection tasks.
generative adversarial learning for end-to-end training, which In industry, the pavement images are captured using a camera
forces the network to always produce crack-GT images while
reserves both crack and BG-image translation abilities by feeding mounted on top of a vehicle running on the road. Under
a larger-size crack image into an asymmetric U-shape generator such setting, most cracks are very thin and crack boundaries
to overcome the “All Black” issue. The proposed approach is are vague, which makes the annotation of pixel-level GTs
validated using four crack datasets; and achieves state-of-the-art very difficult. Instead of the labor-intensive per-pixel crack
performance comparing with that of the recently published works annotation, marking the cracks as 1-pixel curves is more
in efficiency and accuracy.
feasible and preferable in practice because of its simplicity
Index Terms— Pavement crack detection, fully convolutional and low labor-cost, and such GT is named labor-light GT.
networks, generative adversarial learning, partially accurate However, such GTs may not completely match the cracks at
ground truths.
pixel-level accurately; i.e., they are partially accurate GTs, and
that makes the loss computation inaccurate. Moreover, as a
I. I NTRODUCTION long-narrow target, a crack can only occupies a very small
area in a full image. Since patch-wise training is equivalent to

A UTOMATIC pavement crack detection is a challeng-

ing task in intelligent pavement surface inspection sys-
tem [1]. It is also a research topic for more than three decades.
loss sampling in FCN [6], directly training FCN for pixel-level
crack detection makes the training set heavily imbalanced;
and such problem cannot be handled by simply rebalancing
However, industry-level pavement crack detection task is still the data via loss function since the GTs are not accurate.
not well solved: many published references have reported good The observation is that the network will simply converge to
results on specific crack datasets [2], [3]; however, the meth- the status that treats the entire crack image as BG (labeled
ods failed when processing industrial pavement images of with zero), and still can achieve a good detection accuracy
which the cracks were thin and the precise pixel-level ground (BG-samples dominate the accuracy calculation). It is named
truths (GTs) were difficult to obtain [4], [5]. Recently, fully “All Black” problem which is pretty common in industrial
convolutional network (FCN) [6], trained in end-to-end for pixel-level pavement crack detection.
pixel-level object segmentation/detection, was applied to pave- In general, the existing computer vision-based crack detec-
ment crack detection [5], [7]. However, it suffered from the tion approaches could be grouped into two categories:
“All Black” issue when processing industrial images: the rule-based and machine learning-based methods. Rule-based
network converged to the status that treated all the pixels as methods try to extract some pre-defined features to identify the
background (BG) [5]; and similar issue was also reported in [7] cracks. Cheng et al. [8] proposed a fuzzy logic-based intensity
where the FCN failed to detect thin cracks. thresholding method for crack segmentation based on the
Manuscript received December 26, 2018; revised August 28, 2019, assumption that crack pixels were darker than BG pixels; how-
December 16, 2019, and February 16, 2020; accepted April 9, 2020. The ever, the method failed when processing crack images with low
Associate Editor for this article was L. M. Bergasa. (Corresponding author: foreground-background contrast. Wang et al. [9] introduced a
Heng-Da Cheng.)
Kaige Zhang and Heng-Da Cheng are with the Department of Com- wavelet-based edge detection algorithm, and the drawback was
puter Science, Utah State University, Logan, UT 84322 USA (e-mail: that it could not handle the cracks with high curvature or low
[email protected]; [email protected]). continuity well. Oliveira and Correia [10] proposed a dynamic
Yingtao Zhang is with the Department of Computer Science, Harbin Institute
of Technology, Harbin 150001, China (e-mail: [email protected]). thresholding method for crack detection based on information
Digital Object Identifier 10.1109/TITS.2020.2990703 entropy, which was sensitive to noise. Zou et al. [11] designed
1524-9050 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University of Melbourne. Downloaded on May 16,2020 at 10:32:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

an intensity-difference measuring function to find an optimal to select good candidate regions from the noisy pavement
threshold for crack segmentation; however, the robustness was images, and it was also inefficient because a great number
poor and the method was easy to fail when working on of candidate regions had to be processed for a full-size
different datasets. Many works introduced some crack linking image. Zhang et al. [30] employed parallel processing to
method to enhance the crack continuity [12]–[17]. However, improve the computation efficiency of region-based methods;
these methods did not solve the problem well and usually however, the computational resource costs were expensive.
produced intolerable false positives for linking together the Zhang et al. [5] addressed the issue by generalizing a classifi-
noises. In addition, Tsai et al. [18] performed a comprehensive cation network to an end-to-end detection network with FCN.
study on the performances of six low-level image segmentation FCN is a one-stage pixel-level semantic segmentation method
algorithms, and Abdel-Qader et al. [19] discussed different without window-sliding. Recently, Yang et al. [7] employed
edge detectors, including Sobel, Canny, and fast Haar transfor- FCN for pixel-level crack detection and achieved good results
mation [20]. The rule-based approaches are easy to implement; on concrete-wall images and pavement images with clear
however, they are sensitive to noise, which results in poor cracks; however, it failed to detect thin cracks. Moreover,
generalizability. the method relied on accurate pixel-level GTs which were
Machine learning-based methods have attracted increasing labor-intensive and often infeasible under industrial setting.
attentions during the past two decades. These methods per- In addition, deep learning-based crack detection articles have
form crack detection following two steps: feature extraction been keeping on appearing. Chen and Jahanshahi [31] and
and pattern classification. Cheng et al. [21] and Oliveira and Park et al. [32] proposed NB-CNN and patch-CNN for crack
Correia [22] utilized mean and variance of an image block as detection, respectively. Tong et al. [33] utilized deep convo-
the features to train classifiers for pavement crack detection. lutional neural network (DCNN) for crack length estimation.
However, the good performances heavily relied on complex Hoang et al. [34] employed CNN and edge detector for crack
post processing. Hu et al. [23] and Gavilan et al. [24] utilized recognition. Gopalakrishnan et al. [35] performed pavement
textural information to set up the feature vectors and employed distress detection with a DCNN. Zou et al. [36] introduced
support vector machine (SVM) for the classification. However, DCNN for crack detection with hierarchical feature learning.
they could not handle the problem well when processing the The methods were either based on the traditional classification
images with complex pavement textures. Zalama et al. [25] network with fully connected layers which only could handle
employed Gabor filters for feature extraction and AdaBoost- fixed input-size images, or based on the FCN architecture
ing [25] for crack identification; and Shi et al. [3] combined which relied on the accurate, labor-intensive GTs.
multi-channel information to set up the feature vector, and In this paper, we propose CrackGAN for pavement crack
employed random structure forest [26] for crack-token map- detection with the following contributions: (1) it solves a
ping. These methods tried to solve the problem by extract- practical and essential problem,“All Black” issue, existing in
ing some hand-crafted features and training a classifier to deep learning-based pixel-level crack detection methods; (2) it
discriminate cracks from the noisy BG; however, they did proposes the crack-patch-only (CPO) supervised adversarial
not address the issue well because the hand-crafted feature learning and the asymmetric U-Net architecture to perform
descriptors usually calculated statistics locally and lacked good the end-to-end training; (3) the network can be trained with
global view, even the statistics from different locations were partially accurate GTs generated by labor-light method which
combined together. Thus, they could not represent the global can reduce the workload of preparing GTs significantly;
structural pattern well which was important to discriminate (4) furthermore, it can solve data imbalance problem which
cracks from the noisy textures. is the byproduct of the proposed approach. Moreover, even
As one of the most important branches in machine learning, the network is trained with small image patches and partially
deep learning has achieved great success during the past ten accurate GTs, it can deal with full-size images and achieve
years, and it is the most promising way to solve challenging great performance.
object detection problems, including pavement crack detection. The rest of the paper is organized as follows: In section II,
Initially, deep learning-based object detection methods relied it discusses the related works. In section III, it introduces the
on window-sliding or region-proposal; and these methods tried proposed method. In section IV, it describes the evaluation
to find a bounding box for each possible object in an image. metrics and the experimental results. At the end, it provides
R-CNN (region-based convolutional neural networks) [27] the conclusion.
was the early work which utilized selective search [28] to
generate candidate regions, and then sent the regions into II. R ELATED W ORKS
a CNN for classification. Based on R-CNN, Cha et al. [29] In this section, it discusses the techniques related to the
designed a convolutional network for pavement crack detection proposed method.
which worked with window-sliding mode. Zhang et al. [4]
employed a CNN for pre-classification which removed most
of the noise areas before performing crack and sealed crack A. Generative Adversarial Networks
detection. Problems of these methods were: (1) window- Goodfellow et al. [37] proposed generative adversarial net-
sliding-based strategy was impractical due to the huge time work (GAN) which could be trained to generate real-like
complexity, especially when processing large images [5]; images by conducting a max-min two-player game. Based on
(2) traditional region-proposal methods [28] were unable GAN, Mirza and Osindero [38] proposed conditional GAN

ZHANG et al.: CrackGAN: PAVEMENT CRACK DETECTION USING PARTIALLY ACCURATE GTs 3

which introduced additional information (the condition) to the

generator for producing specific outputs according to the input
condition. While GAN is difficult to train, Radford et al. [39]
proposed deep convolutional generative adversarial network
(DC-GAN) which configured the generator with convolutional
layers, and the training became easier and more stable. Based
on conditional GAN, Isola et al. [40] and Zhu et al. [41]set
up the generator with an encoding-decoding network, then the
GAN became an image-to-image translation network. Inspired
by these works, we formulate the crack detection as an image-
to-image translation problem, and introduce generative adver-
sarial loss to regularize the objective function to overcome the
“All Black” issue using partially accurate GTs generated by Fig. 1. Overview of the proposed method.
labor-light method.

III. P ROPOSED M ETHOD

B. Transfer Learning in DCNN
Fig. 1 is the overview of the proposed method. D is a
Transfer learning has been widely used for training deep
pre-trained discriminator obtained directly from a pre-trained
convolutional neural networks, which intends to transfer
DC-GAN using crack-GT patches only. Such pre-trained dis-
knowledge learned in previous tasks to make the training eas-
criminator will force the network to always generate crack-GT
ier [42]. Depending on situations, there are different transfer
images, which is the most important factor to overcome the
learning strategies according to “what knowledge to transfer”
“All Black” issue. The pixel-level loss is employed to ensure
and “how to transfer the knowledge”. Yosinski et al. [43] dis-
that the generated crack patterns are the same as that of
cussed the knowledge transferability of different layers in deep
the input patch via optimizing the L1 distance based on the
neural networks. Oquab et al. [44] transferred the mid-level
dilated GTs. Since the network is trained with crack-patch
knowledge for nature image processing. Zhang et al. [4] trans-
only, the asymmetric U-shape architecture is introduced to
ferred the generic knowledge learned from ImageNet [45] to
enable the translation abilities of both crack and non-crack
ease the training of a crack detection network. Zhang et al. [5]
images. After training, the generator itself will serve as the
also transferred the mid-level knowledge via introducing a
crack detection network. In addition, the network is designed
dense-dilation layer into FCN to improve crack localization
as a fully convolutional network which can process full-size
accuracy. The proposed approach employed transfer learning
images after the patch-based training. Finally, the overall
to train the prototype of the encoding network, and also
objective function is:
transferred the knowledge from a pre-trained DC-GAN to
provide the generative adversarial loss for the end-to-end L f inal = L adv + λL pi xel (1)
training.
where L adv is the adversarial loss generated by the pre-trained
discriminator and L pi xel is the pixel-level loss computed with
C. Fully Convolutional Network
L1-distance.
Regular DCNN usually employed convolutional layers for
feature extraction and fully connected layers for classifi-
A. “All Black” Issue
cation [46]. Interestingly, it turned out that the fully con-
nected layer could be considered as a special case of the This work results from addressing a practical engineering
convolutional layer with kernel size equal to the input issue that the authors have encountered in industry. At the early
size [5]. Long et al. [6] proposed the fully convolutional attempts, we trained FCNs [6] for pixel-level pavement crack
network (FCN) for per-pixel semantic segmentation. Based detection based on the data and GTs [5]. However, the results
on FCN, Chen et al. proposed DeepLab model [47] for were not satisfactory: the networks were easily to converge
multi-scale semantic segmentation; Ronneberger et al. [48] to BG even there were cracks. The most possible reasons
proposed U-Net architecture for medical image segmenta- were: (1) most cracks in the industrial pavement images were
tion. Xie and Tu [49] employed FCN for contour detec- thin and the crack-boundaries were vague, that made it very
tion; Yu et al. [50] proposed dilated convolutional design difficult for per-pixel GT annotation; in practice, the engineers
for multi-scale context aggregation. To improve the compu- just marked the cracks with 1-pixel curves for simplicity, and
tation efficiency, Zhang et al. [5] generalized a patch-based they were used as the GTs that were only partially accurate.
classification network to be a detection network for crack Such GTs could not match the actual cracks at pixel-level
detection where FCN was employed. The proposed approach well, which made the loss computation inaccurate, and failed
introduces FCN to build an asymmetric U-shape generator the task. (2) Crack, as a long-narrow object, only occupies a
with the translation ability of both crack and BG images. The very small area in a full image; and since patch-wise training
FCN design also enables the patch-based CrackGAN (trained was equivalent to loss sampling in FCN [6], training an FCN
with small image patches) to work on the full-size images end-to-end with pavement crack images actually worked on
seamlessly. extremely imbalanced dataset. In addition, the 1-pixel GTs

4 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 2. “All Black” issue encountered when using FCN-based method for
pixel-level crack detection: (a) the industrial pavement crack image; (b) the
dilated GT-image utilized in the training; and (c) the detection result with the
“well-trained” U-Net (see Fig. 3).
Fig. 4. Pre-train a one-class DC-GAN with augmented GTs based on
CPO-supervision. The real crack-GT data are augmented with manually
marked “crack” curves.

in [37], it was better for G to maximize log(D(G(z))) instead

of minimizing log(1 − D(G(z))). Therefore, the actual opti-
mization strategy is to optimize the following two objectives
alternatively [52]:
max V (D, G) = E x∼ pd (x) [log D(x)]
D D
+E z∼ pd (z) [log(1 − D(G(z)))] (2)
max V (D, G = E z∼ pd (z) [log(D(G(z)))] (3)
G G

Fig. 3. The loss and accuracy curves when training a regular U-Net using where x is the image from the real data (crack-GT-like
industrial pavement images with partially accurate GTs. patches) with distribution pd (x); z is the noise vector gener-
ated randomly from Gaussian distribution pd (z); and D is the
discriminator and G is the generator set up with convolutional
used in this work are originally designed for crack patch
and deconvolutional kernels, respectively. In practice, whether
sampling [4] which are highly biased, and that makes the
a sample is real or fake depends on the data setting. In accor-
problem more serious. The network simply classified all the
dance with the CPO-supervision, only crack-GT patches are
pixels as BG, and still achieved quite a “good” accuracy (since
contained in the real image-set for training the DC-GAN.
BG pixels dominate the whole images), that was the “All
With such setting, the discriminator will only recognize
Black” issue. As shown in Figs. 2 and 3, during training,
crack-GT patch as real and treat all-black patch as fake, which
the loss decreases rapidly and approaches to a very low value;
prevents the network to generate all-black (fake) image as
however, the detection results are all blacks (i.e., all BGs).
the detection result, thus overcoming the “All Black” issue.
Moreover, it is worth to mention that other FCN architectures
Such discriminator is named one-class discriminator. In the
also encounter such problem; here it takes U-Net as an
implementation, the crack-GT data are further augmented by
example. Since the GTs are only partially accurate, existing
manually marking a bunch of “crack” curves and sampling the
approaches for solving data imbalance cannot work here.
patches accordingly, as indicated in Fig. 4.
In Fig. 5, after training, the discriminator of the well-trained
B. CPO-Supervision and One-Class Discriminator DC-GAN is concatenated to the end of the asymmetric U-Net
Regular FCN-based methods may only produce all-black generator to provide the adversarial loss for end-to-end train-
images as the detection results [5]. In order to address this ing. Since the output of the generator serves as a fake image,
problem, it adds a new constraint, generative adversarial loss, the adversarial loss is:
to regularize the objective function, which will make the
network always generate crack-GT detection result; accord- L adv = −E x∈I [log D(G(x))] (4)
ingly, the training data are prepared with crack patches Here, different from pre-training the DC-GAN in Eq. (2),
only (i.e., CPO-supervision), without involving any non-crack x is the crack-patch, and I is the training set containing
patches and “all black” patches [51]. As shown in Fig. 1, crack patches only; G is set up with the asymmetric U-Net
the adversarial loss is provided by a one-class discriminator architecture illustrated in Fig. 5, and D is the pre-trained
obtained via pre-training the DC-GAN [39] only with crack- one-class discriminator.
GT-like patches. It is well-known that the DC-GAN can
generate real-like images from random noise by conducting the
training with a max-min two-player game, in which a generator C. Asymmetric U-Net Generator
is used to generate real-like images and a discriminator is In subsection III-B, it introduced the CPO-supervision and
used to distinguish between real and fake images. As verified generative adversarial learning to force the network to always

ZHANG et al.: CrackGAN: PAVEMENT CRACK DETECTION USING PARTIALLY ACCURATE GTs 5

of the input image patch. When the same DCNN network is

fed a larger size input image, it will output multiple neurons,
and each neuron represents a class label of the corresponding
image patch of size m × m “sampled” from the larger input
image. For example, when the network’s input is an image of
m × 3m, the output has five neurons (the number of neurons
depends on the down-sampling rate of the DCNN) which
represent class labels of five image patches including both
crack and non-crack samples of size m × m (from left to
right, the first three neurons represent crack-samples and the
last two represent BG-samples). Indeed, under the multi-layer
Fig. 5. Asymmetric U-Net with larger input image (under larger field of convolutional mode, each neuron actually has a receptive field
view) and CPO-supervision. with a specific size; since the convolutional layer is input-size
insensitive, operating the network under larger receptive field
actually realized a multi-spot image sampling with the image
size equal to the receptive field of the neuron. Thus, when
performing an image translation using a deep convolutional
neural network with a larger input image, the process is equal
to translating multiple smaller image samples at the same time
(the size is equal to the receptive field of the original image
translation network).
According to the analysis, when a crack image with the
size larger than the input-size of the discriminator is input
to the asymmetric U-Net and passes through the network,
the network will produce a downsampled image patch that
exactly matches the input-size of the discriminator. The output
will be treated as a single image by the one-class discriminator
Fig. 6. Receptive field analysis under larger field of view: with a larger for the generative adversarial learning which still maintains
input image, the CNN realizes multi-spot sampling with the same receptive the working mechanism of COP-supervision. However, since
field. At the right, the asymmetric network architecture includes the image
translation of both crack and BG samples; however, the training data contains
the network is trained to translate a larger crack image to a
crack patch only. downsampled crack-GT image, it includes the translation of
both crack and non-crack image samples inherently. In this
generate crack-GT patches and prevent the “All Black” phe- way, the network can be trained to process both crack and BG
nomenon. However, for a crack detection system, it should images. Refer to Fig. 6.
be able to process both crack and non-crack/BG pavement
images. Normally, the discriminator should treat all-black D. L1 Loss With Dilated GTs
patch as real to represent the BG-image translation result,
It introduces the CPO-supervised generative adversarial
such as directly applying the original pix2pix GAN [40];
learning and the asymmetric U-Net to prevent the “All Black”
unfortunately, treating all-black patch as real will encourage
phenomenon; however, it is only an image-level supervision
the network to generate all-black images as the detection
that does not specify the exact location of the cracks in the
results which is against solving the “All Black” issue. In order
generated image. As analyzed before, one of the reasons for
to include the translation of BG-image with CPO-supervision,
the “All Black” issue is the pixel-level mismatching due to the
it also replaces the regular U-Net generator in the original
inaccurate GTs. Thus, it introduces the dilated-GT to specify
pix2pix GAN with the asymmetric U-shape generator which
a relatively larger crack area to ensure that it covers the actual
inputs a larger size crack patch (256 × 256) and outputs
crack locations, and if a detected crack pixel is in the dilated
a smaller crack-GT image (128 × 128) for the end-to-end
area, it is treated as a true positive. The experiments demon-
training. In accordance with the CPO-supervision, the larger
strate that by combining the CPO-supervised adversarial loss
input image has to be a crack image so that the correct
and the loosely-supervised L1 loss, the network can be trained
output will always be a crack-GT patch recognized by the
to generate cracks in the expected locations. Following [4],
discriminator as real. With such setting, the network is able to
it marks the cracks with 1-pixel-width curves, and crops crack
translate both crack image and BG-image after the training.
patches and partially accurate crack-GT patches from the orig-
Receptive Field Analysis Under Larger Field of View: To
inal pavement images and the images with partially accurate
understand how the asymmetric U-Net is able to include
GT, respectively. Then the partially accurate 1-pixel-width GTs
BG-image translation ability by only using crack samples
were dilated three times using a disk structure with radius
for the training, it first performs a receptive field analysis
of 3 to generate the dilated GTs which are used to provide the
under larger field of view. In Fig. 6, there is a DCNN, as a
loosely supervised pixel-level loss:
classification network, with an m × m image patch as input,
and the output is a single neuron representing the class label L pi xel = −E x∈I,y [y − G(x)1 ] (5)

6 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 7. Detection results of CrackGAN: (a) industrial pavement image

suffered from “All Black” issue; (b) detection result of CrackGAN; (c) the Fig. 8. Weakly supervised learning is able to learn rich crack pattern
final result image after removing the isolated noises. information: (a) the image patches inputs into the classification network;
(b) the feature maps after the first convolutional layer.
where x is the input crack patch; y is the dilated GT; I is the
dataset of larger size crack patch (256 × 256 comparing with
the encoding part, then followed by a regular convolutional
the output size of 128 × 128) used for end-to-end training;
layer. After the last de-convolutional layer, another regular
G is the asymmetric U-Net; and D is the discriminator.
convolutional layer with Tanh activation [46] is utilized to
Overall, the final objective function is:
translate the 64-channel feature map to the 1-channel image,
L f inal = L adv + λL pi xel (6) and it is compared with the dilated-GT for L1 loss computation
according to Eq. (5). In summary, the network architecture is
The pixel-level loss is normalized during training and λ = 0.30 as follows. The encoding part:
is determined via grid search with step size 0.05. Fig. 7 shows C_64_7_2 - ReLU - C_128_3_1 - ReLU - C_128_3_2
the detection result of a sample image. Once the training is - ReLU - C_256_3_1 - ReLU - C_256_3_2 - ReLU -
finished, the asymmetric U-Net generator itself will serve as C_512_3_1 - ReLU - C_512_3_2 - ReLU - C_512_3_1 -
the detection network to translate the original pavement image ReLU - C_512_3_2 - ReLU
to the result image. The decoding part:
DC_512_3_2 - ReLU - C_512_3_1 - ReLU - DC_256_3_2
E. Working on Full-Size Images - ReLU - C_256_3_1 - ReLU - DC_128_3_2 - ReLU -
Notice that the network is trained with small image patches; C_128_3_1 - ReLU - DC_64_3_1 - ReLU - C_64_3_1 - ReLU
however, under industry settings, the image size is much - C_1_3_1 - Tanh
larger (2048 × 4096 pixels). A traditional solution to process Here, the naming rule follows the format: “layer
large input image is to sample it into smaller image patches type_channel number_kernel size_stride”. “C” denotes con-
from the full-size image and do the processing patch-by- volution; “DC” is de-convolution; and Tanh is the Tanh
patch, named window-sliding strategy [4], [29]; however, it is activation. For instance, “C_64_7_2” means that the first layer
very inefficient [5]. In our approach, the asymmetric U-Net is is a convolutional layer and the number of channels is 64,
designed as a fully convolutional network, and it can work on the kernel size is 7 and the stride is 2.
images of arbitrary sizes seamlessly. In addition, such fully a) Network training: The training is a two-stage strategy
convolutional processing mechanism is quite efficient, which which employs transfer learning at two places, the one-class
does not involve redundant convolutions [5]. discriminator and the encoding part of the generator. First,
the DC-GAN is trained with the crack-GT patches of 128 ×
128-pixel as described in subsection III-B, aiming at training
F. Implementation Details a discriminator with strong crack-pattern recognition ability
Network architecture: Fig. 5 presents the architecture of the to provide the adversarial loss for the end-to-end training at
asymmetric U-Net. The first layer is configured with 7×7 con- the second stage. A total of 60,000 dilated crack-GT patches
volutional kernels with stride 2 and is followed by a rectified with various crack patterns are used. The other training settings
linear unit (ReLU) [53]; and it serves as the asymmetric part follow [39]: the Adam optimizer [54] is used, the learning
of the U-Net generator, which realize a 2-time downsampling rate is 0.0002, the parameters for momentum updating are 0.9,
of the larger input images and output the feature maps with the batch size is 128 and the input “noise” vector is 128 dimen-
the same size as the final output of the asymmetric U-Net. sions. A total of 100 epochs (each epoch is total images/batch
Then the remaining layers of the encoding and decoding size = 60000/128 iterations) are run to obtain the final model.
parts are following the regular U-Net architecture [48]. The Then the well-trained discriminator is concatenated to the end
encoding part consists of four repeated convolutional layers of the asymmetric U-Net to provide the adversarial loss at
with 3 × 3 kernels and the stride is 2; and each convolu- the second stage. Refer Fig. 5 and Eq. (4).
tional layer is followed by a ReLU layer. After each of the Inspired by [4], it also pre-trains the encoding part of
first three convolutional layers, the number of convolutional the generator under the classification setting. Zhang et al. [4]
channels is doubled. The decoding part consists of four 3 × 3 showed that by performing an image-block classification task,
de-convolutional layers that up-samples the feature maps; the the network was able to extract the relevant crack patterns;
input of each de-convolutional layer is the output of the last and the learned knowledge could be transferred to ease the
layer concatenated with the corresponding feature map from training of an end-to-end detection network [5]. Fig. 8 is the

ZHANG et al.: CrackGAN: PAVEMENT CRACK DETECTION USING PARTIALLY ACCURATE GTs 7

Fig. 10. Illustration of pixel-level mismatching: (a) a crack image; (b) detec-
tion result overlapped with GT-image dilated once using a disk structure with
radius 3; and (c) detection result overlapped with GT-image dilated four times.
The transparent areas represent the dilated GT-cracks with different dilation
scales.
Fig. 9. Grid search board with HD-scores utilized to determinate the optimal
parameters λ and dilation-scale. The optimal parameters are determined CFD contains 118 pavement crack images (480 × 320-pixel
according to the best HD-score. Testing results of the three main failure cases each) obtained by people standing on the road using an
are also present in the figure including the “All Black” result, over-dilated iPhone, the ground truths are carefully marked at pixel level
result, and the non-sense patterns which overlooked the pixel-level loss.
which is labor-intensive. The image quality is high and the
background is smooth and clean. CGD is a dataset with
low-level feature maps of a classification network trained with 400 pavement crack images (2048×4096-pixel each) collected
crack and non-crack patches [4]. The classification network is by the authors using a line-scan industrial camera mounted
configured by adding a fully connected layer at the end of on the top of a vehicle running at 100km/h; and the camera
the encoding part (bottleneck) of the asymmetric U-Net and scans 4.096-meter width road surface and produces a pave-
the output dimension is 2 representing crack and non-crack ment image of 2048 × 4096-pixel for every 2048 line-scans
with labels 0 and 1. The training samples are crack and (i.e., 1 pixel represents 1 × 1 mm2 area). Most of the cracks
non-crack patches of 256 × 256. It shows that the network are thin, and sometimes even hard to be recognized by human.
extracted same crack pattern as the original image, i.e., the Furthermore, it is infeasible to obtain the accurate GTs at
network is able to learn useful information with the weakly pixel-level; thus, the cracks are represented by 1-pixel curves
supervised information, crack/non-crack image labels only. roughly marked by the engineers in a labor-light way, and it
Then the well-trained parameters are used to initialize the is named partially accurate GTs). However, such GTs may
encoding part of the generator for the end-to-end training; and not match the true crack locations accurately, and processing
the other settings are same with the DC-GAN except replacing them is much more challenging. The proposed algorithm can
the generator with the asymmetric U-Net and changing the achieve the best results using both “accurate” and “partially
objective function according to Eq. (6). accurate” datasets that demonstrate its robustness as well.
b) Parameter selection: The parameters, λ in Eq. (6) and For CFD and CGD, the data are augmented following [4]
the dilation scale, are determined by grid search. From the grid to facilitate the training; and the training-test ratio is 2:1.
search board in Fig. 9, the dilation scale (number of dilation Dataset [17] has industrial images from five different capture
times with the disk structure) should be between 3 and 7, and systems: Aigle-RN has 38 images with annotation, ESAR has
the λ should be between 0.15 and 0.4 for an effective detection. 15 images with annotations, and LCMS has 5, LRIS has 3 and
Dilation scale less than 3 will cause the “All Black” issue Tempest has 7 images with annotations, respectively. The GTs
due to the pixel-level mismatching and the data imbalance, are marked at pixel level. To our best knowledge, it is the only
while too big dilation (>8) could not provide meaningful public pavement crack dataset from industry; and it contains
crack location information and would produce useless output. relatively few images, they are used for testing only.
In addition, a larger λ tends to fail the task, but a small λ Different from most object detection tasks [55], the inter-
(0.15 < λ <0.4) can work well, which indicates that the section over union (IOU) is not suitable for evaluating crack
adversarial loss is very important to succeed the training under detection algorithms [56]. As shown in Fig. 10, crack, as a
the industrial setting. It can be observed from the grid board long-narrow target, only occupies a very small area, and the
that the HD-score is either a good one (>80) or a very small image consists mainly of BG pixels. With the fact that the
one (0 or 1) which indeed represents the two different model precise pixel-level GTs are difficult to obtain, it is impossible
statuses, well-trained or failed. However, the causes of the to obtain the accurate intersection area. As shown in Fig. 10
failures could be grouped into three main cases: over-dilated, (b) and Fig. 10 (c), it is obvious that the detection results are
“All Black” problem, or the useless output pattern which very good; however, the IOU values are very low, 0.13 and 0.2,
overlooks the pixel-level loss with a small λ, as indicated respectively. According to [56], it employs Hausdorff distance
in Fig. 9. to evaluate the crack localization accuracy. For two sets of
points A and B, the Hausdorff distance can be calculated with:
IV. E XPERIMENTS
H (A, B) = max[h(A, B), h(A, B)] (7)
A. Dataset and Metrics
where
CFD [3], the CrackGAN dataset (CGD) collected by
the authors, and dataset [17] are utilized for evaluation. h(A, B) = max a∈Ami n b∈B a − b (8)

8 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

TABLE I
Q UANTITATIVE E VALUATIONS ON CFD

Fig. 11. Region-based evaluation: (a) the original crack image; (b) illustration
of counting the crack and non-crack regions. The squares with label “1s” are
the crack regions, and with label “0s” are BG-regions.
B. Overall Performance
The comparisons are performed on CFD [3], CGD, and
The penalty is defined as: dataset [17] to justify the state-of-the-art performance. Since
some of the papers only provided the final detection results,

h p (A, B) = 1/(|A|) satu mi n b∈B a − b (9) the PR-curves are plotted out only for the methods with the
a∈A source codes.
1) CFD: The proposed method is compared with
Here, parameter u is the upper limit of the saturation function CrackIT-v1 [57], MFCD [58], CrackForest [3], [29],
sat which is used to directly get rid of the false positives that FCN-VGG [7], Pix2pix GAN (with U-Net as the genera-
are far away from the GTs. Instead of setting u as 1/5 of tor) [40], and DeepCrack [36] on CFD; and the related results
the image width [56], it is set as 50-pixel to emphasize the are shown in Fig. 12, Fig. 13 (a), and Table I. CrackIT
localization accuracy by eliminating the influence of possible introduced the traditional mean and standard deviation (STD)
noises from the large BG areas. A is the detected crack set for crack patch selection, and utilized some post-processing
and B is the GT set, the overall score is: for pixel-based crack detection. However, the features with
mean and STD are not able to select the crack patches well,
B H (A, B)
scor e B H (A, B) = 100 − × 100 (10) especially when the cracks are thin; thus, the false negative rate
u is high, and it cannot even detect any cracks in the second and
where third sample images from Fig. 12. MFCD developed a complex
path verification algorithm to link candidate crack seeds for
B H (A, B) = max[h p (A, B), h p (B, A)] (11) the detection; however, it might also connect the false positives
and generate fake cracks. As shown in Fig. 12, it produces
The Hausdorff distance score (HD-score) can reflect the many noises in the third image with non-smooth background.
overall crack localization accuracy, and it is insensitive to CrackForest employed integral channel information with 3 col-
the foreground-background imbalance inherent in long-narrow ors, 2 magnitudes and 8 orientations for feature extraction
object detection. and applied random forest for crack token mapping; and the
In addition, the region-based precision rate (p-rate) and histogram difference between crack and non-crack regions was
recall rate (r-rate) are used for evaluation, which can measure used for noise removal. As shown in Fig. 12, it achieves very
the false-detection severity and the missed-detection severity, good results on the images whose backgrounds are smooth and
respectively. In Fig. 11, a pavement image of 400 × 400-pixel clean. However, the performance deteriorates when processing
is divided into small image patches (50×50-pixel); if there is a the industrial images as shown in Figs. 14 and 15. [29]
crack detected in a patch, marked as “1s”, it is positive. In the was a patch-level crack detection method which trained a
same way, for GT images, if there is a marked curve in a patch, deep classification network for crack and non-crack patch
it is a crack patch. Then the region based true positive (TP), classification; it could not provide accurate crack locations
false positive (FP) and false negative (FN) can be obtained as shown in Fig. 12. FCN-VGG was a pixel-level crack
by counting the corresponding squares, and further be used to detection method of which the accurate pixel-level GTs were
calculate the region based precision and recall rates: needed to train the FCN-based network end-to-end. Similar
to the results reported in the original papers, it failed when
T Pregion detecting thin cracks. DeepCrack achieves very good results
Pregion = (12)
T Pregion + F Pregion on CFD due to the multi-scale hierarchical fusion; however,
T Pregion the training relied on accurate GTs and the method would
Rregion = (13) fail easily when the GTs are biased. Pix2pix GAN [40]
T Pregion + F Nregion
was an image-to-image translation network with U-Net as
Then the region-based F1 score can be computed as: the generator which introduced generative adversarial learn-
ing for image style translation originally. However, as dis-
2 ∗ Pregion ∗ Rregion cussed in section III-C, the discriminator treats both crack
F1region = (14)
Pregion + Rregion and non-crack as real which can immediately weaken the

ZHANG et al.: CrackGAN: PAVEMENT CRACK DETECTION USING PARTIALLY ACCURATE GTs 9

Fig. 12. Comparison of the detection results on CFD using different methods. From top to bottom are: original images, GT images, results of CrackIT,
results of MFCD, results of CrackForest, results of [29], results of FCN-VGG, results of DeepCrack, results of Pix2pix GAN, and results of CrackGAN,
respectively.

crack-patch generation ability, that makes the network similar problem as shown in Figs. 14 and 15. CrackGAN introduces
to the regular U-Net; therefore, it achieves similar results CPO-supervision and the asymmetric U-Net architecture to
as FCN-based methods, and also encounters “All Black” build the one-class discriminator for generative adversarial

10 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

respectively. CrackForest introduces many noises, among

which quite a lot of them connected to the true crack regions;
and it was because the method did not consider the removal
of the noise connected to the true positives [4]. Therefore,
it achieves a low p-rate, 31.01%. Same as the results on
CFD, [29] could not give accurate crack locations, and
Fig. 13. Precision-recall curves. Figures (a), (b), and (c) are the PR-curves achieves a low p-rate, 69.20%. Suffering from the “All Black”
on CFD, CGD, and dataset [17], respectively.
issue, the FCN-VGG recognizes all crack and non-crack
patches as background. For a fair evaluation, we conduct the
TABLE II
comparison with DeepCrack on two different settings: one
Q UANTITATIVE E VALUATION ON CGD
(DeepCrack-1) exactly follows the original paper trained with
CrackTree data [11], and another (DeepCrack-2) re-trains
the model using CGD. As present in Table II and Fig. 14,
DeepCrack-1 introduces unacceptable noises due to the
performance degradation on different domains; DeepCrack-2
encounters the “All Black” problem. As discussed in
section III-C, with the default settings, the discriminator in
the original Pix2pix GAN will recognize both crack-GT and
all-black-GT as real which damages the crack-GT generation
ability and makes it like a regular U-Net; thus, it also
TABLE III
produces all-black images as the results. By introducing the
Q UANTITATIVE E VALUATION ON D ATASET [17]
CPO-supervision and the adversarial learning with asymmetric
U-Net generator, the model can be trained to generate
crack-like results without losing the BG translation ability, and
finally overcome the “All Black” issue. As shown in Fig. 14,
it can detect thin cracks from the pavement images obtained
from industrial settings. Refer to Table II and Fig. 13 (b).
3) Dataset [17]: Fig. 15, Fig. 13 (c), and Table III present
the related results on dataset [17]. Similar to the results
on CGD, CrackIT missed most cracks due to the drawback
learning, which enhances the crack patch discrimination ability of feature extraction and post-processing. CrackForest could
by treating the all-black patch as fake images to avoid the data not remove the noises connected to the true crack regions
imbalance problem inherent in crack-like object detection, and and achieves a very low p-rate. MPS [17] is a traditional
finally improves the crack detection ability, especially for thin image processing method based on minimal path selection;
and tiny crack detection. As shown in Fig. 12 and Table I, it performed the detection by following three steps, endpoint
it achieves the best results. selection, minimal path estimation, and minimal path selection.
It is worth to mention that in Tables I, II, and III, some It achieved good results as shown in Fig. 15 and Table III;
p-rates and r-rates of the CrackGAN are not the maximum however, it utilized quite a few tunable parameters and
values, but they do not affect the state-of-the-art performance. post-processing procedures that needed extra works manually.
For example in Table I, CrackIT achieved best p-rate (88.05%) As discussed before, FCN-VGG and Pix2pix GAN fail due to
even it missed quite a lot of cracks; because the precision is the “All Black” issue. Instead, CrackGAN can properly handle
calculated with TP/(TP+FP), if FP is small; even FN is very the “All Black” problem and achieves the best performance.
large, the p-rate can still be large. Similarly, [29] achieved very In addition to pavement crack detection, the proposed
good r-rate (98.21%) even the patch level detection will cause method can also deal with other crack detection tasks;
a lot of false positives, because the recall rate is calculated with Fig. 16 provides crack detection results on concrete pavement
TP/(TP+FN) which does not take into account the FP. There- images and concrete wall images based on the model trained
fore, only p-rate or r-rate cannot represent the performance with dataset [7] and the labor-light GTs.
of state-of-the-art crack detection algorithm. In addition, it is
also worth to mention that, for the case with TP and FP as C. Computational Efficiency
zero, mathematically the precision is undefined due to the zero In addition to the detection accuracy, it also compared the
denominator. However, because it missed all the crack pixels, computation efficiency of the methods with public testing
we define the precision as zero in such situation and leaves codes. The average processing times for processing a full-size
the PR-curve with an open left-end as shown in Fig. 13. Refer image of 2048 × 4096-pixel are present in Table IV. CrackIT-
Table I and Fig. 13 (a) for the quantitative results. v1 [57] takes 6.1 seconds on average; and CrackForest [3]
2) CGD: The related results on CGD are shown in Fig. 14, takes a relative less time (4.0 seconds) via using the parallel
Fig. 13 (b) and Table II. Similar to the results on CFD, computing to implement the random forest for image patch
CrackIT misses most cracks and MFCD introduces many classification. The two methods are implemented with Matlab-
noises because of the thin cracks and textured background, 2016b, running with CPU on a HP 620 workstation with 32G

ZHANG et al.: CrackGAN: PAVEMENT CRACK DETECTION USING PARTIALLY ACCURATE GTs 11

Fig. 14. Comparison of the detection results on CGD. From top to bottom are: original images, GT images, results of CrackIT, results of CrackForest, results
of [29], results of FCN-VGG, results of DeepCrack-1, results of DeepCrack-2, results of Pix2pix GAN, and results of CrackGAN, respectively.

memory and twelve i7 cores. For the deep learning methods, because it is based on the window-sliding. FCN [7], Deep-
they are implemented with the same computer but run on an Crack, Pix2pix GAN, and CrackGAN take much less time
Nvidia 1080Ti GPU with Pytorch. [29] takes 10.2 seconds due to the FCN architecture; moreover, the CrackGAN

12 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 15. Comparison of the detection results on dataset [17]. From top to bottom are: original images, GT images, results of CrackIT, results of CrackForest,
results of FCN-VGG, results of Pix2pix GAN, results of MPS [17], and results of CrackGAN, respectively.

TABLE IV
C OMPARISONS OF C OMPUTATIONAL E FFICIENCY

V. C ONCLUSION
In this work, we propose a novel deep generative adversarial
Fig. 16. Crack detection on concrete pavement images and concrete wall network, named CrackGAN, for pavement crack detection.
images: (a) and (b) are concrete wall images; (c) and (d) are concrete
pavement images; (e)-(h) are the detection results by CrackGAN. The method solves a practical and essential problem, “All
Black” issue, existing in FCN-based pixel-level crack detection
takes much less time (i.e., 1.6 seconds) because it cuts off when using partially accurate GTs. More important, the net-
the last de-convolutional layer for the asymmetric U-Net work can solve crack detection tasks in a labor-light way. It can
design. reduce the workload of preparing GTs significantly, and create

ZHANG et al.: CrackGAN: PAVEMENT CRACK DETECTION USING PARTIALLY ACCURATE GTs 13

the new idea for object detection/segmentation using partially [17] R. Amhaz, S. Chambon, J. Idier, and V. Baltazart, “Automatic crack
accurate GTs. Moreover, the method can solve the data detection on two-dimensional pavement images: An algorithm based
on minimal path selection,” IEEE Trans. Intell. Transp. Syst., vol. 17,
imbalance problem which is the byproduct of the proposed no. 10, pp. 2718–2729, Oct. 2016.
approach. In addition, the network is trained with small image [18] Y.-C. Tsai, V. Kaul, and R. M. Mersereau, “Critical assessment of
patches, but can deal with any size images. The experiments pavement distress segmentation methods,” J. Transp. Eng., vol. 136,
no. 1, pp. 11–19, Jan. 2010.
demonstrate the effectiveness and superiority of the proposed
[19] I. Abdel-Qader, O. Abudayyeh, and M. E. Kelly, “Analysis of edge-
method, and the proposed approach achieves state-of-the-art detection techniques for crack identification in bridges,” J. Comput. Civil
performance comparing with the recently published works. Eng., vol. 17, no. 4, pp. 255–263, Oct. 2003.
The theoretical analysis of neuron’s property concerning [20] R. C. Gonzalez, R. E. Woods, and S. L. Steven, Digital Image Processing
Using MATLAB, Boston, MA, USA: Addison-Wesley, 2009.
receptive field can be employed to explain many phenomena [21] H. D. Cheng, J. Wang, Y. G. Hu, C. Glazier, X. J. Shi, and
in deep learning, such as the boundary vagueness in seman- X. W. Chen, “Novel approach to pavement cracking detection based
tic segmentation [6], blurry of the generated images with on neural network,” Transp. Res. Rec., J. Transp. Res. Board, vol. 1764,
no. 1, pp. 119–127, Jan. 2001.
GAN [40], [41], etc. We believe that the analysis of each
[22] H. Oliveira and P. L. Correia, “Automatic road crack detection and
neuron’s property discussed in this paper could become a characterization,” IEEE Trans. Intell. Transp. Syst., vol. 14, no. 1,
routine for designing effective neural networks in the future. pp. 155–168, Mar. 2013.
[23] Y. Hu, C.-X. Zhao, and H.-N. Wang, “Automatic pavement crack
detection using texture and shape descriptors,” IETE Tech. Rev., vol. 27,
no. 5, pp. 398–405, 2010.
R EFERENCES [24] M. Gavilán et al., “Adaptive road crack detection system by pavement
classification,” Sensors, vol. 11, no. 10, pp. 9628–9657, 2011.
[1] N. F. Hawks and T. P. Teng, “Distress identification manual for the long- [25] E. Zalama, J. Gómez-García-Bermejo, R. Medina, and J. Llamas,
term pavement performance project,” Nat. Acad. Sci., Washington, DC, “Road crack detection using visual features extracted by Gabor filters,”
USA, Tech. Rep. SHRP-P-338, 2014. Comput.-Aided Civil Infrastruct. Eng., vol. 29, no. 5, pp. 342–358,
[2] L. Zhang, F. Yang, Y. Daniel Zhang, and Y. J. Zhu, “Road crack detection May 2014.
using deep convolutional neural network,” in Proc. IEEE Int. Conf. [26] P. Dollar and C. L. Zitnick, “Structured forests for fast edge detection,”
Image Process. (ICIP), Sep. 2016, pp. 3708–3712. in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 1841–1848.
[3] Y. Shi, L. Cui, Z. Qi, F. Meng, and Z. S. Chen, “Automatic road crack [27] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierar-
detection using random structured forests,” IEEE Trans. Intell. Transp. chies for accurate object detection and semantic segmentation,” in Proc.
Syst., vol. 17, no. 12, pp. 3434–3445, Dec. 2016. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587.
[4] K. Zhang, H. D. Cheng, and B. Zhang, “Unified approach to pave- [28] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and
ment crack and sealed crack detection using preclassification based on A. W. M. Smeulders, “Selective search for object recognition,” Int.
transfer learning,” J. Comput. Civil Eng., vol. 32, no. 2, Mar. 2018, J. Comput. Vis., vol. 104, no. 2, pp. 154–171, Sep. 2013.
Art. no. 04018001. [29] Y.-J. Cha, W. Choi, and O. Büyüköztürk, “Deep learning-based crack
[5] K. Zhang, H.-D. Cheng, and S. Gai, “Efficient dense-dilation network damage detection using convolutional neural networks,” Comput.-Aided
for pavement cracks detection with large input image size,” in Proc. Civil Infrastruct. Eng., vol. 32, no. 5, pp. 361–378, May 2017.
21st Int. Conf. Intell. Transp. Syst. (ITSC), Maui, HI, USA, Nov. 2018,
[30] A. Zhang et al., “Automated pixel-level pavement crack detection on 3D
pp. 884–889.
asphalt surfaces using a deep-learning network,” Comput.-Aided Civil
[6] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks Infrastruct. Eng., vol. 32, no. 10, pp. 805–819, 2017.
for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Boston, MA, USA, Jun. 2015, pp. 3431–3440. [31] F. C. Chen and M. R. Jahanshahi, “NB-CNN: Deep learning-based
crack detection using convolutional neural network and Naïve Bayes
[7] X. Yang, H. Li, Y. Yu, X. Luo, T. Huang, and X. Yang, “Automatic
data fusion,” IEEE Trans. Ind. Electron., vol. 65, no. 5, pp. 4392–4400,
pixel-level crack detection and measurement using fully convolutional
May 2018.
network,” Comput.-Aided Civil Infrastruct. Eng., vol. 33, no. 12,
pp. 1090–1109, Dec. 2018. [32] S. Park, S. Bang, H. Kim, and H. Kim, “Patch-based crack detection
in black box images using convolutional neural networks,” J. Comput.
[8] H. D. Cheng, J. Chen, C. Glazier, and Y. Hu, “Novel approach to
Civil Eng., vol. 33, no. 3, May 2019, Art. no. 04019017.
pavement cracking detection based on fuzzy set theory,” J. Comput.
Civil Eng., vol. 13, no. 4, pp. 270–280, 1999. [33] Z. Tong, J. Gao, Z. Han, and Z. Wang, “Recognition of asphalt pavement
[9] K. Wang, Q. Li, and W. Gong, “Wavelet-based pavement distress image crack length using deep convolutional neural networks,” Road Mater.
edge detection with a trous algorithm,” Transp. Res. Rec., vol. 2024, Pavement Des., vol. 19, no. 6, pp. 1334–1349, Aug. 2018.
no. 1, pp. 24–32, 2000. [34] H. Nhat-Duc, Q.-L. Nguyen, and V.-D. Tran, “Automatic recognition
[10] H. Oliveira and P. L. Correia, “Automatic road crack segmentation using of asphalt pavement cracks using Metaheuristic optimized edge detec-
entropy and image dynamic thresholding,” in Proc. 17th Eur. Signal tion algorithms and convolution neural network,” Automat. Construct.,
Process. Conf., Glasgow, U.K., Aug. 2009, pp. 622–626. vol. 94, pp. 203–213, Oct. 2018.
[11] Q. Zou, Y. Cao, Q. Li, Q. Mao, and S. Wang, “CrackTree: Automatic [35] K. Gopalakrishnan, S. K. Khaitan, A. Choudhary, and A. Agrawal,
crack detection from pavement images,” Pattern Recognit. Lett., vol. 33, “Deep convolutional neural networks with transfer learning for com-
no. 3, pp. 227–238, Feb. 2012. puter vision-based data-driven pavement distress detection,” Construct.
[12] Y. Huang, “Automatic inspection of pavement cracking distress,” J. Elec- Building Mater., vol. 157, pp. 322–330, Dec. 2017.
tron. Imag., vol. 15, no. 1, Jan. 2006, Art. no. 013017. [36] Q. Zou, Z. Zhang, Q. Li, X. Qi, Q. Wang, and S. Wang, “DeepCrack:
[13] G. Li, S. He, Y. Ju, and K. Du, “Long-distance precision inspection Learning hierarchical convolutional features for crack detection,” IEEE
method for bridge cracks with image processing,” Automat. Construct., Trans. Image Process., vol. 28, no. 3, pp. 1498–1512, Mar. 2019.
vol. 41, pp. 83–95, May 2014. [37] I. J. Goodfellow et al., “Generative adversarial nets,” in Proc. NIPS,
[14] J. Chen, M. Su, R. Cao, S. Hu, and C. Lu, “A self organizing map 2014, pp. 2672–2680.
optimization based image recognition and processing model for bridge [38] M. Mirza and S. Osindero, “Conditional generative adver-
crack inspection,” Automat. Construct., vol. 73, pp. 58–66, Jan. 2007. sarial nets,” 2014, arXiv:1411.1784. [Online]. Available:
[15] L. Wu, S. Mokhtari, A. Nazef, B. Nam, and H.-B. Yun, “Improvement of http://arxiv.org/abs/1411.1784
crack-detection accuracy using a novel crack defragmentation technique [39] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
in image-based road assessment,” J. Comput. Civil Eng., vol. 30, no. 1, learning with deep convolutional generative adversarial networks,” 2015,
Jan. 2016, Art. no. 04014118. arXiv:1511.06434. [Online]. Available: https://arxiv.org/abs/1511.06434
[16] T. S. Nguyen, S. Begot, F. Duculty, and M. Avila, “Free-form anisotropy: [40] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation
A new method for crack detection on pavement surface images,” in Proc. with conditional adversarial networks,” in Proc. IEEE Conf. Comput. Vis.
18th IEEE Int. Conf. Image Process., Sep. 2011, pp. 1069–1072. Pattern Recognit. (CVPR), Jul. 2017, pp. 1125–1134.

14 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

[41] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-to-Image Yingtao Zhang (Member, IEEE) received the M.S.
translation using cycle-consistent adversarial networks,” in Proc. IEEE degree in computer science and the Ph.D. degree
Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2223–2232. in pattern recognition and intelligence system from
[42] S. Jialin Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. the Harbin Institute of Technology, Harbin, China,
Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010. in 2004 and 2010, respectively. She is currently
[43] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable an Associate Professor with the School of Com-
are features in deep neural networks?” in Proc. NIPS, Montreal, QC, puter Science and Technology, Harbin Institute of
Canada, 2014, pp. 3320–3328. Technology. Her research interests include pattern
[44] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferring recognition, computer vision, and medical image
mid-level image representations using convolutional neural networks,” processing.
in Proc. CVPR, New York, NY, USA, Jun. 2014, pp. 1717–1724.
[45] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
A large-scale hierarchical image database,” in Proc. CVPR, Jun. 2009,
pp. 248–255.
[46] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica-
tion with deep convolutional neural networks,” in Proc. NIPS, 2012,
pp. 1097–1105.
[47] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
“Semantic image segmentation with deep convolutional nets and
fully connected CRFs,” 2014, arXiv:1412.7062. [Online]. Available:
https://arxiv.org/abs/1412.7062
[48] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net-
works for biomedical image segmentation,” in Proc. Int. Conf. Med.
Image Comput. Comput.-Assist. Intervent., 2015, pp. 234–241.
[49] S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proc. ICCV,
Dec. 2015, pp. 1395–1403.
[50] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated
convolutions,” in Proc. ICLR, 2016, pp. 1–13.
[51] K. Zhang, Y. Zhang, and H. D. Cheng, “Self-supervised structure
learning for crack detection based on cycle-consistent generative adver-
sarial networks,” J. Comput. Civil Eng., vol. 34, no. 3, May 2020,
Art. no. 04020004.
[52] L. Tran, X. Yin, and X. Liu, “Disentangled representation learning GAN
for pose-invariant face recognition,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jul. 2017, pp. 1415–1424.
[53] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
Boltzmann machines,” in Proc. ICML, 2010, pp. 1–8.
[54] D. P. Kingma and J. Ba, “Adam: A method for stochastic Heng-Da Cheng (Life Senior Member, IEEE)
optimization,” 2014, arXiv:1412.6980. [Online]. Available: received the Ph.D. degree in electrical engineering
http://arxiv.org/abs/1412.6980 from Purdue University, West Lafayette, IN, in 1985,
[55] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zis- under the supervision Prof. K. S. Fu.
serman. The PASCAL Visual Object Classes Challenge 2011 Results. He is currently a Full Professor with the Depart-
Accessed: 2018. [Online]. Available: http://www.pascalnetwork.org/ ment of Computer Science, and an Adjunct Full
challenges/VOC/voc2011 Professor with the Department of Electrical Engi-
[56] Y.-C. Tsai and A. Chatterjee, “Comprehensive, quantitative crack detec- neering, Utah State University, Logan, UT. He is
tion algorithm performance evaluation system,” J. Comput. Civil Eng., also an Adjunct Professor and a Ph.D. Supervisor
vol. 31, no. 5, Sep. 2017, Art. no. 04017047. with the Harbin Institute of Technology. He is also
[57] H. Oliveira and P. L. Correia, “CrackIT—An image processing toolbox a Guest Professor with the Institute of Remote
for crack detection and characterization,” in Proc. IEEE Int. Conf. Image Sensing Application, Chinese Academy of Sciences, Wuhan University, and
Process. (ICIP), Paris, France, Oct. 2014, pp. 798–802. Shantou University, and a Visiting Professor with Northern Jiaotong Uni-
[58] H. Li, D. Song, Y. Liu, and B. Li, “Automatic pavement crack detection versity, Huazhong Science and Technology University, and Huanan Normal
by multi-scale image fusion,” IEEE Trans. Intell. Transp. Syst., vol. 20, University. He has authored over 350 technical articles. He is also the
no. 6, pp. 2025–2036, Jun. 2019, doi: 10.1109/TITS.2018.2856928. Co-Editor of the book entitled Pattern Recognition: Algorithms, Architectures,
and Applications (World Scientific Publishing Company, 1991). His research
interests include image processing, pattern recognition, computer vision,
Kaige Zhang (Member, IEEE) received the B.S. artificial intelligence, medical information processing, fuzzy logic, genetic
degree in electronic engineering from the Harbin algorithms, neural networks, parallel processing, parallel algorithms, and VLSI
Institute of Technology, China, in 2011, the M.S. architectures.
degree in signal and information processing from Dr. Cheng was the General Chair of the Eighth JCIS in 2005, the Ninth
Harbin Engineering University, China, in 2014, and JCIS in 2006, the tenth JCIS in 2007, and the 11th Joint Conference on
the Ph.D. degree in computer science from Utah Information Sciences (JCIS) in 2008. He has served as a Program Committee
State University, USA, in 2019. His research inter- Member and the Session Chair for many conferences, and as a Reviewer for
ests include computer vision, machine learning, many scientific journals and conferences. He has been listed in Who’s Who in
and the applications on intelligent transportation the World, Who’s Who in America, and Who’s Who in Communications and
systems, precision agriculture, and biomedical data Media. He is also an Associate Editor of Pattern Recognition, Information
analytics. Sciences, and New Mathematics and Natural Computation.