Automatic Pixel Level Detection and Measurement of Corrosion Related Damages in Dim Steel Box Girders Using Fusion Attention U Net

Journal of Civil Structural Health Monitoring (2023) 13:199–217
https://doi.org/10.1007/s13349-022-00631-y
ORIGINAL PAPER
Automatic pixel‑level detection and measurement of corrosion‑related

damages in dim steel box girders using Fusion‑Attention‑U‑net
Fei Jiang1 · Youliang Ding1 · Yongsheng Song2 · Fangfang Geng3 · Zhiwen Wang4
Received: 26 May 2022 / Accepted: 8 September 2022 / Published online: 25 September 2022
© Springer-Verlag GmbH Germany, part of Springer Nature 2022
Abstract
To detect corrosion-related damages inside dim steel box girders, an improved U-net, Fusion-Attention-U-net (FAU-net),
is proposed in this paper. A fusion module and a bottleneck-attention module are embedded in FAU-net for aggregating
multi-level features and learning representative information, respectively. To realize this, a database of 300 damage images
is built after data augmentation. Then, the proposed FAU-net is modified, trained, and validated. Based on the selected best
training, the network achieves 98.61% pixel accuracy (PA), 92.73% mean pixel accuracy (MPA), 77.57% mean intersection
over union (MIoU), and 97.52% frequency weighted intersection over union (FWIoU) on the validation set. Subsequently,
the robustness and adaptability of the trained FAU-net are tested and compared with state-of-the-art networks. For a deep
understanding, an ablation study is conducted to learn the contribution of main components in FAU-net. To establish the
relationship between the detected damage pixel area and its actual physical area, photography experiments, and theoretical
analyses are conducted to study the effect of three critical shooting variables: shooting distance, focal length, and shooting
angle. Finally, a theoretical equation linking the pixel and physical areas is derived and further validated using the field-taken
damage images under different shooting cases. The results show that the proposed method substantiates excellent performance
to detect damage at the pixel level and measure damage areas accurately for the current samples.
Keywords Corrosion · Detection and measurement · U-net network · Steel box girder
1 Introduction mildew, and corrosion (referred to as corrosion-related

damages in this paper) are still found inside the steel box
Since the 1990s, the steel box girder has been of great use girder. Unlike other steel structures, the enclosed steel box
in large cable-stayed or suspension bridges in China. As the girder has a dim interior environment, and the uneven illu-
main component, it is directly subjected to vehicle loads mination makes photographed damage features suffer from
and plays a critical role in the bridge operation. For such heavy background noise. These damage photos can be seen
thin-walled steel structures, corrosion has become one of in the following sections. Human-based visual inspection
the main problems affecting their safety and durability. currently supports the decision-making on maintenance
Although dehumidification equipment is installed, ponding, of these damages, which can directly obtain the location
and size of the damaged areas. However, its consistency in
quantitative evaluation and accessibility cannot be guaran-
* Youliang Ding
[email protected] teed considering environmental (complex internal structure
and varied light intensity, etc.) and human (lack of special-
1
Key Laboratory of Concrete and Prestressed Concrete ized knowledge and discrepancies in visual acuity and color
Structures of Ministry of Education, Southeast University, vision, etc.) factors [1, 2]. Furthermore, human-based visual
Nanjing 210096, China
inspection is relatively time-consuming. Therefore, auto-
2
School of Architecture Engineering, Jinling Institute matic inspections based on computers have been recently
of Technology, Nanjing 211169, China
proposed as the development of smarter camera carriers such
3
School of Architecture Engineering, Nanjing Institute as unmanned aerial vehicles (UAVs).
of Technology, Nanjing 211167, China
Most computer-based visual inspections adopt image
4
Shenzhen Express Engineering Consulting Co. Ltd, processing techniques (IPTs) to detect specific structural
Shenzhen 518000, China
123
Vol.:(0123456789)
200 Journal of Civil Structural Health Monitoring (2023) 13:199–217
damage such as steel cracks [1], concrete cracks [3, 4], To address the drawbacks of IPTs, many deep-learning-
concrete spalling [5], road pavement potholes [6], and also based damage detection methods have been developed over
steel corrosion [7, 8]. A significant advantage of IPTs is that the past decade. Among them, convolutional neural networks
almost all surface defects may be identified [2]. A review (CNNs) have been highlighted in image classification, which
on IPT-based corrosion detection of bridge structures was can extract features without the need for manual construction
conducted by Jahanshahi et al. [9]. They summarized a gen- of features or prior knowledge about them. Atha et al. [17]
eralized three-step detection procedure: image segmentation, adopted CNNs with sliding windows to classify corroded or
feature extraction, and classification, before which methods non-corroded areas of steel surfaces. They studied the effect
of image grayscale, median filtering, histogram equaliza- of CNN architectures, sliding window sizes, different color
tion, morphological operations, etc. are usually used for pre- spaces, and grayscale. Ma et al. [18] also used a fine-tuned
processing of the corrosion images. During segmentation, CNN network and the sliding window technique to local-
methods of thresholding, region growing, edge detection, ize the corrosion defect in ship structures. Du et al. [19]
and pattern matching are utilized to isolate the potential developed an improved CNN model with a parallel CNN
corrosion patterns. Thereafter the features of texture, color, architecture to classify the corrosion degree of grounding
shape, intensity, and organization are extracted for the seg- grids. Feng et al. [20] proposed an injurious or noninjuri-
mentation using filters such as Haar wavelets, Canny edge ous defect identification method based on CNN trained by
detection, Histogram of Oriented Gradients. etc. Finally, magnetic flux leakage images instead of the measurement
classification of the segmented patterns is conducted using features. They showed that the proposed network outper-
methods of nearest neighbor, neuro-fuzzy systems, support formed the conventional IPT-based methods. Although the
vector machine, and principle component analysis [10]. sliding window is commonly adopted in the CNN-based
There has been a considerable amount of studies in the area methods to localize damages, its optimal size is hard to be
of IPT-based corrosion detection. Furuta et al. [11] devel- determined since the size of the damage features varies in
oped a decision support system for the assessment of struc- the testing images [21, 22]. To classify and localize multiple
tural corrosion, in which a neural network with hue, satura- objects, Girshick et al. [23] invented the region-based CNN
tion, and value (HSV) color space features was used. Choi (R-CNN), which uses a pre-trained CNN to extract features
and Kim [12] classified five types of corrosion damages in region proposals from selective searches, and a linear
using the hue, saturation, and intensity (HSI) color space regression model and multiple support vector machines
and varimax approaches. By performing statistical analysis (SVMs) to localize and classify the objects, respectively.
on red, green, and blue (RGB) color space, Lee et al. [13] R-CNN outperforms CNN in terms of accuracy. However,
were able to detect tiny rusts on steel bridge coating with it is computationally slower because R-CNN is not trained
a contrasting background. Ghanta et al. [14] used the con- end-to-end, and three different training processes for CNNs,
cepts of pattern recognition and Haar wavelet transform for regressor, and SVMs are required. Therefore, Fast-RCNN
detection of rust in RGB subimages of steel coating bridge [24] and Faster-RCNN [25] are further proposed to speed up
surfaces. Jahanshahi and Masri [15] evaluated the effect of the training and testing efficiency by performing CNN for-
different parameters on the performance of color wavelet- ward propagation on the entire image rather than the region
based texture analysis algorithms for detecting corrosion. proposals and replacing the selective search with a jointly
Their results indicated the wavelet-based features obtained trained region proposal network. Although the above object
from CbCr color channels resulted in better detection per- detection methods can localize damage, they do not have
formance with respect to other color spaces such as RGB sufficient accuracy for the post-measurement of corrosion
and HSI. Bonnin-Pascual and Ortiz [16] proposed two novel damage features.
corrosion detection algorithms, a weak classifier color-based With the advancement of deep learning and computing
corrosion detector (WCCD) and an AdaBoost based corro- resources, as well as the development of databases of labeled
sion detector (ABCD). Both combined weak classifiers to data, image segmentation has reached a new height in terms
achieve better performance. Doubtless the above methods of accuracy [26]. By removing the fully connected layer in
have promoted the development of computer vision-based traditional CNN, Fully Convolutional Network (FCN) [27]
corrosion detection. However, the process of feature extrac- retains the spatial information of an input image, and since
tion and classification of IPT-based methods is tedious and then an FCN-based segmentation trend has emerged, and
laborious due to the limitation of computational power and many improved end-to-end image segmentation models such
feature extraction methods. For detection of the multiple as ParseNet [28], Pyramid Attention Network (PAN) [29],
corrosion-related damages in steel box girders, the general- feature pyramid network (FPN) [30], pyramid scene parsing
ity of IPT-based methods is not guaranteed. Likewise, the network (PSPNet) [31], and DeepLab [32] have been pro-
accuracy and applicability of these methods for corrosion posed. Several studies have been carried out on pixel-level
detection under noisy weak-light condition are doubted. corrosion detection using the above segmentation models
123
Journal of Civil Structural Health Monitoring (2023) 13:199–217 201
[33–35]. However, the accuracy of these deep networks spalling, and efflorescence [40–42]. However, to the best of
commonly relies on the amount and quality of the labeled the author’s knowledge, no comprehensive calibration exper-
corrosion data. To study the open question regarding the iment has been proposed in the field of corrosion detection.
benefit of producing a large, but poorly labeled, dataset ver- Besides, previous calibration studies mainly focused on
sus a small, expertly segmented dataset for semantic seg- the impact of shooting distance and focal length, while the
mentation, Nash et al. [34] trained an FCN model for corro- impact of shooting angle is rarely been reported. Recently, Li
sion segmentation using a large dataset of 250 images with et al. [43] developed novel algorithms for the correction of
segmentation labels by undergraduates and a small dataset distorted crack images caused by different shooting angles.
of just 10 images with segmentation labels by subject mat- However, the correction and measurement processes are not
ter experts, respectively. They claimed that training with a straightforward, and there is still much room for improve-
larger, imperfectly segmented dataset outperforms a very ment in terms of detection speed. Therefore, for practical
small, expertly segmented dataset. However, publicly avail- application, it is of great significance to conduct calibration
able datasets of well-labeled corrosion images, especially experiments for corrosion-related damages in steel box gird-
with corrosion-related damages in steel box girders, have not ers, considering the influence of key photographic variables
been published. On the other hand, it is essential to reduce in a comprehensive manner. It is worth mentioning that pho-
the burden of labeling enough corrosion segmentations for tographic variables such as shooting distance and shooting
training. Therefore, a deep learning method with a small angle can be measured using technologies like laser meas-
training and computational burden and the ability to pre- urement and electronic angle sensors during a real robotic
cisely detect multiple damages at the pixel level is needed. inspection.
Among the deep learning models, an improved FCN, In this paper, a modified U-net, called Fusion-Attention-
called U-net, was developed by Ronneberger et al. [36] U-net (FAU-net) is proposed for the detection of corrosion-
for biomedical image segmentation and achieved a good related damages inside dim steel box girders. For practical
performance when little training data were available. The application, a comprehensive photography experiment is
U-net network has also been utilized to perform semantic conducted to study the relationship between the pixel and
segmentation for corrosion damages. Katsamenis et al. [37] physical areas. The main contributions can be summarized
trained the U-net model for corrosion detection of metal con- as (1) establishment of a novel network for accurate segmen-
structions on a small training set with 116 images. Nguyen tation of corrosion-related damages under noisy weak-light
et al. [38] proposed a U-net-based alternative for automated conditions using a small training dataset and (2) deriva-
corrosion detection using a micro aerial vehicle. The U-net tion of a theoretical equation for the physical damage area
was trained by less than 40 images and did inference at a with respect to the pixel area, taking into account the key
speed of 12 fps using a single GPU. Shi et al. [39] studied photographic variables comprehensively. The remaining
the effect of two dataset building methods, namely squash- of this paper is described as follows. Section 2 introduces
ing segmentation and cropping segmentation, on the per- the overall architecture of the proposed FAU-net and the
formance of a VGG-U-net for corrosion segmentation in detailed operations in it. Section 3 presents the establishment
steel bridges. U-net has been successful with fewer train- processes of the database. Section 4 describes the training,
ing images and better performance than traditional CNNs. validation, and testing processes of FAU-net. Meanwhile, the
However, the architecture of U-net still has great space for performance of the trained FAU-net is compared with that of
improvement especially for the damage detection inside dim state-of-the-art networks, U-net and FCN. Furthermore, An
steel box girders, because weak and uneven illumination ablation study is conducted to investigate the effectiveness
tends to cause a significant amount of noise appearing on of the main components in FAU-net. In Sect. 5, a theoretical
the images of complex surface textures. equation describing the relationship between the pixel and
In the task of damage detection and measurement, a well- physical damage areas is derived, fitted, and validated. Sec-
trained network only fulfills the aim of damage localization tion 6 concludes this paper.
and segmentation. However, the segmented images cannot
be used directly to characterize the corrosion range because
their corresponding physical areas are closely related to pho- 2 Fusion‑Attention‑U‑net
tographic variables, such as shooting distance, focal length,
and shooting angle. To further quantify the extent of corro- 2.1 FAU‑net architecture overview
sion using segmented digital images, it is essential to learn
the relationship between the physical damage area and the FAU-net’s architecture follows an encoder (left side) and
segmented pixel numbers. Several studies have performed decoder structure (right side), as illustrated in Fig. 1. The
calibration experiments to investigate the actual size repre- encoder can be thought of as a traditional classification net-
sented by individual pixels for measuring concrete crack, work, in which lies a stack of common convolution and max
123
Fig. 1 Architecture of FAU-net. Each cuboid corresponds to a multi-channel feature map. Conv, Bn, Trans-conv, and Up are abbreviations for
convolution, batch normalization, transposed convolution, and bilinear interpolation upsampling
pooling operation blocks. These operations are intended to input variables to zero mean and unit variance, which is a
capture and compress the context in an input image into mul- way of accelerating network training as well as improving
tiple-dimensional feature representations. Then the decoder network stability. Some apply the normalization before the
decodes this learned information back to the original image nonlinearity and some after. Herein we follow the former
dimension to enable precise localization using transposed option.
convolution and concatenation operations. There are two Activation functions make a neural network can do non-
main contributions of the FAU-net. One is the fusion mod- linear prediction. As a commonly adopted activation func-
ule between the encoder and decoder sides, and the other is tion, Rectified Linear Unit (ReLU) is expressed as follows:
the Convolutional Block Attention Module (CBAM) at the
central bottleneck. The former aims to fuse the multilevel
ReLU(x) = max(0, x). (1)
and multiscale features to make the network more robust The simplicity of Eq. (1) makes ReLU faster than the
for damage classification and localization under different Sigmoid and Tangens hyperbolicus (Tanh) activation
conditions, while the latter tries to use trained spatial-wise functions which are computationally more expensive. In
and channel-wise weights to suppress the misdetection of addition, there is no issue of Vanishing Gradient problem
background noise. The detailed operations applied in the when using ReLU so the network prediction accuracy and
network architecture are outlined as follows. efficiency are at their peak [46].
Pooling layers are usually added after convolution lay-
2.2 Basic operations ers (specifically after the ReLU functions) to perform
down-sampling that reduces the spatial dimensions of the
Convolutions are designed to extract features from the input parameter space as well as the computation performed.
image using a kernel that is a matrix of learnable weight Moreover, pooling helps control model overfitting and
factors. This kernel slides over the image matrix at a given makes the model more robust to variations in the position
stride, performing a dot product with the covered matrix of the features. In this study, max pooling [47] is adopted,
data as a single output pixel. Compared with the fully con- which draws the maximum from each patch of the feature
nected layer, convolution operation can make spatial output map covered by pooling filters with predefined sizes.
maps, which makes it a natural choice for dense problems Transposed convolution does the opposite of the nor-
(per-pixel predictions in an image) like semantic segmenta- mal convolution. It up-samples a small feature map into
tion [44]. a higher resolution one with learnable parameters. In this
To avoid significant variation in the dynamic ranges of study, both filter size and stride are set to 2, which doubles
the inputs, batch normalization [45] is adopted to scale the the dimensions of the input feature map.
123
Softmax function is implemented in the U-net network while the channel-wise features indicate the semantic infor-
just before the output layer. It converts inputs into multi- mation about the damage object. To make the bottleneck
class probabilities that sum to one and is defined as: learn and focus on more representative information, an atten-
tion module, CBAM proposed by Woo et al. [50], is placed
ezi
Softmax(zi ) = ∑K , (2) after the convolution operations in the bottleneck part.
j=1 ezj CBAM is a sequential connection of the channel attention
module and spatial attention module, which provide insight
where the variable K is the dimension of the input vector z, into what and where is meaningful for damage segmentation,
indicating there are K classes; zi is the element in the input respectively. To compute the channel attention module, both
vector. average-pooled and max-pooled features are used. Given the
feature map F ∈ R1024×16×16 as input, the 1D channel atten-
2.3 Fusion module tion map Mc(F) ∈ R1024×1×1 can be inferred as:
For pixel-level detection of corrosion damages, the semantic Mc (F) = 𝜎(MLP(AvgPool(F)) + MLP(MaxPool(F)))
( ( ) ( ))
information for damage classification and the localization = 𝜎 W1 W0 (AvgPool(F)) + W1 W0 (AvgPool(F)) ,
information for damage localization are both essential. The (3)
low-level encoder features (feature maps closer to the input where σ denotes the sigmoid function; MLP represents a
layer at the encoder side) have strong localization informa- multi-layer perceptron with one hidden layer; W0 ∈ R128×1024
tion, but they still lack semantic information. Conversely, the and W1 ∈ R1024×128 are MLP weights.
high-level decoder features have rich semantic information After obtaining the channel attention map, the channel
but weak localization information. This contradiction could attention process can be described as:
have a far-reaching effect on the segmentation results. A
natural way to overcome this problem is to incorporate more F � =Mc (F) ⊗ F, (4)
semantic information into the low-level features and locali-
where ⊗ denotes element-wise multiplication. During mul-
zation information into the high-level features. Although the
tiplication, the attention values are broadcasted (copied)
skip-connection operation adopted in U-net can refine the
accordingly: channel attention values are broadcasted along
segmentation results to a certain extent, the degree of feature
the spatial dimension, and vice versa.
aggregation is low [48].
The spatial attention module, following the channel
In FAU-net, a fusion model is proposed to improve the
module, aggregate the channel information of the feature
aggregation of multilevel features. Inspired by FPN, we
map using max-pooling and average-pooling. The pooled
upsample the high-level feature map in the bottleneck by
features are then concatenated and convolved by a standard
a factor of two, using bilinear interpolation for simplicity.
convolutional layer. Accordingly, the 2D spatial attention
The upsampled map is then merged with the low-level fea-
map Ms(F′) ∈ R1×16×16 can be inferred as:
ture map from the encoder side by element-wise addition.
( ) ( ([ ( ) ( )]))
Before the addition, the encoder low-level map is processed Ms F � = 𝜎 f 7×7 AvgPool F � ;MaxPool F � , (5)
using a 1 × 1 convolutional layer to adjust the channel dimen-
sion. This process is iterated to aggregate the multi-level where f7×7 represents a convolutional operation with the fil-
features. Finally, a 3 × 3 convolutional layer is appended to ter size of 7 × 7.
each merged map to generate the final maps, which aims to Therefore, the final spatial attention process can be
reduce the aliasing effect of up-sampling. described as:
( )
2.4 Attention module
F �� = Ms F � ⊗ F � . (6)
The central bottleneck between the encoder and decoder

provides the most powerful and discriminative semantic
features [49]. In FAU-net, the bottleneck consists of two 3 Establishment of a database
convolutional blocks that collect high-level representative for corrosion‑related damages
semantic features from the encoder side and pass them to
the decoder side. Therefore, the bottleneck of FAU-net has To establish a database containing corrosion, mildew, and
a profound impact on the final damage segmentation results. ponding images in steel box girders, 77 raw images are col-
The feature map in the bottleneck generally has spatial lected from the routine inspections of three long-span river
and channel dimensions. The spatial-wise features represent or sea crossing bridges in China. The lighting condition in
the localization information related to the damage object, the steel girder is poor and nonuniform. Therefore, each
123
image has different photographic lighting conditions. Due to it is still much smaller than that in previous studies on deep-
differences in camera equipment and post-storage strategies, learning-based detection methods, such as [41].
these images are also with different pixel resolutions, rang- To save computational resources and time, the database
ing from 200 × 195 to 3072 × 4608. In addition, the images images and their corresponding ground truths are resized
vary in their background colors because of the color devia- to a 256 × 256 pixel resolution. It has to be noted that the
tion in those anti-corrosion coatings. area out of the original image is simply masked with black
In the annotation process, the ground truths of the col- during the resizing process. The reader may try some other
lected images are manually labeled using an open annotation procedures like reflection, which will not be discussed due to
tool, LabelMe [51]. Based on LabelMe’s polygons tool, we limited space. Figure 2 shows examples of the built images
extract different corrosion-related damages by applying con- and their ground truths with colors. Before being pumped
trol points along the damage’s boundary. Errors in the place- into the training, these ground truths should be converted to
ment of boundary points are considered inevitable because a one-hot encoded tensor of dimension 4 × 256 × 256, where
the transition from the damage to the background is gradual 4 indicates there are four categories. A total of 300 images
rather than a cliff change. However, such errors are thought are manually divided into three subsets: the training set, the
of as being minor in engineering practice. After specifying validation set, and the testing set using a split ratio of 6: 2:
the damage categories, these labeled images are saved in 2. While division, scrutiny is conducted to ensure the dam-
PNG format with three channels. age samples from different sets do not overlap each other to
77 images still seem to be limited, although FAU-net is avoid overestimation of the network prediction performance.
designed to work with very few annotated images. Data aug- The detailed numbers of images in the three sets are listed
mentation is a data-space solution to the problem of limited in Table 1. To clarify, mildew will eventually develop into
data, which teaches the network the desired invariance and corrosion damage over time, and thus sporadic corrosion
robustness properties [36, 52]. Therefore, in addition to tech-
niques like batch normalization, data augmentation is imple-
mented herein to make full use of the collected images and Table 1 Number of images in the training, validation, and testing sets
approach the problem of overfitting. Image augmentation Damage type Training Validation Testing
algorithms applied include flipping, random rotation, grid
Corrosion 59 19 20
shuffle, and padding. After the augmentation procedure, the
Mildew 27 17 16
total number of images increases to 300. It is worth mention-
Ponding 63 18 19
ing that although the number of images is enhanced herein,
Corrosion + Mildew 31 6 5
Fig. 2 Examples of damage

images and their ground truths
123
often appears in the mildew aggregation area (Fig. 2), which (MPA), mean intersection over union (MIoU), and frequency
is denoted as multiple damage images as “corrosion + mil- weighted intersection over union (FWIoU) [41, 56]. These
dew” in Table 1. metrics are recorded every epoch both for training and vali-
dation. In another word, the metric values are average metric
values of batches in that epoch.
4 Training and validation of FAU‑net model Figure 3 shows the recorded training and validation
losses over epochs under different initial learning rates,
4.1 Model optimization and learning rate and the effect of learning rate decay is illustrated in each
subplot. Besides, other metric values of PA, MPA, MIoU,
It is widely known that the learning rate is the single most and FWIoU under different learning rate configurations are
important hyperparameter to configure our model [53]. plotted in Fig. 4. Note that LRD in these figures represents
Unfortunately, it is impossible to obtain the optimal learn- the short for the learning rate decay. It can be observed that
ing rate based on a given model dataset [54]. This difficulty both the training and validation losses decrease quickly at
in choosing better learning rates makes adaptive learning the beginning and then converge around 450 epochs. There-
rate methods popular. Adam optimizer is one of the widely fore, the stable metric values on the validation set at epoch
adopted adaptive learning rate methods. It combines the 450 are listed in Table 2 for quantitative assessment of the
benefits of the two other extensions of stochastic gradient trained model.
descent, namely AdaGrad [55] and RMSProp. Therefore, Based on the above results, the choice for the better
Adam is suitable for non-stationary objectives and problems learning rate configuration is indicated as follows. When
with very noisy and/or sparse gradients. lr = 1 × 10–6, the model loss with learning rate decay is
In this study, the Adam algorithm is used to optimize the significantly larger than that without the decay schedule.
FAU-net weights. A grid search is conducted to study the The metrics in Fig. 4 and Table 2 all show that the model
relationship between learning rates and model performance, without learning rate decay outperforms that with the decay
which adopts four initial learning rates for the Adam algo- schedule. These results suggest that too small learning rates
rithm, namely lr = 1 × 10–6, 1 × 10–5, 1 × 10–4 and 1 × 10–3. incurred by the learning rate decay seem to get the training
Besides, the exponential decay rates of the first and second stuck on a suboptimal solution. In addition, for the cases of
moment estimates are β1 = 0.9 and β2 = 0.999, respectively. lr = 1 × 10–6, the convergence speed of the model is slower
Though the Adam algorithm can adapt its learning rate over than that when a larger learning rate is adopted. However,
the training process, it has been pointed out that learning rate a larger learning rate (lr = 1 × 10–3) incurs larger weight
schedules may substantially improve Adam's performance. updates, which leads to the result that the model perfor-
Therefore, a time-based decay schedule is adopted or not mance oscillates over the training epoch, as illustrated in
for the initial learning rate to study the difference. Herein, Fig. 3d. When lr = 1 × 10–3, the evaluation metrics meet a
exponential decay is implemented as follows: sharp mutation near epoch 150, which may be owing to the
adaption of the learning rate by the Adam algorithm. When
lrnew = lr × 𝛾 epo , (7) lr = 1 × 10–4, the model performance is closer to that with
lr = 1 × 10–5 but higher metrics are revealed both in Fig. 4
where lrnew is the new learning rate; lr is the initial learning and Table 2. Based on the above analysis, the learning rate
rate; γ is the multiplication factor used for updating and is of 1 × 10–4 is concluded as the better learning rate, and the
assigned to 0.99; epo is the current epoch number. decay schedule shows little improvement on the metrics
listed in Table 2, but indeed increases the stability as shown
4.2 Training and validation results in Fig. 4.
The model is trained to optimize the model weight and bias 4.3 Testing the trained FAU‑net
parameters, and then the performance of the trained model
is validated on the validation set. All the tasks described are In this study, the trained model using the optimal learn-
performed on a workstation (CPU: double Intel® Xeon® ing rate of 1 × 10–4 without the decay schedule is studied
CPU E5-2680 v4 @ 2.40 GHz, RAM: 64 GB, GPU: ASUS to examine its performance. The remaining 60 images that
GeForce RTX 2060 D6 12G). are not used for training and validation processes are used
Given the training configuration of the original U-net to test.
[36], FAU-net is trained with a batch size of one image. Figures 5, 6, and 7 show typical examples of the predicted
During training and validation processes, several metrics damage by the trained FAU-net. In the figures, the objects in
can be used to evaluate the model performance. They are each row are original input images, their ground truths, pre-
cross entropy loss, pixel accuracy (PA), mean pixel accuracy dicted damage pixels, and corresponding evaluation metrics.
123
Fig. 3 Training and valida- 1.4 1.4

Training loss (without LRD) Training loss (without LRD)
tion losses over epochs: a
1.3 Validation loss (without LRD) 1.3 Validation loss (without LRD)
lr = 1 × 10–6, b lr = 1 × 10–5, c Training loss (with LRD) Training loss (with LRD)
lr = 1 × 10–4, and d lr = 1 × 10–3 1.2
Validation loss (with LRD)
1.2
Validation loss (with LRD)
1.1 1.1
Loss
Loss
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0 100 200 300 400 500 0 100 200 300 400 500
Epoch Epoch
(a) (b)
1.2 0.95
Training loss (without LRD) Training loss (without LRD)
Validation loss (without LRD) Validation loss (without LRD)
1.1 Training loss (with LRD) Training loss (with LRD)
0.90
Validation loss (with LRD) Validation loss (with LRD)
1.0
0.85
Loss
Loss
0.9
0.80
0.8
0.75
0.7
0 100 200 300 400 500 0 100 200 300 400 500
Epoch Epoch
(c) (d)
Among the three damage categories, the prediction result color to mildew are correctly identified as the background.
of the ponding has the highest metric values. This is on the In addition, the corrosion damage in a weak light condition
one hand because the simple geometric and spatial pond- is successfully detected in Fig. 5d.
ing features promote the model learning, and on the other Some typical images related to mildew damage under
hand because the ponding area is vast enough that a small various conditions are presented in Fig. 6. As illustrated
classification error does not significantly reduce the metrics. in Fig. 6a, a crowd of black mildew spots is successfully
Conversely, the metrics of the mildew prediction results are detected. From the predicted result, we can conclude that
relatively the lowest. This is mostly due to the fact that the the trained FAU-net can recognize tiny mildew features
numerous numbers and complicated features make human with an equivalent size of three pixels. For the detection
labeling mistakes unavoidable throughout the annotation of mildew along with black color markers, FAU-net also
process. That is to say, inevitable initial errors exist in the essentially detects the mildew damages though the MIoU
ground truths, thereby reducing the model’s performance in value is not very high. To evaluate the model performance
predicting mildew damages. in detecting light-colored mildew, a cluster of light-colored
Figure 5 illustrates typical images with corrosion dam- mildews is generally detected in Fig. 6c, and the needle-like
age under different conditions. In Fig. 5a, a corrosion image edges seem to be better revealed in the predicted results.
under the common condition is used to test the trained In Fig. 6d, mildews along with corrosions are satisfactorily
model, and FAU-net provides a good prediction result for recognized, which demonstrates the model’s robustness to
it. Figure 5b is used to study the model's effectiveness with multiple-damage detection.
local reflected light. Most of the corrosion is satisfacto- Typical images with ponding are shown in Fig. 7. The
rily detected, even though a small number of pixels in the ponding pixels under normal conditions are effectively
light irradiation area fail to be recognized. To evaluate our extracted from the background, as shown in Fig. 7a. The
model’s robustness to surface interference, the image with influence of water reflection and underwater sediment
corrosion on a contaminated surface is fed into the trained is studied by testing the trained model using Fig. 7b, c,
FAU-net, as shown in Fig. 5c. Thanks to the fusion module, respectively. Furthermore, the ponding under weak light
some stains on the right side of the image with a similar conditions is also detected in Fig. 7d. The results generally
123
Fig. 4 Metrics for model perfor- 100% 100%
mance over epochs: a PA, b PA

with LRD, c MPA, d MPA with 90% 90%
LRD, e MIoU, f MIoU with
LRD, g FWIoU, and h FWIoU 80% 80%
with LRD
PA
PA
70% 70%
lr = 1e-6 lr = 1e-6
60% lr = 1e-5 60% lr = 1e-5
lr = 1e-4 lr = 1e-4
lr = 1e-3 lr = 1e-3
50% 50%
0 100 200 300 400 500 0 100 200 300 400 500
Epoch Epoch
(a) (b)
100% 100%
90% 90%
80% 80%
MPA
MPA
70% 70%
60% 60%
lr = 1e-6 lr = 1e-6
lr = 1e-5 lr = 1e-5
50% lr = 1e-4 50% lr = 1e-4
lr = 1e-3 lr = 1e-3
40% 40%
0 100 200 300 400 500 0 100 200 300 400 500
Epoch Epoch
(c) (d)
90% 90%
80% 80%
70% 70%
60% 60%
MIoU
MIoU
50% 50%
40% 40%
lr = 1e-6 lr = 1e-6
30% 30%
lr = 1e-5 lr = 1e-5
20% lr = 1e-4 20% lr = 1e-4
lr = 1e-3 lr = 1e-3
10% 10%
0 100 200 300 400 500 0 100 200 300 400 500
Epoch Epoch
(e) (f)
100% 100%
90% 90%
FWIoU
FWIoU
80% 80%
lr = 1e-6 lr = 1e-6
70% 70%
lr = 1e-5 lr = 1e-5
lr = 1e-4 lr = 1e-4
lr = 1e-3 lr = 1e-3
60% 60%
0 100 200 300 400 500 0 100 200 300 400 500
Epoch Epoch
(g) (h)
123
Table 2 Validation results for Learning rate lr lr decay Stable PA (%) Stable MPA (%) Stable MIoU (%) Stable
FAU-net FWIoU
(%)
1e−6 No 97.07 87.04 46.86 95.48

1e−5 No 98.48 92.57 77.45 97.37
1e−4 No 98.61 92.73 77.57 97.52
1e−3 No 98.34 91.00 75.35 97.11
1e−6 Yes 94.47 84.49 39.26 92.64
1e−5 Yes 97.54 87.88 62.24 96.36
1e−4 Yes 98.48 92.49 77.16 97.34
1e−3 Yes 97.76 88.40 72.45 96.41
demonstrate the model's effectiveness, though minor errors 4.4 Comparison with state‑of‑art networks
occur when recognizing the ponding edges.
Overall, despite minor errors, the trained FAU-net To compare the proposed FAU-net model with state-of-the-
model performs well in detecting multiple corrosion- art segmentation networks, the built training set is used to
related damages in steel box girders. The minor errors may train the conventional U-net and FCN-8 s models. FCN is
be mainly owing to the small training database. Therefore, the pioneering work of the most successful deep learning
further efforts can be dedicated to improving and general- technology for semantic segmentation. The core architec-
izing the proposed model by enlarging the database with tural ideas and implementations of FCN and U-net are very
more images under various field conditions. similar. The minor difference between the two is that U-net
has a finer up-sampling process. For one, U-net is having
Fig. 5 Examples of predicted

corrosion pixels by the trained
FAU-net: a normal condition,
b local reflection, c contami-
nated surface, and d weak light
condition
123

mildew pixels by the trained
FAU-net: a normal condition,
b interference of same color
marks, c light-colored mildew,
and d multiple damages
more feature channels in the up-sampling part. For another, mildew and ponding, the segmentation results by FAU-net
rather than being simply added as in FCN, the same-level and U-net are very close, as shown in Fig. 8c, d. Neverthe-
down-sampled feature map and up-sampled feature map are less, the performance of FAU-net is a little better in terms of
concatenated in U-net. segmentation details and the metrics. This may be attributed
The U-net here has the same backbone as the proposed to the aggregation of semantic and localization information
FAU-net. For FCN, ResNet-101 is used to extract the dam- by the fusion module.
age features. U-net and FCN are trained using the same Although the example images in Fig. 8 show better
learning rate schedules as FAU-net. Besides, the same batch performance of FAU-net, the comparison seems isolated,
size and optimizer are used. After the training process, the static, and one-sided. To evaluate the overall performance
optimal learning schedule is obtained as 1 × 10–4 without the of the three networks, typical accuracy metrics on the test
decay, which is the same as FAU-net. set are calculated and summarized in Table 3, along with
A comparison of segmentation results for representative the average testing speed, (frames per second (fps)). In addi-
damage images is shown in Fig. 8. For the listed subplots, tion to the four metrics used in the validation phase, preci-
the metrics of the proposed FAU-net outperform that of sion, recall, and F1-score are further evaluated given the
U-net and FCN to a certain extent. FCN can roughly rec- limited dataset used. It can be observed the accuracy of the
ognize the damage pixels, but the segmentation results proposed FAU-net outperforms U-net and FCN on the test
are not fine enough due to the lack of enough localization set. However, FAU-net obtains the slowest testing speed for
information. Figure 8a, b show prediction results for corro- its relatively complex architectures. Therefore, it is like a
sion damages with complex background noise. Obviously, double-edged sword. Given the small training set and cur-
FAU-net equipped with the attention module provides the rent limitations of GPU-based computation, for practical
best inference while U-net and FCN make some mistakes usage, further validations should be performed on FAU-net
in the detection of marks and stains. In the prediction of using a more comprehensive training database. Despite the
123

ponding pixels by the trained
FAU-net: a normal condition, b
water reflection, c underwater
sediment interference, and d
weak light condition
above-mentioned limitations, the proposed FAU-net per- 5 Measurement of physical damage areas
forms well on the current database. using the trained FAU‑net
5.1 Photography experiment introduction

4.5 Ablation study
To compute the physical areas of the detected damages, pho-
To investigate the contribution of main components in FAU- tography experiments are conducted to study the relation-
net, we increase each essential component gradually on the ship between the pixel area (number of detected pixels) and
basis of the conventional U-net (referred as baseline). We the physical damage area. Three photography parameters
compare the performance of different components with affecting the relationship are selected as the critical vari-
the baseline in Table 4. As shown in Table 4, the proposed ables, including the shooting distance d, the focal length f,
fusion module and attention module improve the MIoU of and the shooting angle θ. In this study, θ is defined as the
the baseline by 2.86% and 3.69%, respectively. The whole angle between the target normal and the axis of the lens.
FAU-net architecture achieves the best performance of Besides, the sign of θ is prescribed to be positive when
80.11%. This ablation study demonstrates the effectiveness shooting upwards, and negative when shooting downwards.
of the proposed fusion and attention modules in FAU-net. A photography platform is established based on a Nikon
However, considering both prediction accuracy and speed, D3200 camera with an 18–55 mm lens, as shown in Fig. 9. A
the baseline with the attention module is more promising. designed target is placed on the wall with a predefined target
height. Based on the trigonometric relationship, different
combinations of d, f, and θ can be obtained by adjusting the
vertical distance and the lens height, so as to get target photo
samples under different experimental cases. As presented in
123
Fig. 8 Comparison of prediction results by FAU-net and the state-of-art networks
Fig. 10a, the designed target has five regions colored differ- and physical areas based on the theoretical derivation and
ently. Each region is a square with a side length of 80 mm. parameter fitting.
The purpose of having five regions is to get more samples
at each shot, which is expected to accelerate the experiment 5.2 Derivation of relation equation
process. In addition, these regions are designed to be scat-
tered in different spatial locations in order to improve the To perform theoretical derivation of the relation equation,
generality of the established relationship. During the experi- the effect of the shooting distance d, the focal length f, and
ment, the photographed targets are resized to the 256 × 256 the shooting angle θ should be elaborated. In this subsec-
pixel resolution. Figure 10b shows the photographed tar- tion, the influence of d and f on the physical size of a single
get when d = 1,600 mm, f = 40 mm, and θ = 24°. A python
code is developed to process the photographed target and
obtain the pixel total number and coordinates in different Table 4 Results of the ablation study
color regions. The processed target and the pixel coordinate
U-net Fusion module Attention MIoU (%) Speed (fps)
system are detailed in Fig. 10c. Note that the origin of the
module
pixel coordinate system is set to coincide with the image
center. Therefore, through obtaining pixel samples under dif- ✓ 73.30 44
ferent combinations of photography variables, it is expected ✓ ✓ 76.16 19
to establish a generalized relation equation between the pixel ✓ ✓ 76.99 43
✓ ✓ ✓ 80.11 18
Table 3 A comparison of the test metrics of FAU-net and the state-of-art networks
Model Precision (%) Recall (%) F1-score (%) PA (%) MPA (%) MIoU (%) FWIoU (%) Speed (fps)
FAU-net 90.81 86.31 91.89 97.85 90.81 80.11 96.10 18

U-net 90.58 79.02 91.67 97.74 90.58 73.30 96.02 44
FCN 82.32 83.40 84.90 96.44 82.32 72.64 93.50 24
123
(unit: mm), where n is the number of pixels located at the

side of the yellow region. Figure 11a, b show the change of
L against d and f, respectively. According to their trends, the
scatters in Fig. 11a, b are fitted using functions L = k1d + b1
and L = k2f −λ, respectively. The fitting results are shown
by solid lines in Fig. 11. It can be observed that the two
functions provide good fits for the data points. Based on the
above analysis, it can be concluded that the physical size L
increases linearly with the shooting distance d and decreases
exponentially with the focal length f.
The shooting angle θ between the target normal and
the lens axis causes photographed targets to be stretched
or shrunk. This subsection is devoted to the impact of the
shooting angle θ and the derivation of a theoretical relation
equation. Figure 12 presents a schematic of the target imag-
Fig. 9 An illustration of the photography experiment ing with a shooting angle θ. In the photographed target, the
pixel is studied. After setting θ = 0, 25 photographed targets red-colored pixel with a side length of one pixel has a cor-
are obtained by changing the values of d and f. There is no responding physical size L, which is marked in orange in
difference between the photographed five regions since the the inclined target side view. Due to the existence of θ, the
shooting angle is zero here. Therefore, only the pixel sam- shooting distance to each point in the target plane is not the
ples from the yellow region are adopted. In this study, the same and is related to the shooting angle θ and the distance
physical size L of a single pixel can be calculated as 80/n H that is the projected distance from the point to the neutral
Fig. 10 Illustrations of the

target: a designed target, b pho-
tographed target (d = 1,600 mm,
f = 40 mm, θ = 24°), and c
processed target
Fig. 11 Influence of d and f on the physical size of a single pixel: a shooting distance d and b focal length f
123
Fig. 12 Schematic of target

imaging with a shooting angle θ
�
axis j. In addition, in the process of photography, what the ⎡ (kd + b)2 + 2k(f )−λ (kd + b)2 tan 𝜃 hij ⎤
sensor acquires is actually the projection of the target as � ⎢ ij∈D
⎥
A= Aij = (f )−2λ ⎢ � ⎥∕ cos 𝜃,
shown by the purple projection line in Fig. 12. Therefore, the ⎢ +k2 (f )−2λ (kd + b)2 (tan 𝜃)2 h 2 ⎥
ij∈D
⎢ ij ⎥
red pixel is directly linked with the projected physical size ⎣ ij∈D ⎦
Lp and a shooting distance of d + H⋅tanθ (θ is negative herein (13)
for shooting downwards). According to the studied effect of where D denotes the set of detected damage pixel coordi-
the shooting distance d and the focal length f, the projected nates. From Eqs. (10) and (13), the shooting angle θ makes
physical length Lp can be calculated as: the pixel position a critical variable to the physical length
calculation, which is a big difference from the effect of the
Lp = (f )−λ (k(d + H tan 𝜃) + b), (8) shooting distance d and the focal length f. The two summa-
tions Σhij and Σhij2 in Eq. (13) are defined as position vari-
where λ, k, and b are parameters to be fitted. The projected
ables in the measurement of corrosion-related damages in
distance H can be approximated as:
steel box girders. Overall, Eq. (13) describes the relationship
H = (f )−λ (kd + b)h, (9) between the pixel and physical areas.
where h is the vertical coordinate of the red pixel of interest 5.3 Fitting of the equation parameters
in the ij coordinate system of the photographed target. By
substituting Eq. 9 into Eq. 8, the projected physical length Model fitting is conducted to estimate the parameters k, b,
Lp can be rewritten as: and λ of Eq. (13). Samples used for fitting can be obtained
( ( ) ) through the designed photographed experiment. As shown
Lp = (f )−λ k d + (f )−λ (kd + b)h tan 𝜃 + b . (10) by Eq. (13), the shooting distance d, the focal length f, and
the shooting angle θ are selected as variables. The domain
Based on Eq. (10), the vertical (i direction) and trans- of each variable herein is determined to be the shooting dis-
verse (j direction) physical sizes of the single pixel with tance of 1,200 − 2,000 mm, the focal length of 35 − 55 mm,
coordinates (i, j) can be obtained as: and the shooting angle of − 30° − 30°. The Latin-Hypercube
sampling (LHS) [57] is adopted to generate random and
{ ( ( ) )
Lv,ij = (f )−λ k d + (f )−λ (kd + b)h tan 𝜃 + b ∕ cos 𝜃 evenly sample points in the above domains for the variables
( ( ) ) , of interest. Finally, a total of 50 image samples of the target
Lh,ij = (f )−λ k d + (f )−λ (kd + b)h tan 𝜃 + b
shown in Fig. 10 are collected from the experiment.
(11)
Since there are five regions in each image sample, 250
where Lv and Lh are the vertical and transverse physical samples (available at https://g ithub.c om/F
eiJia ng199 5/F
AU-
sizes, respectively. Consequently, the physical area of the net/tree/main/Data%20for%20fitting) are actually used to fit
single red pixel can be derived as: the equation parameters. The least-squares method is used
[ to find the best fit for the set of sample points by minimizing
Aij = Lv,ij ⋅ Lh,ij = (f )−2λ (kd + b)2 + 2k(f )−λ
( )2 ] (12) the sum of the offsets of samples from their predicted val-
(kd + b)2 hij tan 𝜃 + k2 (f )−2λ (kd + b)2 hij tan 𝜃 ∕ cos 𝜃. ues. After fitting analysis, the suitable values for the param-
eters k, b, and λ are obtained as 0.05176562076156766,
By summing the detected damage pixels, the physical − 3.5063862267014057, and 0.9631068924383132,
damage area A can be calculated as: respectively.
123
5.4 Measurement method and its validation The damage areas in these images are calculated according
to the flowchart in Fig. 13. It has to be clarified that the large
In the proposed measurement method, the trained FAU-net is number of mildew spots makes the manual measurement
used to provide a pixel-level detection for the taken damage of all range laborious, and in addition, because of the wide
image. Therefore, the two critical variables Σhij and Σhij2 distribution of ponding, there is no sufficient shooting dis-
can be obtained after setting the ij coordinate system for tance in the steel box girder to frame all the ponding areas.
the detected damage image. Afterwards, the physical dam- For these reasons, in this validation, only a certain range of
age area can be calculated by substituting the shooting and the three damages is selected for measurement, and those
position variables into Eq. (13). The detailed flowchart for beyond the range are masked using black. Examples of the
detecting and measuring the corrosion-related damages in calculated damage areas are illustrated in Fig. 14, and the
the steel box girder is summarized in Fig. 13. To validate the details of the shooting cases and their detection results are
proposed measurement method, images are taken with dif- available at https://github.com/FeiJiang1995/FAU-net/tree/
ferent shooting variables at three sites in a steel box girder. main/Measurement%20results, where Ag is the manually
Fig. 13 Flowchart for detec-

tion and measurement of the
corrosion-related damages in
the steel box girder

damage pixels and measured
damage areas
123
measured actual damage area. It can be observed that FAU- the measurement error for small objects is relatively higher,
net provides satisfactory detection results of the damage and a finer image resolution is proposed to overcome it.
pixels. In addition, the highest measurement error is found Although the test results by FAU-net are encouraging,
for mildew, followed by corrosion, and the lowest for pond- given the small database used, further validation should be
ing. The main reason for this trend may be that the deri- conducted to study the effectiveness of FAU-net for such
vation from Eq. (12) to Eq. (13) assumes that the number detection tasks inside dim steel box girders. Furthermore,
of damage pixels is large enough for the cumulative sum optimization studies are needed to reduce network redun-
to be approximated as the actual damage area. However, dancy and increase detection speed to realize the real-time
for small target detection, the single pixel here is relatively processing capability of the network.
large, meaning that the total number of damage pixels may
be insufficient. Therefore, for the detection of small targets, Acknowledgements The authors appreciate the support of the Dis-
tinguished Young Scientists of Jiangsu Province [Grant Number
an increase in picture resolution should be considered to BK20190013], the National Natural Science Foundation of China
reduce the measurement error. [Grant Number 51978154], and the Jiangsu Natural Science Founda-
tion [Grant Number BK20211003].
6 Conclusions Declarations
This paper proposed an improved U-net, called FAU-net, Conflict of interest The authors declare that they have no known com-
peting financial interests or personal relationships that could have ap-
for detecting three types of corrosion-related damages peared to influence the work reported in this paper.
inside dim steel box girders. 77 raw images are collected
from the routine inspections of three long-span river or
sea crossing bridges in China. The date augmentation is
implemented to approach the problem of overfitting and References
increase the number of images to 300. These images are
then resized to a 256 × 256 pixel resolution to save compu- 1. Yeum CM, Dyke SJ (2015) Vision-based automated crack detec-
tion for bridge inspection. Comput Civ Infrastruct Eng 30:759–
tational resources and time. The 300 images are manually
770. https://doi.org/10.1111/mice.12141
divided into three sets: the training set (180 images), the 2. Cha Y-J, Choi W, Büyüköztürk O (2017) Deep learning-based
validation set (60 images), and the test set (60 images). crack damage detection using convolutional neural networks.
Adam algorithm is used to optimize FAU-net weights. Comput Civ Infrastruct Eng 32:361–378. https://doi.org/10.1111/
mice.12263
To find the best training model, different learning rate
3. Abdel-Qader I, Abudayyeh O, Kelly ME (2003) Analysis of
schedules are studied, and the results indicate that the edge-detection techniques for crack identification in bridges. J
optimal learning rate is 1 × 10–4 and the decay schedule Comput Civ Eng 17:255–263. https://doi.org/10.1061/(ASCE)
shows little improvement in the model performance under 0887-3801(2003)17:4(255)
4. Nishikawa T, Yoshida J, Sugiyama T, Fujino Y (2012) Concrete
the current network configuration. Based on the best train-
crack detection by multiple sequential image filtering. Comput
ing, FAU-net achieves the stable PA of 98.61%, MPA of Civ Infrastruct Eng 27:29–47. https://doi.org/10.1111/j.1467-
92.73%, MIoU of 77.57%, and FWIoU of 97.52%. 8667.2011.00716.x
The trained FAU-net is tested by the test set. The results 5. German S, Brilakis I, DesRoches R (2012) Rapid entropy-based
detection and properties measurement of concrete spalling with
show that FAU-net is strong at detecting corrosion-related
machine vision for post-earthquake safety assessments. Adv
damages under various conditions. In the comparison with Eng Inform 26:846–858. https://doi.org/10.1016/j.aei.2012.06.
the state-of-art networks, FAU-net shows better performance 005
in damage detection with noisy background and achieves 6. Koch C, Brilakis I (2011) Pothole detection in asphalt pavement
images. Adv Eng Inform 25:507–515. https://doi.org/10.1016/j.
the highest metric values. The effectiveness of the proposed
aei.2011.01.002
fusion and attention modules is validated through an abla- 7. Chen P-H, Yang Y-C, Chang L-M (2010) Box-and-ellipse-based
tion study. It is worth mentioning that, due to the complex ANFIS for bridge coating assessment. J Comput Civ Eng 24:389–
architecture, FAU-net has a relatively slow test speed. 398. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000041
8. Chen P-H, Chang L-M (2006) Effectiveness of neuro-fuzzy rec-
Using the trained FAU-net, damage images are accurately
ognition approach in evaluating steel bridge paint conditions. Can
segmented. To measure actual areas associated with these J Civ Eng 33:103–108. https://doi.org/10.1139/l05-077
segmented regions, the relationship between the pixel and 9. Jahanshahi MR, Kelly JS, Masri SF, Sukhatme GS (2009) A sur-
physical areas is thoroughly investigated by photography vey and evaluation of promising approaches for automatic image-
based defect detection of bridge structures. Struct Infrastruct Eng
experiments and theoretical analysis. The effects of the
5:455–486. https://doi.org/10.1080/15732470801945930
shooting distance, the focal length, and the shooting angle 10. Vorobel R, Ivasenko I, Berehulyak O, Mandzii T (2021) Segmen-
are fully considered. Finally, a theoretical relation equation tation of rust defects on painted steel surfaces by intelligent image
is established and validated. The validation results indicate
123
analysis. Autom Constr 123:103515. https://d oi.o rg/1 0.1 016/j.a ut- 28. Liu W, Rabinovich A, Berg AC (2015) Parsenet: looking wider to
con.2020.103515 see better. 1–11
11. Furuta H, Deguchi T, Kushida M (1995) Neural network analy- 29. Li H, Xiong P, An J, Wang L (2018) Pyramid Attention Network for
sis of structural damage due to corrosion. In: Proceedings of 3rd Semantic Segmentation. 1–13
international symposium on uncertainty modeling and analysis 30. Li X, Lai T, Wang S, et al (2019) Weighted feature pyramid net-
and annual conference of the North American fuzzy information works for object detection. In: 2019 IEEE Intl Conf on Parallel &
processing society. IEEE Comput. Soc. Press, pp 109–114 Distributed Processing with Applications, Big Data & Cloud Com-
12. Choi KY, Kim SS (2005) Morphological analysis and classi- puting, Sustainable Computing & Communications, Social Comput-
fication of types of surface corrosion damage by digital image ing & Networking (ISPA/BDCloud/SocialCom/SustainCom). IEEE,
processing. Corros Sci 47:1–15. https://doi.org/10.1016/j.corsci. pp 1500–1504
2004.05.007 31. Zhao H, Shi J, Qi X, et al (2017) Pyramid scene parsing network. In:
13. Lee S, Chang LM, Skibniewski M (2006) Automated recognition 2017 IEEE conference on computer vision and pattern recognition
of surface defects using digital color image processing. Autom (CVPR). IEEE, pp 6230–6239
Constr 15:540–549. https://doi.org/10.1 016/j.a utcon.2005.08.001 32. Chen L, Papandreou G, Kokkinos I et al (2018) Deeplab: Semantic
14. Ghanta S, Karp T, Lee S (2011) Wavelet domain detection of rust image segmentation with deep convolutional nets, atrous convolu-
in steel bridge images. In: 2011 IEEE international conference tion, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell
on acoustics, speech and signal processing (ICASSP). IEEE, pp 40:834–848. https://doi.org/10.1109/TPAMI.2017.2699184
1033–1036 33. Hoskere V, Narazaki Y, Hoang T, Spencer B (2018) Vision-based
15. Jahanshahi MR, Masri SF (2013) Parametric performance evalu- structural inspection using multiscale deep convolutional neural
ation of wavelet-based corrosion detection algorithms for condi- networks
tion assessment of civil infrastructure systems. J Comput Civ Eng 34. Nash W, Drummond T, Birbilis N (2018) Quantity beats quality for
27:345–357. https://d oi.o rg/1 0.1 061/( ASCE)C
P.1 943-5 487.0 0002 semantic segmentation of corrosion in images. 1–10
25 35. Tong T, Lin J, Hua J et al (2021) Crack identification for bridge con-
16. Bonnin-Pascual F, Ortiz A (2014) Corrosion detection for auto- dition monitoring using deep convolutional networks trained with a
mated visual inspection. In: Aliofkhazraei M (ed) Developments feedback-update strategy. Maintenance Reliab Cond Monit 1:37–51.
in corrosion protection. InTech, London, pp 619–632 https://doi.org/10.21595/mrcm.2021.22032
17. Atha DJ, Jahanshahi MR (2018) Evaluation of deep learning 36. Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional net-
approaches based on convolutional neural networks for corrosion works for biomedical image segmentation. In: Navab N, Hornegger
detection. Struct Heal Monit 17:1110–1128. https://doi.org/10. J, Wells W, Frangi A (eds) Medical image computing and computer-
1177/1475921717737051 assisted intervention – MICCAI 2015. Springer, Cham, pp 234–241
18. Ma Y, Yao Y, Zhao X, et al (2018) Image-based corrosion rec- 37. Katsamenis I, Protopapadakis E, Doulamis A et al (2020) Pixel-level
ognition for ship steel structures. In: Meyendorf NG (ed) Smart corrosion detection on metal constructions by fusion of deep learn-
structures and NDE for Industry 4.0. SPIE, London, p 102134 ing semantic and contour segmentation. In: Bebis G, Yin Z, Kim
19. Du J, Yan L, Wang H, Huang Q (2018) Research on grounding E et al (eds) Advances in Visual Computing. Springer, Cham, pp
grid corrosion classification method based on convolutional neural 160–169
network. MATEC Web Conf 160:01008. https://doi.org/10.1051/ 38. Nguyen T, Ozaslan T, Miller ID, et al (2018) U-net for mav-based
matecconf/201816001008 penstock inspection: an investigation of focal loss in multi-class seg-
20. Feng J, Li F, Lu S et al (2017) Injurious or noninjurious defect mentation for corrosion identification
identification from MFL images in pipeline inspection using con- 39. Shi J, Dang J, Cui M et al (2021) Improvement of damage segmen-
volutional neural network. IEEE Trans Instrum Meas 66:1883– tation based on pixel-level data balance using vgg-unet. Appl Sci
1892. https://doi.org/10.1109/TIM.2017.2673024 11:518. https://doi.org/10.3390/app11020518
21. Kang DH (2021) Autonomous unmanned aerial vehicles and deep 40. Yang K, Ding Y, Sun P et al (2021) Computer vision-based crack
learning-based damage detection. Dissertation, The University of width identification using F-CNN model and pixel nonlinear calibra-
Manitoba tion. Struct Infrastruct Eng. https://doi.org/10.1080/15732479.2021.
22. Cha Y-J, Choi W, Suh G et al (2018) Autonomous structural visual 1994617
inspection using region-based deep learning for detecting multiple 41. Li S, Zhao X, Zhou G (2019) Automatic pixel-level multiple damage
damage types. Comput Civ Infrastruct Eng 33:731–747. https://doi. detection of concrete structure using fully convolutional network.
org/10.1111/mice.12334 Comput Civ Infrastruct Eng 34:616–634. https://doi.org/10.1111/
23. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich Feature Hier- mice.12433
archies for Accurate Object Detection and Semantic Segmentation. 42. Li S, Zhao X (2021) Pixel-level detection and measurement of con-
In: 2014 IEEE conference on computer vision and pattern recogni- crete crack using faster region-based convolutional neural network
tion. IEEE, pp 580–587 and morphological feature extraction. Meas Sci Technol 32:065010.
24. Girshick R (2015) Fast R-CNN. In: 2015 IEEE international confer- https://doi.org/10.1088/1361-6501/abb274
ence on computer vision (ICCV). IEEE, pp 1440–1448 43. Li S, Zhao X (2020) Automatic crack detection and measurement
25. Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real- of concrete structure using convolutional encoder-decoder network.
time object detection with region proposal networks. IEEE Trans IEEE Access 8:134602–134618. https://doi.org/10.1109/ACCESS.
Pattern Anal Mach Intell 39:1137–1149. https://doi.org/10.1109/ 2020.3011106
TPAMI.2016.2577031 44. Shelhamer E, Long J, Darrell T (2017) Fully convolutional networks
26. Rahman A, Wu ZY, Kalfarisi R (2021) Semantic deep learning inte- for semantic segmentation. IEEE Trans Pattern Anal Mach Intell
grated with RGB feature-based rule optimization for facility surface 39:640–651. https://doi.org/10.1109/TPAMI.2016.2572683
corrosion detection and evaluation. J Comput Civ Eng 35:04021018. 45. Ioffe S, Szegedy C (2015) Batch normalization: accelerating
https://doi.org/10.1061/(ASCE)CP.1943-5487.0000982 deep network training by reducing internal covariate shift. J Pract
27. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks 10:730–743
for semantic segmentation. In: 2015 IEEE conference on computer 46. Szandała T (2021) Review and comparison of commonly used acti-
vision and pattern recognition (CVPR). IEEE, pp 3431–3440 vation functions for deep neural networks. Bio-inspired neurocom-
puting. Springer, Singapore, pp 203–224
123
47. Nagi J, Ducatelle F, Di Caro GA, et al (2011) Max-pooling convolu- K (eds) Neural networks: tricks of the trade. Springer, Berlin, pp
tional neural networks for vision-based hand gesture recognition. In: 437–478
2011 IEEE International Conference on Signal and Image Process- 54. Reed R, Marks RJ (1999) Neural Smithing: Supervised learning in
ing Applications (ICSIPA). IEEE, pp 342–347 feedforward artificial neural networks. The MIT Press
48. Qiu X (2021) A new multilevel feature fusion network for medical 55. Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods
image segmentation. Sens Imaging 22:23. https://doi.org/10.1007/ for online learning and stochastic optimization. J Mach Learn Res
s11220-021-00346-2 12:2121–2159
49. Wang C, Wang Y, Liu Y et al (2020) ScleraSegNet: an improved 56. Garcia-Garcia A, Orts-Escolano S, Oprea S, et al (2017) A review
U-net model with attention for accurate sclera segmentation. IEEE on deep learning techniques applied to semantic segmentation. 1–23
Trans Biometrics Behav Identity Sci 2:40–54. https://doi.org/10. 57. McKay MD, Beckman RJ, Conover WJ (2000) A comparison of
1109/TBIOM.2019.2962190 three methods for selecting values of input variables in the analysis
50. Woo S, Park J, Lee J-Y, Kweon IS (2018) CBAM: Convolutional of output from a computer code. Technometrics 42:55–61. https://
block attention module. In: Ferrari V, Hebert M, Sminchisescu C, doi.org/10.1080/00401706.2000.10485979
Weiss Y (eds) Computer Vision – ECCV 2018. Springer, Cham, pp
3–19 Publisher's Note Springer Nature remains neutral with regard to
51. Russell BC, Torralba A, Murphy KP, Freeman WT (2008) LabelMe: jurisdictional claims in published maps and institutional affiliations.
a database and web-based Tool for image annotation. Int J Comput
Vis 77:157–173. https://doi.org/10.1007/s11263-007-0090-8 Springer Nature or its licensor holds exclusive rights to this article under
52. Shorten C, Khoshgoftaar TM (2019) A survey on image data aug- a publishing agreement with the author(s) or other rightsholder(s);
mentation for deep learning. J Big Data 6:60. https://doi.org/10. author self-archiving of the accepted manuscript version of this article
1186/s40537-019-0197-0 is solely governed by the terms of such publishing agreement and
53. Bengio Y (2012) Practical recommendations for gradient-based applicable law.
training of deep architectures. In: Montavon G, Orr GB, Müller
123

Automatic Pixel Level Detection and Measurement of Corrosion Related Damages in Dim Steel Box Girders Using Fusion Attention U Net

Uploaded by

Copyright:

Available Formats

Automatic Pixel Level Detection and Measurement of Corrosion Related Damages in Dim Steel Box Girders Using Fusion Attention U Net

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automatic Pixel Level Detection and Measurement of Corrosion Related Damages in Dim Steel Box Girders Using Fusion Attention U Net

Uploaded by

Copyright:

Available Formats

Journal of Civil Structural Health Monitoring (2023) 13:199–217

Automatic pixel‑level detection and measurement of corrosion‑related

1 Introduction mildew, and corrosion (referred to as corrosion-related

The central bottleneck between the encoder and decoder

Fig. 2 Examples of damage

Fig. 3 Training and valida- 1.4 1.4

Fig. 4 Metrics for model perfor- 100% 100%

mance over epochs: a PA, b PA

1e−6 No 97.07 87.04 46.86 95.48

Fig. 5 Examples of predicted

Fig. 6 Examples of predicted

Fig. 7 Examples of predicted

5.1 Photography experiment introduction

Fig. 8 Comparison of prediction results by FAU-net and the state-of-art networks

FAU-net 90.81 86.31 91.89 97.85 90.81 80.11 96.10 18

(unit: mm), where n is the number of pixels located at the

Fig. 10 Illustrations of the

Fig. 12 Schematic of target

Fig. 13 Flowchart for detec-

Fig. 14 Examples of predicted

You might also like

5.1 Photography experiment introduction