0% found this document useful (0 votes)
12 views

Real-Time_Explainable_Multiclass_Object_Detection_

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Real-Time_Explainable_Multiclass_Object_Detection_

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Hindawi

Complexity
Volume 2022, Article ID 4637939, 17 pages
https://doi.org/10.1155/2022/4637939

Research Article
Real-Time Explainable Multiclass Object Detection for Quality
Assessment in 2-Dimensional Radiography Images

Sadra Naddaf-Sh ,1 M-Mahdi Naddaf-Sh ,1 Hassan Zargarzadeh ,1 Maxim Dalton,2


Soodabeh Ramezani,2 Gabriel Elpers,2 Vinay S. Baburao,2 and Amir R. Kashani2
1
Phillip M. Drayer Electrical Engineering Department, Lamar University, Beaumont, TX, USA
2
Artificial Intelligence Lab, Stanley Oil & Gas, Stanley Black & Decker, New Britain, CT, USA

Correspondence should be addressed to Hassan Zargarzadeh; [email protected]

Received 29 May 2021; Revised 22 August 2021; Accepted 30 August 2021; Published 8 August 2022

Academic Editor: Long Wang

Copyright © 2022 Sadra Naddaf-Sh et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.

Quality inspection and defect detection play a critical role in infrastructure safety and integrity specially when it comes to aging
infrastructure mostly owned by governments around the world. One of the prevalent inspections performed in the industry is
nondestructive testing (NDT) using radiography imaging. Growing demand, shortage of experts, diversity of required skills, and
specific regional standards with a time-limited requirement of inspection results make automated inspection an urgent need.
Therefore, utilizing artificial intelligence- (AI-) based tools as an assistive technology has become a trend for industrial ap-
plications, which automates repeated tasks and provides increased confidence before and during the inspection operation. Most of
the works in quality assessment are focused on the classification of few categories of defects and mostly performed on public or
noncomprehensive research datasets. In this work, a scalable, efficient, and real-time deep learning family of models for detection
and classification of 10 various categories of weld characteristics on a real-world industrial dataset is presented. The models are
evaluated and compared against each other, various critical hyperparameters and components are optimized, and local
explainability of models is discussed. Additionally, AutoAugment for object detection and various techniques are utilized and
investigated. The best performance for object detection and classification for 10 class models is reached by mean average precision
of 72.4% and top-1 accuracy of 90.2%, respectively. Also, the fastest object detection model is able to evaluate a full 15360 × 1024
pixels weld image in 0.39 seconds. Finally, the proposed models are deployable on edge-devices to perform as assistant to NDT
experts or auditing professionals.

1. Introduction larger than a certain threshold [5, 6]. Thus, weld inspection is
the most economical preventive approach specifically at early
Inspection and assessment of welded joints are critical in stages of its construction. Among different nondestructive
many industries such as marine, aerospace, and chemical, and testing (NDT) technologies at the point of constructions,
specifically in oil and gas industries [1]. Welded joints are radiographic testing (RT), ultrasonic testing (UT), and
among vulnerable parts of any industrial infrastructure in- magnetic testing (MT) are of great importance. Currently RT,
cluding pipelines. Hence, preliminary weld inspection during in which X-ray imaging of the welded part is done, is pre-
the construction has a crucial role in its longevity as a small ferred due to the universal training and accuracy of its
discontinuity can grow into an utter failure over time [2, 3]. technology [7]. Nonetheless, analysis of X-ray images is time-
Moreover, pipeline failures can damage life in large-scale and consuming and tedious, and at the end different experts might
is a threat to the environment [4]. Furthermore, it is very have different opinions and hence auditing is essential [8].
costly to maintain continuous inspection to track the growth Thus, automation of these systems is of interest in the industry
of initial imperfections over time or efforts to restore the to certify reliability and safety of the product in various stages
surrounding environment or the pipeline when defects are of construction, approval, audit, and risk assessment.
2 Complexity

In recent decades, many research have been conducted


on automation of tasks employing robots [9–12], including
robotic platforms automation of welding operation to ac-
celerate the process and reduce human error. As an instance,
Figure 1 shows a robotic digital X-ray photographer by
Stanley Oil and Gas. The robot autonomously conducts the
X-ray imaging that significantly minimizes the human in-
tervention to prevent the operators from exposure [13].
After a robotic imaging process is done, human experts use
images generated to inspect the welds. However, recent Figure 1: Digital X-ray detector and source on a robotic platform,
rapid improvement in machine learning, computer vision, Stanley Oil and Gas.
and pattern recognition has opened new roads to provide
novel solutions in order to address the challenges regarding
ultimate defect diagnosis and complete tractability of dis- classifiers for distinguishing between six types of defects
continuities over the pipeline’s life cycle [2, 3, 8, 14, 15]. In annotated based on British Standards or labeling as non-
the following, a review on related research performed in defect. Preprocessing steps of their research include utilizing
weld and defect diagnosis is provided. local threshold, graph-based segmentation, and then geo-
Previous research works with focus on defect analysis are metric and texture features are used as input for classifiers
mainly divided into two smaller subgroups. Before the like ANN, K-nearest neighbor (KNN), and SVM. In [24], a
prevalence of deep learning and convolutional neural net- comprehensive review of similar methods is provided. It can
works (CNNs) approaches in the early 2010s, procedures be concluded that classical approaches require major pre-
focused on traditional image processing methods for image processing steps before feature extraction and preprocessing
preprocessing and classification utilizing classical machine enhancements have direct impact on final accuracy.
learning methods (e.g., support vector machine (SVM)) and On the other hand, a few researches focused on image
training artificial neural networks (ANNs) based on hand- segmentation to provide a general understanding of defect
crafted features extracted from image patches (cropped localization. Carrasco and Mery [25] presented a method for
rectangular pieces of a larger image). Among these works, segmenting defects. The method consists of a few steps:
they mainly focused on classifying defect and nondefect median filtering, bottom-hat filter, binary thresholding, and
images and assigning a single label to an image patch with or watershed transform. The results suggested an area under
without segmentation of defect area. Mery and Berti [16] curve (AUC) of 93.58% for ten images. In [26], sliding
used texture features to train ANNs and the best result window approach is used for weld object detection based on
reached 8% false alarm. In [17], gray level co-occurrence a large set of features. In [27], Ben Gharsallah and Ben Braiek
matrix (GLCM) texture features were used for multiclass proposed a method to address nonrobustness of defect
ANNs with 86.1% accuracy and optimized to reach 87.3% by segmentation caused by uneven illumination, based on level
applying Levenberg–Marquardt optimizing function in [18]. set active contour guided with an off-center saliency map, in
A similar approach to classify defects with a combination of which an energy function gets minimized to achieve seg-
statistical and geometric features and utilizing top-hat fil- mentation. Despite faster convergence and higher accuracy
tering, thresholding, and morphological smoothing as than local image filtering and contrast enhancement, the
preprocessing presented in [19] resulted in 91% accuracy in method requires further investigation to minimize human
detecting defects and nondefects and 96% in classifying of a intervention in finding region of interest (ROI). In [28],
hundred of test images containing low contrast images. In defect segmentation problem is addressed using Gabor fil-
[20], Wiener filter is considered the best enhancement as it tering and canny edge detector. As more recent research,
leads to lower rooted mean square error (RMSE) in com- which is also evaluated on aerospace weld dataset, a novel
parison with median filtering and contrast enhancement, pixelwise segmentation defect detection system is presented
and also defective segments are obtained from the seg- in [8]. Dong et al. [8] described a system to detect weld
mented image using an automatic threshold. Finally, for defects by using random forest instead of Softmax as the
feature extraction, the lexicographically-ordered one-di- classifier of a U-net [29]. The approach is pixelwise labeling
mensional signal of the image is generated, and mel-fre- of highly similar circular defects, which are prevalent in
quency cepstral coefficients (MFCCs) and polynomial aerospace industries.
coefficients are extracted from the power density spectra Since the prevalence of deep convolutional neural net-
(PDSs) of the image and passed into ANN, which reduced works (DCNNs), many works have focused on using these
false positive rate to 7%. Lim al. [21] employed a multilayer models for feature extraction/selection instead of traditional
perceptron (MLP) network trained on a simulated dataset of hand-crafted feature extraction and nonrobust methods.
weld radiographic images for classification of the patches. Primarily two general tasks are performed using DCNNs
Zapata et al. in [22] used an adaptive network-based (i.e., classification and object detection task). Furthermore,
fuzzy inference system (ANFIS) and ANN, in which geo- weld defect dataset has class-imbalance issue, since the
metrical and texture features were selected with respect to number of weld defects might not distribute equally among
minimizing computational complexity and reached 82.6% different classes. Hoe et al. [30] focused on extending three
accuracy. Valavanis and Kosmopoulos [23] applied certain types of datasets using auto encoders to address the
Complexity 3

imbalance problem. Next, a few models, including DCNNs the method shows AUC of 88.4% for defect segmentation,
and other models based on extracted features are trained to lack of defect classification is discernible. Gantala and
classify four different types of defects and reached accuracy Balasubramaniam [45] presented an automatic defect rec-
of 97.2%. Ajmi et al. [31] explored two-class (porosity and ognition model trained on total focusing method (TFM)
lack of penetration) classification of weld defects. Data imaging dataset and finite element simulated dataset with
augmentation through horizontal mirroring, translations, addition of noise and further expansion of dataset utilizing
and RGB channels modification are applied to boost model deep convolutional gaN (DCGAN). Their two-class defect
performance, and 85.2% accuracy is reported with transfer detection model was evaluated with yolov4 [46] and reached
learning utilizing AlexNet [32] and addition of a few drop- 85 average precision (AP) on the noisy dataset.
out layers as well as modified final layer on GDXray [33]. In Although the above research papers are mostly related to
[34], a real-time and two-stage method based on images employing deep CNN methods to automate the preliminary
from a 3D laser scanner is proposed. The method performs inspection in construction and welding, studies using deep
four-class classification of narrow lap welds. Also, a com- CNN methods for NDT and defect diagnosis are not limited
parison on classical and deep classification methods is to radiography images and weld construction. Yan et al. [47]
performed with average accuracy of 80% for classical ap- developed deep models for enhanced feature extraction and
proaches while for deep methods of VGG-16 [35] and ultrasonic pattern recognition for inspection gas pipelines.
ResNet50 [36], 97.1% and 97.8% accuracy are reported, The method uses contact-less dual-mode bulk wave elec-
respectively. Wang et al. [37] presented a tutorial for weld tromagnetic acoustic transduce (EMAT) and interpretations
defect detection based on DCNNs with implementation of A-scan signals to detect defects. It leverages continuous
provided in PyTorch [38]. The paper provides a step-by-step wavelet transform (CWT) to extract frequency-time domain
approach for the data collection, preprocessing, and model features, then a deep CNN model is applied to perform high-
designing, training, and testing. end feature extraction, and finally, a pretrained SVM is used
Further investigation is performed for accurate locali- for defect/nondefect classification of signals. The method
zation of weld characteristics using deep methods. Hou et al. feature extraction ability is verified by comparing to other
[14] designed a deep learning-based system for weld quality methods, including discrete wavelet transform (DWT) and
assessment. They used sparse autoencoder (SAE) to extract statistical features, all of which are outperformed by the
and use intrinsic features for classifying 32 × 32 pixels weld CNN model, which achieves 93.75% accuracy on a dataset of
patches and finally using a sliding window to classify image pipe with artificially manufactured defects. The work is
pixels as defect or nondefect. The process reaches an ac- performed for defect/nondefect classification, and the pos-
curacy of 91% on GDXray [33], even though the work is a sibility of defect type classification is to be investigated.
binary class defect classification and the process is time- In addition to ultrasonic pattern recognition, deep CNNs
consuming because of the nature of the sliding window are also utilized for thermography crack detection. In [48],
approach and size of full weld images. In [39], extensive Hu et al. explored supervised thermography video sequence
experiments with 24 various computer vision-based weld metal crack detection and localization. The work uses eddy
object detection methods (including deep learning methods current pulsed thermography (ECPT), a multi-physics
based on sliding window) are performed and reported. In coupling method, to detect turbulence in conductive ma-
[40], two-stage detectors (i.e., Faster RCNN [41]) are used terials by analyzing thermal patterns. Initially, principal
whose task is object detection of weld defects in shipbuilding component analysis (PCA) is used to extract thermal se-
which accounts for 60% of the building process, where quence components from original data. Then, Faster RCNN
radiography testing is used to inspect welded joints. The [41] is used to perform object detection on images accu-
proposed object detector is trained to detect two general rately. Finally, the method is compared to traditional de-
types of porosity and lack of fusion/slag defects. Moreover, tection methods, and it demonstrates 0.97 probability of
the best result is acquired by data augmentation, which detection, which outperforms the accurate prior method by
reached 53.2 mean average precision (mAP) on Faster 26%. Proposed methods are validated experimentally and
RCNN [41] with ResNet50 [36] backbone. have shown significant improvement in their own type of
Gau et al. [42] developed a contrast enhancement NDT and data acquisition, demonstrating the advantages of
conditional generative adversarial network (GAN) to ad- using CNN for feature extraction in NDT. While UT and
dress the contrast and class-imbalance issue. There are two thermography methods (e.g., ECPT) are commonly used for
separate target networks in their work. The first network in-line inspection and maintenance purposes and not for
accepts a 71 × 71 pixels patch from weld seam to classify the weld construction inspections, these methods have their
patch as defect/nondefect. For determining defect type, limitations, such as low sensitivity to small defects or in-
defective patches are passed into a second classification ternal crack detection [13].
network. At the end, the sliding window approach is used for Studies mentioned above are all experimentally evalu-
localizing defects. Thus, with respect to the two-stage design ated on either (1) a set of images from a private dataset (i.e.,
of the system and the sliding window, the entire system will usually created for experimental purposes) or (2) GDXray
not perform in real-time for high-resolution images. In [43], [33] or similar public and noncomprehensive sets. As shown
a defect localization method based on U-net and augmen- in Figures 2 and 3, there are noticeable differences in images
tation using conditional GAN (cGAN) [44] is presented, and from welded joints at Stanley, and the GDXray dataset. First,
the method is evaluated on GDXray dataset [33]. Although GDXray has a limited number of samples. Second, class
4 Complexity

methods is provided, respectively, as well as description of


system architecture. In Section 4, the methods are tested,
various models are described, and the augmentation ap-
proach and results are evaluated. Finally, conclusions are
proposed in Section 5.

2. Dataset
The dataset contains thousands of X-ray images taken with
Figure 2: Tow samples of images in the GDXray database (top) and the purpose of NDT of weld construction in preliminary
SBD dataset (bottom). The bottom image is cropped to be able to stages. There is little to no material variation in weld con-
compare with each other. Defects are more visible in the GDXray struction, which helps developing a model focusing on
database than SBD dataset. accuracy and robustness. The majority of the structures are
plain carbon steel. The diameter of the pipes ranges from 24
to 56 inches. However, pipes with either 36 inches or 42
inches are mostly common. Moreover, the pipes wall
thickness is at least 0.5 inches with the grade of X65 or
greater. Finally, all pipes are consistent with API 5L [49] in
terms of types, dimensions, material, and grade.
Figure 3: An image sample from SBD database in that two surfaces
Welded-joint images have various resolutions depending
with different thicknesses are welded.
on the exterior diameter of the structure. In this dataset, the
resolution of the images is roughly 15360 × 1024 pixels, with
diversity is limited and also annotations and weld charac- the occurrence of weld discontinuities. As the welded area
teristics are based on a different standard [33]. Third, vis- only covers one-fifth of each weld image’s center area,
ibility of defects is limited compared to defects at Stanley. images are cropped into 224 × 224 patches with 20% overlap.
Also, in some cases, a single patch contains more than one This overlap benefits in two ways. First, it assists in retaining
type of defect, which does not permit the experts to designate defects lying in between two patches in one patch. Second, as
a single label for the entire image patch, all of which make smaller defects shift in two consecutive images, it can be
classification only or defect/nondefect localization incom- interpreted as data augmentation. Next, experts annotated
patible with real-world industrial requirements and stan- the images based on API 1104 [50] standards. Most of the
dards. In other words, the detection of non-hand-picked and defect-free patches are removed from the dataset to prevent
diverse real-world samples is more of a challenge. On the overwhelming the network with nondefect images. Finally,
other hand, since the systems will work as assistant to NDT Figure 4 shows samples of the dataset, and Table 1 shows the
experts, and there are limitations in hardware for deploying distribution of images for each set. As the dataset reveals,
as well as time-constraint processing requirements, scal- about 75% is used as train set (i.e., 17872 images), and 10%
ability is required for efficient and optimized utilization. and 15% are used as dev/validation set and test set, re-
Considering mentioned reasons, these methods either fail to spectively. Note that the dataset is collected from welding of
reach required specifications or do not meet required per- various structures and different welding devices. Thus, re-
formance based on industry measures. sults obtaining from this dataset can demonstrate the
This paper aims to address the accuracy and inference generalizability and robustness of proposed solutions for
time trade-offs by presenting an efficient and scalable set of extensive use as assistant to NDT experts. Figure 5 sum-
deep models. Moreover, instead of assigning a single label marizes preprocessing steps on the dataset. The steps are
for each patch, accurate location and label for each dis- described in detail in Section 3.1.
continuity will be determined. The contributions in this
work are as follows: (1) describing an efficient and scalable 3. Method
system for object detection or classification of weld char-
acteristics on long, high-resolution radiography weld im- Addressing robustness, accuracy, and time performance are
ages, which is deployable as a real-time assistant for NDT required for employing a deep convolutional model in
experts, (2) demonstration and analysis of the transferring production for the task of weld defect object detection. Over
augmentation strategies during training which can improve recent years, scaling up image resolution, depth and width of
the performance of the system on detection of rare small the network, and using a larger backbone are widely used to
discontinuity which are easier to miss during manual in- boost the performance of the models [51–54]. However, this
spection and harder to detect with deep learning methods, costs, in a larger model, higher computation and inference
(3) analyzing and experimenting with different components time [51, 52] as well as longer training time. Thus, a robust
of the deep model, such as activation functions, and feature and efficient design is required to address the accuracy
extraction backbones, and (4) comparative analysis on the versus time performance. In order to address the trade-off
presented models with base-line models. between accuracy and time and achieve efficiency in models,
The rest of this article is organized as follows. In Sections a family of one-stage and scalable models called EfficientDet
2 and 3, an overview of dataset preparation and proposed [52] are exploited. Employing a single compound coefficient,
Complexity 5

Elongated Slag Inadequate Hollow Bead Internal


Gas Packet (GP)
Inclusion (ESI) penetration (IP) Porosity (HB) Concavity (IC)

Isolated Slag Inadequate penetration Scattered Cover Pass


Porosity (P)
Inclusion (ISI) Due to high-low (IPD) Porosity (SP) Undercutting (EU)

Figure 4: Samples of classes in dataset.

Table 1: Distribution of images and labels.


Categories
Total
ESI ISI IP IPD GP P HB SP IC EU
Train 17872 4424 2948 908 702 1840 598 4062 498 597 1295
Validation 2203 569 338 118 97 280 71 435 51 87 157
Test 3394 891 490 190 139 408 116 707 83 130 240
Total 23469 5884 3776 1216 938 2528 785 5204 632 814 1692

one can scale the architecture to address the trade-off be- 3.1. System Architecture. Figure 5 depicts the required
tween model size and accuracy of the model, resulting in a preprocessing steps to generate the dataset, which start with
model deployable on various end-devices ranging from downloading image patches and quality validation. Al-
mobile devices to high-performance GPU clusters. though the images on the cloud storage are prevalidated for
A two-stage object detection model generally starts with quality, it can be done through a wire IQI tag, which is
a search on regions of interest (ROI) using the selective discernible on the image in Figure 6. As this step is optional
search or, in more recent designs, applying region proposal and can be done upon uploading the images to the cloud
networks (RPNs), and then by passing image to the second storage, its time burden is disregarded from total system
stage for feature extraction, classification of the boxes, and time performance. As the final two steps, brightness cor-
refinement of the bounding box are performed [41, 55]. rection and contrast leveling as well as slicing of the original
Although the tow-stage methods might lead to higher 15360 × 1024 pixels image with 20% overlap are done.
accuracy, the inference time because of the first stage As Figure 5 training depicts, training starts on a scaled
burden is significant in sight of the additional step (RPN). model, which depends on a single coefficient for the de-
In contrast, one-stage detectors apply a feature extractor termination of depth and width of the network. In addition,
called backbone and then fuse multilevel extracted features. AutoAugment during training is performed. Procedures of
In the end, class/box networks help to extract class labels network design, scaling, and augmentation are elaborated in
and regression of bounding boxes. Since the image passes Sections 3.2–3.4. As the next step, based on the type of the
only once through the network, the one-stage detector trained network in the model, it predicts either label and
performs significantly faster than other methods [54]. By accurate location of the defects or assigns a single label for
utilizing pretrained backbones, the power from classifi- the whole patch with explanations on the decision provided.
cation tasks transfers to these object detectors as employed Finally, Figure 5 visualization indicates stitching as the first
in [56]. In this section, preprocessing steps, EfficientDet step of visualization. Since exact slicing points are saved
architecture design, and augmentation strategies for object during slicing, relative predicted defect locations of the
detection as well as system architecture to achieve an ac- whole image are calculable. Finally, the full DICONDE
curate model with low time latency are discussed, image can be visualized through Stanley web-app or mobile-
respectively. app or saved as DICONDE metadata.
6 Complexity

Preprocessing

Feeding annotated Image Quality Indicator Adaptive Auto brightness


data from Cloud (IQI) validating and contrast leveling

Slicing with
20% overlap
Training

Inference
10-class Object
Detection

ESI, ISI, SP, P, GP,


Train on AutoAugment EU, HB, IP, IPD, IC
scalable model During Training

10-class
Classification
and grad-CAM ESI
Visualization

Storing results Visualizing Stitching image


as DICONDE on UI patches
meta data

Figure 5: An overview of the system architecture: the box in blue accent is optional and can be done once images are captured before the
start of the detection process. The arrows in gray color show transitions to the next stage. In section inference, one of the object detection or
classification tasks will be done based on the in-use trained model. In visualization, depending on the need, stitching, visualization, storing,
or all of them in once can be done.

3.2. Network. As described in Section 2, the final dataset view of less contribution in feature fusion to optimize the
contains 23469 image patches of size 224 × 244 pixels. An structure. In addition, a few edges from input to output
image patch passes through backbone for feature extraction. (similar to skip connections in ResNet [36]) are added,
In this work, EfficientNet [53] is used as the backbone of which boost both the training and accuracy processes. Fi-
object detection models and for feature extraction in clas- nally, fused features pass through two similar class and box
sification models. However, for weld quality assessment, networks used to determine the class label and the bounding
different backbone performances are evaluated, and class box location of detected discontinuities. Similar to backbone
activation maps are reported. Next, multiscale features from and BiFPN, depth of class/box nets gets scaled with a single
levels P3 to P7 pass through a successor of feature pyramid coefficient.
networks (FPNs) [57]. Pi denotes the resolution of the input
activation map that is 1/2i of the original input image. In
conventional FPN, it is assumed that features from various 3.3. Scalability. In this part, the single compound scaling
scales contribute equally to the final detection. A few works coefficient of the overall architecture is reviewed. Effi-
have investigated the optimization of feature fusion; e.g., cientDet family starts from the smallest model D0 and ends
NAS-FPN [51] is an effort to find optimum architecture for with the deepest and largest model D7, where the number
cross-scale fusing network through search. However, it takes stands for the single compound coefficient ϕ, used to scale
thousands of GPU hours to find an optimal design and the input image resolution and overall depth and width of the
resulting model is oversized. To address the equal contri- architecture. For backbone, if EfficientNet is used, one of the
bution of different scales in fusing features, EfficientDet uses pretrained networks is applied based on ϕ. Figure 6 shows
bidirectional FPN (BiFPN). In BiFPN, similar to FPN, a top- the architecture, which is similar for all networks. The final
down pass is used, and similar to PA-Net [58] bottom-up input image resolution is determined using the following
pass is added. Nonetheless, the bottom-up pass adds a lot of equation:
costly additional weights to the network. Thus, nodes with Rinput � 512 + ϕ · 128. (1)
single connections (highest and lowest levels) are removed in
Complexity 7

P3
P4
P5 Class
P6
P7 CONV CONV
Input

CONV CONV

Box

Input patches Backbone n BiFPN Layers Prediction net


Figure 6: EfficientDet family architecture: images pass through the backbone, and feature scales P3 to P7 get fed into the BiFPN network.
Input image resolution is calculated from equation (1). The number of BiFPN repeated blocks extracted using equation (2). Depth of box/
class prediction nets is determined using equation (3).

(a) (b) (c)

Figure 7: Samples of a training batch with different augmentations: (a) a batch with no augmentation, (b) an augmented batch with a
collection of random augmentation named as train-time augmentation, and (c) augmented batch based on policyV3.

Equation (2) shows how the number of channels/layers 3.4. Data Augmentation. Many object detection as well as
of the BiFPN is scaled, where each layer is one of the BiFPN weld quality assessment deep learning approaches employ
repeated blocks starting from 3 for D0 shown in Figure 6. data augmentation in order to improve both the perfor-
Finally, the number of layers of the class/box is determined mance of the network and generalization [31, 40, 43]. The
through equation (3). effectiveness of augmentation is shown and evaluated in
literature [59]. Nonetheless, there are countless strategies,
Wbifpn � 64 · 􏼐1.35ϕ 􏼑, such as rotation, affine, zoom in/out, flipping, etc., various
(2) magnitudes, and also different possible combinations of
Dbifpn � 3 + ϕ, strategies to be used for augmenting the dataset. One so-
lution is to search through all possible solutions to find the
ϕ optimal ones. The authors in [60] investigated and searched
Dbox � Dclass � 3 +⌊ ⌋. (3)
3 through the area of 1010 different combinations for the
8 Complexity

classification task. Similarly, [61] investigated the effec- 4. Experiments


tiveness of AutoAugmentation for object detection and
extracted a few sets of policies enhancing detection per- In this section, various experiments are designed and per-
formance the best for the object detection task named as formed to investigate a set of scalable models with fast
policy V0-3. As searching for optimal augmentation strat- processing time while maintaining high accuracy. In addi-
egies is a time-consuming task, extracted policies are applied tion to EfficientNet backbones, results are reported utilizing
and investigated in this work. For this purpose, a base model other backbones, namely, MobileNetV3 [66], ResNet50 [36],
(D0 with EfficientNet B0 backbone) is trained utilizing each which is called Resdet50 in detection models, CspResdet50
of the policies to find the best policy. Then, best policy is used [67], and Darknet (utilized in Yolov3 [54]). Moreover,
for training larger models and investigating other effective standalone object detection models including Yolov3,
parameters of the model. Yolov4 [46], Yolov5 [68], and RetinaNet [65] are fine-tuned
as a basis for comparison. In the following sections, the
K-means method is used to extract optimal anchor boxes;
3.5. Evaluating Metrics. Evaluating results is performed analysis and results from applying various AutoAugment
through average precision (AP) metrics. Models output a policies on models, training setup and hyperparameter
bounding box, a corresponding class label, and confidence tuning, quality assessment with single class labels, effects of
for each detection. A detection is considered correct when using several activation functions, and backbones are
the area of the ground truth bounding box and the detected elaborated, respectively.
box have at least 0.5 intersection over the area of the union of
two mentioned boxes, which is called Intersection over
Union (IoU). Also, the class labels of both bounding boxes 4.1. Anchor Boxes. Similar to [57], EfficientDet uses anchor
should be the same, which means boxes to detect objects. By default, there are three distinct
aspect ratios (0.5, 1.0, and 2.0). K-means clustering is utilized
area􏼐BBp ∩ Bgt 􏼑 to find the set of optimal aspect ratios for the box prediction
IoU � . (4)
area􏼐BBp ∪ Bgt 􏼑 network [64]. Moreover, the input image resolution is also
considered in optimized aspect ratio calculation. Table 2
demonstrates the effectiveness of using aspect ratios cal-
With IoU less than 0.5, the detection is counted as fp. Fn culated by K-means. The results are reported using AP
is also the count of nondetected bounding boxes. Therefore, metrics. New aspect ratios (1.2, 2.14, 3.8) suggest that 99.92%
precision and recall are calculated through the following: (i.e., equivalent to the percentage of bounding boxes that lie
tp tp into one of K-means calculated clusters) of the defects are
precision � ; recall � . (5) horizontal rectangles, and optimizing helps with a 6.6%
tp + fp tp + fn
increase in AP50.
As recall and precision of a robust object detector do not
alter much with varying confidence, it is required to consider
4.2. Analysis of Augmentation Policies. Since training each
multiple confidence thresholds to evaluate the performance
model employing all policies V0 to V3 is time-consuming,
of the object detector [62]. Defining all-point interpolation
EfficientDet-D0 is considered the base model for analyzing
of the area under precision-recall curve obtains accurate
how transferring augmentation policies affect the detection
results by pruning zig-zag behavior of the curve and utilizing
of characteristics. Table 3 demonstrates the effects of uti-
maximum precision (Pinterp (r) where r is recall level and
lizing various policies during training for augmentation,
recall of the point is greater than rn+1 ) at each recall level,
based on AP metrics. In NoAugment, raw images are passed
instead of using the precision at that point. The mathe-
to the network, while in Train-timeAug, two common
matical presentation of this is as follows:
augmentations for train-time are used. First, images are
AP � 􏽘 rn+1 − rn 􏼁Pinterp rn+1 􏼁, (6) flipped horizontally with a probability of 50%, and second,
n images are randomly resized and padded with a random
scale between 0.1 and 2.0. Also, bilinear interpolation is used
where
while resizing, and the mean of the dataset is applied for
padding when the final image is smaller than 512 × 512 pixels
Pinterp rn+1 􏼁 � max P(􏽥r). (7) (as 512 × 512 pixels image is the target image size for the D0
􏽥r: 􏽥r≥rn+1
model). PolicyV0-3 refers to each of 4 policies introduced in
AP has become a standard for comparing model per- [61]. In a similar way, during training of the D0 model using
formance in different object detection challenges [63] as well each of these policies, a random set of strategies from the
as literature [41, 52, 64, 65]. selected policy with a probability of 66% is selected, and the
In Section 4, models are evaluated using mAP (mean AP) input image is augmented based on it (the probability of not
(which is equal to mean of AP with IOU threshold ranging performing any of the strategies is one-third). Moreover,
from 0.50 to 0.95 and step of 0.05), AP50, AP75 (which is similar augmentation is performed on bounding boxes if any
equal to ap@iou�0.75), APs (s stands for small and objects is affected. Based on Table 3, augmentation policies dra-
with area < 322), APm (m is medium and area of the objects matically boost the performance of the network by 3.8 to 6.9
is between 322 and 962), and APl (objects with area > 962). AP. Most policies assist the network detect smaller defects
Complexity 9

Table 2: Effect of optimizing aspect ratio of anchor boxes based on bounding box annotations in train dataset. For this experiment,
EfficientDet-D0 model is used and the optimized aspect ratios are (1.2, 2.14, 3.8).
Default K-means optimized (% of improvement)
mAP AP50 APs APm APl mAP AP50 APs APm APl
30.7 59.3 24.0 33.9 68.3 34.8 (↑4.1) 65.9 (↑6.6) 25.4 (↑1.4) 35.0 (↑1.1) 69.2 (↑0.9)

(i.e., APs which are of more importance since they are easier Table 3: Policies used to train on D0 model on entire train images.
to miss during manual inspection and are harder to detect Results are reported on the validation set.
with deep learning methods). For further investigations, Base (% of improvement)
train-timeAug, and policyV3 are applied to the images as Policy name
mAP APs APm AP50
they resulted the best in these experiments. Figure 8 depicts a
NoAugment 24.2 21.5 27.2 46.0
sample training batch with mentioned augmentations
Train- 30.4 24.4 39.0 62.5
applied. timeAug (↑6.2) (↑2.9) (↑11.8) (↑16.5)
28.0
policyV0 23.2 (↑1.7) 33.0 (↑5.8) 56.0 (↑10.)
(↑3.8)
4.3. Training and Hyperparameter Tuning. The size of the 22.2
models and the resolution of input images increase from D0 policyV1 30.3 (↑6.1) 35.1 (↑7.9) 61.1 (↑15.1)
(↑0.7)
to D7 gradually using equations (1)–(3). It is not possible to 30.2 59.7
policyV2 22.1 (↑0.6) 36.3 (↑9.1)
train all the models on GPUs with 16 GB RAM with suitable (↑6.0) (↑13.7)
possible batch size relative to model size. Models with 24.9 60.9
policyV3 31.1 (↑6.9) 37.2 (↑10.)
smaller ϕ coefficients (i.e., D0 to D2) are trained on 3 (↑3.4) (↑14.9)
NVIDIA V100 16 GB RAM GPUs with maximum possible
batch size though for fitting these models in such GPU training ended up with higher AP. Furthermore, a few
memory, a few actions are performed. First, mixed-precision optimizers are evaluated and results show in this task
training2 is applied using the Apex package which assists in Fusedadam [73] optimizer converges faster and reaches a
decreasing memory usage and training time by utilizing half- higher accuracy (0.6 mAP). In the following, the impact of
precision weights and operations if possible. Second, as different activation functions is discussed.
providing accurate statistics for batch normalization is
crucial for the stabilized learning process and high-speed
4.4. Effect of Different Activation Functions. The performance
convergence, in distributed training, synchronized batch
of the base model (i.e., EfficientDet-D0) is analyzed by
normalization is used to provide cross-device batch-norm
testing over different activation functions, namely, Leaky
statistics. Nonetheless, these would not help to fit D5 to D7
Rectified Linear Unit (Leaky ReLU), Gaussian Error Linear
models in GPU. Thus, results related to those models are not
Units (GeLU) [74], Swish [75], Mish [76], and hard Swish
reported. Finally, for comparison, several original Yolo and
[66] in which sigmoid is replaced with relu6(x + 3)/6, which
RetinaNet models are trained. For RetinaNet models, images
is more memory efficient, and hard Mish [77]. Figure 9(a)
are resized to 800 × 800 pixels, and ResNet with 50 or 101
visualizes mentioned activation functions. Note that the
layers are used as the backbone, and for Yolo models, images
specified activation function is used for BiFPN layers and
are resized to 640 × 640 pixels. Implementation for yolo
class/box prediction nets. As shown in Figure 9(b), both
models can be found in [68], and RetinaNet can be found in
hard Swish and Swish outperforms other activation func-
detectron2 framework [69].
tions based on AP50. The same improvement applies for
During training, normalization using precomputed
other AP metrics. However, this did not happen for deeper
mean and standard deviation values per channel on the
models of the EfficientDet family. Thus, default Swish is used
entire dataset is performed. Also, each image is first ran-
to maximize model performance, though in view of the
domly flipped horizontally and/or resized for all experi-
memory efficiency of hard Swish, it is a preferable choice for
ments (i.e., Train-timeAug explained in Section 3.4). For
activation function if the model is planned to get deployed
weight initialization, the weights originally were trained on
on hardware-constraint end-device. For non-EfficientDet
MS COCO dataset in [52] and are converted to PyTorch in
family models, the default activation function of the model is
[70]. Thus, all weights of the network are trained to reach
used.
maximum performance similar to our previous work [56].
Identical to [52], cosine learning [71] is used. At the
beginning of the training process, for the first few epochs 4.5. Defect Object Detection without considering Class Labels.
(epoch numbers 0 to 5) learning rate increases gradually to In this experiment, all discontinuities are considered with a
the desired point, and from epoch 5 to the end of the training single defect label. Table 4 shows network performance
process the learning rate decreases gradually in cosine form. considering a single class for all discontinuities. Although
In addition, learning rate noises applied to 30% and 90% of these models only perform localization of the defects and no
the training process. Moreover, in a few experiments, ex- class label is available, higher accuracy in localization is
ponential moving average (EMA) [72] with a weight decay of reached. In addition to EfficientDet Family, several other
0.9998 was applied; however, it was removed as non-EMA models are also added for sake of comparison. All models are
10 Complexity

0.65
4

3 0.60

AP 50
0.55

0.50

0.45
–4 –3 –2 –1 0 1 2 3 4 30 50 70 90 110
Epochs
hard_swish hard_mish
swish gelu hard mish swish
mish leaky_relu gelu mish
leaky relu hard swish
(a) (b)

Figure 8: Activation functions used for comparing performance of the models.

trained using base parameters in their original work, except reliable measure to evaluate the performance of the models,
for Yolov3, in which focal loss is preferred to address class and it is not reported.
imbalance. In Section 4.6, models in Table 4 are discussed. For inference time analysis, a similar GPU that is used
for training, NVIDIA V100 16 GB, is exploited. Inference
times in Tables 4 and 6 suggest that models are able to
4.6. Evaluating Results. Tables 4–6 demonstrate models perform in real-time performance based on definition of
evaluation on 10-class object detection task, 10-class clas- real-time for object detection models [78]. As a result, the
sification, and defect/nondefect object detection task, re- fastest and the most accurate models can infer up to 224 and
spectively. All results for object detection models are 150 image patches per second with a batch size of 16, re-
obtained from the test set. For all experiments, common spectively. Considering the fact that in the worst case each
train-time augmentations and policyV3 (described in Sec- complete weld image has a length of 15360 pixels and is
tion 3.4) are applied. Although hard Swish improved ac- cropped with 20% overlap, a full image will have about 86
curacy for shallower models, the same did not happen for patches, meaning models can infer an image in 385 to
deeper ones (D3 and deeper). Thus, default activation 465 ms. Thus, models are able to process weld images in real
functions of the models are used (i.e., Swish for EfficientDet time with consideration of required preprocessing. Figure 10
models, leaky-ReLU for CspResdet50 and darkdet53, and summarizes models’ latency and floating-point operations
ReLU for Resdet50). As mentioned in the tables, generally, (FLOPs) count. Yolo models have a higher number of op-
deeper models show more accurate performance. However, erations, and the resulting models are larger. In contrast, the
little improvement or minor deterioration in larger models is EfficientDet family models and models with Bi-FPN Layers
a result of having to use smaller batch size to be able to fit the enhanced with AutoAugmentation are both smaller and
model into the GPUs (i.e., batch size of 20 per GPU is used to more accurate. Although EfficientDet models have a smaller
train the D0 versus batch sizes 3 and 1 for D3 and D4 models, number of parameters, they perform slower on GPU because
respectively), which accounts for inaccurate estimation of of slower execution of separable convolution. Finally, a
statistics of batch normalization and deteriorates training fusion of Resnet50 with EfficientDet object detection ar-
process. Since for task of weld quality assessment and chitecture results in best accuracy versus latency, for this
indexing of weld as well as rejecting or accepting, predicting task.
50% of the discontinuity is acceptable, AP50 is used for In Tables 4 and 6, models reported above double line are
further model comparison and analysis. APl is not reported trained for the sake of comparison. In Yolo models in ad-
because defects in welds with area greater than 962 pixels are dition to Train-timeAug, mosaic augmentation is applied.
undersampled and uncommon. Therefore, it would not be a Although this improved results by 0.5 AP, EfficientDet
Complexity 11

EffResdet50 B8
EffResdet50 D3
70 70 D1 D2 88
D3
D1 EffCSPResdet50
D2 D0 EffCSPResdet50
EffDarkDet53 EffDarkDet53
D0 65 86
65 Resnet50

F1-Score
B6

AP 50
AP 50

60 YOLOV5x B1
YoloV3 84
60 YOLOV5x CSPResdet50
YoloV3
55
Retina101 82 B0
Retina101
55
50 Retina50
80
YoloV4
MoblieNetV3
50 Retina50 45
0 25 50 75 100 125 150 175 200 10 20 30 40 50 60 70 80 20 40 60 80 100 120 140 160
GFLOPs GPU Latency (ms/im) GPU Latency (ms/im)
Retina YoloV# Retina YoloV# EfficientNet
EfficientDet Efficient∗ EfficientDet Efficient∗ Other

(a) (b) (c)

Figure 9: (a) AP50 accuracy versus giga floating-point operations (GFLOPs) for 10-class object detection models. (b) AP50 versus GPU
latency (milliseconds/224 × 224 image patch) for 10-class object detection. (c) F1-score versus GPU latency (ms/224 × 224 image patch)
performance for 10-class classification models.

Table 4: Model defect detection performance on test set based on defect/nondefect labels.
Test Inference time (images/sec)
Model Name # of params
AP AP50 AP75 APs APm b�1 b�8 b � 16
Yolo v3 61.4 M 34.3 81.0 — — — 54 65 75
Yolo v4 64.3 M 27.4 61.6 — — — 48 130 185
Yolo v5x 87 M 34.8 81.9 — — — 35 50 56
RetinaNet-50 37.9 M 29.6 73.6 14.5 29.5 30.5 22 — —
RetinaNet-101 56.9 M 30.9 75.2 14.2 29.9 31.5 20 — —
EfficientDet-D0 3.8 M 37.1 87.5 18.5 37.5 37.3 35 130 185
EfficientDet-D3 11.9 M 37.7 88.1 19.7 36.7 38.8 14 56 65
Efficientdarkdet53 11.9 M 36.6 85.7 18.9 38.4 35.5 15 34 37
EfficientCspResDet50 23.6 M 37.6 86.2 20.2 38.8 36.8 15 36 39
EfficientResDet50 26.9 M 38.2 89.0 21.8 37.8 40.9 27 143 168
Inference times are reported based on tests on a V100 GPU card. For EfficientDet models, number D# stands for ϕ coefficient of the model, and depth and
input image resolution can be acquired using equations (1)–(3). Also, the number stands for backbone number in [53]. For other models, the same structure of
EfficientDet-D1 is used except the backbone, the name of which is determined in Model Name. For models without efficient in their name, implementations
from [69] or [68] with common augmentation are employed.

Table 5: 10-class classification backbone performances.


Inference time
Model Name Image Size # of params Top-1 accuracy Precision Recall F1-score (images/sec)
b�1 b � 16
Resnet50 224 23.5 M 86.83 85.68 86.26 85.97 30 207
CspResnet50 224 20.6 M 85.3 83.89 83.71 83.8 31 266
MobileNetV3 224 4.2 M 79.91 79.41 78.67 79.04 37 267
EfficientNet-b0 224 3.8 M 82.15 81.4 81.36 81.38 38 355
EfficientNet-b1 240 6.5 M 85.4 83.8 84.4 84.1 29 200
EfficientNet-b6 528 40.7 M 85.71 84.5 84.1 84.3 13 26
EfficientNet-b8 672 89.6 M 90.2 89.5 88.72 89.11 6 10

models are still more accurate. In contrast to large number of convolutions is employed for feature fusion in EfficientDet
parameters in Yolo, they still perform faster than Effi- and they run slower on GPU. However, thanks to lower
cientDet models and the reason is that depth-wise separable number parameters, these models will perform better on
12 Complexity

Table 6: 10-class defect detection models performance on test set based and inference time with multiple batch sizes.
Test set Inference time
Model Name # of params (M)
AP AP50 AP75 APs Pm b b �1 b�8 b � 16
YOLO V3 61.5 M 30.9 58.4 — — — 0 150 150
YOLO V4 64.3 M 24.1 46.0 — — — 0 120 125
YOLO V5x 87.2 M 30.1 58.1 — — — 5 50 56
RetinaNet50 37.9 M 26.7 51.0 18.4 20. 27.6 21 — —
RetinaNet101 56.9 M 27.7 54.7 18.2 21. 30.7 20 — —
EfficientDet-D0 3.8 M 34.8 65.9 22.2 2.4 35.0 35 144 224
EfficientDet-D1 6.5 M 34.9 68.0 23.7 27.3 34.4 17 111 148
EfficientDet-D2 8M 35.2 68.0 23.0 26.0 36.5 16 88 106
EfficientDet-D3 11.9 M 34.7 69.3 25.1 25.9 34.9 13 58 66
Efficientdarkdet53 11.9 M 34.4 67.1 25.4 264 33.9 14 33 37
EffcientCspResdet50 23.7 M 34.1 68.1 22.5 25.7 33.6 13 35 39
EfficientResDet50 27M 36.1 72.4 25.0 26.0 37.7 24 132 150
For EfficientDet models, number D# stands for ϕ coefficient of the model and depth and input image resolution can be acquired using equations (1)–(3). Also,
the number stands for backbone number in [53]. For other models, the same structure of EfficientDet-D1 is used except the backbone. Note that all models are
optimized for maximum AP. b denotes batch size for inference.

(a) (b) (c) (d) (e) (f )

Figure 10: Examples of network prediction: true positives (a, b) (green boxes denote ground truth and others are network prediction), false
negatives and false positive as a result of IoU < 0.5 (c), erroneous class prediction because of the similarity of the classes IP and IPD (d), and
hard samples which resulted in incorrect prediction as a consequence of not meeting minimum standards (e, f ).

CPUs compared with competitors. Results from Table 4, in patches to the train set is suggested. (3) False negatives are
which for all discontinuities a single label is used, suggest where the network does not detect defects. With more than
that a portion of false-positive detections are related to 12% of false negatives, this group contains the largest erro-
detecting correct labels. In the following, erroneous and neous behaviour of the model, with HBs and ESIs forming
missed detections of the best model are elaborated upon. more than 55% of the normalized number of false positive
Also, the performance of feature extraction backbones both detections. Figure 11(f) is a sample of a false negative. A
numerically and visually is discussed. suggested workaround is to perform online or offline hard
Error analysis: based on Table 6 and inferred images of example mining for training. Note that by lowering the
the test set, erroneous detections of the network with highest minimum confidence threshold, most of these are detected
AP50 belong to one of these subcategories: (1) errors as a concisely by the network. (4) Misclassified samples are when
consequence of inadequate IoU of detections and ground the network detects the object with acceptable IoU, though
truth: 0.7% of error cases of test set belong to this category, the class label is incorrect. A sample is shown in Figure 11(d).
and they are from 3 classes of IP, ESI, and GP, where 76% of Finally, Figure 12(a) shows the distribution of misclassified
cases are ESI. Figure 11(c) shows a sample from this category. samples from the 10-class object detection model. It is
(2) False positives were mostly related to instances that a showing that HB class has the most misclassified detections,
nondefect bounding box is detected, and it is closely similar to and it is mostly mistaken with class IC, and the similarity is
one of the other defect classes, and a nonexpert observer that both IC and HB create a hallow area in the weld root.
might consider it as a defect. However, it does not meet Comparison of backbones: although it is common that
minimum requirements such as length for slag inclusion, size multiple discontinuity types appear in a single patch, a part of
for porosity, and other criteria to be counted as a disconti- the dataset (which includes around 80% of each set) that holds
nuity. Finally, out of 4.5% of instances lying in this category, image patches with a single defect type is separated and used to
slag inclusions and HBs had the largest normalized per- evaluate feature extraction and backbone performance, and
centage of errors. Mostly, sides of the weld root and also weld also to train a classification model. Table 5 shows performance
toe were falsely predicted as slag inclusions. Figure 11(e) is a of various backbones. The most accurate backbone is Effi-
sample of this category. As a workaround to reduce the error cientNet-B8, with 90.2% accuracy on the validation set. A
rate of this type, adding a large number of similar image similar training environment and optimizer with object
Complexity 13

35 30

30
25

25
% of Erroneous Detections

% of Erroneous Detections
20

20
15
15

10
10

5
5

0 0
GP HB IC IP ESI ISI GP HB IC IP ESI ISI IPD SP
True Labels (with miss-classified samples) True Labels (with miss-classified samples)
ISI GP ISI IC
SP HB SP HB
ESI IPD ESI IPD
IC EU IP
GP
(a) (b)

Figure 11: Misclassified detected defects. (a) Misclassification distribution for 10-class object detection model and (b) number of mis-
classified patches in classification backbone (EfficientNet-b8).

ESI

GP

original ResNet50 CSPResNet50 b0 b1 b6 b8

Figure 12: Grad-CAM visualization of different backbones from their three last blocks for classes GP and ESI.
14 Complexity

detection models are used for this purpose. Transfer learning is Future works contain test-time augmentation, model
applied and weights were originally trained on ImageNet [77]. ensemble without sacrificing real-time capability of the
Figure 12(b) shows the distribution of erroneous behavior of system, searching for optimal auto augmentation policies
the classifier. Based on Figure 12(b), most of the misclassified utilizing reinforcement learning since policies were initially
samples are related to HB. Also, erroneous detections are extracted from the COCO dataset and the nature of the weld
mostly misclassified as ISI, as shown in Figure 12(b). images is not consistent with nature of COCO dataset
Explainability using Grad-CAM: gradient-weighted class images. In addition, through time, more samples will be
activation mappings (Grad-CAMs) [79] recognizes the parts gathered from various sites of different parts of the world,
of input image with deterministic role in final decision- and the dataset will expand in both the number of classes and
making of the model. In Grad-CAM, instead of applying the number of instances per class.
global average pooling as ending layers [80] which requires
model modification and affects the network performance, Data Availability
back-propagation is utilized to extract feature contributions.
Therefore, class activation maps get extracted precisely. In All open-source implementations used in this paper are
Figure 7, Grad-CAMs of various backbones with different referenced in the main body of the article. However, the
depth are visualized, which provides local explainability for remaining implementations and dataset are a part of on-
input images. Bottom image of each cell shows final layer going research and proprietary of Stanley Black & Decker,
Grad-CAM, and upper images are second to last and third USA.
from last block output. It shows how network gradually
attend to discriminative features of each image. Conflicts of Interest
The authors declare that there are no conflicts of interest
5. Conclusions regarding the publication of this study.
In this paper, a scalable and efficient family of deep models
for 10-class weld quality assessment using object detection is Acknowledgments
presented. A comparative analysis on various models is also This work was done primarily with the help and support
performed; several critical elements of the networks such as from Michael George, Jeremy Guretzki, Matthew Nelson,
activation functions and hyperparameters are explored and Jason Miller, William Aston, Jake Smith, Haresh Ghansyam,
tuned to achieve state-of-the-art results on the dataset. Pete Morris, Prashanth Tirumalaseti, Adam Wynne Hughes,
Moreover, the effects of transferring object detection and Shengnan Wang as well as great support from Dr. Mark
AutoAugment policies are surveyed. Furthermore, various Maybury and Dr. Manish Mehta from the office of CTO at
scenarios such as considering task as a classification only Stanley Black and Decker. This research was conducted at
task and defect/nondefect scenarios are also analyzed and Lamar University and was fully funded by Artificial Intel-
models are compared with main-stream object detection ligence Lab, Stanley Oil & Gas, Stanley Black & Decker, USA.
models in real-time applications. Finally, model visual
explainability is analyzed through employing Grad-CAM References
and visualizing gradient information for target class. The
results are interpreted. They demonstrate that models are [1] R. Vilara, “An automatic system of classification of weld
able to infer a complete welded joint (15360 × 1024 reso- defects in radiographic images,” NDT & E International,
lution X-ray Image) in 385 milliseconds. Although classi- vol. 42, no. 5, pp. 467–476, 2009.
fication task outperforms object detection models, [2] M.-M. Naddaf-Sh, S. Naddaf-Sh, H. Zargaradeh et al., “Next-
localization of the defect (whether the defect is on root pass, generation of weld quality assessment using deep learning and
digital radiography,” in Proceedings of the AAAI Spring
fill pass, or cover pass) is necessary for further indexing of
Symposium Series, Palo Alto, CA, USA, March 2020.
the weld, pass or rejection, and optimization of welding [3] M.-M. Naddaf-Sh, S. Naddaf-Sh, H. Zargarzadeh et al.,
operation. “Defect detection and classification in welding using deep
Traditional computer vision techniques for weld defect learning and digital radiography,” in Fault Diagnosis and
detection require several critical preprocessing steps Prognosis Techniques for Complex Engineering
resulting in a nonrobust outcome or human intervention is Systemsvol. 2021, pp. 327–352, 2021.
needed. In contrast, automatic feature extraction approaches [4] A. Shahriar, R. Sadiq, and S. Tesfamariam, “Risk analysis for
and deep learning-based methods require minimum human oil & gas pipelines: a sustainability assessment approach using
intervention or preprocessing to achieve state-of-the-art fuzzy based bow-tie analysis,” Journal of Loss Prevention in the
results. The models presented here can be used as assistive Process Industries, vol. 25, no. 3, pp. 505–523, 2012.
defect-recognition systems to facilitate robust defect local- [5] Z. Rui, G. Han, H. Zhang, S. Wang, H. Pu, and K. Ling, “A new
model to evaluate two leak points in a gas pipeline,” Journal of
ization and classification and to reduce both human
Natural Gas Science and Engineering, vol. 46, pp. 491–497,
workload and error. Finally, as experts may have conflicting 2017.
and personal performance in particular defect detection, [6] L. T. Popoola, A. S. Grema, G. K. Latinwo, B. Gutti, and
provided deep models may train on specific samples and A. S. Balogun, “Corrosion problems during oil and gas
predict defects with a consolidated standard which can also production and its mitigation,” International Journal of In-
be helpful in training experts. tegrated Care, vol. 4, no. 1, pp. 1–15, 2013.
Complexity 15

[7] S. M. Anouncia and R. Saravanan, “Non-destructive testing [24] W. Hou, D. Zhang, Y. Wei, J. Guo, and X. Zhang, “Review on
using radiographic images? a survey,” Insight-Non-Destructive computer aided weld defect detection from radiography
Testing and Condition Monitoring, vol. 48, no. 10, pp. 592–597, images,” Applied Sciences, vol. 10, no. 5, p. 1878, 2020.
2006. [25] M. Carrasco and D. Mery, “Segmentation of welding defects
[8] X. Dong, C. J. Taylor, and T. F. Cootes, “Small defect detection using a robust algorithm,” Materials Evaluation, vol. 62,
using convolutional neural network features and random no. 11, pp. 1142–1147, 2004.
forests,” Lecture Notes in Computer Science, vol. 11132, [26] D. Mery, “Automated detection of welding discontinuities
pp. 398–412, 2019. without segmentation,” Materials Evaluation, vol. 69, no. 6,
[9] A. Shukla and H. Karki, “Application of robotics in offshore pp. 656–663, 2011.
oil and gas industry-a review part II,” Robotics and Auton- [27] M. Ben Gharsallah and E. Ben Braiek, “Weld inspection based
omous Systems, vol. 75, pp. 508–524, 2016. on radiography image segmentation with level set active
[10] L. Yang, Y. Liu, and J. Peng, “Advances techniques of the contour guided off-center saliency map,” Advances in Ma-
structured light sensing in intelligent welding robots: a re- terials Science and Engineering, vol. 2015, Article ID 871602,
view,” International Journal of Advanced Manufacturing 10 pages, 2015.
Technology, vol. 110, pp. 1–20, 2020. [28] C. Ajmi, S. E. Ferchichi, and K. Laabidi, “New procedure for
[11] S. Habibian, M. Dadvar, B. Peykari et al., “Design and weld defect detection based-gabor filter,” in Proceedings of the
implementation of a maxi-sized mobile robot (karo) for 2018 International Conference on Advanced Systems and
rescue missions,” ROBOMECH Journal, vol. 8, 2021. Electric Technologies (ICASET), pp. 11–16, Hammamet,
[12] M. Dadvar and S. Habibian, “Contemporary research trends Tunisia, March 2018.
in response robotics,” 2021, https://arxiv.org/abs/2105.07812. [29] O. Ronneberger, P. Fischer, and T. Brox, “U-net: convolu-
[13] Q. Ma, G. Tian, Y. Zeng et al., “Pipeline in-line inspection tional networks for biomedical image segmentation,” in
method, instrumentation and data management,” Sensors, Proceedings of the International Conference on Medical image
vol. 21, no. 11, 2021. computing and computer-assisted intervention, pp. 234–241,
[14] W. Hou, Y. Wei, J. Guo, Y. Jin, and C. Zhu, “Automatic Munich, Germany, October 2015.
detection of welding defects using deep neural network,” [30] W. Hou, Y. Wei, Y. Jin, and C. Zhu, “Deep features based on a
Journal of Physics: Conference Series, vol. 933, no. 1, 2018. dcnn model for classifying imbalanced weld flaw types,”
[15] X. Dong, C. J. Taylor, and T. F. Cootes, “A random forest-
Measurement, vol. 131, pp. 482–489, 2019.
based automatic inspection system for aerospace welds in [31] C. Ajmi, J. Zapata, S. Elferchichi, A. Zaafouri, and K. Laabidi,
x-ray images,” IEEE Transactions on Automation Science and
“Deep learning technology for weld defects classification
Engineering, pp. 1–14, 2020.
based on transfer learning and activation features,” Advances
[16] D. Mery and M. A. Berti, “Automatic detection of welding
in Materials Science and Engineering, vol. 2020, Article ID
defects using texture features,” Insight-Non-Destructive
1574350, 16 pages, 2020.
Testing and Condition Monitoring, vol. 45, no. 10, pp. 676–681,
[32] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
2003.
classification with deep convolutional neural networks,”
[17] J. Kumar, R. Anand, and S. Srivastava, “Multi-class welding
Advances in Neural Information Processing Systems, vol. 25,
flaws classification using texture feature for radiographic
images,” in Proceedings of the International Conference on pp. 1097–1105, 2012.
[33] D. Mery, V. Riffo, U. Zscherpel et al., “GDXray: the database
Advances in Electrical Engineering (ICAEE), pp. 1–4, Vellore,
India, 2014. of X-ray images for nondestructive testing,” Journal of
[18] J. Kumar, R. S. Anand, and S. P. Srivastava, “Flaws classifi- Nondestructive Evaluation, vol. 34, no. 4, pp. 1–12, 2015.
cation using ann for radiographic weld images,” in Pro- [34] R. Miao, Z. Jiang, Q. Zhou et al., “Online inspection of narrow
ceedings of the 2014 International Conference on Signal overlap weld quality using two-stage convolution neural
Processing and Integrated Networks (SPIN), pp. 145–150, network image recognition,” Machine Vision and Applica-
Noida, India, February 2014. tions, vol. 32, no. 1, pp. 1–14, 2021.
[19] J. Hassan, A. M. Awan, and A. Jalil, “Welding defect detection [35] K. Simonyan and A. Zisserman, “Very deep convolutional
and classification using geometric features,” in Proceedings of networks for large-scale image recognition,” 2015, https://
the 2012 10th International Conference on Frontiers of In- arxiv.org/abs/1409.1556.
formation Technology, pp. 139–144, Islamabad, Pakistan, [36] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
December 2012. for image recognition,” 2015, https://arxiv.org/abs/1512.
[20] O. Zahran, H. Kasban, M. El-Kordy, and F. E. A. El-Samie, 03385.
“Automatic weld defect identification from radiographic [37] Q. Wang, W. Jiao, P. Wang, and Y. Zhang, “A tutorial on deep
images,” NDT & E International, vol. 57, pp. 26–35, 2013. learning-based data analytics in manufacturing through a
[21] T. Y. Lim, M. M. Ratnam, and M. A. Khalid, “Automatic welding case study,” Journal of Manufacturing Processes,
classification of weld defects using simulated data and an mlp vol. 63, pp. 2–13, 2021.
neural network,” Insight-Non-Destructive Testing and Con- [38] A. Paszke, S. Gross, F. Massa et al., “Pytorch: An imperative
dition Monitoring, vol. 49, no. 3, pp. 154–159, 2007. style, high-performance deep learning library,” in Advances in
[22] J. Zapata, R. Vilar, and R. Ruiz, “Performance evaluation of an Neural Information Processing Systems 32, H. Wallach,
automatic inspection system of weld defects in radiographic H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and
images based on neuro-classifiers,” Expert Systems with Ap- R. Garnett, Eds., Curran Associates, Inc., Red Hook, NJ, USA,
plications, vol. 38, no. 7, pp. 8812–8824, 2011. 2019.
[23] I. Valavanis and D. Kosmopoulos, “Multiclass defect detec- [39] D. Mery and C. Arteta, “Automatic defect recognition in x-ray
tion and classification in weld radiographic images using testing using computer vision,” in Proceedings of the 2017
geometric and texture features,” Expert Systems with Appli- IEEE Winter Conference on Applications of Computer Vision
cations, vol. 37, no. 12, pp. 7606–7614, 2010. (WACV), pp. 1026–1035, Santa Rosa, CA, USA, March 2017.
16 Complexity

[40] S. J. Oh, M. J. Jung, C. Lim, and S. C. Shin, “Automatic [58] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation
detection of welding defects using faster R-CNN,” Applied network for instance segmentation,” in Proceedings of the
Sciences, vol. 10, no. 23, pp. 1–10, 2020. IEEE Conference on Computer Vision and Pattern Recognition,
[41] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: towards pp. 8759–8768, Salt Lake City, UT, USA, June 2018.
real-time object detection with region proposal networks,” [59] L. Perez and J. Wang, “The effectiveness of data augmentation
vol. 28, pp. 91–99, 2015. in image classification using deep learning,” 2017, https://
[42] R. Guo, H. Liu, G. Xie, and Y. Zhang, “Weld defect detection arxiv.org/abs/1712.04621.
from imbalanced radiographic images based on contrast [60] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le,
enhancement conditional generative adversarial network and “Autoaugment: Learning augmentation strategies from data,”
transfer learning,” IEEE Sensors Journal, 2021. in Proceedings of the IEEE Conference on Computer Vision and
[43] L. Yang, H. Wang, B. Huo, F. Li, and Y. Liu, “An auto- Pattern Recognition, pp. 113–123, Long Beach, CA, USA, June
matic welding defect location algorithm based on deep 2019.
learning,” NDT & E International, vol. 120, Article ID [61] B. Zoph, E. D. Cubuk, G. Ghiasi, T.-Y. Lin, J. Shlens, and
102435, 2021. Q. V. Le, “Learning data augmentation strategies for object
[44] M. Mirza and S. Osindero, “Conditional generative adver- detection,” in Proceedings of the European Conference on
sarial nets,” 2014, https://arxiv.org/abs/1411.1784. Computer Vision, pp. 566–583, Glasgow, UK, August 2020.
[45] T. Gantala and K. Balasubramaniam, “Automated defect [62] R. Padilla, S. L. Netto, and E. A. B. da Silva, “A survey on
recognition for welds using simulation assisted tfm imaging performance metrics for object-detection algorithms,” in
with artificial intelligence,” Journal of Nondestructive Evalu- Proceedings of the 2020 International Conference on Systems,
ation, vol. 40, no. 1, pp. 1–24, 2021. Signals and Image Processing, pp. 237–242, 2020.
[46] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: [63] “Coco detection challenge (bounding box),” 2021, https://
optimal speed and accuracy of object detection,” 2020, https:// cocodataset.org/#detection-eval.
arxiv.org/abs/2004.10934. [64] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stron-
[47] Y. Yan, D. Liu, B. Gao, G. Y. Tian, and Z. C. Cai, “A deep ger,” in Proceedings of the IEEE Conference on Computer
learning-based ultrasonic pattern recognition method for Vision and Pattern Recognition, pp. 7263–7271, Honolulu, HI,
inspecting girth weld cracking of gas pipeline,” IEEE Sensors USA, July 2017.
Journal, vol. 20, no. 14, pp. 7997–8006, 2020. [65] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal
[48] J. Hu, W. Xu, B. Gao et al., “Pattern deep region learning for loss for dense object detection,” in Proceedings of the IEEE
crack detection in thermography diagnosis system,” Metals, International Conference on Computer Vision, pp. 2980–2988,
vol. 8, no. 8, 2018. Venice, Italy, October 2017.
[49] American Petroleum Institute, API 5L: Specification for Line [66] A. Howard, M. Sandler, G. Chu et al., “Searching for
Pipe, American Petroleum Institute, Washington, NJ, USA, Mobilenetv3,” 2019, https://arxiv.org/abs/1905.02244.
2004. [67] C.-Y. Wang, H.-Y. M. Liao, I.-H. Yeh, Y.-H. Wu, P.-Y. Chen,
[50] American Petroleum Institute, API 1104: Standard for and J.-W. Hsieh, “Cspnet: a new backbone that can enhance
Welding of Pipelines and Related Facilities, American Petro- learning capability of Cnn,” 2019, https://arxiv.org/abs/1911.
leum Institute, Washington, NJ, USA, 2001. 11929.
[51] G. Ghiasi, T.-Y. Lin, and Q. V. Le, “Nas-fpn: learning scalable [68] G. Jocher, A. Stoken, J. Borovec et al., “Ultralytics/yolov5:
feature pyramid architecture for object detection,” in Pro- v5.0-YOLOv5-P6 1280 models, AWS, Supervise.Ly and
ceedings of the IEEE Conference on Computer Vision and YouTube integrations,” 2021, https://zenodo.org/record/
Pattern Recognition, pp. 7036–7045, Long Beach, CA, USA, 4679653#.YTmqdrAzbIU.
June 2019. [69] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick,
[52] M. Tan, R. Pang, and Q. V. Le, “Efficientdet: scalable and “Detectron2,” 2019, https://github.com/facebookresearch/
efficient object detection,” in Proceedings of the IEEE/CVF detectron2.
Conference on Computer Vision and Pattern Recognition, [70] 2020 Efficientdet (A Pytorch Implementation of Efficientdet).
pp. 10781–10790, Seattle, WA, USA, June 2020. [71] I. Loshchilov and F. Hutter, “Sgdr: stochastic gradient descent
[53] M. Tan and Q. Le, “Efficientnet: rethinking model scaling for with warm restarts,” 2016, https://arxiv.org/abs/1608.03983.
convolutional neural networks,” in Proceedings of the Inter- [72] 2021 Exponential Moving Average.” https://www.tensorflow.
national Conference on Machine Learning, pp. 6105–6114, org/api_docs/python/tf/train/ExponentialMovingAverage.
Beach, CA, USA, June 2019. [73] 2021 Nvidia Apex Optimizers.” https://nvidia.github.io/apex/
[54] J. Redmon and A. Farhadi, “Yolov3: an incremental im- optimizers.html.
provement,” 2018, https://arxiv.org/abs/1804.02767. [74] D. Hendrycks and K. Gimpel, “Gaussian error linear units
[55] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE Inter- (gelus),” 2020, https://arxiv.org/abs/1606.08415.
national Conference on computer Vision, pp. 1440–1448, [75] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for
Santiago, Chile, December 2015. activation functions,” 2017, https://arxiv.org/abs/1710.05941.
[56] S. Naddaf-Sh, M.-M. Naddaf-Sh, A. R. Kashani, and [76] D. Misra, “Mish: a self regularized non-monotonic neural
H. Zargarzadeh, “An efficient and scalable deep learning activation function,” 2019, https://arxiv.org/abs/1908.08681.
approach for road damage detection,” in Proceedings of the [77] R. Wightman, “Pytorch image models,” 2019, https://github.
2020 IEEE International Conference on Big Data (Big Data), com/rwightman/pytorch-image-models.
pp. 5602–5608, Atlanta, GA, USA, December 2020. [78] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only
[57] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and look once: unified, real-time object detection,” 2016, https://
S. Belongie, “Feature pyramid networks for object detection,” arxiv.org/abs/1506.02640.
in Proceedings of the IEEE Conference on Computer Vision and [79] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh,
Pattern Recognition, pp. 2117–2125, Honolulu, HI, USA, July and D. Batra, “Grad-cam: visual explanations from deep
2017. networks via gradient-based localization,” in Proceedings of
Complexity 17

the IEEE International Conference on Computer Vision,


pp. 618–626, Venice, Italy, October 2017.
[80] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba,
“Learning deep features for discriminative localization,” in
Proceedings of the Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 2921–2929, Las
Vegas, NV, USA, June 2016.

You might also like