0% found this document useful (0 votes)
20 views9 pages

Attention-Guided Multitask Learning For Surface Defect Identification

The document presents a novel framework called Defect-Aux-Net for surface defect identification in industrial quality control, leveraging multitask learning and attention mechanisms to enhance performance despite limited data. The proposed architecture integrates classification, segmentation, and detection tasks, achieving high accuracy and efficiency in recognizing defects at various scales. Experimental results demonstrate significant improvements over existing models, making it suitable for real-time applications in automated visual inspection systems.

Uploaded by

thaiansonghoan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views9 pages

Attention-Guided Multitask Learning For Surface Defect Identification

The document presents a novel framework called Defect-Aux-Net for surface defect identification in industrial quality control, leveraging multitask learning and attention mechanisms to enhance performance despite limited data. The proposed architecture integrates classification, segmentation, and detection tasks, achieving high accuracy and efficiency in recognizing defects at various scales. Experimental results demonstrate significant improvements over existing models, making it suitable for real-time applications in automated visual inspection systems.

Uploaded by

thaiansonghoan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 19, NO.

9, SEPTEMBER 2023 9713

Attention-Guided Multitask Learning for


Surface Defect Identification
Vignesh Sampath , Iñaki Maurtua , Juan José Aguilar Martín , Andoni Rivera, Jorge Molina ,
and Aitor Gutierrez

Abstract—Surface defect identification is an essential I. INTRODUCTION


task in the industrial quality control process, in which vi-
UTOMATED visual inspection plays an important role
sual checks are conducted on a manufactured product to
ensure that it meets quality standards. The convolutional
neural network (CNN)-based surface defect identification
A in industrial-informatics-based decision-making systems
in various industries, including steel manufacturing companies,
method has proven to outperform traditional image pro- automotive industries, electronic manufacturing, and pharma-
cessing techniques. However, the real-world surface de-
fect datasets are limited in size due to the expensive data ceutical companies. The correct, consistent, and early detec-
generation process and the rare occurrence of defects. To tion of surface defects can make it possible to detect defec-
address this issue, this article presents a method for ex- tive products early in the manufacturing process, which leads
ploiting auxiliary information beyond the primary labels to to time and cost savings. Inspection procedures for detecting
improve the generalization ability of surface defect identi- such defects are usually performed using nondestructive testing
fication tasks. Considering the correlation between pixel-
level segmentation masks, object-level bounding boxes, (NDT) methods. NDT procedure is a combination of various
and global image-level classification labels, we argue that inspection steps used to identify discontinuities or defects in
jointly learning features of the related tasks can improve the a product without causing damage to its usability. The most
performance of surface defect identification tasks. This arti- frequently used industrial NDT methods are visual optic testing,
cle proposes a framework named Defect-Aux-Net, based on radiography, X-ray vision, ultrasonic imaging, dye penetrant
multitask learning with attention mechanisms that exploit
the rich additional information from related tasks with the testing, magnetic particle testing, and infrared thermal imaging.
goal of simultaneously improving robustness and accuracy The testing procedure for each of these methods involves several
of the CNN-based surface defect identification. We con- steps, all of which can be easily automated. However, the final
ducted a series of experiments with the proposed frame- step of visual inspection is more complex in terms of automa-
work. The experimental results showed that the proposed tion and remains primarily a manual process performed by
method can significantly improve the performance of state-
of-the-art models while achieving an overall accuracy of operators.
97.1%, Dice score of 0.926, and mean average precision of The traditional machine-vision system relies on hand-crafted
0.762 on defect classification, segmentation, and detection features, such as color, contrast, texture, edges, foreground–
tasks. background statistics, etc., followed by machine learning classi-
Index Terms—Deep learning, defect classification, defect fiers, such as support vector machines, decision tree, or K-nearest
detection, defect segmentation, machine vision, multitask neighbors. Consequently, hand-crafted feature extraction plays
learning (MTL), quality control, surface defect detection. an important role in classical approaches. However, these fea-
tures are not robust and suited for different tasks, which leads to
long development cycles. Deep learning methods, on the other
Manuscript received 3 January 2022; revised 20 June 2022 and 25 hand, learn the relevant features directly from the raw data, with-
September 2022; accepted 22 December 2022. Date of publication 4
January 2023; date of current version 24 July 2023. This work was
out the need for handcrafted feature representations. In recent
supported in part by DIGIMAN4.0 Project (“Digital Manufacturing Tech- years, convolutional neural network (CNN) has achieved and
nologies for Zero? defect”, https://www.digiman4?0.mek.dtu.dk/) which even surpassed human-level performance on computer vision
is a European Training Network supported by Horizon 2020, the EU
Framework Programme for Research and Innovation under Project
tasks such as image classification. The key difference between
814225, and in part by the ELKARTEK Project KK?2020/00049 3KIA of CNN and traditional machine-vision algorithms is that CNN
the Basque Government. Paper no. TII-21-5940. (Corresponding author: automatically detects significant features without any human
Vignesh Sampath.)
Vignesh Sampath, Iñaki Maurtua, Andoni Rivera, Jorge Molina, and
supervision, which made it the most widely used. A fascinating
Aitor Gutierrez are with the Autonomous and Intelligent Systems Unit, feature of CNN is its ability to take advantage of the spatial
Tekniker, 20600 Gipuzkoa, Spain (e-mail: [email protected]; or temporal correlation of image data. There are three main
[email protected]; [email protected]; jorge.molina
@tekniker.es; [email protected]).
problem categories for image recognition tasks using CNN:
Juan José Aguilar Martín is with the Department of Design and Man- 1) classification, 2) segmentation, and 3) object detection. The
ufacturing Engineering, University of Zaragoza, 50009 Zaragoza, Spain classification task aims to classify an image into a certain cate-
(e-mail: [email protected]).
Color versions of one or more figures in this article are available at
gory. Starting with the ImageNet Large Scale Visual Recogni-
https://doi.org/10.1109/TII.2023.3234030. tion Challenge winning architecture of AlexNet [1], a series
Digital Object Identifier 10.1109/TII.2023.3234030 of increasingly complex architectures including ResNet [2],
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
9714 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 19, NO. 9, SEPTEMBER 2023

Inception [3], Densenet [4], and EfficientNet [5] have been pro-
posed in the literature for the classification task. Object detection
is a task that localizes an object using a bounding box. Some
of the notable object detection algorithms include Fast R-CNN
[6], Faster R-CNN, Mask R-CNN [7], single shot detection
(SSD) [8], You Only Look Once (YOLO) [9], etc. Segmentation
is the task of performing pixel-by-pixel classification. Several
segmentation algorithms have been proposed in the literature
including fully convolutional networks, encoder–decoder-based
approaches [10], multiscale and pyramid architectures [11], etc.
However, industrial visual inspection systems barely utilized
the potential of those complex architectures due to several
reasons [12]. One of the main reasons is that the continuous
improvement in industrial processes has resulted in fewer and
fewer defective samples or the number of defective samples
is very limited [13]. This problem of learning from a limited
number of samples is usually referred to as the small sample
problem, which can easily lead to poor generalization ability of
the trained model [14]. In addition, the target surface defects
have different scales, making the deep learning models even
more challenging to identify the small-sized defects. On the one
hand, the visual appearance of the real-world surfaces defects
varies with the type of materials, imaging conditions, and camera
position. On the other hand, it is challenging to distinguish
tiny defects from the noise or non-defect components within
an image (as shown in Fig. 1). Hence, the appearance of false
positives in a defect-free image is an inevitable circumstance.
Furthermore, real-time applications of complex CNN models are
extremely limited due to the long inference time and the resulting
higher computational resource and power consumption.
To address these limitations, we present a novel universal
architecture that integrates classification, segmentation, and de-
tection of surface defects in a single network. Our architecture, Fig. 1. Magnetic particle inspection on threaded fasteners of differ-
Defect-Aux-Net, is primarily motivated by a multitask learning ent surface finish (TekErreka dataset). Surface defects are marked by
red circles and noise due to magnetic particle depositions are marked
(MTL) scheme that exploits useful information from related in yellow.
learning tasks to help mitigate the problem of data scarcity. The
proposed architecture is based on FPN-semantic-segmentation
[11] with the additional tasks of defect classification and de- learning features of related tasks can improve the performance
tection to improve the generalization ability by utilizing the of all tasks.
image-level information as an inductive bias. Specifically, we Overall, the contributions of our work are as follows.
developed a new MTL network based on FPN, where the classifi- 1) First, we propose a Defect-Aux-Net model architecture,
cation task is carried out in the bottom-up pathway of the network which can perform classification, segmentation, and de-
and segmentation is performed in the top-down pathway of the tection of surface defects in a single network. Compared
network. To create a bounding box, we employ two subnetworks with the existing state-of-the-art CNN models, this ar-
in the top-down pathway, where one subnet determines the class chitecture is lightweight and compact in terms of model
associated with the bounding box and the other performs the parameters. From the model training point of view em-
regression to adjust the bounding box position. ploying fewer parameters in the architecture enables the
The FPN-based feature extractor in the proposed network al- model to efficiently learn potential surface defects from
lows surface defects to be recognized at vastly different scales by a smaller number of labeled examples.
efficiently sharing features between image regions. We further 2) In contrast to existing single-task learning, our proposed
introduce the positional and the channel attention mechanisms MTL in surface defect detection facilitates the model
that focus on learning the features of small surface defects to to learn useful representations of the data by exploiting
improve the robustness of detecting small defects surrounded shared information from related tasks.
by a complex background. 3) Considering surface defect detection with a complex
We evaluate our model on TekErreka, and Severstal [15] sur- background, the positional and the channel attention
face defect datasets, with defect classification, segmentation, and mechanisms are incorporated to amplify target features
detection tasks. Experimental results demonstrate that jointly and to reduce the influence of background noise.
SAMPATH et al.: ATTENTION-GUIDED MULTITASK LEARNING FOR SURFACE DEFECT IDENTIFICATION 9715

4) The proposed model is compact and efficient with the


state-of-the-art performance that meets the computational
resource requirements of the real-time inference speed.

II. RELATED WORK


A large and growing body of literature has explored the
use of CNN for surface defect identification. Kim et al. [16]
adopted a few-shot learning technique with a Siamese neural
network using CNN, which aims to classify surface defects with
a limited number of training images. Lin et al. [17] employed
a class activation mapping technique in CNN to simultaneously
achieve defect classification and localization tasks in the LED
chip defect inspection process. Tao et al. [18] designed cascaded
autoencoder (CASAE) architecture to segment and localize Fig. 2. Sample features in different channels of top-down pathway at
defect region. The proposed architecture transforms the input stage 3.
image into a mask prediction, and then, the defect region of
the segmented mask is classified into their specific classes. Jing
et al. [19] combined autoencoder with a fully connected network the lower semantic features are interconnected to the higher
to detect keyboard light leakage defects from mere dust. Jian semantics.
et al. [20] leveraged generative adversarial network to exaggerate 1) Bottom-Up Pathway: We tested several standard image
the tiny defects within the images to improve the accuracy of classification architectures to select the core model and finally
different classifiers. Zheng et al. [21] proposed a three-stage chose ResNet-50 as the backbone. ResNet-50 has shown great
model for rail surface and fastener defect detection. In the first performance for surface defect classification, segmentation, and
stage, the YOLOV5 framework is employed to localize the detection tasks. ResNet-50 architecture has the advantage of us-
rail and fasteners. Then, an object detection model based on ing a stride of two for each scale reduction, which makes it easier
Mask-RCNN is used to detect the surface defect of the rail to incorporate ResNet-50 into FPNs when we need to upscale
surface. At the final stage, the ResNet architecture is utilized to feature maps in a top-down pathway. Furthermore, Resnet-50 is a
classify defects of the fasteners. To detect defects at a different relatively small network based on modern standards; therefore,
scale, Xu et al. [22] used a pretrained ResNet model to extract it is suitable for our limited labeled data problem. However,
the multiscale features and fuse them using a multilevel feature existing ResNet-50 feature pyramids have two problems in the
fusion network. In [23], U-Net and residual U-Net architectures way they apply convolution operations to the input features.
were used for the fine-grained segmentation of surface defects First, the receptive field of the encoder has the information only
on a steel sheet. The main drawback of these methods is that about the local region, so the global information is lost. Second,
the model needs a large amount of annotated data and hence the the feature maps constructed from the learned weights are given
localization of defects is very coarse in the real-time scenario. an equal magnitude of importance but some feature maps are
more important for the next layers than others. For instance, a
III. PROPOSED METHOD feature map that contains edge information of the defects might
be more important than another feature map that has background
A. Network Architecture
texture information (as shown in Fig. 2). Thus, to incorporate
Our proposed network is inspired by two deep learning ar- channel attention we adopt Squeeze-and-excitation (SE) module
chitectures that are widely used: 1) feature pyramid network [24] in the encoder. SE module consists of three components: 1)
(FPN) and 2) ResNet-50. Recognizing surface defects at vastly squeeze, 2) excite, and 3) scale components.
different scales is a fundamental challenge in the industrial The main goal of the squeeze component is to extract global
machine vision system. For this reason, we use FPN that uses a information from each of the channels c in a feature block U.
pyramidal hierarchy of convolutional filters to extract feature The global information is acquired by applying a global average
pyramids at different scales. FPN consists of two pathways: pooling operation across their spatial dimensions (H × W ) for
1) bottom-up and 2) top-down. The bottom-up pathway also each channel Uc of U to obtain global statistics (1 × 1 × C).
known as the encoder is the typical CNN, which can be any Mathematically, squeeze operation can be represented as
image classifier for feature extraction. As we go up, the en-
H  W
coder gradually decreases the spatial resolution while building 1
zc = Fsqueeze (Uc ) = Uc (m, n) . (1)
high-level feature maps. The top-down pathway is connected to H × W m=1 n=1
the bottom-up pathway through lateral connections for efficient
multiscale feature fusion. It is designed to enhance the feature After obtaining global information from the squeeze compo-
maps from the bottom-up pathway and build semantically strong nent, the excite component generate a set of weights for each
feature maps at multiple scales by double upscaling. As a result, channel. It uses a fully connected multilayer perceptron (MLP)
the feature pyramid has rich semantics at all levels because bottleneck structure to dynamically calibrate the weights. This
9716 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 19, NO. 9, SEPTEMBER 2023

2) Top-Down Pathway: Deep features from a bottom-up


pathway are upsampled by convolutions and bilinear upsam-
pling operations until all the feature maps reach one-fourth
scale. Attention module outputs from a bottom-up pathway
{C2 , C3 , C4 , C5 } are fused to a top-down pathway through
lateral connections for an efficient multiscale feature fusion.
Fig. 3. Structure of squeeze and excite module. First, 1 × 1 convolutional filter is applied to the feature maps
{C2 , C3 , C4 , C5 } to get a fixed number of channels and then
merged with the corresponding top-down feature map by ele-
mentwise addition. Finally, the outputs are summed and then
transformed into a pixelwise output (as shown in Fig. 6).
3) Segmentation Branch: The segmentation branch from a
top-down pathway aims at classifying pixels into a set of pre-
defined classes. The pixels corresponding to background are far
Fig. 4. Structure of spatial attention module. more numerous than the pixels of surface defects in the real-
world dataset, which causes the model to be biased toward the
background element. To address the pixelwise class imbalance,
we employ Dice loss, which uses the Dice coefficient to calculate
overlapping of the pixels of the predicted mask with the ground
truth label. Mathematically, the Dice loss function is defined as
2y ŷ + 1
Lseg = 1 − (3)
y + ŷ + 1
Fig. 5. FPN bottom-up structure with attention module.
where yi is the ground truth label and ŷi is the predicted label.
The value of the Dice coefficient ranges from 0 to 1, where 1
MLP bottleneck has two fully connected layers with sigmoid indicates the perfect and complete overlap of pixels.
4) Classification Branch: The output of the bottom-up path-
activation as the output layer. The output of the excitation com-
ponent can formally be represented by the following equation: way encodes the rich abstract feature representations of the input
image. Hence, we utilize the spatial average of the feature maps
from the bottom-up pathway via a global average pooling layer,
s = Fexcite (z, W ) = σ (g (z, W )) = σ (W2 ρ (W1 , z)) (2) and then, the resulting feature vector is fed into the sigmoid or
softmax layer depending on the classification type. We employ
where σ is a Sigmoid operation, ρ is ReLU operation, z is binary cross-entropy (BCE) as a classification loss function.
the output from the squeeze component, W1 and W2 refers to Mathematically, our classification loss is defined as
weights of the two fully connected layers. Subsequently, each k
channel in the feature map is scaled by a simple elementwise 1
Lclass = CE (yi , ŷi ) (4)
multiplication of the input feature map and weights obtained k 1
from the excite component (as shown in Fig. 3).
Surface defects only appear in some parts of the image where yi is the ground truth label, ŷi is the predicted label of ith
but not the whole image. Unlike the conventional Resnet-50 sample, and k is the total number of samples. CE is the binary
architecture, which gives equal importance to each region in an cross entropy function.
image, the spatial attention reduces background interferences by 5) Object Detection Branch: We extract bounding boxes and
assigning a weight to each pixel in the feature map. its associated classes by employing box regression and classifi-
The spatial attention focuses on the most relevant parts of the cation subnets at each level of top-down pathway. The classifi-
feature maps in the spatial dimension. The working principle of cation subnet predicts the probability of defect presence at each
our spatial attention mechanism is as follows. spatial location of an input image. The box regression subnet is
Given feature block U , we use average and max-pooling attached to a top-down pathway in parallel to the classification
operations along the channel axis and concatenate them to subnet for the purpose of regressing offset from each anchor box
generate an efficient feature map summary M. A convolutional to the ground truth bounding boxes. To handle class imbalance
layer followed by sigmoid operation is then performed on the problems, we adopt focal loss [25], an improved version of cross
feature M to produce a spatial attention map (as shown in Fig. 4). entropy to focus learning on hard negative examples. It is defined
ResNet uses four modules consisting of residual blocks, as
each of which uses two blocks, 1) Identity (ID) blocks and 2) Ldetection = −αt (1 − pt )γ log (pt ) (5)
convolution blocks, depending on whether the input / output
dimensions are the same or different. We arrange SE and SA where αt is the weight parameter per class and γ is the hyperpa-
modules in series and integrate into a residual block (as shown in rameter focuses on hard negative samples. We choose αt = 0.25
Fig. 5). and γ = 4 as suggested in [26].
SAMPATH et al.: ATTENTION-GUIDED MULTITASK LEARNING FOR SURFACE DEFECT IDENTIFICATION 9717

Fig. 6. Overview of the proposed Defect-Aux-Net architecture. It is mainly composed of classification, segmentation, and detection module that
incorporates multitask loss function.

B. Loss Function
Our proposed method combines three loss functions from the
classification, segmentation, and detection tasks, which provide
mutual sources of inductive bias for each task. Specifically,
the segmentation and detection loss functions signal back to
the entire model (bottom-up and top-down pathway) while the
classification loss signals back only to bottom-up pathway. We
combine and weight the three losses into a multitask loss LM
to leverage the heterogeneous annotations and jointly optimize
multiple tasks as follows:

LM = βLclass + β1 Lseg + β2 Ldetection (6)

where β, β1 , and β2 are weight parameters. We tested with


different combinations of weight parameters and found that
β = β1 = β2 = 1 yields the best result for all the tasks.

IV. EXPERIMENTS
A. Datasets
In this article, we evaluate our framework on real-world
surface defect identification problems. We use two challenging
datasets with increasing resolutions and complexities, 1) Sev-
erstal steel sheet [15] and 2) TekErreka steel fastener defect
datasets. Severstal, the largest steel and steel-related mining
company, has recently published the largest industrial steel
Fig. 7. Sample images of Severstal steel with four classes of defect.
sheet surface defect dataset, which contains pixelwise masks
annotated by their technical experts. The dataset contains 12 568
grayscale images of size 1600×256. Each image in the dataset The TekErreka dataset is a self-collected steel fastener surface
has the possibility of having either no defects, a single defect, defect dataset based on a magnetic particle inspection proce-
or multiple defects divided into four classes. Fig. 7 shows dure. The magnetic particle inspection is an excellent method
the example of steel defect images on Severstal datasets. We to investigate near-surface defects in steel fasteners. The basic
randomly select 10% and 20% of the 12 568 original images principle is to magnetize a steel fastener parallel to its surface.
as the validation and test data. The main challenge with this If the fastener is free from defects the magnetic field lines run
dataset is that the interclass similarities between defective and within the fastener and parallel to its surface. In case of magnetic
defect-free examples are very high. inhomogeneity, for instance, near cracks, the magnetic field lines
9718 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 19, NO. 9, SEPTEMBER 2023

will locally leave the surface and a leakage field occurs. When a TABLE I
PERFORMANCE OF THE PROPOSED APPROACH ON LOSS VARIANTS FOR THE
suspension of ferromagnetic particles is applied to the test piece DEFECT SEGMENTATION TASK
surface the magnetic particles will run off at defect-free areas.
In the places of leakage fields, the magnetic particles are at-
tracted and clustered together thus indicating the location of the
defect. The surface defects can be visible under ultraviolet light.
We acquired the TekErreka dataset from a magnetic particle
inspection apparatus located at the Erreka fastening solutions.
The defects in the TekErreka dataset differ in their size, shape,
location, and materials type and thus cover several scenarios
in real-time defect detection. The difficulty in this dataset lies
where TP, TN, FP, and FN denote true positive (correctly iden-
in the similarity of defects and noise due to magnetic particles
tified surface defects), true negative (correctly identified non-
deposition on the defect-free surface of the fasteners. There are
defect images), false positive (erroneously classified images as
many factors responsible for the noise component, which include
surface defect), and false negative (erroneously classified images
magnetic particle size, the amount of magnetic particles used,
as non-defect). Precision measures the percentage of images
ultraviolet light present, etc. The original examples are directly
with surface defects that are correctly classified while recall is
stored in a database as RGB images of size 2464 × 2056. It has
the ratio of correctly classified images with surface defects to
450 positive and 1200 negative examples. We split the TekErreka
all images with surface defects. F1-score can be interpreted as a
dataset into training and testing sets: 80% for training and 20%
harmonic mean of precision and recall. The overall performance
for evaluation of the model performance.
of the classification task is measured by its accuracy.
The segmentation results are evaluated using Dice score and
B. Preprocessing Intersection-over-Union (IoU), which quantify the percentage
We resized the images of the Severstal dataset to 128×800 overlap between the predicted and target binary masks. To
and the TekErreka dataset to 600×600. To keep the pixel values evaluate defect detection results, we used the mean average
in the same scale, we normalized the images using min–max precision (mAP) that compares the detected bounding box to
standardization. It rescales raw pixel values to a range of 0 and the ground truth bounding box and returns a score.
1. This helps the optimizer not get stuck taking steps that are too
large in one dimension, or too small in another. F. Experiments on Defect Segmentation
We performed a series of experiments on the TekErreka
C. Data Augmentation dataset to test the effectiveness of different loss functions. First,
To improve the diversity of the training set, we apply ran- we trained Defect-Aux-Net using BCE, and Dice loss alone as
dom but realistic data augmentation such as rotation, verti- the segmentation loss. Then, it was trained using a combination
cal/horizontal flips, zoom, shear, and channel shifts. of loss functions. The results are shown in Table I.
Using Dice loss alone yielded more accurate results than using
D. Training Details a combination of losses. Additionally, the Dice loss function
assisted our model to converge faster. We use the Dice loss
The Defect-Aux-Net is implemented using the Tensorflow function throughout rest of the experiments.
framework. All the experiments are run on Google-cloud TPU To verify the effectiveness of the segmentation task using
V2 infrastructure, which contains 8 cores with 64 GB memory. the MTL strategy, we compared the proposed MTL network
The network is optimized with the Adam optimizer and trained (Defect-Aux-Net) against the following network with the same
with a batch size of 128 for 50 epochs. We adopt one cycle policy bottom-up backbone (Resnet50 + SE + SA attention module).
[27] to find an optimal learning rate. 1) FPN [11]: This is the original FPN architecture without
the MTL strategy and serves as our baseline.
E. Evaluation Metrics 2) UNet [10]: This network uses an encoder for multilevel
The classification results are evaluated using precision, recall, feature extraction and a decoder that scales them up and
F1-score, and binary accuracy combines multilevel features through stacking.
3) LinkNet [28]: This is similar to UNet with the difference
TP
Recall = (7) of replacing stacking operation with addition in skip
TP + FN connections.
TP 4) PSPNet [28]: Pyramid scene parsing network uses a pyra-
Precision = (8)
TP + FP mid pooling module for multiscale feature extraction.
2 · (Precision · Recall) Based on the experimental results, we observed that the pro-
F1Score = (9) posed multitask learning strategy achieves better segmentation
(Precision + Recall)
performance as compared to the state-of-the-art segmentation
TP + TN models. The Dice and IoU scores of the various segmentation
Accuracy = (10)
TP + FP + TN + FN models on the Severstal dataset are depicted in Figs. 8 and 9.
SAMPATH et al.: ATTENTION-GUIDED MULTITASK LEARNING FOR SURFACE DEFECT IDENTIFICATION 9719

TABLE III
COMPARISON OF PERFORMANCE OF DEFECT-AUX-NET AND
STATE-OF-THE-ART CLASSIFICATION MODELS

Fig. 8. IOU comparison between the state-of-the-art segmentation


methods and the proposed approach on each type of defect classifi-
cation.

G. Experiments on Defect Classification


We evaluated and compared the classification task perfor-
Fig. 9. Dice score comparison between the state-of-the-art segmen-
tation methods and the proposed approach on each type of defect mance of the proposed approach with the state-of-the-art deep
classification. learning architectures. While evaluating the classification task,
the other two modules, segmentation and detection, are removed
TABLE II from the network. The results of the experiments are summarized
PERFORMANCE OF THE COMPETING MODELS ON THE TEKERREKA DATASET in Table III. It can be noted that most errors are due to false pos-
itives. The visual similarity between defects and surface noise
leads to false positive errors. Notably, Defect-Aux-Net obtains
overall accuracy of at least 92.9% and at most 99.4% across all
defect types on the Severstal dataset. Based on the experimental
results, we observe that the proposed MTL approach achieves
a surpassing performance over the other models. Also, it is
evident that incorporating the segmentation task improves the
performance of the classification task and vice-versa.
To assess the effectiveness of the proposed approach against
the limited data problem, we removed part of the training data
and conducted a series of experiments leaving 90%, 75%, and
We observe that Defect-Aux-Net is able to achieve higher 50% from the training data. The effect of training data size on
scores for all classes as compared to the other segmentation its accuracy is shown in Fig. 10. The proposed Defect-Aux-Net
models. Table II shows the performance of the various networks showed a consistent performance even when only 50% of the
on the TekErreka dataset. Experimental results from Table II original training data is used in training. As seen, the proposed
showed that the proposed multitask learning can improve the multitask loss function greatly improves the performance of
performance of its corresponding single-task model. Taking the classification task by taking image, pixel, and map level
advantage of the classification-guidance module, Defect-Aux- optimization into consideration.
Net avoids the oversegmentation of defects in a complex back- To verify the importance of the attention mechanisms in
ground. Defect-Aux-Net, we compared the accuracy of the network with
9720 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 19, NO. 9, SEPTEMBER 2023

Fig. 11. mAP comparison between the state-of-the-art detection mod-


Fig. 10. Training data size versus classification accuracy of the Sever- els and the proposed model.
stal dataset.

TABLE V
TABLE IV SYSTEM SPECIFICATION
EFFECT OF USING ATTENTION MECHANISMS ON TEKERREKA DATASET

TABLE VI
COMPARISON OF THE INFERENCE TIME OF DEFECT-AUX-NET AND BASELINE
MODEL

and without spatial and channel attention mechanism (squeeze


and excite) on the TekErreka dataset, as shown in Table IV.
Furthermore, we experimented with inserting a combination of
both spatial and channel attention mechanisms.

H. Experiments on Defect Detection


The proposed model is compared with other object detection
algorithms on the TekErreka dataset. The comparative models
include SSD [8], RetinaNet [25], and cascade R-CNN [30].
Fig. 11 shows the mAP scores of the various detection models
for the TekErreka dataset. We observe that Defect-Aux-Net
is able to achieve a higher mAP score as compared to the
alternative networks. The mAP of the proposed algorithm is
17.95%, 43.77%, and 26.03% higher than that of RetinaNet, From Table VI, we can see that our proposed framework
SSD, and Cascade RCNN. allows for a 57.1% reduction in the model size by solving
different tasks jointly rather than independently. Compared to
I. Inference Time the single-task network, the inference time of our proposed
network reduces by 45.5%.
In addition to the model performance, we attempt to determine
the effectiveness of the MTL framework on the inference time.
V. DISCUSSION
We compared the inference time of the proposed approach with
a conventional single-task network where each task requires By incorporating the MTL strategy, our proposed Defect-
a separate pass through the network during inference. All the Aux-Net improves the performance of defect classification,
inference time was measured using a computer with an Intel Core segmentation, and detection tasks. Intuitively, the multitask
processor. The CPU specification is summarized in Table V. deep learning system can provide regularization effects to the
SAMPATH et al.: ATTENTION-GUIDED MULTITASK LEARNING FOR SURFACE DEFECT IDENTIFICATION 9721

multiscale feature learning and thus improve the performance as [11] S. Seferbekov, V. Iglovikov, A. Buslaev, and A. Shvets, “Feature pyra-
opposed to the single-task algorithms. Also, the MTL framework mid network for multi-class land segmentation,” in Proc. IEEE/CVF
Conf. Comput. Vis. Pattern Recognit. Workshops, 2018, pp. 272–2723,
can save computational inference time as only a single network doi: 10.1109/CVPRW.2018.00051.
needs to be evaluated for three different tasks. The experimental [12] X. Ni, Z. Ma, J. Liu, B. Shi, and H. Liu, “Attention network for rail
results show that our proposed algorithm greatly improves the surface defect detection via consistency of intersection-over-union (IoU)-
guided center-point estimation,” IEEE Trans. Ind. Inform., vol. 18, no. 3,
performance of the surface defect identification tasks compared pp. 1694–1705, Mar. 2022, doi: 10.1109/TII.2021.3085848.
to other state-of-the-art deep learning algorithms. [13] D. Zhang, K. Song, Q. Wang, Y. He, X. Wen, and Y. Yan, “Two deep
learning networks for rail surface defect inspection of limited samples with
line-level label,” IEEE Trans. Ind. Inform., vol. 17, no. 10, pp. 6731–6741,
VI. CONCLUSION Oct. 2021, doi: 10.1109/TII.2020.3045196.
[14] L. Wen, Y. Wang, and X. Li, “A new cycle-consistent adversarial networks
In this article, we described an attention-guided MTL scheme, with attention mechanism for surface defect classification with small sam-
which combines classification, segmentation, and defection for ples,” IEEE Trans. Ind. Inform., vol. 18, no. 12, pp. 8988–8998, Dec. 2022,
automated surface defect detection. Specifically, we proposed doi: 10.1109/TII.2022.3168432.
[15] Kaggle, “Severstal: Steel defect detection. Can you detect and classify
an extended FPN architecture with Resnet-50 incorporated as defects in steel?,” 2019.
the encoder section of the model. The hybrid loss function [16] M. S. Kim, T. Park, and P. Park, “Classification of steel surface defect
is introduced to enhance the performance of the model. An using convolutional neural network with few images,” in Proc. IEEE 12th
Asian Control Conf., 2019, pp. 1398–1401.
overall accuracy of 97.1%, Dice score of 0.926, and mAP of [17] H. Lin, B. Li, X. Wang, Y. Shu, and S. Niu, “Automated de-
0.762 on classification, segmentation, and detection tasks of the fect inspection of LED chip using deep convolutional neural net-
TekErreka dataset were achieved with Defect-Aux-Net. work,” J. Intell. Manuf., vol. 30, no. 6, pp. 2525–2534, Aug. 2019,
doi: 10.1007/s10845-018-1415-x.
[18] X. Tao, D. Zhang, W. Ma, X. Liu, and D. Xu, “Automatic metallic surface
ACKNOWLEDGMENT defect detection and recognition with convolutional neural networks,”
Appl. Sci., vol. 8, no. 9, 2018, Art. no. 1575, doi: 10.3390/app8091575.
This work was undertaken in the context of DIGIMAN4.0 [19] J. Ren and X. Huang, “Defect detection using combined deep au-
project (“Digital Manufacturing Technologies for Zero-Defect,” toencoder and classifier for small sample size,” in Proc. IEEE 6th
Int. Conf. Control Sci. Syst. Eng., 2020, pp. 32–35, doi: 10.1109/ICC-
https://www.digiman4-0.mek.dtu.dk/). DIGIMAN4.0 is a Eu- SSE50399.2020.9171953.
ropean Training Network supported by Horizon 2020, the EU [20] J. Lian et al., “Deep-learning-based small surface defect detection via
Framework Programme for Research and Innovation under an exaggerated local variation-based generative adversarial network,”
IEEE Trans. Ind. Inform., vol. 16, no. 2, pp. 1343–1351, Feb. 2020,
Project 814225. doi: 10.1109/TII.2019.2945403.
[21] D. Zheng et al., “A defect detection method for rail surface and fasteners
REFERENCES based on deep convolutional neural network,” Comput. Intell. Neurosci.,
vol. 2021, Jul. 2021, Art. no. 2565500, doi: 10.1155/2021/2565500.
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification [22] P. Xu, Z. Guo, L. Liang, and X. Xu, “MSF-Net: Multi-scale feature learning
with deep convolutional neural networks,” Commun. ACM, vol. 60, no. 6, network for classification of surface defects of multifarious sizes,” Sensors,
pp. 84–90, May 2017, doi: 10.1145/3065386. vol. 21, no. 15, Jul. 2021, Art. no. 5125, doi: 10.3390/s21155125.
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image [23] D. Amin and S. Akhter, “Deep learning-based defect detection system in
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, steel sheet surfaces,” in Proc. IEEE Region 10 Symp., 2020, pp. 444–448,
pp. 770–778, doi: 10.1109/CVPR.2016.90. doi: 10.1109/TENSYMP50017.2020.9230863.
[3] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Re- [24] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc.
thinking the inception architecture for computer vision,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141,
IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2818–2826, doi: 10.1109/CVPR.2018.00745.
doi: 10.1109/CVPR.2016.308. [25] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense
[4] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely object detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 2,
connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. pp. 318–327, Feb. 2020, doi: 10.1109/TPAMI.2018.2858826.
Pattern Recognit., 2017, pp. 2261–2269, doi: 10.1109/CVPR.2017.243. [26] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for
[5] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolu- dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2017,
tional neural networks,” in Proc. 36th Int. Conf. Mach. Learn., vol. 97, pp. 2999–3007, doi: 10.1109/ICCV.2017.324.
2019, pp. 6105–6114. [27] L. Smith, “A disciplined approach to neural network hyper-parameters:
[6] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, Part 1 – learning rate, batch size, momentum, and weight decay,” 2018,
pp. 1440–1448, doi: 10.1109/ICCV.2015.169. arXiv:1803.09820.
[7] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” IEEE [28] A. Chaurasia and E. Culurciello, “LinkNet: Exploiting encoder represen-
Trans. Pattern Anal. Mach. Intell., vol. 42, no. 2, pp. 386–397, Feb. 2020, tations for efficient semantic segmentation,” in Proc. IEEE Vis. Commun.
doi: 10.1109/TPAMI.2018.2844175. Image Process., 2017, pp. 1–4, doi: 10.1109/VCIP.2017.8305148.
[8] W. Liu et al., “SSD: Single shot multiBox detector,” in Proc. Eur. Conf. [29] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
Comput. Vis. (Lecture Notes in Computer Science Series), B. Leibe, J. network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017,
Matas, N. Sebe, and M. Welling, Eds. Cham, Switzerland: Springer, 2016, pp. 6230–6239, doi: 10.1109/CVPR.2017.660.
pp. 21–37. [30] Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high quality
[9] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog-
Unified, real-time object detection,” in Proc. IEEE Conf. Comput. Vis. nit., 2018, pp. 6154–6162, doi: 10.1109/CVPR.2018.00644.
Pattern Recognit., 2016, pp. 779–788, doi: 10.1109/CVPR.2016.91.
[10] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks
for biomedical image segmentation,” in Proc. Int. Conf. Med. Image
Comput. Comput.-Assist. Intervention, 2015, pp. 234–241.

You might also like