0% found this document useful (0 votes)

14 views17 pages

Electronics 12 04422 v2

This document presents an improved algorithm for real-time defect detection in steel, named YOLOv5s-7.0-FCC, which enhances the accuracy and speed of defect identification by integrating advanced modules into the YOLOv5 architecture. The proposed algorithm demonstrates a mean average precision (mAP) increase from 73.1% to 79.5% and an improved detection speed from 101.1 f/s to 109.4 f/s, addressing the limitations of existing object detection algorithms in steel manufacturing. The study emphasizes the importance of accurate defect detection for maintaining steel quality and outlines the technical enhancements made to the YOLOv5 model to achieve these improvements.

Uploaded by

sanae

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views17 pages

Electronics 12 04422 v2

Uploaded by

sanae

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

electronics

Article
Development of an Algorithm for Detecting Real-Time Defects
in Steel
Jiabo Yu 1 , Cheng Wang 1, *, Teli Xi 2 , Haijuan Ju 1 , Yi Qu 1 , Yakang Kong 1 and Xiancong Chen 1

1 Fundamentals Department, Air Force Engineering University, Xi’an 710051, China;

[email protected] (J.Y.); [email protected] (H.J.); [email protected] (Y.Q.);
[email protected] (Y.K.); [email protected] (X.C.)
2 School of Optoelectronic Engineering, Xidian University, Xi’an 710126, China; [email protected]
* Correspondence: [email protected]

Abstract: The integration of artificial intelligence with steel manufacturing operations holds great
potential for enhancing factory efficiency. Object detection algorithms, as a category within the field
of artificial intelligence, have been widely adopted for steel defect detection purposes. However,
mainstream object detection algorithms often exhibit a low detection accuracy and high false-negative
rates when it comes to detecting small and subtle defects in steel materials. In order to enhance
the production efficiency of steel factories, one approach could be the development of a novel
object detection algorithm to improve the accuracy and speed of defect detection in these facilities.
This paper proposes an improved algorithm based on the YOLOv5s-7.0 version, called YOLOv5s-
7.0-FCC. YOLOv5s-7.0-FCC integrates the basic operator C3-Faster (C3F) into the C3 module. Its
special T-shaped structure reduces the redundant calculation of channel features, increases the
attention weight on the central content, and improves the algorithm’s computational speed and
feature extraction capability. Furthermore, the spatial pyramid pooling-fast (SPPF) structure is
replaced by the Content Augmentation Module (CAM), which enriches the image feature content
with different convolution rates to simulate the way humans observe things, resulting in enhanced
feature information transfer during the process. Lastly, the upsampling operator Content-Aware
ReAssembly of Features (CARAFE) replaces the “nearest” method, transforming the receptive field
size based on the difference in feature information. The three modules that act on feature information
Citation: Yu, J.; Wang, C.; Xi, T.; Ju, are distributed reasonably in YOLOv5s-7.0, reducing the loss of feature information during the
H.; Qu, Y.; Kong, Y.; Chen, X. convolution process. The results show that compared to the original YOLOv5 model, YOLOv5s-7.0-
Development of an Algorithm for FCC increases the mean average precision (mAP) from 73.1% to 79.5%, achieving a 6.4% improvement.
Detecting Real-Time Defects in Steel. The detection speed also increased from 101.1 f/s to 109.4 f/s, an improvement of 8.3 f/s, further
Electronics 2023, 12, 4422. https:// meeting the accuracy requirements for steel defect detection.
doi.org/10.3390/electronics12214422

Academic Editor: Spyridon Keywords: receptive field; YOLOv5s-7.0; feature extraction; surface defect detection; attention weight
Nikolaidis

Received: 13 September 2023

Revised: 15 October 2023
1. Introduction
Accepted: 23 October 2023
Published: 27 October 2023
Steel serves as a critical material in the manufacturing industry, infrastructure sector,
and other fields. The quality of steel directly impacts both the quality of the finished
products and the construction of infrastructure. Therefore, ensuring strict quality control
for steel is exceptionally important, as it constitutes the initial guarantee for product
Copyright: © 2023 by the authors. qualification.
Licensee MDPI, Basel, Switzerland. In the production and processing of steel products, defects such as pitting, scratches,
This article is an open access article and patches often occur on the surface of the steel. These defects can lead to a decrease in
distributed under the terms and the quality and performance of steel products, thereby reducing the reliability and service
conditions of the Creative Commons
life of the steel. Therefore, measures must be taken during steel production and usage
Attribution (CC BY) license (https://
to accurately and quickly detect and eliminate these non-compliant steel materials. The
creativecommons.org/licenses/by/
methods for detecting surface defects in steel can be categorized as manual inspection and
4.0/).

Electronics 2023, 12, 4422. https://doi.org/10.3390/electronics12214422 https://www.mdpi.com/journal/electronics

Electronics 2023, 12, 4422 2 of 17

machine inspection. Manual inspection is prone to randomness, and the accuracy is heavily
influenced by the experience and attentiveness of the inspectors. Moreover, small defects
in steel may not be easily noticed by human workers [1]. On the other hand, machine
inspection offers the advantages of low cost, high efficiency, and good stability.
Object detection algorithms serve as the core of machine-based detection. However,
there are still two challenges for mainstream object detection algorithms in recognizing
surface defects on steel. Firstly, there is a high similarity between the different types of
defects on the steel surface, while some similar defects exhibit significant variations [2].
Secondly, the multitude of defect types on the steel surface leads to imprecise classification
results [3]. These two challenges result in decreased precision and a slower detection speed
of the object detection algorithms. And mainstream object detection algorithms can no
longer meet the strict defect detection requirements of factories [4]. Therefore, there is an
urgent need for more advanced algorithms with improved performance in order to meet
the production demands of factories.
As a solution to this problem, the architecture of the YOLOv5 model was introduced.
However, it should be noted that in the YOLOv5 model architecture, the propagation of
convolution calculations leads to the loss of feature information. Therefore, it is crucial
to focus on the rational distribution of structures and module replacements that are more
suitable for machine computations. Zhang et al. [5] proposed an improved algorithm based
on YOLOv5 by incorporating deformable modules to adaptively adjust the perception
field scale. They also introduced the ECA-Net attention mechanism to enhance feature
extraction capabilities. The improved algorithm achieved a 7.85% increase in the mean
average precision (mAP) compared to the original algorithm. Li et al. [6] put forward
a modified algorithm based on YOLOv5, where they integrated the Efficient Channel
Attention for Deep Convolutional Neural Networks (ECA-Net)—an attention mechanism
to emphasize feature extraction in defective regions. They replaced the PANNet module
with the Bidirectional Feature Pyramid Network (BiFPN) module to integrate feature maps
of different sizes. The results showed that compared to the original YOLOv5 model, the
mAP increased by 1% while the computation time decreased by 10.3%. Guizhong Fu et al. [7]
proposed a compact Convolutional Neural Network (CNN) model that focused on training
low-level features to achieve the accurate and fast classification of steel surface defects. This
model demonstrated high precision performance with a small training dataset, even under
various modes of interference such as non-uniform illumination, motion blur, and camera
noise. Yu He et al. [3] proposed a fusion of multi-level feature maps approach, enabling the
detection of multiple defects on a single image. They utilized a Region Proposal Network
(RPN) to generate regions of interest (ROI), and the final conclusions were produced by the
detector. The results showed an accuracy of 82.3% on the MEU-DET dataset.
Real-Time Object Detection is also an important requirement for the industrialization
of steel defect detection. Qinglang et al. [8], focusing on the elongated nature of road
cracks, proposed an improved algorithm based on YOLOv3. They fused high-level and
low-level feature maps to enhance feature representation and achieve real-time detection
of road surfaces. To achieve faster object detection, Jiang et al. [9] introduced the YOLOv4-
tiny. This model replaced two (CSPBlock) modules with two ResnetBlock-D modules
to improve computation speed. Furthermore, residual network blocks were utilized to
extract more feature information from images, thus improving the detection accuracy.
The results showed that the improved algorithm achieved a faster detection rate without
sacrificing accuracy.
There are two main categories of deep learning-based object detection methods: one-
stage and two-stage. The one-stage approach directly utilizes convolutional neural net-
works to extract image features, and perform object localization and classification. Classic
algorithms in this category include the YOLO [10–13] series and SSD [14,15] series. In
contrast, the two-stage approach generates candidate regions before performing the afore-
mentioned processes. Popular algorithms in this category include the RCNN [16,17] series,
SPPNet [18], and R-FCN [19]. Considering its practicality within factories, the YOLOv5
Electronics 2023, 12, 4422 3 of 17

model from the one-stage category is commonly used. It offers a faster detection speed but
suffers from a lower accuracy. To address this problem, the authors made improvements to
certain structures in YOLOv5 specifically for steel training datasets. These modifications
made the algorithm’s structure more suitable for machine feature extraction, resulting in
an improved detection speed and an increased average accuracy.
This article begins by introducing the basic architectural features of the YOLOv5-7.0
algorithm. Afterwards, the author addresses several issues affecting the measurement
accuracy in the original algorithm and proposes three improved modules to replace the
problematic ones. The structure and distinguishing characteristics of these replacement
modules are emphasized. Subsequently, details regarding the experimental setup, dataset,
evaluation metrics, and other relevant information are provided.
The article then presents comparative experimental results, including comparisons
of six different module replacements for the c3 module, three different forms of CAM for
replacing the SPPF module, and eight different forms of Carafe for replacing the nearest
module. Additionally, a comparative experiment is conducted using the three selected
optimal modules in combination. Furthermore, the improved algorithm is compared with
mainstream detection algorithms.
Finally, the article concludes by presenting comparative visual results of the detection
performance between the improved algorithm and the original algorithm.

2. Materials and Methods

2.1. YOLOv5-7.0 Algorithm
The YOLOv5 algorithm is one of the typical one-stage algorithms that utilizes a series
of convolutional computations to extract hierarchical features from images within the
backbone network. By fusing high-level semantic information with low-level details, it
facilitates effective classification and localization, ultimately performing object detection in
the “detect” stage.
Version 7.0 of YOLOv5 represents the latest release in the YOLOv5 series and brings
significant improvements to instance segmentation performance. YOLOv5-7.0 offers five
different models, namely YOLOv5s, YOLOv5m, YOLOv5n, YOLOv5l, and YOLOv5x,
arranged in increasing order of module size. These models exhibit varying speeds and
accuracy levels.
The network structure of YOLOv5-7.0 consists of three components: the backbone,
neck, and head. The backbone is responsible for extracting high-level semantic features
from input images. The default backbone network in YOLOv5-7.0 is CSPDarknet, which
employs stacked convolutional and pooling layers to reduce the image resolution while
capturing essential features. After the processing by the backbone, the neck module in
YOLOv5-7.0 performs feature fusion and processing, combining features from different
levels to extract more comprehensive information. A Feature Pyramid Network (FPN)
is commonly used as the fusion method in YOLOv5-7.0, enabling the extraction of scale
information from multiple feature maps. The processed information from the neck module
is then fed into the head module, which employs non-maximum suppression to generate
three prediction results for predicting object locations and categories.
The algorithm proposed in this paper aims to enhance the YOLOv5s architecture
of version 7.0, with the goal of improving detection efficiency and reducing error rates
in practical applications. This enhancement seeks to achieve both rapid detection and
increased accuracy, catering to the requirements of factories.

2.2. YOLOv5-7.0 Improvement

2.2.1. C3F Operator
C3 is derived from the Cross Stage Partial Networks (CSPNet) architecture. C3 has
two variations: one in the backbone of YOLOv5-7.0 as shown in Figure 1a, and another in
the head of YOLOv5-7.0 as shown in Figure 1b. The difference between BottleNeck1 and
2.2. YOLOv5-7.0 Improvement
2.2.1. C3F Operator
C3 is derived from the Cross Stage Partial Networks (CSPNet) architecture. C3 ha
two variations: one in the backbone of YOLOv5-7.0 as shown in Figure 1a, and another in
Electronics 2023, 12, 4422 the head of YOLOv5-7.0 as shown in Figure 1b. The diﬀerence between BottleNeck14 of 17 and
BottleNeck2 lies in their input processing. In BottleNeck1, the result is obtained by addin
the output of Conv applied twice to the initial input.
BottleNeck2 lies in their input processing. In BottleNeck1, the result is obtained by adding
the output of Conv applied twice to the initial input.

(a) (b)
Figure
Figure 1. Two 1. Two
forms of theforms of the (a)
C3 module: C3 the
module: (a) of
structure theBottleNeck1,
structure of(b)
BottleNeck1,
the structure(b)
of the structure of Bottle
BottleNeck2.
Neck2.
Many studies have shown that the differences in feature maps across different channels
Manyare
of the same image studies have[20,21].
minimal shown While
that the differences
most in feature
algorithms aim tomaps across
reduce different chan
computa-
nels of the
tional complexity andsame imageaccuracy,
improve are minimal they[20,21].
have not While most algorithms
effectively addressedaim theto reduce
issue of compu
computing redundant
tational featuresand
complexity across different
improve channels.
accuracy, theyThe C3not
have structure, which
effectively follows the issu
addressed
traditional
of methods
computing forredundant
processingfeatures
featureacross
maps different
in each channel,
channels.inevitably results inwhich fol
The C3 structure,
redundant computations
lows traditionalbetween
methodssimilar feature maps.
for processing feature maps in each channel, inevitably result
An improved
in redundantversion of the C3 module
computations betweeninsimilar
YOLOv5s-7.0, known as C3-Faster (C3F),
feature maps.
has been introduced to effectively address the aforementioned
An improved version of the C3 module in YOLOv5s-7.0, issues. Its design
known asconcept
C3-Faster (C3F)
is derivedhas
from the PConv module used by Jierun Chen [22] in FasterNet.
been introduced to effectively address the aforementioned issues. Its In C3F, the concep
design
Electronics 2023, 12, x FOR PEERunprocessed data arefrom
REVIEW is derived concatenated
the PConv with the PConv
module used module forChen
by Jierun further computation.
[22] 5 of 18 This
in FasterNet. In C3F, th
approach unprocessed
significantly data
reduces
are the computational
concatenated with workload
the PConvwhile enhancing
module the computation.
for further accuracy. Thi
The structure of the significantly
approach PConv module can bethe
reduces seen in Figure 2a, while
computational the structure
workload of C3F is the accu
while enhancing
depicted in Figure
racy. The 2b.
structure of the PConv module can be seen in Figure 2a, while the structure o
C3F is depicted in Figure 2b.

(a) (b)
Figure 2. Overview diagram of the C3F model: (a) the structure of PConv, (b) the structure of C3F.
Figure 2. Overview diagram of the C3F model: (a) the structure of PConv, (b) the structure of C3F.
C3F is a fundamental operator that can be embedded into various neural networks
to address the issue of redundant convolutions that often occur in neural network com-
putations. By reducing memory access, C3F performs conventional Conv convolutions on
only a portion of the input data, typically treating either the first or last channel as the
representation of the entire image. The floating point operations (FLOPs) for conventional
Electronics 2023, 12, 4422 5 of 17

C3F is a fundamental operator that can be embedded into various neural networks
to address the issue of redundant convolutions that often occur in neural network com-
putations. By reducing memory access, C3F performs conventional Conv convolutions
on only a portion of the input data, typically treating either the first or last channel as the
representation of the entire image. The floating point operations (FLOPs) for conventional
Conv are
a × b × c2 × d2 (1)
The corresponding amount of memory access is

a × b × 2d + c2 × d2 (2)

In a contrasting manner, after replacing the C3 module with a C3F module, the FLOPs
are reduced to
a × b × c2 × d f 2 (3)
The corresponding amount of memory access is

a × b × 2d f + c2 × d f 2 (4)

where d/d f ≥ 2 is the number of channels.

This means that the FLOPs for Conv are at least four times greater than the FLOPs for
C3F, and the memory access for Conv is more than double that of C3F.
In order to fully utilize the unused channels (d − d f ) in the feature maps after the
aforementioned operations, these channels are transformed into pointwise convolutions
(Conv1 × 1) and added to the center position of the PConv module. This results in a
convolutional layer with efficient computational capabilities. Batch normalization (BN) is
then applied to further improve the convergence speed. To avoid the issues of gradient
vanishing or exploding during computation, a Rectified Linear Unit (ReLU) is used as the
activation function to enhance the non-linear fitting ability of the upper and lower layer
function values. Subsequently, pointwise convolutions, average global pooling, and fully
connected layers are employed to merge and output the final results.
This approach of processing digital images allows for a reduction in computational
workload, thereby enhancing the speed of computation without sacrificing accuracy. By
intelligently combining convolutional layers (attaching Conv1 × 1 layers to the center
position of the PConv module, forming a T-shaped structure), greater attention is given
to this central position, which has the highest Frobenius norm. This arrangement aligns
with the pattern of feature extraction in images and can even reduce the computational
workload while improving the precision.

2.2.2. Information Feature Enhancement Module—CAM

To address the issue of the varying candidate box sizes after the first stage of detection
in RCNN, He, Kaiming et al. [23] proposed the spatial pyramid pooling (SPP) structure to
fix the detection box size. The SPP structure incorporates parallel operations of MaxPool2d
with 5 × 5, 9 × 9, and 13 × 13 modules, as shown in Figure 3a, to ensure fixed-size outputs.
Subsequently, He, Kaiming further improved the SPP structure by introducing the spatial
pyramid pooling-fast (SPPF) structure. The innovation in the network architecture is due to
the replacement of the MaxPool2d operation in SPP with three consecutive 5 × 5 modules,
as depicted in Figure 3b. This modification leads to outputs of the same size while reducing
the computational workload and improving the detection speed. However, it is important
to note that the SPPF structure inherently involves the loss of partial information during
the pooling process. If the convolutional operations prior to pooling fail to learn sufficient
features, it can have a significant impact on the detection results.
utive 5 × 5 modules, as depicted in Figure 3b. This modification leads to outputs of the
same size while reducing the computational workload and improving the detection
speed. However, it is important to note that the SPPF structure inherently involves the
loss of partial information during the pooling process. If the convolutional operations
prior to pooling fail to learn suﬃcient features, it can have a significant impact on the
Electronics 2023, 12, 4422 detection results. 6 of 17

(a) (b)
Figure 3. Structure
Figure of pooling
3. Structure layer layer
of pooling in YOLOv5: (a) the(a)structure
in YOLOv5: of SPP,
the structure of (b)
SPP,the
(b)structure of SPPF.
the structure of SPPF.

The utilization of the of

The utilization context augmentation
the context module (CAM)
augmentation modulehas demonstrated
(CAM) remarka- re-
has demonstrated
ble effectiveness in handling low-resolution
markable effectiveness targets, while also
in handling low-resolution providing
targets, a robust
while also solution
providing a to
robust
the aforementioned
solution to the issues. The conceptual
aforementioned issues.architecture of CAM
The conceptual was a solution
architecture of CAMto address the
was a solution
imbalanced training
to address dataset andtraining
the imbalanced limitations of theand
dataset network. In this
limitations ofstudy, the CAM
the network. Inmodule is the
this study,
incorporated
CAM module into YOLOv5s-7.0,
is incorporatedreplacing the SPPF structure
into YOLOv5s-7.0, and further
replacing the SPPFenhancing
structurethe com-
and further
enhancing
putational the computational precision.
precision.
The CAM
The CAM module module
uses uses convolution
convolution with with varying
varying dilation
dilation ratesrates to process
to process imagesimages
and and
enrichenrich the contextual
the contextual information
information fromfrom
bothboth the upper
the upper and lower
and lower regions
regions of theofimage.
the image.
By By
combining
combining
Electronics 2023, 12, x FOR PEER REVIEW the feature
the feature information
information fromfrom multiple
multiple imagesimages
with with different
diﬀerent dilation
dilation rates,rates,
of 18the
7 the
expression
expression of the offeatures
the features becomes
becomes moremore evident.
evident. The structure
The structure of theofCAM
the CAM module
module is is
illustrated in
illustrated in Figure 4.Figure 4.

Figure
Figure 4. The
4. The structure
structure of CAM.
of CAM.

In the
In the above
above figure, 3 ×33×convolution
figure, 3 convolution kernels
kernels are are applied
applied with
with rates
rates of 1,of3,1,and
3, and
5. 5.
This approach draws inspiration from the way humans recognize objects, where using a a
This approach draws inspiration from the way humans recognize objects, where using
raterate
of 1ofis1akin
is akin to observing
to observing details
details up up close,
close, suchsuch as when
as when observing
observing a panda
a panda andand noticing
noticing
its creamy white torso, sharp black claws, and black ears. However,
its creamy white torso, sharp black claws, and black ears. However, these details may not these details may
not be sufficient to determine the object’s category. By contrast,
be sufficient to determine the object’s category. By contrast, performing convolution cal-performing convolution
calculations
culations with rates
with rates ofor
of 3, 5, 3, even
5, or even larger
larger ratesrates is akin
is akin to viewing
to viewing an object
an object inentirety,
in its its entirety,
comparing it to the surrounding environment. Applying this
comparing it to the surrounding environment. Applying this visual approach to machinevisual approach to machine
learning has demonstrated comparable results. By simulating this method of human ob-
servation, machine learning adjusts the rate to obtain different receptive fields and then
fuses them for improved accuracy. This learning technique works particularly well for
smaller targets at various resolutions.
rate of 1 is akin to observing details up close, such as when observing a pan
its creamy white torso, sharp black claws, and black ears. However, these
be sufficient to determine the object’s category. By contrast, performing c
culations with rates of 3, 5, or even larger rates is akin to viewing an obje
Electronics 2023, 12, 4422 comparing it to the surrounding environment. Applying this visual
7 of 17 appro
learning has demonstrated comparable results. By simulating this metho
servation, machine learning adjusts the rate to obtain different receptive
learning has demonstratedfusescomparable
them for improved
results. Byaccuracy. This this
simulating learning technique
method of humanworks parti
smaller targets at various resolutions.
observation, machine learning adjusts the rate to obtain different receptive fields and then
The CAM
fuses them for improved accuracy. Thismodule includes
learning threeworks
technique types of weight forms:
particularly wellWeight,
for Ada
catenation,
smaller targets at various resolutions.as illustrated in Figure 5.
The CAM module includes three types of weight forms: Weight, Adaptive, and
Concatenation, as illustrated in Figure 5.

Electronics 2023, 12, x FOR PEER REVIEW 8 of 18

(a)

(b) (c)
Figure 5. The
Figure three
5. The modes
three of CAM:
modes (a) (a)
of CAM: Weight, (b)(b)
Weight, Concatenation, (c) (c)
Concatenation, Adaptive.
Adaptive.

TheTheWeight
Weightmode involves
mode adding
involves the information
adding after it
the information undergoes
after Conv1Conv1
it undergoes × 1 pro-× 1
cessing three three
processing times.times.
The Adaptive modemode
The Adaptive adopts an adaptive
adopts an adaptiveapproach
approachto match
to match thethe
weights, where [bs, 3, h, w] in the diagram represents spatially adaptive weight
weights, where [bs, 3, h, w] in the diagram represents spatially adaptive weight values. The values.
TheConcatenation
Concatenation mode
mode combines
combines thethe information
information after
after it undergoes
it undergoes × 1 ×processing
Conv1
Conv1 1 pro-
cessing
threethree
timestimes through
through weighted
weighted fusion.
fusion. ThereThere
is noisphenomenon
no phenomenon of one
of one mode modebeingbeing
better
better
thanthan the other
the other amongamong
thesethese
three three
modes. modes. The performance
The performance of themodule
of the CAM CAM module is
is influenced
by factors
influenced bysuch as the
factors suchdimensions of different
as the dimensions datasets’ datasets’
of diﬀerent images, their characteristic
images, features,
their character-
and
istic the connections
features, between different
and the connections between modules.
diﬀerent modules.

2.2.3. Variable Receptive Field Module—CARAFE

Upsampling is widely used in deep neural networks to enlarge high-level features.
In version 7.0 of YOLOv5, the nearest module, which means the nearest neighbor inter-
polation, is employed for upsampling operations. When using the upsampling module in
Electronics 2023, 12, 4422 8 of 17

2.2.3. Variable Receptive Field Module—CARAFE

Upsampling is widely used in deep neural networks to enlarge high-level features.
In version 7.0 of YOLOv5, the nearest module, which means the nearest neighbor interpo-
lation, is employed for upsampling operations. When using the upsampling module in
image processing, each point in the output image can be mapped to the input image. The
mapped point takes on the value of the nearest point among the adjacent four points in
the input image, assumed as a, b, c, and d. The “nearest” module does not add computa-
tional complexity as it only requires passing values and processing data blocks. However,
this module also has several drawbacks that significantly affect the results of neural net-
work computations. The data transformation approach of the “nearest” module reduces
the gradual correlation between adjacent pixels in the original image. Additionally, its
1 × 1 perceptual field is very small, and this uniform and simplistic sampling method does
not effectively utilize the semantic information of the feature map.
Electronics 2023, 12, x FOR PEER REVIEW 9 of 18
The Content-Aware ReAssembly of FEatures (CARAFE) module, initially proposed
by CL Chen et al. [24] can address the limitations of nearest neighbor methods. CARAFE
provides a larger
the semantic receptiveoffield
information for images
the feature map.byThis
generating corresponding
allows for kernels
a more focused based on
consideration
the semantic information of the feature map. This allows for a more focused consideration
of the content surrounding the image, thereby improving the detection accuracy without
of the contentincreasing
significantly surrounding the image, thereby
the computational improvingThe
requirements. theexecution
detection process
accuracyofwithout
the CA-
significantly increasing the computational requirements. The execution
RAFE module consists of two steps: “Creation of the Upsampling Kernel” and “New process ofKer-
the
CARAFE module consists of two steps: “Creation of the Upsampling Kernel” and “New
nel Convolution Calculation”. The creation of the upsampling kernel is illustrated in Fig-
Kernel Convolution Calculation”. The creation of the upsampling kernel is illustrated in
ure 6, while the process of new kernel convolution calculation is depicted in Figure 7.
Figure 6, while the process of new kernel convolution calculation is depicted in Figure 7.

Figure 6.
Figure 6. Creation
Creation of
of the
the upsampling
upsampling kernel.
kernel.

The image above depicts the process of creating an upsampling kernel. To meet the
requirement of reducing the computational complexity, the input image of size h × w × l
undergoes calculations using a 1 × 1 convolutional operation, compressing the channels to
l1 . Next, the content is re-encoded and passed through a k2 × k2 convolutional operation,
dividing the l1 channels into m2 groups of k1 2 channels. These channels are then rearranged
and combined in a mh × mw × k1 2 structure after unfolding the spatial dimensions. Finally,
the rearranged mh × mw × k1 2 structure is normalized, ensuring that the weights of the
newly created upsampling kernel sum up to 1.
At each position in the output feature map h × w × l, there is a mapping to the
input feature map. For example, in the rectangular shape shown in the image, the yellow
region corresponds to a 6 × 6 area in the input feature map. Then, the upsampling kernel
mh × mw × k1 2 is rearranged into a k1 × k1 structure within the red region, and the
dot product between this rearranged kernel and the 6 × 6 input area yields an output

Figure 7. New kernel convolution calculation.

Electronics 2023, 12, 4422 9 of 17

value. The yellow region of the rectangular shape determines the corresponding positional
Figure 6. Creation
coordinates of theofred
theregion
upsampling
in thekernel.
upsampling kernel, and a single upsampling kernel can
be shared among all channels at that corresponding position.

Figure 7. New kernel convolution calculation.

The image 1 × k1depicts

sizes kabove and k2the × process
k2 represent the reorganized
of creating an upsampling upsampling
kernel. Tokernel
meetandthe
the receptive field, respectively. A larger receptive field means
requirement of reducing the computational complexity, the input image of size h × w × la larger range of content
perception, requiring ausing
undergoes calculations largera reorganized upsampling
1 × 1 convolutional kernel
operation, to match the
compressing theincreased
channels
receptive
to l1. Next,field. CL Chen
the content introducedand
is re-encoded and explained
passed throughthe apermutation patterns operation,
k2 × k2 convolutional of k1 k2 in
the Carafe
dividing themodule,
l1 channelswhich
intoinclude
m2 groups [1, 3],
of k[1, 5], [3, 3], These
12 channels. [3, 5], channels
[3, 7], [5, are
5], [5,
then7],rearranged
and [7, 7].
Additionally,
and combineditin was noted
a mh × mw that
× kthe parameter
12 structure aftersize increasesthe
unfolding as spatial
the value of k increases.
dimensions. Finally,
the rearranged mh × mw × k1 structure is normalized, ensuring that the weights of the
2
2.2.4. The Architecture of the Improved Model Based on YOLOv5s-7.0
newly created upsampling kernel sum up to 1.
The proposed
At each position improved algorithm
in the output featureismap
based h ×on
wYOLOv5s-7.0. The purpose
× l, there is a mapping to theof the
input
improvement
feature map. For is toexample,
improvein thethedetection performance
rectangular shape shown of small target
in the objects,
image, such asregion
the yellow those
with low pixel clarity. The structure of the improved algorithm is depicted in Figure 8. In
this improvement, all seven C3 modules in the original algorithm’s backbone structure and
four C3 modules in the neck structure are replaced with C3F modules. The C3F modules
reduce redundant computations and their T-shaped pattern allows the receptive field to
focus more on the center position with the maximum Frobenius norm.
Additionally, the SPPF in the original algorithm’s backbone is replaced with a CAM.
The CAM obtains three different receptive fields with rates of 1, 3, and 5 when processing
an image. This allows the algorithm to focus more on contextual information, reducing the
impact of low pixel clarity features. Moreover, three fusion methods (weight, adaptive, and
concatenation) are proposed for combining the obtained receptive fields.
Furthermore, the two nearest neighbor upsampling operators—the “nearest” modules
in the original algorithm’s neck—are replaced with a feature recombination module called
CARAFE, which focuses on semantic information. Based on the values of k1 and k2 ([1, 3],
[1, 5], [3, 3], [3, 5], [3, 7], [5, 5], [5, 7], and [7, 7]), a total of 16 combinations can be obtained
from the two “nearest” modules. This approach of increasing the receptive field differs
from directly expanding it with CAM, as it reconstructs the upsampling kernel based on
the feature information to enhance the receptive field.
The proposed improved algorithm is based on YOLOv5s-7.0. The purpose of the im-
provement is to improve the detection performance of small target objects, such as those
with low pixel clarity. The structure of the improved algorithm is depicted in Figure 8. In
this improvement, all seven C3 modules in the original algorithm’s backbone structure
and four C3 modules in the neck structure are replaced with C3F modules. The C3F mod-
Electronics 2023, 12, 4422 10 of 17
ules reduce redundant computations and their T-shaped pattern allows the receptive field
to focus more on the center position with the maximum Frobenius norm.

Theimproved
Figure8.8.The
Figure improvedarchitecture
architecturebased
basedon
onversion
version7.0
7.0of
ofYOLOv5s.
YOLOv5s.

By combiningthe
Additionally, these
SPPFthree improvement
in the methods without
original algorithm’s backbonesignificantly
is replaced increasing
with a CAM. the
computational complexity, the algorithm achieves a significant improvement in its accuracy.
The CAM obtains three diﬀerent receptive fields with rates of 1, 3, and 5 when processing
A comparative
an analysis
image. This allows will
the be presented
algorithm in the
to focus following
more sections.information, reducing
on contextual
the impact of low pixel clarity features. Moreover, three fusion methods (weight, adaptive,
3. Experiments and Analysis
and concatenation) are proposed for combining the obtained receptive fields.
3.1. Experimental Environment
Furthermore, the two nearest neighbor upsampling operators—the “nearest” mod-
ules inThe
theexperimental platform
original algorithm’s used in this
neck—are studywith
replaced is thea AutoDL cloud service platform
feature recombination module
provided by SeetaTech Technology Ltd. The basic configuration
called CARAFE, which focuses on semantic information. Based on the values is shown in Table
of k1.1 and
During
k2
the training process using the improved algorithm based on YOLOv5s-7.0, no pre-trained
([1, 3], [1, 5], [3, 3], [3, 5], [3, 7], [5, 5], [5, 7], and [7, 7]), a total of 16 combinations can be
weights were used. The basic parameters were set as follows: epochs = 300, batch-size = 32,
obtained from the two “nearest” modules. This approach of increasing the receptive field
workers = 22, initial learning rate = 0.001, and confidence threshold = 0.5.

Table 1. Setup and parameters of the experimental platform.

Name Type of Configuration

Mirror image environment PyTorch 1.11.0 + Cuda 11.3
Deep learning framework ubuntu20.04
Program Python 3.8
GPU RTX 4090 (24 GB)
CPU 22 vCPU AMD EPYC 7T83 64-Core Processor
Memory capacity 90 GB

3.2. Dataset and Evaluation Indicators

3.2.1. Dataset
This article utilizes a publicly available dataset called “NEU-DET” [25] provided by
the Northeastern University of China for training and testing purposes. A total of six
types of damage patterns shown in Figure 9 were selected, including scratches, crazing,
inclusions, patches, pitting, and rolled-in scales, which are common defects in steel. The
dataset consists of 1800 images, each with a resolution of 200 × 200 pixels, and contains
300 images for each defect category. To facilitate effective training, validation, and testing,
the dataset was divided in an 8:1:1 ratio. This means that 1440 images were used for
training, while the remaining 360 images were evenly distributed between the validation
and testing sets.
damage patterns shown in Figure 9 were selected, including scratches, crazing, inclusions,
patches, pitting, and rolled-in scales, which are common defects in steel. The dataset consists
of 1800 images, each with a resolution of 200 × 200 pixels, and contains 300 images for each
defect category. To facilitate effective training, validation, and testing, the dataset was di-
vided in an 8:1:1 ratio. This means that 1440 images were used for training, while the re-
Electronics 2023, 12, 4422 11 of 17
maining 360 images were evenly distributed between the validation and testing sets.

Electronics 2023, 12, x FOR PEER REVIEW 12 of 18

(a) (b) (c)

(d) (e) (f)

Figure 9.
Figure 9. Six
Six kinds
kinds of
of defect
defect pictures:
pictures: (a)
(a) scratches, (b) crazing,
scratches, (b) crazing, (c)
(c) inclusions,
inclusions, (d)
(d) patches,
patches, (e)
(e) pitting,
pitting,
(f) rolled-in scales.
(f) rolled-in scales.

3.2.2. Evaluation Indicators

3.2.2. Indicators
In the
In the object
object detection
detection experiments,
experiments, the
the algorithm’s
algorithm’s performance
performance isis evaluated
evaluated using
using
precision(P),
precision (P), recall
recall (R),
(R), average
average precision
precision (AP),
(AP), mean
mean average
average precision
precision (mAP)
(mAP) and
and frames
frames
per second
per second (FPS).
(FPS). The
The corresponding
correspondingexpressions
expressionsare
areas
asfollows:
follows:

TPTP
==
PP (5)
(5)
TP +FPFP
TP +

TPTP
R= (6)
R =TP + FN (6)
TP + FN
Z 1
AP = 1P ( R ) dR (7)
AP =0 P ( R ) dR (7)
0
∑n AP = (i )
mAP = i=0n (8)
n

AP = ( i )
mAP =as thei = 0number of pictures with positive samples
In these expressions, TP is expressed (8)
n
for both labeling and detection; FP represents the number of images where the ground
truthIn
is these
negative but the detection
expressions, is positive;
TP is expressed as theand FN represents
number of picturesthe number
with positiveofsamples
images
where the ground truth is positive but the detection is negative. FPS is an
for both labeling and detection; FP represents the number of images where the ground indicator of
detection speed, representing the number of images that can be detected per second.
truth is negative but the detection is positive; and FN represents the number of images
where the ground truth is positive but the detection is negative. FPS is an indicator of
detection speed, representing the number of images that can be detected per second.

3.3. Module Replacement Comparative Experiment

In order to analyze and demonstrate the impact of diﬀerent improvement methods
on the performance of defect detection algorithms on the “Steel Strip NEU-DET”, we will
conduct individual or combined comparative analyses of three modules replaced in the
Electronics 2023, 12, 4422 12 of 17

3.3. Module Replacement Comparative Experiment

In order to analyze and demonstrate the impact of different improvement methods
on the performance of defect detection algorithms on the “Steel Strip NEU-DET”, we will
conduct individual or combined comparative analyses of three modules replaced in the
improved algorithm. The following sections will experimentally evaluate the comparative
effects of replacing several common C3 modules of the same type, compare the performance
of three weight options in the CAM module, and assess the comparative effects of varying
k1 k2 parameter combinations in the Carafe module.

3.3.1. Comparative Experiment of C3 Module Replacement

In addition to the C3F module being able to replace C3, in recent years, many excellent
papers have proposed convolutional kernels that can improve the algorithm’s performance.
For example, Yolov8 introduces a module called cross stage partial feature fusion (C2f),
which merges high-level and low-level features to enhance accuracy. The deformable
convolution V2 (DCN) [26] is a variable convolution that adapts the kernel size according
to the network layer size. The switchable atrous convolution (SAC) [27] by Google’s team is
used for tiny object detection. The distribution shifting convolution (DSConv) [28] quantizes
the original convolutional parameters and shifts the distribution to reduce the memory
usage and accelerate the computation speed. The diverse branch block (DBB) [29], proposed
in CVPR2021, combines multiple branches with rich features into a main branch through
reparameterization. These convolutional kernels will be incorporated into YOLOv5s-7.0,
and their performance will be compared and evaluated, as shown in Table 2.

Table 2. Comparison of parameters of replacing the C3 module.

Model Params (×106 ) FPS (f/s) mAP (%)

YOLOv5s-7.0(C3) 7.03 101.1 73.1
YOLOv5s-7.0(C3F) 5.8 111.2 76.3
YOLOv5s-7.0(C2f) 9.29 119 76.2
YOLOv5s-7.0(SAC) 8.29 85.7 76
YOLOv5s-7.0(DBB) 7.03 56.8 74.1
YOLOv5s-7.0(DCN) 7.09 121.6 73.9
YOLOv5s-7.0(DSConv) 7.03 129.4 73.8

The improved algorithms that replaced the convolutional kernels with C3F, C2f, SAC,
DBB, DCN, and DSConv have increased the mAP compared to the original algorithm
by 3.4%, 3.3%, 3.1%, 1.2%, 1%, and 0.9%, respectively. All six improved algorithms of
YOLOv5s-7.0 can enhance the detection accuracy. However, only the algorithm, which
incorporates C3F, C2f, DCN, and DSConv, improves both the detection accuracy and
speed. The algorithm that incorporates DCN and DSConv has a weaker ability to improve
the detection accuracy compared to the other two convolutional kernels. Additionally,
considering that C2f has approximately 3.49 × 106 more computational parameters than
C3F while achieving a similar detection accuracy, in the subsequent ablation experiments,
only the case of integrating C3F will be considered.

3.3.2. Comparative Experiment of CAM Module Replacement

The three fusion modes of the CAM module have been described in detail in Section 2.2.2.
of this paper. In this paper, when the original algorithm was fused with the CAM module, it
was discovered that the Adaptive mode, which allows for the importance analysis of the results
through attention mechanisms, performed the best on this dataset. It is more suitable for this
dataset, as shown in Table 3.
Electronics 2023, 12, 4422 13 of 17

Table 3. Comparison of the performance of CAM in three modes.

CAM Params (×106 ) mAP (%)

Weight 14.25 75.4
Adaptive 14.25 75.7
Concatenation 14.51 75.2

The three fusion modes of the CAM module, when compared to the original algorithm,
have all shown an improvement in the detection accuracy of at least 2.3%. Among them,
after multiple experiments on the “NEU-DET” dataset, the Adaptive mode of the CAM
module achieved the highest mAP of 75.7%. Therefore, in the subsequent ablation experi-
ments, only the Adaptive mode of the CAM module will be considered. Although the CAM
module can enhance the detection accuracy, it has a higher computational cost compared to
the SPPF in the original algorithm. To further reduce the overall computational cost of the
algorithm, future improvements can consider integrating the CAM module in the higher
levels of the original algorithm.

3.3.3. Comparative Experiment of Carafe Module Replacement

In version 7.0 of YOLOv5s, this paper replaces both the “nearest” upsampling opera-
tors with Carafe. The variation in the k values for the higher-level Carafe has a relatively
small impact on the overall parameter size of the algorithm. Even when replacing the high-
level upsampling operator with Carafe [7, 7], the overall parameter size of the algorithm
only increases from 7.03 × 106 to 7.67 × 106 . This is due to the geometric growth in the pa-
rameter size resulting from increasing the k values of both the higher-level and lower-level
Carafe. In this study, the lower-level upsampling operator is set as Carafe [1, 3], and the
different combinations of the k values for the higher-level Carafe are explored to find the
optimal integration of Carafe into YOLOv5s-7.0. The comparison of the parameter size and
detection accuracy of the eight combinations of Carafe-enhanced algorithm integration is
shown in Table 4.

Table 4. Comparison of the performance of Carafe in eight modes.

[k1, k2 ] Params (×106 ) mAP (%)

[1, 3] 7.06 74.6
[1, 5] 7.06 74.2
[3, 3] 7.07 74.1
[3, 5] 7.11 75
[3, 7] 7.17 74
[5, 5] 7.21 74.3
[5, 7] 7.37 75.2
[7, 7] 7.67 73.2

All the fusion combinations incorporating the Carafe module lead to an improved
detection accuracy. Through a comparison of the mAP results, it was found that the optimal
detection accuracy is achieved when the recombination sampling kernel k1 is set to 1, 3, or
5, and the receptive field k2 is set to 3, 5, or 7. Among the combination modes [1, 3], [3, 5],
and [5, 7], the detection accuracy increases with the increase in the k values, with the [5, 7]
combination achieving the highest mAP of 75.2%. Moreover, the increase in the parameter
size is not substantial. Therefore, in the subsequent ablation experiments, only the Carafe
mode [5, 7] needs to be considered.

3.4. Ablation Experiment and Analysis

To investigate the impact of the three aforementioned improvement combinations
on the detection accuracy of surface defects in steel strips, ablation experiments were
conducted as shown in Table 5.
Electronics 2023, 12, 4422 14 of 17

Table 5. Ablation experiment.

Model Group 1 Group 2 Group 3 Group 4 Group 5 Group 6 Group 7 Group 8

YOLOv5s-7.0 4 4 4 4 4 4 4 4
C3F 4 4 4 4
CAM(Adaptive) 4 4 4 4
Carafe [5, 7] 4 4 4 4
Params (×106 ) 7.03 5.8 14.25 7.37 13.0 6.14 14.6 13.35
mAP (%) 73.1 76.3 75.7 75.2 77.9 77.1 76.7 79.5

In the table, ‘4’ indicates the presence of the module Comparing. Group 2 with Group
1, the addition of the C3F module resulted in a reduction in the 1.23 × 106 parameters.
However, the mAP increased from 73.1% to 76.3%, showing an improvement of 3.2%. When
comparing Group 3 with Group 1, the inclusion of the CAM module led to an almost
doubling of the parameter count, but the detection accuracy also improved by 2.6%. In the
case of Group 4 compared to Group 1, the integration of the Carafe module only increased
the parameter count by 0.34 × 106 , yet the detection accuracy improved by 2.1%. Groups
5, 6, and 7 represent paired combinations of C3F, CAM, and Carafe, respectively. When
compared to their individual integration with YOLOv5s-7.0, all these combination modes
exhibited further improvements in mAP, suggesting that there is no adverse reaction when
combining C3F, CAM, and Carafe. Group 8 fused C3F, CAM, and Carafe into YOLOv5s-7.0,
and created a new model named “YOLOv5s-7.0-FCC”. Furthermore, the code for the
YOLOv5s-7.0-FCC model is publicly available and readers are free to use it. The download
location can be found in the Supplementary Materials Data File S1. Lastly, Group 8, when
compared to Group 1, saw an increase in the parameter count from 7.03 × 106 to 13.35 × 106 ,
while the detection accuracy soared from 73.1% to 79.5%. Sacrificing some parameters to
achieve significant improvements in the detection accuracy proves to be meaningful.

4. Results
4.1. Superior Features of YOLOv5s-7.0-FCC in Mainstream Object Detection Algorithms
Currently, in engineering applications, other mainstream object detection algorithms
such as Faster-RCNN, SSD, YOLOv3, and YOLOv4 are widely used. They are compre-
hensively evaluated and compared with YOLOv5s-7.0-FCC, and the specific results are
presented in Table 6.

Table 6. Comparison of performance for Mainstream Object Detection Algorithms.

Algorithm Params (×106 ) FPS (f/s) mAP (%)

Faster-RCNN 56.81 56.2 61.5
SSD 48.83 216 72
YOLOv3 61.55 83.1 70.6
YOLOx 18.01 98 71.9
YOLOv5s-7.0 7.03 101.1 73.1
YOLOv5s-7.0-FCC 13.35 109.4 79.5

The table above shows the results obtained on the “NEU-DET” dataset by testing
various algorithms. YOLOv5s-7.0-FCC has fewer parameters than Faster-RCNN, SSD,
YOLOv3, and YOLOv4, and it achieves the highest detection accuracy. In terms of the
detection speed, except for being slower than SSD, it is faster than the other four algorithms.
With a parameter count of 13.35 × 106 , YOLOv5s-7.0-FCC improves the detection accuracy
to 79.5%. Compared to the original algorithm, although the parameter count of YOLOv5s-
7.0-FCC increases almost twice, the detection speed improves from 101.1 f/s to 109.4 f/s,
and the mAP increases by 6.4%. Overall, YOLOv5s-7.0-FCC enables real-time detection
functionality in engineering applications and exhibits significantly superior performance
compared to other mainstream object detection algorithms.
Electronics 2023, 12, 4422 15 of 17

4.2. Comparison of YOLOv5s-7.0-FCC and YOLOv5s-7.0 Detection Results

The comparison between the detection results of YOLOv5s-7.0 and YOLOv5s-7.0-FCC
is shown in Figure 10. The figure includes six types of defects and their corresponding con-
fidence scores. From the figure, it is evident that YOLOv5s-7.0-FCC yields more detection
Electronics 2023, 12, x FOR PEER REVIEW 16 of 18
boxes, allowing for the detection of targets with less distinct features. Additionally, the
confidence scores are generally higher compared to the original algorithm.

Figure 10. Comparison

Figure 10. Comparison of
of YOLOv5s-7.0-FCC
YOLOv5s-7.0-FCC and
and YOLOv5s-7.0
YOLOv5s-7.0 detection
detection results.
results.
5. Conclusions
5. Conclusions
During the surface defect detection process in steel strips, it is often challenging
During
to detect the features
subtle surface defect detection
and small defectprocess
targets,inresulting
steel strips,
in aitlow
is often challenging
detection to
accuracy.
detect subtle features and small defect targets, resulting in a low detection
To address these issues effectively, this paper proposes an improved algorithm called accuracy. To
address these issues
YOLOv5s-7.0-FCC eﬀectively,
to meet this paper
the demands proposespractices.
of engineering an improved algorithm called
YOLOv5s-7.0-FCC is an
YOLOv5s-7.0-FCC to meet the demands of engineering practices. YOLOv5s-7.0-FCC
algorithm that enhances semantics and optimizes the feature extraction. It introduces is an
a
algorithm that
lightweight enhances semantics
convolutional and optimizes
kernel operator, C3F, intothethefeature extraction.
original algorithmItto introduces
reduce thea
lightweightcomputations.
redundant convolutionalC3F kernel operator,
focuses moreC3F, into
on the the original
center position,algorithm
which is to reduce
better the
suited
redundant computations. C3F focuses more on the center position, which is
for feature extraction. The algorithm enriches the contextual information by incorporatingbetter suited
for feature extraction. The algorithm enriches the contextual information by incorporating
a CAM module to enhance feature representation. Additionally, it integrates two CA-
RAFE upsampling operators, experimenting with various upsampling kernels and recep-
tive field combinations to increase the feature extraction capabilities. The experimental
results demonstrate that YOLOv5s-7.0-FCC outperforms YOLOv5s-7.0 in terms of the de-
Electronics 2023, 12, 4422 16 of 17

a CAM module to enhance feature representation. Additionally, it integrates two CARAFE

upsampling operators, experimenting with various upsampling kernels and receptive
field combinations to increase the feature extraction capabilities. The experimental results
demonstrate that YOLOv5s-7.0-FCC outperforms YOLOv5s-7.0 in terms of the detection
speed, exhibiting a more reasonable structural distribution that is better suited for machine
learning. Moreover, YOLOv5s-7.0-FCC achieves a 6.4% increase in the detection accu-
racy, effectively reducing false negatives and false positives in steel strip defect detection,
resulting in a significant overall performance improvement.

Supplementary Materials: The following supporting information can be downloaded at: https://www.
mdpi.com/article/10.3390/electronics12214422/s1, Data File S1: Code for the YOLOv5s-7.0-FCC model.
Author Contributions: Conceptualization, C.W.; methodology, T.X.; software, H.J.; validation, Y.Q.;
formal analysis, J.Y.; investigation, Y.K.; resources, X.C.; data curation, J.Y.; writing—original draft
preparation, J.Y.; writing—review and editing, J.Y.; visualization, T.X.; supervision, Y.Q.; project
administration, H.J.; funding acquisition, C.W. All authors have read and agreed to the published
version of the manuscript.
Funding: The applicant for the fund is Haijuan Ju. This research was funded by Natural Science
Basic Research Program of Shaanxi, China, the fund number 2023-JC-QN-0696.
Data Availability Statement: The data that support the findings of this research are openly available
at https://download.csdn.net/download/qq_41264055/85490311 (accessed on 15 September 2023).
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Zhao, Z. Review of non-destructive testing methods for defect detection of ceramics. Ceram. Int. 2021, 47, 4389–4397. [CrossRef]
2. Jain, S.; Seth, G.; Paruthi, A.; Soni, U.; Kumar, G. Synthetic data augmentation for surface defect detection and classification using
deep learning. J. Intell. Manuf. 2022, 33, 1007–1020. [CrossRef]
3. He, Y.; Song, K.; Meng, Q.; Yan, Y. An End-to-end Steel Surface Defect Detection Approach via Fusing Multiple Hierarchical
Features. IEEE Trans. Instrum. Meas. 2019, 69, 1493–1504. [CrossRef]
4. He, D.; Xu, K.; Zhou, P. Defect detection of hot rolled steels with a new object detection framework called classification priority
network. Comput. Ind. Eng. 2019, 128, 290–297. [CrossRef]
5. Zhang, M.; Yin, L. Solar Cell Surface Defect Detection Based on Improved YOLO v5. IEEE Access 2022, 10, 80804–80815. [CrossRef]
6. Li, X.; Wang, C.; Ju, H.; Li, Z. Surface defect detection model for aero-engine components based on improved YOLOv5. Appl. Sci.
2022, 12, 7235. [CrossRef]
7. Fu, G.Z.; Sun, P.Z.; Zhu, W.B.; Yang, J.X.; Cao, Y.L.; Yang, M.Y.; Cao, Y.P. A deep-learning-based approach for fast and robust steel
surface defects classification. Opt. Laser Eng. 2019, 121, 397–405. [CrossRef]
8. Wang, Q.; Mao, J.; Zhai, X.; Gui, J.; Shen, W.; Liu, Y. Improvements of YoloV3 for road damage detection. J. Phys. Conf. Ser. 2021,
1903, 012008. [CrossRef]
9. Jiang, Z.; Zhao, L.; Li, S.; Jia, Y. Real-time object detection method based on improved YOLOv4-tiny. arXiv 2020, arXiv:2011.04244.
10. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767.
11. Cai, X.; Zhou, S.; Cheng, P.; Feng, D.; Sun, H.; Ji, J. A Social Distance Monitoring Method Based on Improved YOLOv4 for
Surveillance Videos. Int. J. Pattern Recognit. Artif. Intell. 2023, 37, 2354007. [CrossRef]
12. Qiu, M.; Huang, L.; Tang, B.H. ASFF-YOLOv5: Multielement Detection Method for Road Traffic in UAV Images Based on
Multiscale Feature Fusion. Remote Sens. 2022, 14, 3498. [CrossRef]
13. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision & Pattern
Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525.
14. Pinto, L.G.; Martins, W.M.; Ramos, A.C.; Pimenta, T.C. Analysis and Deployment of an OCR-SSD Deep Learning Technique for
Real-Time Active Car Tracking and Positioning on a Quadrotor. In Data Science: Theory, Algorithms, and Applications; Springer:
Singapore, 2021.
15. Berg, A.C.; Fu, C.Y.; Szegedy, C.; Anguelov, D.; Erhan, D.; Reed, S.; Liu, W. SSD: Single Shot MultiBox Detector. In Proceedings of
the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016.
16. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE
Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef] [PubMed]
17. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer
Vision, Venice, Italy, 22–29 October 2017.
18. Zheng, Z.; Hu, Y.; Zhang, Y.; Yang, H.; Qiao, Y.; Qu, Z.; Huang, Y. CASPPNet: A chained atrous spatial pyramid pooling network
for steel defect detection. Meas. Sci. Technol. 2022, 33, 085403. [CrossRef]
Electronics 2023, 12, 4422 17 of 17

19. Xue, N.; Niu, L.; Li, Z. Pedestrian Detection with modified R-FCN. In Proceedings of the UAE Graduate Students Research
Conference 2021 (UAEGSRC’2021), Abu Dhabi, United Arab Emirates.
20. Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C. GhostNet: More Features from Cheap Operations. arXiv 2019, arXiv:1911.11907.
21. Zhang, Q.; Jiang, Z.; Lu, Q.; Han, J.N.; Zeng, Z.; Gao, S.H.; Men, A. Split to Be Slim: An Overlooked Redundancy in Vanilla
Convolution. arXiv 2020, arXiv:2006.12085.
22. Chen, J.; Kao, S.-h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster
Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC,
Canada, 18–22 June 2023; pp. 12021–12031.
23. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans.
Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [CrossRef] [PubMed]
24. Loy, C.C.; Lin, D.; Wang, J.; Chen, K.; Xu, R.; Liu, Z. CARAFE: Content-Aware ReAssembly of FEatures. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, Seoul, Republicof Korea, 27 October–2 November 2019.
25. Cui, L.; Jiang, X.; Xu, M.; Li, W.; Lv, P.; Zhou, B. SDDNet: A fast and accurate network for surface defect detection. IEEE Trans.
Instrum. Meas. 2021, 70, 1–13. [CrossRef]
26. Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets v2: More Deformable, Better Results. arXiv 2018, arXiv:1811.11168.
27. Qiao, S.; Chen, L.-C.; Yuille, A. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 10213–10224.
28. Gennari, M.; Fawcett, R.; Prisacariu, V.A. DSConv: Efficient Convolution Operator. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, Seoul, Republicof Korea, 27 October–2 November 2019.
29. Ding, X.; Zhang, X.; Han, J.; Ding, G. Diverse Branch Block: Building a Convolution as an Inception-like Unit. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.