Electronics 12 04422 v2
Electronics 12 04422 v2
Article
Development of an Algorithm for Detecting Real-Time Defects
in Steel
Jiabo Yu 1 , Cheng Wang 1, *, Teli Xi 2 , Haijuan Ju 1 , Yi Qu 1 , Yakang Kong 1 and Xiancong Chen 1
Abstract: The integration of artificial intelligence with steel manufacturing operations holds great
potential for enhancing factory efficiency. Object detection algorithms, as a category within the field
of artificial intelligence, have been widely adopted for steel defect detection purposes. However,
mainstream object detection algorithms often exhibit a low detection accuracy and high false-negative
rates when it comes to detecting small and subtle defects in steel materials. In order to enhance
the production efficiency of steel factories, one approach could be the development of a novel
object detection algorithm to improve the accuracy and speed of defect detection in these facilities.
This paper proposes an improved algorithm based on the YOLOv5s-7.0 version, called YOLOv5s-
7.0-FCC. YOLOv5s-7.0-FCC integrates the basic operator C3-Faster (C3F) into the C3 module. Its
special T-shaped structure reduces the redundant calculation of channel features, increases the
attention weight on the central content, and improves the algorithm’s computational speed and
feature extraction capability. Furthermore, the spatial pyramid pooling-fast (SPPF) structure is
replaced by the Content Augmentation Module (CAM), which enriches the image feature content
with different convolution rates to simulate the way humans observe things, resulting in enhanced
feature information transfer during the process. Lastly, the upsampling operator Content-Aware
ReAssembly of Features (CARAFE) replaces the “nearest” method, transforming the receptive field
size based on the difference in feature information. The three modules that act on feature information
Citation: Yu, J.; Wang, C.; Xi, T.; Ju, are distributed reasonably in YOLOv5s-7.0, reducing the loss of feature information during the
H.; Qu, Y.; Kong, Y.; Chen, X. convolution process. The results show that compared to the original YOLOv5 model, YOLOv5s-7.0-
Development of an Algorithm for FCC increases the mean average precision (mAP) from 73.1% to 79.5%, achieving a 6.4% improvement.
Detecting Real-Time Defects in Steel. The detection speed also increased from 101.1 f/s to 109.4 f/s, an improvement of 8.3 f/s, further
Electronics 2023, 12, 4422. https:// meeting the accuracy requirements for steel defect detection.
doi.org/10.3390/electronics12214422
Academic Editor: Spyridon Keywords: receptive field; YOLOv5s-7.0; feature extraction; surface defect detection; attention weight
Nikolaidis
machine inspection. Manual inspection is prone to randomness, and the accuracy is heavily
influenced by the experience and attentiveness of the inspectors. Moreover, small defects
in steel may not be easily noticed by human workers [1]. On the other hand, machine
inspection offers the advantages of low cost, high efficiency, and good stability.
Object detection algorithms serve as the core of machine-based detection. However,
there are still two challenges for mainstream object detection algorithms in recognizing
surface defects on steel. Firstly, there is a high similarity between the different types of
defects on the steel surface, while some similar defects exhibit significant variations [2].
Secondly, the multitude of defect types on the steel surface leads to imprecise classification
results [3]. These two challenges result in decreased precision and a slower detection speed
of the object detection algorithms. And mainstream object detection algorithms can no
longer meet the strict defect detection requirements of factories [4]. Therefore, there is an
urgent need for more advanced algorithms with improved performance in order to meet
the production demands of factories.
As a solution to this problem, the architecture of the YOLOv5 model was introduced.
However, it should be noted that in the YOLOv5 model architecture, the propagation of
convolution calculations leads to the loss of feature information. Therefore, it is crucial
to focus on the rational distribution of structures and module replacements that are more
suitable for machine computations. Zhang et al. [5] proposed an improved algorithm based
on YOLOv5 by incorporating deformable modules to adaptively adjust the perception
field scale. They also introduced the ECA-Net attention mechanism to enhance feature
extraction capabilities. The improved algorithm achieved a 7.85% increase in the mean
average precision (mAP) compared to the original algorithm. Li et al. [6] put forward
a modified algorithm based on YOLOv5, where they integrated the Efficient Channel
Attention for Deep Convolutional Neural Networks (ECA-Net)—an attention mechanism
to emphasize feature extraction in defective regions. They replaced the PANNet module
with the Bidirectional Feature Pyramid Network (BiFPN) module to integrate feature maps
of different sizes. The results showed that compared to the original YOLOv5 model, the
mAP increased by 1% while the computation time decreased by 10.3%. Guizhong Fu et al. [7]
proposed a compact Convolutional Neural Network (CNN) model that focused on training
low-level features to achieve the accurate and fast classification of steel surface defects. This
model demonstrated high precision performance with a small training dataset, even under
various modes of interference such as non-uniform illumination, motion blur, and camera
noise. Yu He et al. [3] proposed a fusion of multi-level feature maps approach, enabling the
detection of multiple defects on a single image. They utilized a Region Proposal Network
(RPN) to generate regions of interest (ROI), and the final conclusions were produced by the
detector. The results showed an accuracy of 82.3% on the MEU-DET dataset.
Real-Time Object Detection is also an important requirement for the industrialization
of steel defect detection. Qinglang et al. [8], focusing on the elongated nature of road
cracks, proposed an improved algorithm based on YOLOv3. They fused high-level and
low-level feature maps to enhance feature representation and achieve real-time detection
of road surfaces. To achieve faster object detection, Jiang et al. [9] introduced the YOLOv4-
tiny. This model replaced two (CSPBlock) modules with two ResnetBlock-D modules
to improve computation speed. Furthermore, residual network blocks were utilized to
extract more feature information from images, thus improving the detection accuracy.
The results showed that the improved algorithm achieved a faster detection rate without
sacrificing accuracy.
There are two main categories of deep learning-based object detection methods: one-
stage and two-stage. The one-stage approach directly utilizes convolutional neural net-
works to extract image features, and perform object localization and classification. Classic
algorithms in this category include the YOLO [10–13] series and SSD [14,15] series. In
contrast, the two-stage approach generates candidate regions before performing the afore-
mentioned processes. Popular algorithms in this category include the RCNN [16,17] series,
SPPNet [18], and R-FCN [19]. Considering its practicality within factories, the YOLOv5
Electronics 2023, 12, 4422 3 of 17
model from the one-stage category is commonly used. It offers a faster detection speed but
suffers from a lower accuracy. To address this problem, the authors made improvements to
certain structures in YOLOv5 specifically for steel training datasets. These modifications
made the algorithm’s structure more suitable for machine feature extraction, resulting in
an improved detection speed and an increased average accuracy.
This article begins by introducing the basic architectural features of the YOLOv5-7.0
algorithm. Afterwards, the author addresses several issues affecting the measurement
accuracy in the original algorithm and proposes three improved modules to replace the
problematic ones. The structure and distinguishing characteristics of these replacement
modules are emphasized. Subsequently, details regarding the experimental setup, dataset,
evaluation metrics, and other relevant information are provided.
The article then presents comparative experimental results, including comparisons
of six different module replacements for the c3 module, three different forms of CAM for
replacing the SPPF module, and eight different forms of Carafe for replacing the nearest
module. Additionally, a comparative experiment is conducted using the three selected
optimal modules in combination. Furthermore, the improved algorithm is compared with
mainstream detection algorithms.
Finally, the article concludes by presenting comparative visual results of the detection
performance between the improved algorithm and the original algorithm.
(a) (b)
Figure
Figure 1. Two 1. Two
forms of theforms of the (a)
C3 module: C3 the
module: (a) of
structure theBottleNeck1,
structure of(b)
BottleNeck1,
the structure(b)
of the structure of Bottle
BottleNeck2.
Neck2.
Many studies have shown that the differences in feature maps across different channels
Manyare
of the same image studies have[20,21].
minimal shown While
that the differences
most in feature
algorithms aim tomaps across
reduce different chan
computa-
nels of the
tional complexity andsame imageaccuracy,
improve are minimal they[20,21].
have not While most algorithms
effectively addressedaim theto reduce
issue of compu
computing redundant
tational featuresand
complexity across different
improve channels.
accuracy, theyThe C3not
have structure, which
effectively follows the issu
addressed
traditional
of methods
computing forredundant
processingfeatures
featureacross
maps different
in each channel,
channels.inevitably results inwhich fol
The C3 structure,
redundant computations
lows traditionalbetween
methodssimilar feature maps.
for processing feature maps in each channel, inevitably result
An improved
in redundantversion of the C3 module
computations betweeninsimilar
YOLOv5s-7.0, known as C3-Faster (C3F),
feature maps.
has been introduced to effectively address the aforementioned
An improved version of the C3 module in YOLOv5s-7.0, issues. Its design
known asconcept
C3-Faster (C3F)
is derivedhas
from the PConv module used by Jierun Chen [22] in FasterNet.
been introduced to effectively address the aforementioned issues. Its In C3F, the concep
design
Electronics 2023, 12, x FOR PEERunprocessed data arefrom
REVIEW is derived concatenated
the PConv with the PConv
module used module forChen
by Jierun further computation.
[22] 5 of 18 This
in FasterNet. In C3F, th
approach unprocessed
significantly data
reduces
are the computational
concatenated with workload
the PConvwhile enhancing
module the computation.
for further accuracy. Thi
The structure of the significantly
approach PConv module can bethe
reduces seen in Figure 2a, while
computational the structure
workload of C3F is the accu
while enhancing
depicted in Figure
racy. The 2b.
structure of the PConv module can be seen in Figure 2a, while the structure o
C3F is depicted in Figure 2b.
(a) (b)
Figure 2. Overview diagram of the C3F model: (a) the structure of PConv, (b) the structure of C3F.
Figure 2. Overview diagram of the C3F model: (a) the structure of PConv, (b) the structure of C3F.
C3F is a fundamental operator that can be embedded into various neural networks
to address the issue of redundant convolutions that often occur in neural network com-
putations. By reducing memory access, C3F performs conventional Conv convolutions on
only a portion of the input data, typically treating either the first or last channel as the
representation of the entire image. The floating point operations (FLOPs) for conventional
Electronics 2023, 12, 4422 5 of 17
C3F is a fundamental operator that can be embedded into various neural networks
to address the issue of redundant convolutions that often occur in neural network com-
putations. By reducing memory access, C3F performs conventional Conv convolutions
on only a portion of the input data, typically treating either the first or last channel as the
representation of the entire image. The floating point operations (FLOPs) for conventional
Conv are
a × b × c2 × d2 (1)
The corresponding amount of memory access is
a × b × 2d + c2 × d2 (2)
In a contrasting manner, after replacing the C3 module with a C3F module, the FLOPs
are reduced to
a × b × c2 × d f 2 (3)
The corresponding amount of memory access is
a × b × 2d f + c2 × d f 2 (4)
(a) (b)
Figure 3. Structure
Figure of pooling
3. Structure layer layer
of pooling in YOLOv5: (a) the(a)structure
in YOLOv5: of SPP,
the structure of (b)
SPP,the
(b)structure of SPPF.
the structure of SPPF.
Figure
Figure 4. The
4. The structure
structure of CAM.
of CAM.
In the
In the above
above figure, 3 ×33×convolution
figure, 3 convolution kernels
kernels are are applied
applied with
with rates
rates of 1,of3,1,and
3, and
5. 5.
This approach draws inspiration from the way humans recognize objects, where using a a
This approach draws inspiration from the way humans recognize objects, where using
raterate
of 1ofis1akin
is akin to observing
to observing details
details up up close,
close, suchsuch as when
as when observing
observing a panda
a panda andand noticing
noticing
its creamy white torso, sharp black claws, and black ears. However,
its creamy white torso, sharp black claws, and black ears. However, these details may not these details may
not be sufficient to determine the object’s category. By contrast,
be sufficient to determine the object’s category. By contrast, performing convolution cal-performing convolution
calculations
culations with rates
with rates ofor
of 3, 5, 3, even
5, or even larger
larger ratesrates is akin
is akin to viewing
to viewing an object
an object inentirety,
in its its entirety,
comparing it to the surrounding environment. Applying this
comparing it to the surrounding environment. Applying this visual approach to machinevisual approach to machine
learning has demonstrated comparable results. By simulating this method of human ob-
servation, machine learning adjusts the rate to obtain different receptive fields and then
fuses them for improved accuracy. This learning technique works particularly well for
smaller targets at various resolutions.
rate of 1 is akin to observing details up close, such as when observing a pan
its creamy white torso, sharp black claws, and black ears. However, these
be sufficient to determine the object’s category. By contrast, performing c
culations with rates of 3, 5, or even larger rates is akin to viewing an obje
Electronics 2023, 12, 4422 comparing it to the surrounding environment. Applying this visual
7 of 17 appro
learning has demonstrated comparable results. By simulating this metho
servation, machine learning adjusts the rate to obtain different receptive
learning has demonstratedfusescomparable
them for improved
results. Byaccuracy. This this
simulating learning technique
method of humanworks parti
smaller targets at various resolutions.
observation, machine learning adjusts the rate to obtain different receptive fields and then
The CAM
fuses them for improved accuracy. Thismodule includes
learning threeworks
technique types of weight forms:
particularly wellWeight,
for Ada
catenation,
smaller targets at various resolutions.as illustrated in Figure 5.
The CAM module includes three types of weight forms: Weight, Adaptive, and
Concatenation, as illustrated in Figure 5.
(a)
(b) (c)
Figure 5. The
Figure three
5. The modes
three of CAM:
modes (a) (a)
of CAM: Weight, (b)(b)
Weight, Concatenation, (c) (c)
Concatenation, Adaptive.
Adaptive.
TheTheWeight
Weightmode involves
mode adding
involves the information
adding after it
the information undergoes
after Conv1Conv1
it undergoes × 1 pro-× 1
cessing three three
processing times.times.
The Adaptive modemode
The Adaptive adopts an adaptive
adopts an adaptiveapproach
approachto match
to match thethe
weights, where [bs, 3, h, w] in the diagram represents spatially adaptive weight
weights, where [bs, 3, h, w] in the diagram represents spatially adaptive weight values. The values.
TheConcatenation
Concatenation mode
mode combines
combines thethe information
information after
after it undergoes
it undergoes × 1 ×processing
Conv1
Conv1 1 pro-
cessing
threethree
timestimes through
through weighted
weighted fusion.
fusion. ThereThere
is noisphenomenon
no phenomenon of one
of one mode modebeingbeing
better
better
thanthan the other
the other amongamong
thesethese
three three
modes. modes. The performance
The performance of themodule
of the CAM CAM module is
is influenced
by factors
influenced bysuch as the
factors suchdimensions of different
as the dimensions datasets’ datasets’
of different images, their characteristic
images, features,
their character-
and
istic the connections
features, between different
and the connections between modules.
different modules.
Figure 6.
Figure 6. Creation
Creation of
of the
the upsampling
upsampling kernel.
kernel.
The image above depicts the process of creating an upsampling kernel. To meet the
requirement of reducing the computational complexity, the input image of size h × w × l
undergoes calculations using a 1 × 1 convolutional operation, compressing the channels to
l1 . Next, the content is re-encoded and passed through a k2 × k2 convolutional operation,
dividing the l1 channels into m2 groups of k1 2 channels. These channels are then rearranged
and combined in a mh × mw × k1 2 structure after unfolding the spatial dimensions. Finally,
the rearranged mh × mw × k1 2 structure is normalized, ensuring that the weights of the
newly created upsampling kernel sum up to 1.
At each position in the output feature map h × w × l, there is a mapping to the
input feature map. For example, in the rectangular shape shown in the image, the yellow
region corresponds to a 6 × 6 area in the input feature map. Then, the upsampling kernel
mh × mw × k1 2 is rearranged into a k1 × k1 structure within the red region, and the
dot product between this rearranged kernel and the 6 × 6 input area yields an output
value. The yellow region of the rectangular shape determines the corresponding positional
Figure 6. Creation
coordinates of theofred
theregion
upsampling
in thekernel.
upsampling kernel, and a single upsampling kernel can
be shared among all channels at that corresponding position.
Theimproved
Figure8.8.The
Figure improvedarchitecture
architecturebased
basedon
onversion
version7.0
7.0of
ofYOLOv5s.
YOLOv5s.
By combiningthe
Additionally, these
SPPFthree improvement
in the methods without
original algorithm’s backbonesignificantly
is replaced increasing
with a CAM. the
computational complexity, the algorithm achieves a significant improvement in its accuracy.
The CAM obtains three different receptive fields with rates of 1, 3, and 5 when processing
A comparative
an analysis
image. This allows will
the be presented
algorithm in the
to focus following
more sections.information, reducing
on contextual
the impact of low pixel clarity features. Moreover, three fusion methods (weight, adaptive,
3. Experiments and Analysis
and concatenation) are proposed for combining the obtained receptive fields.
3.1. Experimental Environment
Furthermore, the two nearest neighbor upsampling operators—the “nearest” mod-
ules inThe
theexperimental platform
original algorithm’s used in this
neck—are studywith
replaced is thea AutoDL cloud service platform
feature recombination module
provided by SeetaTech Technology Ltd. The basic configuration
called CARAFE, which focuses on semantic information. Based on the values is shown in Table
of k1.1 and
During
k2
the training process using the improved algorithm based on YOLOv5s-7.0, no pre-trained
([1, 3], [1, 5], [3, 3], [3, 5], [3, 7], [5, 5], [5, 7], and [7, 7]), a total of 16 combinations can be
weights were used. The basic parameters were set as follows: epochs = 300, batch-size = 32,
obtained from the two “nearest” modules. This approach of increasing the receptive field
workers = 22, initial learning rate = 0.001, and confidence threshold = 0.5.
TPTP
==
PP (5)
(5)
TP +FPFP
TP +
TPTP
R= (6)
R =TP + FN (6)
TP + FN
Z 1
AP = 1P ( R ) dR (7)
AP =0 P ( R ) dR (7)
0
∑n AP = (i )
mAP = i=0n (8)
n
AP = ( i )
mAP =as thei = 0number of pictures with positive samples
In these expressions, TP is expressed (8)
n
for both labeling and detection; FP represents the number of images where the ground
truthIn
is these
negative but the detection
expressions, is positive;
TP is expressed as theand FN represents
number of picturesthe number
with positiveofsamples
images
where the ground truth is positive but the detection is negative. FPS is an
for both labeling and detection; FP represents the number of images where the ground indicator of
detection speed, representing the number of images that can be detected per second.
truth is negative but the detection is positive; and FN represents the number of images
where the ground truth is positive but the detection is negative. FPS is an indicator of
detection speed, representing the number of images that can be detected per second.
The improved algorithms that replaced the convolutional kernels with C3F, C2f, SAC,
DBB, DCN, and DSConv have increased the mAP compared to the original algorithm
by 3.4%, 3.3%, 3.1%, 1.2%, 1%, and 0.9%, respectively. All six improved algorithms of
YOLOv5s-7.0 can enhance the detection accuracy. However, only the algorithm, which
incorporates C3F, C2f, DCN, and DSConv, improves both the detection accuracy and
speed. The algorithm that incorporates DCN and DSConv has a weaker ability to improve
the detection accuracy compared to the other two convolutional kernels. Additionally,
considering that C2f has approximately 3.49 × 106 more computational parameters than
C3F while achieving a similar detection accuracy, in the subsequent ablation experiments,
only the case of integrating C3F will be considered.
The three fusion modes of the CAM module, when compared to the original algorithm,
have all shown an improvement in the detection accuracy of at least 2.3%. Among them,
after multiple experiments on the “NEU-DET” dataset, the Adaptive mode of the CAM
module achieved the highest mAP of 75.7%. Therefore, in the subsequent ablation experi-
ments, only the Adaptive mode of the CAM module will be considered. Although the CAM
module can enhance the detection accuracy, it has a higher computational cost compared to
the SPPF in the original algorithm. To further reduce the overall computational cost of the
algorithm, future improvements can consider integrating the CAM module in the higher
levels of the original algorithm.
All the fusion combinations incorporating the Carafe module lead to an improved
detection accuracy. Through a comparison of the mAP results, it was found that the optimal
detection accuracy is achieved when the recombination sampling kernel k1 is set to 1, 3, or
5, and the receptive field k2 is set to 3, 5, or 7. Among the combination modes [1, 3], [3, 5],
and [5, 7], the detection accuracy increases with the increase in the k values, with the [5, 7]
combination achieving the highest mAP of 75.2%. Moreover, the increase in the parameter
size is not substantial. Therefore, in the subsequent ablation experiments, only the Carafe
mode [5, 7] needs to be considered.
In the table, ‘4’ indicates the presence of the module Comparing. Group 2 with Group
1, the addition of the C3F module resulted in a reduction in the 1.23 × 106 parameters.
However, the mAP increased from 73.1% to 76.3%, showing an improvement of 3.2%. When
comparing Group 3 with Group 1, the inclusion of the CAM module led to an almost
doubling of the parameter count, but the detection accuracy also improved by 2.6%. In the
case of Group 4 compared to Group 1, the integration of the Carafe module only increased
the parameter count by 0.34 × 106 , yet the detection accuracy improved by 2.1%. Groups
5, 6, and 7 represent paired combinations of C3F, CAM, and Carafe, respectively. When
compared to their individual integration with YOLOv5s-7.0, all these combination modes
exhibited further improvements in mAP, suggesting that there is no adverse reaction when
combining C3F, CAM, and Carafe. Group 8 fused C3F, CAM, and Carafe into YOLOv5s-7.0,
and created a new model named “YOLOv5s-7.0-FCC”. Furthermore, the code for the
YOLOv5s-7.0-FCC model is publicly available and readers are free to use it. The download
location can be found in the Supplementary Materials Data File S1. Lastly, Group 8, when
compared to Group 1, saw an increase in the parameter count from 7.03 × 106 to 13.35 × 106 ,
while the detection accuracy soared from 73.1% to 79.5%. Sacrificing some parameters to
achieve significant improvements in the detection accuracy proves to be meaningful.
4. Results
4.1. Superior Features of YOLOv5s-7.0-FCC in Mainstream Object Detection Algorithms
Currently, in engineering applications, other mainstream object detection algorithms
such as Faster-RCNN, SSD, YOLOv3, and YOLOv4 are widely used. They are compre-
hensively evaluated and compared with YOLOv5s-7.0-FCC, and the specific results are
presented in Table 6.
The table above shows the results obtained on the “NEU-DET” dataset by testing
various algorithms. YOLOv5s-7.0-FCC has fewer parameters than Faster-RCNN, SSD,
YOLOv3, and YOLOv4, and it achieves the highest detection accuracy. In terms of the
detection speed, except for being slower than SSD, it is faster than the other four algorithms.
With a parameter count of 13.35 × 106 , YOLOv5s-7.0-FCC improves the detection accuracy
to 79.5%. Compared to the original algorithm, although the parameter count of YOLOv5s-
7.0-FCC increases almost twice, the detection speed improves from 101.1 f/s to 109.4 f/s,
and the mAP increases by 6.4%. Overall, YOLOv5s-7.0-FCC enables real-time detection
functionality in engineering applications and exhibits significantly superior performance
compared to other mainstream object detection algorithms.
Electronics 2023, 12, 4422 15 of 17
Supplementary Materials: The following supporting information can be downloaded at: https://www.
mdpi.com/article/10.3390/electronics12214422/s1, Data File S1: Code for the YOLOv5s-7.0-FCC model.
Author Contributions: Conceptualization, C.W.; methodology, T.X.; software, H.J.; validation, Y.Q.;
formal analysis, J.Y.; investigation, Y.K.; resources, X.C.; data curation, J.Y.; writing—original draft
preparation, J.Y.; writing—review and editing, J.Y.; visualization, T.X.; supervision, Y.Q.; project
administration, H.J.; funding acquisition, C.W. All authors have read and agreed to the published
version of the manuscript.
Funding: The applicant for the fund is Haijuan Ju. This research was funded by Natural Science
Basic Research Program of Shaanxi, China, the fund number 2023-JC-QN-0696.
Data Availability Statement: The data that support the findings of this research are openly available
at https://download.csdn.net/download/qq_41264055/85490311 (accessed on 15 September 2023).
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Zhao, Z. Review of non-destructive testing methods for defect detection of ceramics. Ceram. Int. 2021, 47, 4389–4397. [CrossRef]
2. Jain, S.; Seth, G.; Paruthi, A.; Soni, U.; Kumar, G. Synthetic data augmentation for surface defect detection and classification using
deep learning. J. Intell. Manuf. 2022, 33, 1007–1020. [CrossRef]
3. He, Y.; Song, K.; Meng, Q.; Yan, Y. An End-to-end Steel Surface Defect Detection Approach via Fusing Multiple Hierarchical
Features. IEEE Trans. Instrum. Meas. 2019, 69, 1493–1504. [CrossRef]
4. He, D.; Xu, K.; Zhou, P. Defect detection of hot rolled steels with a new object detection framework called classification priority
network. Comput. Ind. Eng. 2019, 128, 290–297. [CrossRef]
5. Zhang, M.; Yin, L. Solar Cell Surface Defect Detection Based on Improved YOLO v5. IEEE Access 2022, 10, 80804–80815. [CrossRef]
6. Li, X.; Wang, C.; Ju, H.; Li, Z. Surface defect detection model for aero-engine components based on improved YOLOv5. Appl. Sci.
2022, 12, 7235. [CrossRef]
7. Fu, G.Z.; Sun, P.Z.; Zhu, W.B.; Yang, J.X.; Cao, Y.L.; Yang, M.Y.; Cao, Y.P. A deep-learning-based approach for fast and robust steel
surface defects classification. Opt. Laser Eng. 2019, 121, 397–405. [CrossRef]
8. Wang, Q.; Mao, J.; Zhai, X.; Gui, J.; Shen, W.; Liu, Y. Improvements of YoloV3 for road damage detection. J. Phys. Conf. Ser. 2021,
1903, 012008. [CrossRef]
9. Jiang, Z.; Zhao, L.; Li, S.; Jia, Y. Real-time object detection method based on improved YOLOv4-tiny. arXiv 2020, arXiv:2011.04244.
10. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767.
11. Cai, X.; Zhou, S.; Cheng, P.; Feng, D.; Sun, H.; Ji, J. A Social Distance Monitoring Method Based on Improved YOLOv4 for
Surveillance Videos. Int. J. Pattern Recognit. Artif. Intell. 2023, 37, 2354007. [CrossRef]
12. Qiu, M.; Huang, L.; Tang, B.H. ASFF-YOLOv5: Multielement Detection Method for Road Traffic in UAV Images Based on
Multiscale Feature Fusion. Remote Sens. 2022, 14, 3498. [CrossRef]
13. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision & Pattern
Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525.
14. Pinto, L.G.; Martins, W.M.; Ramos, A.C.; Pimenta, T.C. Analysis and Deployment of an OCR-SSD Deep Learning Technique for
Real-Time Active Car Tracking and Positioning on a Quadrotor. In Data Science: Theory, Algorithms, and Applications; Springer:
Singapore, 2021.
15. Berg, A.C.; Fu, C.Y.; Szegedy, C.; Anguelov, D.; Erhan, D.; Reed, S.; Liu, W. SSD: Single Shot MultiBox Detector. In Proceedings of
the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016.
16. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE
Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef] [PubMed]
17. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer
Vision, Venice, Italy, 22–29 October 2017.
18. Zheng, Z.; Hu, Y.; Zhang, Y.; Yang, H.; Qiao, Y.; Qu, Z.; Huang, Y. CASPPNet: A chained atrous spatial pyramid pooling network
for steel defect detection. Meas. Sci. Technol. 2022, 33, 085403. [CrossRef]
Electronics 2023, 12, 4422 17 of 17
19. Xue, N.; Niu, L.; Li, Z. Pedestrian Detection with modified R-FCN. In Proceedings of the UAE Graduate Students Research
Conference 2021 (UAEGSRC’2021), Abu Dhabi, United Arab Emirates.
20. Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C. GhostNet: More Features from Cheap Operations. arXiv 2019, arXiv:1911.11907.
21. Zhang, Q.; Jiang, Z.; Lu, Q.; Han, J.N.; Zeng, Z.; Gao, S.H.; Men, A. Split to Be Slim: An Overlooked Redundancy in Vanilla
Convolution. arXiv 2020, arXiv:2006.12085.
22. Chen, J.; Kao, S.-h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster
Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC,
Canada, 18–22 June 2023; pp. 12021–12031.
23. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans.
Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [CrossRef] [PubMed]
24. Loy, C.C.; Lin, D.; Wang, J.; Chen, K.; Xu, R.; Liu, Z. CARAFE: Content-Aware ReAssembly of FEatures. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, Seoul, Republicof Korea, 27 October–2 November 2019.
25. Cui, L.; Jiang, X.; Xu, M.; Li, W.; Lv, P.; Zhou, B. SDDNet: A fast and accurate network for surface defect detection. IEEE Trans.
Instrum. Meas. 2021, 70, 1–13. [CrossRef]
26. Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets v2: More Deformable, Better Results. arXiv 2018, arXiv:1811.11168.
27. Qiao, S.; Chen, L.-C.; Yuille, A. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 10213–10224.
28. Gennari, M.; Fawcett, R.; Prisacariu, V.A. DSConv: Efficient Convolution Operator. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, Seoul, Republicof Korea, 27 October–2 November 2019.
29. Ding, X.; Zhang, X.; Han, J.; Ding, G. Diverse Branch Block: Building a Convolution as an Inception-like Unit. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.