Electronics 12 01515
Electronics 12 01515
Article
An Image Object Detection Model Based on Mixed Attention
Mechanism Optimized YOLOv5
Guangming Sun 1,2 , Shuo Wang 1 and Jiangjian Xie 3, *
1 Department of Electrical and Information Engineering, Hebei Jiaotong Vocational and Technical College,
Shijiazhuang 050051, China
2 Road Traffic Perception and Intelligent Application Technology R&D Center of Universities in Hebei Province,
Shijiazhuang 050051, China
3 School of Technology, Beijing Forestry University, Beijing 100083, China
* Correspondence: [email protected]
Abstract: As one of the more difficult problems in the field of computer vision, utilizing object image
detection technology in a complex environment includes other key technologies, such as pattern
recognition, artificial intelligence, and digital image processing. However, because an environment
can be complex, changeable, highly different, and easily confused with the target, the target is easily
affected by other factors, such as insufficient light, partial occlusion, background interference, etc.,
making the detection of multiple targets extremely difficult and the robustness of the algorithm low.
How to make full use of the rich spatial information and deep texture information in an image to
accurately identify the target type and location is an urgent problem to be solved. The emergence of
deep neural networks provides an effective way for image feature extraction and full utilization. By
aiming at the above problems, this paper proposes an object detection model based on the mixed
attention mechanism optimization of YOLOv5 (MAO-YOLOv5). The proposed method fuses the
local features and global features in an image so as to better enrich the expression ability of the feature
map and more effectively detect objects with large differences in size within the image. Then, the
attention mechanism is added to the feature map to weigh each channel, enhance the key features,
remove the redundant features, and improve the recognition ability of the feature network towards
the target object and background. The results show that the proposed network model has higher
Citation: Sun, G.; Wang, S.; Xie, J. An
precision and a faster running speed and can perform better in object-detection tasks.
Image Object Detection Model Based
on Mixed Attention Mechanism Keywords: deep neural network; object detection; YOLOv5; context information; attention
Optimized YOLOv5. Electronics 2023, mechanism
12, 1515. https://doi.org/10.3390/
electronics12071515
relationship between objects, that is, context information, are beneficial to target detection.
Since the object detection algorithm with manual feature extraction can only extract shallow
feature information, it is difficult to go deep to extract more semantic feature information.
In order to improve detection accuracy, it is necessary to build a more complex feature
extraction model [3]. Although object detection technology has gone through many years of
theoretical research and technical application with the continuous development of society,
the existing technology still needs to be further explored to meet the new requirements in
different fields and, on the basis of continuously enhanced performance, it also needs to
make the design more simplified, more scientific and more robust.
The collected images are mainly affected by various internal and external factors.
Internal factors include window size selection and variable target shape. The selection of
the target window size used in some algorithms is a difficult problem because the window
size is usually predefined, and the local difference is invariant [4,5]. However, if the window
size is too large, you may not be able to detect slight changes in texture features, and this
will consume more computation. In a window of too small a size, some interfering pixels
can be detected as target pixels, such as noise or pixels affected by lighting changes or other
factors [6]. The target may be in high-speed motion when shooting, so the imaging will
show deformation and scale changes, and the robustness of the target detection algorithm
is high. External factors include background noise and various detection equipment. In
a complex background, because the target strength is weak, it is partially blocked by the
interfering object or buried in the clutter and noise [7]. In these cases, it is difficult to
separate the target from the complex and noisy background. The farther the target is from
the acquisition point, the smaller the image area of the target is. At the same time, the
worse the image quality is and the more difficult it is to detect the target. For the target
collection device, different collection devices will be set according to different scenes. In
order to estimate the relative position and absolute position of the target, it is necessary to
obtain the relative distance between the target and the acquisition equipment by means of
ranging [8].
This paper analyzes the object detection algorithm based on deep learning, and the
research contents mainly include:
1. In this paper, the YOLOv5 feature extraction network is improved, and the feature
map extraction operation is advanced. If it is a small object, the deeper the network
level is, the less semantic information is retained by the small object. By extracting
the feature map in advance, more abundant location information can be obtained
to improve the problem of the feature loss of object information as the network
level deepens;
2. The network model proposed in this paper adds a selfattention mechanism. The
attention mechanism can improve the object feature weight, reduce the background
feature weight, let the model obtain the object area that needs to be focused on, and
reconstruct the features;
3. The network model proposed in this paper uses a context feature-fusion structure. The
feature information with rich deep semantics but also unclear location information is
fused to the feature information with clear shallow location information but not rich
semantics to improve the detection difficulty caused by complex multiobject;
4. The remainder of this paper is organized as follows: Section 2 discusses work related
to image object detection, followed, in Section 3, by the object detection algorithms
based on deep learning. In Section 4, the proposed method is addressed along with
the considerations for the analysis of this paper. Section 5 presents the experimental
results and analysis. Finally, in Section 6, a conclusion is drawn.
2. Related Work
As a basic task in computer vision, object detection has been a hot topic in academic
research in recent years. From traditional image processing methods to deep learning-
based methods, from single-stage and two-stage object detection frameworks based on
Electronics 2023, 12, 1515 3 of 21
anchor frames to object detection frameworks based on anchor frames, researchers have
explored better object detection methods from various dimensions. In recent years, the
object detection algorithm has been constantly updated iteratively. It has been constantly
challenged between solving new problems and finding new problems. It has weighed
between bigger and stronger algorithms and lighter and faster algorithms. With the joint
efforts of scholars, the overall research framework of the algorithm has become increasingly
mature [9]. As the downstream task of image classification, the object detection algorithm
needs to complete two tasks: one is to generate the object detection frame to be recognized;
the second is to accurately judge the types of objects in the detection frame. Traditional
object detection methods use artificial features to represent complex features, which are
gradually reduced due to their weak applicability and low detection performance; the
object detection method based on deep learning has been developed with the proposed
convolutional neural network, which eliminates the disadvantages of traditional object
detection, and it is widely used [10]. At present, in the field of deep learning, object
detection algorithms can be divided into two categories: two-stage object detection and
one-stage object detection. The former gradually realizes detection according to the idea
of “from big to small”, such as region-based convolutional neural networks (R-CNN)
series, spatial pyramid pooling network (SPPNet), etc.; the latter performs region category
judgment while generating prediction boxes, with low model complexity and fast detection
speed, including the YOLO series, single shot multibox detector (SSD), and RetinaNet [11].
The squeeze and extinction network (SENet) represents an attention mechanism proposed
by Momenta. In the attention mechanism, SENet is not an independent model design but
is only an optimization of the model. Generally, SENet is used in combination with other
factors [12].
In the actual object detection task, objects of different sizes are mixed together and
interfere with each other, which affects the performance of small object detection. Therefore,
Hu et al. [13] proposed a pixel-level balance (PLB) method, which improved the accuracy of
small object detection. Afsharirad and Seyedin [14] introduced the salient object detection
method using task simulation (SOD-TS) based on saliency detection algorithms. SOD-TS
can detect the salient object, which is the best response to the current task. This method
has a wide range of applications. Du et al. [15] proposed a correlation complement (CC)
module that combines the class representation vector with the relationships between the
categories in the dataset. Yuet al. [16] proposed a multiobject subspace projection sample-
weighted CEM (MSPSW-CEM) algorithm to solve the problem of spectral variability, which
causes very serious false detection and missed detection. MSPSW-CEM showed a better
detection performance than other object detection methods. With respect to adaptively
obtaining optimal object anchor scales for the object detection of high spatial resolution
remote sensing images (HSRIs), Donget al. [17] proposed a novel object detection method
for HSRIs based on CNNs with optimal object anchor scales. Zhan et al. [18] proposed a
novel task for visual relationship detection and significance detection as a complement
to object detection and predicate detection. Meanwhile, they proposed a novel multitask
compositional network (MCN) that simultaneously performed object detection, predicate
detection, and significance detection.
Wang et al. [19] proposed a multiscale block fusion object detection method for large-
scale HSRIs, aiming at how to achieve optimal block object detection for large-scale HSRIs.
This method is superior to other single-scale image block detection methods. With respect
to the multiangle features of object orientation in HSRIs object detection, Dong et al. [20]
presented a novel HSRI object detection method based on a CNN with adaptive object
orientation features; the proposed method can more accurately detect objects with large
aspect ratios and densely distributed objects. Hou et al. [21] proposed a kullback–leibler
single shot multibox detection (KSSD) object detection algorithm to detect small- and
medium-sized objects. This algorithm has higher accuracy and stability than existing
detection algorithms. Xi et al. [22] proposed an infrared small-object detection method to
improve the capacity for detecting thermal objects in complex scenarios. When compared
single shot multibox detection (KSSD) object detection algorithm to detect small- and me-
dium-sized objects. This algorithm has higher accuracy and stability than existing detec-
Electronics 2023, 12, 1515 4 of 21
tion algorithms. Xi et al. [22] proposed an infrared small-object detection method to im-
prove the capacity for detecting thermal objects in complex scenarios. When compared
with the current advanced models, this framework has better performance in infrared
with
smallthe current
object advanced
detection under models,
complex thisbackground
frameworkconditions.
has better performance
Objects in aerial in infrared
images
small object
have the detection under
characteristics of a complex background
small volume conditions.
and dense Objects in
and uncertain aerial images
directions, which have
in-
the characteristics
crease the difficultyofofa detection.
small volume and to
In order dense
solveand
theuncertain
problem, directions,
Koyun et al.which increase
[23] proposed
the difficultyobject
a two-stage of detection.
detectionInframework
order to solve the“Focus-and-Detect”.
called problem, Koyun et Kim al. [23] proposed
et al. [24] pro- a
two-stage object detection framework called “Focus-and-Detect”. Kim et al.
posed a novel object detection framework for obtaining robust object detection in occlu- [24] proposed
asion.
novel
Theobject detection
framework frameworkoffor
is composed anobtaining robust object
object detection detection
framework and in occlusion.
a plug-in The
bound-
framework
ing box (BB)isestimator.
composed of anPascal
Using objectVOC,
detection framework
MS Coco, and
and Kitti a plug-in
datasets, bounding
it was proved thatbox
(BB) estimator. Using
this framework improvesPascal
the VOC, MS Coco,
performance and Kitti
of object datasets, it was proved that this
detection.
framework improves the performance of object detection.
3. Object Detection Algorithms Based on Deep Learning
3. Object Detection Algorithms Based on Deep Learning
3.1. Classification
3.1. Classification of
of Mainstream
Mainstream Algorithms
Algorithms inin Object
Object Detection
Detection
In the
In the field
field of
of object
object detection,
detection, the
the methods
methods of of object
object detection
detection using
using deep
deep learning
learning
technology have been gradually accepted by the majority of researchers. At present,
technology have been gradually accepted by the majority of researchers. At present, object object
detection is mainly divided into two-stage methods and one-stage methods.
detection is mainly divided into two-stage methods and one-stage methods. These two These two
methods have their own advantages and effects for different detection problems.
methods have their own advantages and effects for different detection problems. In general, In gen-
eral,two-stage
the the two-stage
method method has better
has better detection
detection accuracy,
accuracy, while while the one-stage
the one-stage method method has
has faster
faster detection
detection speed speed
[25,26].[25,26].
Two-stage object
Two-stage object detection
detection algorithms
algorithms paypay more
more attention
attention toto the
the extraction
extraction of of high-
high-
quality features from object images to obtain a higher detection accuracy.
quality features from object images to obtain a higher detection accuracy. Its process is Its process is
shown in Figure 1. Classic two-stage object detection algorithms include
shown in Figure 1. Classic two-stage object detection algorithms include R-CNN, Fast R-CNN, Fast R-
CNN, etc.
R-CNN, First,
etc. thethe
First, candidate boxes
candidate boxesareare
extracted, andand
extracted, then they
then are are
they classified andand
classified re-
gressed [27].
regressed [27].
Target
category
Convolution+ Pool candidate
Input image Convolution Convolution
Pooling regions
Target
location
Extract
candidate
regions
Figure 1.
Figure 1. Flow
Flow chart
chart of
of two-stage
two-stage object
object detection
detection algorithms.
algorithms.
One-stage object
One-stage objectdetection
detectionalgorithms
algorithmsonly onlyneed
needone
one regression
regression operation
operation ononthethe in-
input
put image to predict the category and location information of the object image,
image to predict the category and location information of the object image, so they have so they
ahave
fastadetection
fast detection
speed.speed.
TheThe detection
detection flowflow chart
chart of of this
this typeofofalgorithm
type algorithmisisshown
shown in
Figure 2.
Figure 2. Two-stage object detection algorithms,
algorithms, suchsuch as R-CNN, have shortcomings, such
parameters and
as large training parameters and long
long training
training times,
times, which
which are
are not
not suitable
suitable for tasks with
high real-time requirements; classical one-stage object detection algorithms, such as SSD,
EfficientNet, and the YOLO series, only use convolutional neural networks to process the
EfficientNet,
whole
whole picture
picture once,
once, greatly
greatly reducing
reducingtraining
trainingparameters
parametersand andprocessing
processingtimestimes[28].
[28].
Target
location
increases while keeping the feature scale the same, with the feature information extracted
from Conv7 used for prediction processing; then, the pooled core scale of the fifth pooled
Figure 2. Flow
layer is chartfrom
changed of one-stage
2 × 2 toobject
3 ×detection algorithms.
3 and the step size is changed from 2 to 1 to ensure
that the feature layer that can be extracted from the convolution layer after the fifth pooled
3.2. SSD
layer hasObject
a highDetection Algorithm
resolution, meaning that the network model can detect small objects with
3.2.1.
higherSSD Algorithm
accuracy; Model
finally, Structure
through the convolution layer Conv4_3, the feature information
for multiscale feature prediction
The SSD algorithm is extracted,
extracts the adding additional
feature information convolution
from images throughlayers to the
the VGG-
Electronics 2023, 12, x FOR PEER REVIEW
model and selecting different step down-sampling processing to improve layers 5 of 22
the performance
16 network, extracting object features through multiple scale convolution and com-
of the model
pleting to extract
co-ordinate the image features
determination [29]. classification for feature information
and object
through a prediction module. Its structure is shown in Figure 3. The SSD algorithm has
made a series of improvements to the VGG-16 network. First, the FC6, FC7, and FC8 full
connection layers of the VGG-16 are deleted, and the 3 × 3 Conv6 and 1 ×Target 1 Conv7 convo-
lution layers are designed to replace them so as to ensure that the receptive category field of the
feature map increases while keeping the feature scale the same, with the feature infor-
mation extracted from Conv7 used for prediction processing; then, the pooled core scale
Input
of the imagelayer is changed
fifth pooled Convolution
from 2 × 2 to 3 × 3 and the step size is changed from 2
to 1 to ensure that the feature layer that can be extracted from the convolution layer after
the fifth pooled layer has a high resolution, meaning that the network model Target can detect
small objects with higher accuracy; finally, through the convolution layer location
Conv4_ 3, the
feature information for multiscale feature prediction is extracted, adding additional con-
volution
Figure 2.
Figure layerschart
2. Flow to the model and
one-stage
of one-stage object
objectselecting
detectiondifferent
detection algorithms.
algorithms. step down-sampling processing to
improve the performance of the model to extract the image features [29].
The SSD
3.2. The
SSD SSD algorithm
Object Detectioninnovatively
Algorithm designed
innovatively designedmultiscale multiscalefeature feature layer
layer prediction
prediction process-
pro-
ing through
cessing through the original
the originalpyramidpyramidnetwork network structurestructureof the ofCNN network.
the CNN Through
network. these
Through
3.2.1. SSD Algorithm Model Structure
feature
these layers,
feature features
layers, at allatlevels
features of the
all levels of detection
the detection objectobjectcancan be extracted.
be extracted. Moreover,
Moreover, the
objectThe SSD details
feature algorithm extractsinthe
contained the feature
lower information
feature layer from
are are images throughwhich
comprehensive, the VGG-
can
the object feature details contained in the lower feature layer comprehensive, which
16 network,
effectively extracting
detect small object
objects features
in the through
image. multiplefeatures
High-level scale convolution
have highhighlayers and com-
dimensions and
can effectively detect small objects in the image. High-level features have dimensions
pleting
strong
and strong co-ordinate
semantic
semantic determination
information,
information, which which and
can can object
effectively
effectively classification
detectdetectlargelarge for objects. information
objects. feature
through a prediction module. Its structure is shown in Figure 3. The SSD algorithm has
made a series of improvements to the VGG-16 network. First, the FC6, FC7, and FC8 full
Reduced VGG-16
connection layers of the VGG-16 are deleted, and the 3 × 3 Conv6 and 1 × 1 Conv7 convo-
Conv4-3 Conv7 Conv8-2
300 lution layers are designed to replace them Conv9so as to ensure that the receptive field of the
-2
38 Conv10 -2 scale the same, with the feature infor-
feature map increases while 19 keeping the feature Conv11-2
10
mation extracted from Conv7 used for prediction 5 3 1 processing; then, the pooled core scale
1
Detections:
5 3
image of the fifth pooled layer is19changed from 2 × 2256to 3 × 3 and
10 256
Classifierthe step size is changed from 2
38 256
300 to 1 to ensure that the feature layer 512 that can be extracted from the 8732
Layer
(one/scale)
boxes
convolution layer after
the fifth pooled layer has a1024 high resolution, meaning that the network model can detect
512
small objects with higher accuracy; finally, through the convolution layer Conv4_ 3, the
feature information for multiscale feature prediction is extracted, adding additional con-
Feature pyramid
volution layers to the model and selecting different step down-sampling processing to
improve
Figure
Figure thenetwork
3.3.SSD
SSD performance
network of the model to extract the image features [29].
structure.
structure.
The SSD algorithm innovatively designed multiscale feature layer prediction pro-
3.2.2.
3.2.2. Prior
Prior
cessing Framethe original pyramid network structure of the CNN network. Through
Frame
through
these feature
Because
Because the layers,
the features
receptive
receptive at allof
fields
fields oflevels
the of the detection
thefeature
feature maps
mapsofofthe object can belayers
thedifferent
different extracted.
layers areareMoreover,
different
different on
thethe
on corresponding
object
corresponding original
feature details images,
contained
original images, ainprior
the frame
lower
a prior withwith
feature
frame thelayer same
thearescale
same but different
comprehensive,
scale scales
which
but different
will be generated
can effectively detecton different
small objectsfeature
in the layers.
image. If High-level
the featurefeatures map scale have m × dimensions
is high n and each
grid contains
and strong k prior information,
semantic frames, thenwhich the feature map generates
can effectively detectm × nobjects.
large × k prior frames. In
addition, each prior box needs to complete the prediction of category confidence and border
positioning co-ordinates Reduced (x, y, w, and h). If c categories are detected, (c + 4) prediction
VGG-16
values of the corresponding prior boxes will be generated. At this time, the number of
Conv4-3 Conv7
300 output prior boxes is (c + 4) × m × Conv8 n × -k,2
as shown
Conv9 -2 in Figure 4.
38 Conv10-2
19
10 Conv11-2
5 3
1
1 Detections:
5 3
image 19 10
256
256
Classifier
38 256
300 512
Layer 8732 boxes
(one/scale)
1024
512
Feature pyramid
scales will be generated on different feature layers. If the feature map scale is m × n and
each grid contains k prior frames, then the feature map generates m × n × k prior frames.
In addition, each prior box needs to complete the prediction of category confidence and
border positioning co-ordinates (x, y, w, and h). If c categories are detected, (c + 4) predic-
Electronics 2023, 12, 1515 tion values of the corresponding prior boxes will be generated. At this time, the number 6 of 21
The
The scaling
scaling formula
formula of
of the
the prior
prior frame
frame in
in the
the corresponding
corresponding feature
featuremap
mapisis
Sk 𝑆= =
Smin 𝑆 −−
Smax Smin
𝑆 −−
𝑆 + + m − 1 (k(𝑘 ke𝑘𝜖(1,2,
1),1), .,m
(1, 2, . . … )
, 𝑚) (1)
(1)
𝑚−1
Sk is the scale, Smax is the scale of the highest-level feature map, and 0.9 is taken as
𝑆 is the scale, 𝑆 is the scale of the highest-level feature map, and 0.9 is taken as
the set value under normal conditions; Smin is the scale of the lowest layer feature map.
the set value under normal conditions; 𝑆 is the scale of the lowest layer feature map.
Under normal circumstances, 0.2 is taken as the set value. m refers to the calculated layer
Under normal circumstances, 0.2 is taken as the set value. m refers to the calculated layer
m feature map, and k refers to the layer k of the layer feature map.
m feature map, and k refers to the layer k of the layer feature map.
If the number of network layers is low, it corresponds to a large-scale feature map and
If the number of network layers is low, it corresponds to a large-scale feature map
has low receptive field feature information, so it is necessary to preset a smaller scale prior
and has low receptive field feature information, so it is necessary to preset a smaller scale
frame to detect small objects. On the contrary, if the number of network layers is high, it
prior frame to detect small objects. On the contrary, if the number of network layers is
corresponds to a small-scale feature map and has large receptive field feature information,
high, it corresponds to a small-scale feature map and has large receptive field feature in-
so it is necessary to preset a larger scale prior frame to detect large objects. A total of 8732
formation,
prior framessowill
it isbenecessary
generated tobypreset a largerthe
calculating scale
SSDprior
modelframe
[30].to detect large objects. A
total of 8732 prior frames will be generated by calculating the SSD model [30].
3.2.3. Border Regression
3.2.3. Border Regression
As shown in Figure 5, the red box P refers to the region proposal generated by training,
As shown
the green box Ginrefers
Figure
to 5,
thethe red box
ground P refers
truth toobject,
of the the region
and proposal generated
the blue box by to
Ĝ Refers train-
the
ing, the green
regression box Gthat
window refers to thetoground
is close truthtruth
the ground of the object,
after the and
region blue boxisGfine-tuned,
theproposal Refers to
the regression
that window
is, the prediction that
box, and is defines
close tothe
thecentral
ground truth aftervalue
co-ordinate the region proposal is scale
and width-height fine-
tuned,
of
Electronics 2023, 12, x FOR PEER REVIEWeach that is, the
prediction prediction
box as (x, box,
y, w, and
and defines
h). The the
IOU central
value isco-ordinate
improved value
by and width-
optimizing the 7 of 22
height scale of each prediction box as (x, y, w, and h). The IOU value
gap between the prior frame and the real window to achieve accurate object positioning is improved by op-
timizing
and the gap between the prior frame and the real window to achieve accurate object
detection.
positioning and detection.
G
^
G
P
Figure5.5.Schematic
Figure Schematic diagram
diagram of border
of border regression.
regression.
𝑓 P ,P ,P ,P = Ĝ , Ĝ , Ĝ , Ĝ ≈ G ,G ,G ,G (2
Frame regression learning is mainly completed through coding and decoding. The
coding formula is as follows:
Electronics 2023, 12, 1515 7 of 21
Frame regression learning is mainly completed through coding and decoding. The
coding formula is as follows:
lx = (Gx − Px )/Pw
ly = Gy − Py /Ph
(3)
l = log(Gw /Pw )
w
lh = log(Gh /Ph )
Among them, lx , ly , lw , lh is the conversion value of the real box in the format of a
prior box to facilitate the calculation of the loss function.
Decoding is the reverse process of encoding. It mainly restores a prior frame to the im-
age in the form of translation transformation andscaling, that is, reversely deducing the real
frame positioning information Gx , Gy , Gw , Gh through the output values lx , ly , lw , lh .
In order to improve the proximity between the prior frame and the real frame, the over-
shoot parameters v x , vy , vw , vh are used for fine tuning during decoding. The decoding
formula is
Gx = Pw (v1 × lx) + Px
Gy = Ph v2 × ly + Py
(4)
Gw = Pw · exp( v3 × lw )
Gh = Ph · exp(v4 × lh )
The coding part normalizes the error between the prior frame and the real frame,
which is conducive to the loss value calculation in the SSD network, while the decoding
part deduces the position information of the real frame through the location information of
the prior frame and the offset, which is the main step of frame regression learning.
p
p
exp(ci ) (6)
cbi =
p
∑ p exp(ci )
p
where ci refers to the detection probability when the corresponding prior box of the pth
p p
category object is the ith; cbi refers to ci detection probability after regression; cbio refers to
the detection probability when the category object is the background and its corresponding
Electronics 2023, 12, 1515 8 of 21
p
prior frame is the ith; dij refers to the true or false value of the ith prior to box matching the
p
pth category object of the jth real box, dijP ∈ {0, 1}, if dij = 1. This means that the object in
the prior box is the category of the real box; Pose refers to the number of positive samples,
and Nega refers to the number of negative samples.
The location classification function of the SSD model uses Smooth L1 a function to
calculate the loss of the location offset information of the prior frame relative to the center
co-ordinates and the width and height of the real frame. The expression is as follows:
N k Smooth L l m − g m
L ( d, l, g ) = ∑ ∑ d 1 i
loc ( i∈ Pose m∈{ x,y,w,h} ij
b j
0.5x 2 if | x |< 1 (7)
Smooth L1 = | x |−0.5 otherwise
where lim refers to the location information of the ith prior box; gbm
j refers to the position
information of the jth real frame after it passes the coding phase.
stitute the basic unit of the Darknet-53 network structure, where n represents the number
of basic units used by this layer. The residual structure enables the upper layer of the
the network
network to directly
to directly skipskip
twotwo or more
or more layerslayers to connect
to connect the subsequent
the subsequent network,
network, which which
can
can alleviate
alleviate the the gradient
gradient problem
problem caused
caused byby networkdeepening.
network deepening.WhenWhenassuming
assumingthat that the
network layer
layer behind
behindthetheshallow
shallownetwork
networkisisanan identity
identitymapping
mapping layer, directly
layer, fitting
directly the
fitting
the potential identity mapping function H(x) of a layer is difficult, so the residual struc-
potential identity mapping function H ( x ) of a layer is difficult, so the residual structure
doesdoes
ture not directly learn
not directly the object
learn mapping
the object mapping butbut
learns a residual
learns a residualF(x)F(x) ) − x,
= H=(xH(x) − making
x, mak-
the original mapping H ( x ) = F ( x ) + x, as shown in Figure 6, when
ing the original mapping H(x) = F(x) + x, as shown in Figure 6, when F(x) = 0, identity F ( x ) = 0, identity
mapping can
mapping can be
be realized
realized H (x) =
H(x) x.
= x.
Residual structure.
Figure 6. Residual structure.
As shown
As shown in in Figure
Figure 6,6, two
two convolutional
convolutionallayers useaa11××1 1convolution
layersuse convolution core
core and
and a 3a
×3 3×convolution
3 convolution core,
core, respectively,ofofwhich
respectively, whichthethe11××1 1convolution
convolutioncorecoreisismainly
mainly used
used for
for
channel expansion and reduction. The residual network first uses the
channel expansion and reduction. The residual network first uses the 1 × 1 convolution 1 × 1 convolution
check channel
check channel totoshrink
shrinkandandthenthenuses
usesthe ×convolution
the3 3× 3 3 convolution check
check channel
channel to restore.
to restore. Its
Its essence is the idea of matrix decomposition, which is used to reduce
essence is the idea of matrix decomposition, which is used to reduce the number of pa- the number of
rameters. This helps the convolution network reduce the amount of computation to a cer-a
parameters. This helps the convolution network reduce the amount of computation to
certain
tain extent
extent andand
makesmakes
thethe convolution
convolution network
network runrun faster
faster andand more
more efficiently.
efficiently.
3.3.3. Loss Function of YOLOv3
3.3.3. Loss Function of YOLOv3
For the YOLO series algorithms, the loss function is the core of the algorithm and
For the YOLO series algorithms, the loss function is the core of the algorithm and
plays a key role in optimization. There are 6 prediction parameters for YOLOv3 object
plays a key role in optimization. There are 6 prediction parameters for YOLOv3 object
detection; (x, y, w, h) are the co-ordinates of the upper left vertex of the object box, width
detection;
and height, (x,class
y, w,ish)the
arecategory
the co-ordinates of the upper
of the detected object,left
and vertex of the object
confidence box, width
is the confidence
and height, class is the category of the detected object,
level of the detected object; the formula of loss function is and confidence is the confidence
level of the detected object; the formula of loss function is
obj
loss(object) = λcoord ∑iK=×0K ∑ jM=0 Iij (2 − wi × hi )[ xi − xbi ]2 + [yi − ybi ]2
h i2
obj
+λcoord ∑iK=×0K ∑ jM=0 Iij (2 − wi × hi )[wi − w b i ]2 + h i − bhi
obj
− ∑iK=×0K ∑ jM=0 1ij [cbi log(ci ) + (1 − cbi )log(1 − ci )] (8)
h i
noobj b
−λnoobj ∑K×K ∑ M I
i =1 j=0 ij Ci log(Ci ) + 1 − Cbi log(1 − Ci )
K ×K obj
− ∑i=0 Iij ∑c∈c lasscs [ pbi (c)log( pi (c)) + (1 − pbi (c))log(1 − pi (c))]
The loss function consists of three parts: the part is the error of the upper left vertex
co-ordinate of the object frame and the frame width and height. The BCE (binary cross
entropy) loss function is used for the upper left vertex co-ordinate error, and the MSE (mean
square error) loss function is used for the frame width and height error; Confidence error
and category error are represented by the obj part and the class part of the formula, and
BCE error function is used. The loss function is an important evaluation index for network
training results, and the model detection ability can be improved through the loss function
optimization algorithm.
(mean square error) loss function is used for the frame width and height error; Confidence
error and category error are represented by the obj part and the class part of the formula,
and BCE error function is used. The loss function is an important evaluation index for
network training results, and the model detection ability can be improved through the
loss function optimization algorithm.
Electronics 2023, 12, 1515 10 of 21
Compression Activation
Fex( · ,W)
X U Fsq(·) x
~
1×1×C 1×1×C
H
H’ Ftr H Fscale( ·,· )
W’ W W
C’ C C
Figure 7.
Figure Structureof
7. Structure ofthe
the SENet
SENet module.
module.
In the compression section, the dimension of the input element feature map is
In the compression section, the dimension of the input element feature map is H × W
H × W × C. H, W, and C represents height, width, and number of channels, respectively.
× C. H, W, and C represents height, width, and number of channels, respectively. The func-
The function of the compression part is to change the dimension from H × W × C com-
tion of the compression part is to change the dimension from H × W × C compressed to 1
pressed to 1 × 1 × C. Namely, H × W is compressed to 1 × 1D. In the activation part, the
× 1 × C. Namely, H × W is compressed to 1 × 1D. In the activation part, the dimension of 1
dimension of 1 × 1 × C is integrated into the full connection layer, predicting the impor-
× 1 × C is integrated into the full connection layer, predicting the importance of each chan-
tance of each channel and then encouraging the operation on the corresponding channel
nel and then encouraging the operation on the corresponding channel of the preceding
of the preceding feature map. A simple gating mechanism and the Sigmoid activation
feature map. A simple gating mechanism and the Sigmoid activation function were
function were adopted.
adopted.
4.1.2. Spatial Attention Mechanism
The representative model of the spatial attention mechanism is the spatial transformer
network (STN), which can transform various deformation data in space and automatically
capture important regional features. It can ensure that the image can still obtain the same
results as the original image after clipping, translation, or rotation. The STN network in-
cludes a local network, parametric network sampling (network generator), and differential
image sampling.
Channel
Channel Spatial
Input features
features attention Spatial Adjusted features
Adjusted features
Input attention attention
module attention
module module
module
Figure 8.
Figure 8. Model
Model structure
structure of
of CBAM.
CBAM.
Figure 8. Model structure of CBAM.
4.1.4. Channel
4.1.4.
4.1.4. Channel Attention
Channel Attention Module
Attention Module CAM
Module CAM
CAM
Theinput
The
The inputof
input ofCAM
of CAMisis
CAM isaaafeature
feature
feature map,
map,
map, andand
andthethe
the dimension
dimension
dimension is set
is set
is set
as Has×
as HHW×× WW ×× C,
× C, C, where
where
where HHH
is
is
the the height
height of of
the the feature
feature map, map,
W is W
the is the
width, width,
and C and
is theC is the
number
is the height of the feature map, W is the width, and C is the number of channels. The number
of of
channels. channels.
The The
thought
thoughtisprocess
process
thought process is represented
represented
is represented by the
by the following:
by the following:
following: first,
first, thefirst,
input the
the input characteristic
characteristic
characteristic
input graph
graph isgraphpooled is
is
globally
pooled and
globallyaveragely;
and (pool
averagely; the space
(pool the dimensions
space to
dimensions compress
to
pooled globally and averagely; (pool the space dimensions to compress the space dimen- the
compress space
the dimensions;
space dimen-
facilitate
sions; the learning
sions; facilitate
facilitate the of the of
the learning
learning characteristics
of of theof
the characteristics
the characteristics channel
of later).later).
the channel
the channel Then,Then,
later). the global
Then, and
the global
the global
evaluation
and pooled
evaluation results
pooled are
results sentare to the
sent multilayer
to the perceptron
multilayer for
perceptron
and evaluation pooled results are sent to the multilayer perceptron for MLP learning; MLP forlearning;
MLP (based
learning;
on the characteristics
(based
(based on the
on of the MLP
the characteristics
characteristics of the
of learning
the channelchannel
MLP learning
MLP learning dimensions
channel and the and
dimensions
dimensions importance
and of each
the importance
the importance
channel).
of each
of Finally,
each channel). the
channel). Finally, MLP
Finally, the outputs
the MLP
MLP outputsthe value
outputs the result,
the value performs
value result, the
result, performs
performs the“add” operation,
the “add”
“add” operation, and
operation,
then obtains
and then the
then obtains final
obtains the “channel
the final
final “channel attention
“channel attention value”
attention value”through
value” through the
through the mapping
the mapping processing
processingthe
mapping processing of of
and of
Sigmoid
the function.
Sigmoid function.Figure 9 shows
Figure 9 channel
shows attention
channel module.
attention module.
the Sigmoid function. Figure 9 shows channel attention module.
Channel attention
Channel attention
Maximize
Maximize module
pooling module
pooling
Average
Average
pooling
pooling Channel
Channel
attention Mc
attention Mc
Sharing MLP
Sharing MLP
Input features
Input features FF
Figure9.
Figure
Figure 9. Channel
9. Channel attention
Channel attention module.
attention module.
module.
The
The calculation
calculation formula
formulaof
ofchannel
channelattention
attentionis
isas
asfollows:
follows:
The calculation formula of channel attention is as follows:
Mc ( F ) = σ (MLP (AvgPool (F)) + MLP(MaxPool (F)))
MM (F)
(F) == 𝜎(𝑀𝐿𝑃(AvgPool(F))
𝜎(𝑀𝐿𝑃(AvgPool(F))
c
+ 𝑀𝐿𝑃(MaxPool(F)))
+ 𝑀𝐿𝑃(MaxPool(F)))
c (9)
c = σ W1 W0 Favg + W1 (W0 (Fmax )) (9)
== 𝜎𝜎 (𝑊
𝑊 (𝑊𝑊 (FFcavg)) +
+𝑊𝑊 (𝑊 (Fc )))
𝑊 (F ) (9)
1 0 avg 1 0 max
In the above formula, Mc is channel attention, MLP is sharing, F is input feature,
AvgPool is average pooling, and MaxPool is maximum pooling.
= σf 7×7 ([
Ms ( F ) AvgPool(Fi ); MaxPool(F)])
7 × 7
h
s s (10)
=σ f Fang ; Fmax
Convolutional
layer
4.2. The
Construction of MAO-YOLOv5
calculation Model
formula of spatial FeatureisExtraction
attention Structure
as follows:
YOLOv5 is an improvement to ×the YOLO series algorithms, where the detection
M (F) = 𝜎(𝑓 ([AvgPool(F); MaxPool(F)]))
principle of the YOLO series algorithms is similar. First, take the whole input image (10)as
the input of the network and divide = 𝜎 𝑓 it× into 𝐹ang ;𝐹
multiple N × N grids of the same size; each
grid can predict B bounding BOXs, with a total of N × N × B candidates, where each of
In the above formula, M is spatial attention, F is feature, AvgPool is average pool-
the boxes contains five variables (pc, bx, by, bh, bw). The original YOLO candidate box has
ing, and MaxPool is maximum pooling.
serious defects, and its width and height are completely unrestricted, which easily leads to
gradients that are out of control and unstable. YOLOv5 fixes this error and ensures that
4.2. Construction of MAO-YOLOv5 Model Feature Extraction Structure
the center point remains unchanged. Therefore, the current equation for YOLOv5 limits
the YOLOv5
multiplesisofan improvement
the anchor pointtofrom the YOLO seriesof
a minimum algorithms,
0 to a maximumwhere the of 4,detection prin-
and the anchor
ciple
frame object matching is also updated based on the width and height multiples. Setthe
of the YOLO series algorithms is similar. First, take the whole input image as the
input of the network
corresponding and divide
confidence it into
threshold multiple
through N × N gridssuppression
non-maximum of the same(NMS) size; each
and grid
select
can
thepredict
anchorBbox bounding BOXs, withconfidence
of the maximum a total of Nto×obtain
N × B candidates,
the prediction where box.each of the boxes
contains five variables (pc, bx, by, bh, bw). The original
YOLOv5 has two CSP architectures. One is with X residual component YOLO candidate box has (Resunit)
serious
defects, and its width and height are completely unrestricted, which
modules, and the other is to replace Resunit with two CBL modules. Resunit is composed easily leads to gradi-of
ents that are out of control and unstable. YOLOv5 fixes this error
two CBL convolution modules+residual networks, which are mainly used in the backbone and ensures that the
center pointThe
network. remains
backbone unchanged.
networkTherefore, the currentofequation
is mainly composed for YOLOv5
CSPdarknet+SPP. limits the
Backbone is a
multiples of the anchor
deeper network system.point from aadding
Therefore, minimumResunitof 0can
to aimprove
maximum the of 4, andvalue
gradient the anchor
during
frame object
reverse matching between
transmission is also updated
the layers,based
thusonpreventing
the width and height multiples.
the gradient generated Set
bythe
the
corresponding
increase fromconfidence threshold through
gradually disappearing so asnon-maximum
to obtain moresuppression
fine-grained(NMS) and se-
characteristics
lect the anchor
without box of
worrying the maximum
about confidence to obtain the prediction box.
network fading.
YOLOv5
The networkhas two CSP architectures.
structure of MAO-YOLOv5 One isisdivided
with X into
residual component
four parts: input,(Resunit)
backbone,
neck module,
modules, and theand head;
other it mainly
is to replaceadvances
Resunit withthe feature
two CBL extraction
modules.operation
Resunit isand needs to
composed
ofadjust the step
two CBL size of the
convolution convolution core networks,
modules+residual of the last convolution
which are mainly structureusedin in
thethe
Backbone
back-
structure
bone to 1 so
network. Theasbackbone
to achievenetwork
the fusion operation
is mainly of the features
composed with the same Backbone
of CSPdarknet+SPP. scale as the
isneck layer.network
a deeper In the original
system.YOLOv5
Therefore,structure,
addingthe convolution
Resunit structure,
can improve the encapsulated
gradient valueby
2D convolution, the BN layer, and the SiLU activation function,
during reverse transmission between the layers, thus preventing the gradient is used for down-sampling
generated
operations, and the sliding operation step of its convolution core
by the increase from gradually disappearing so as to obtain more fine-grained character- is 2.
After MAO-YOLOv5
istics without worrying about conducts × 1 convolution and 3 × 3 convolution operations,
network1 fading.
a new SE module is added. The SE module first pools the input feature map globally,
and then, through a two-layer full connection structure, the correlation between complex
channels can be established. Through weight normalization and channel weighting, the
channels with high-weight ratios will receive more attention so as to achieve the goal of
improving channel attention. The feature map output from the bottle-neck layer is further
input into the neck of the network. The neck is constructed as a pyramid network structure.
The function is to divide the detector head into three different sizes, namely, large, medium,
and small while ensuring that the underlying information is not lost so that the network can
have good detection results for targets of different sizes. The neck contains multiple CSP
bottleneck layers, and each CSP bottleneck layer contains several bottleneck layers added
with SE modules. Therefore, the CSP bottleneck layer enriches the gradient combination
of the architecture and improves the speed and accuracy of reasoning while reducing the
amount of network computation and computing costs.
Due to the sliding window mechanism of the MAO-YOLOv5 network, a target may
generate multiple detection frames. In order to make the detection results more accurate,
Electronics 2023, 12, 1515 13 of 21
non-maximum suppression (NMS) can be applied to find the detection frame with the
maximum probability, and then a judgment can be made whether the intersection ratio
of other detection frames and the detection frame is greater than the set threshold. If it is
greater than the threshold, remove the detection box. If it is less than the threshold value,
the detection frame will be retained, and it will be merged with the original detection frame,
and finally, the rectangle processing will be performed.
MAO-YOLOv5 advances the feature extraction operation and the corresponding
feature scale extracted is twice the feature scale extracted from the YOLOv5 Backbone
structure. It cannot be fused with the corresponding layer of the FPN feature extraction
of the neck layer. The convolution kernel step of the last convolution structure in the
Backbone structure needs to be modified to 1 to successfully achieve the operation of
feature extraction in advance. The modules of the MAO-YOLOv5 network structure are
shown in Figure 11 below.
As can be seen from Figure 11, MAO-YOLOv5 mainly changes the feature extraction
structure of the Backbone part (cf. Figure 11a) of the YOLOv5 benchmark network. The
original network feature extraction operation is carried out in advance. The object features
are extracted from the first C3 module of the Backbone structure and horizontally integrated
into the feature layer of the same scale at the neck layer (cf. Figure 11b); After the SPPF
structure (cf. Figure 11c), the SENet attention mechanism is introduced to reconstruct the
feature weight of the detected object and background information; we added a context
feature fusion structure at the head end of YOLOv5 network. This structure fuses the three
feature maps used to predict the object at the head end, uses the transposed convolution to
transform the width and height scales of the deepest and subdeep features, and sets the
channel scale of both to half of the shallow feature channel scale as the context information
and shallow feature splicing for feature fusion.
where oij ∈ [0, 1], indicating whether the j-type object exists in the prediction object bound-
ing box i, cij is the predicted value, cbij is cij prediction confidence obtained by Sigmoid
function, Npos is the number of positive samples.
Binary cross entropy used for confidence loss:
Lcorf(o,c) = − ∑i
(oi ln(cbt )+(1−oi )ln(1−cbt ))
N
cbt = sigmoid(ci ) (12)
where oi ∈ [0, 1], indicating the IoU of the predicted object bounding box and real object
bounding box, c is the predicted value, cbi is the prediction confidence obtained by Sigmoid
function, N is the number of positive and negative samples, and category loss function also
uses binary cross entropy loss.
Electronics
Electronics 12,12,
2023,
2023, 1515PEER REVIEW
x FOR 1422
14 of of 21
(a)
Neck
Upper
CONCAT
CSP2_1 CBL
sampling Upper
CSP2_1 CBL
CONCAT
sampling
CSP2_1 CO NV
76×76×255
CBL
CONCAT
CSP2_1 CO NV
38×38×255
CBL
CONCAT
CSP2_1 CONV
19×19×255
(b)
Slice Maxpool
CONCAT
CONCAT
Slice Maxpool
CBL CBL
Focus SPP CBL
Slice Maxpool
Slice
(c)
Figure 11. MAO-YOLOv5 network structure. (a) Backbone module; (b) Neck module; (c) Focus
Figure 11. MAO-YOLOv5 network structure. (a) Backbone module; (b) Neck module; (c) Focus and
and SPP module.
SPP module.
As can be seen
5. Experimental from and
Process Figure 11, MAO-YOLOv5
Result Analysis mainly changes the feature extraction
structure
Two of the Backbone
datasets, VOC2007 part (cf.VOC2012,
and Figure 11a) of the
were YOLOv5
utilized benchmark
to compare network. The
the advantages of
original
MAO-YOLOv5 networkcompared
feature extraction
to YOLOv3, operation
YOLOv5, is carried
and SSD. outFurthermore,
in advance. Thesomeobject
of thefeatures
typical
are extracted
pictures of thefrom the detection
object first C3 module
results of
arethe Backbone
listed structure
to visualize theand
highhorizontally
performance inte-
of
grated into
MAO-YOLOv5. the feature layer of the same scale at the neck layer (cf. Figure 11b); After the
SPPF structure (cf. Figure 11c), the SENet attention mechanism is introduced to recon-
5.1. Experiment
struct the feature Configuration
weight of the anddetected
Dataset object and background information; we added a
context
Thefeature
hardwarefusion structure
equipment of at
thethe head endwas
experiment of YOLOv5
an AMD network. This structure
Ryzen 7 5800H [email protected] fuses
the three GeForce
NVIDIA feature maps
GTX1650used GPU,
to predict
with the
4G object at the head
GPU memory. end, uses
PASCAL VOC theistransposed
a popular
convolution to transform
universal detection datasetthe(http://host.robots.ox.ac.uk/pascal/VOC/)
width and height scales of the deepest and subdeep (accessedfeatures,
on 21
and sets the channel scale of both to half of the shallow feature channel
October 2022), so this chapter conducts experimental training using the VOC dataset. The scale as the context
information
VOC datasetand shallow
includes feature splicing
detection, for feature
segmentation, human fusion.
body layout, action classification
(Object Classification, Object Detection, Object Segmentation, Human Layout, Action
4.3. Modified Loss
Classification), Function
etc. VOC2007 contains 9963 labeled images, which are composed of three
parts:Intrain/val/test, and 24,640
the object detection task ofobjects
YOLOv5, werein marked. VOC the
order to make 2012predicted
containsvalue
20 types of
of the
objects, 11,530 images in train and val, 27,450 target detection tags, and 6929
model closer to the real value, even if the prediction box is closer to the real box, three losssegmentation
tags. During
functions training, VOC2007
are introduced and VOC2012
for optimization. One isare often put together
classification loss, thefor joint
other training to
is confidence
increase
loss, andthethenumber
final oneofissamples
regressionso that
loss,the
thatmodel can learnbox
is, boundary more features. loss. GIou loss
positioning
is used as the loss of the bounding box. The probability of this class and the loss to the
5.2. Evaluating Indicators
object value can be calculated by using binary cross entropy and the logits loss function.
In this subsection,
The category thecross
loss is binary experiment
entropy uses
loss. precision,
The formulamAP
is as(mean average precision),
follows:
frames per second (FPS), and P-R (precision recall) curves to evaluate the performance
of the four object detection algorithms. Suppose that TP (True Positive) means that the
positive samples are correctly classified into positive samples, FP (False Positive) means
that negative samples are wrongly classified into negative samples, FN (False Negative)
means that positive samples are wrongly classified into negative samples, and TN (True
Electronics 2023, 12, 1515 16 of 21
Negative) means that negative samples are correctly classified into negative samples. The
indicators are calculated as follows:
(1) P-R curve
The P-R curve is a curve made with precision as the ordinate and the recall as the
abscissa. The precision and recall are calculated as follows:
TP TP
Precision = Recall = (13)
TP + FP TP + FN
The P-R curve can judge the performance of the model, and the classifier with good
performance can ensure that the Precision value remains a high value with the increase in
Recall value; However, a classifier with poor performance may lose more Precision values
in order to improve the Recall value. In addition, the P-R curve of the classifier with good
performance has a larger offline area.
(2) mAP
mAP stands for the Average Precision (AP) of all categories. The AP value is obtained
by calculating the area under the P-R curve. Assuming that APi represents the average
recognition accuracy of the ith category, the calculation of mAP is as follows:
MAP stands for Average Precision (AP) of all categories. The AP value is obtained
by calculating the area under the P-R curve. Assuming that APi represents the average
recognition accuracy of the ith category, the calculation of mAP is as follows:
∑in=1 APi
mAP = (14)
n
n represents the total number of categories tested.
Table 1. mAP and FPS of different models on the PASCAL VOC dataset.
Figure12.
Figure 12.P-R
P-Rcurves
curvesofofMAO-YOLOv5
MAO-YOLOv5inindifferent
differentcategories.
categories.
MAO-YOLOv5′s
MAO-YOLOv5 0 s improvements
improvements compared
compared with
withother
otherobject
objectdetection
detectionalgorithms
algorithms
mainlyinclude
mainly includeenhanced
enhancedpicture
picturedata
datausing
usingMosaic
Mosaictechnology;
technology;adaptive
adaptiveimage
imagecompres-
compres-
siontechnology
sion technologycan canscale
scaleimages
imagesofofdifferent
differentscales
scalestotoaafixed
fixedscale,
scale,which
whichisisconvenient
convenient
fornetwork
for networktraining;
training;thetheBackbone
Backbonenetwork
networkadds
addsaafocus
focusstructure
structureand
andfeature
featureextraction
extraction
network
network(CSP)
(CSP)structure;
structure;neckneckadds
addsananFPN+PAN
FPN+PANmodule modulefor fornetwork
networkfeature
featurefusion;
fusion;
GIOU
GIOUisisused
usedininthe
theoutput
output(head).
(head).GIOU_Loss
GIOU_Lossisisthetheloss
lossfunction
functionofofthe
thebounding
boundingbox.
box.
GIOU
GIOUisisan animprovement
improvementofofIOU. IOU.IOU
IOUrepresents
representsthe theintersection
intersectionandandmerger
mergerratio
ratio
between the real box A and the prediction box B. The expression
between the real box A and the prediction box B. The expression is is
AA∩∩BB
IOU
IOU==A ∪ B (15)
A∪B (15)
However, there are two problems with IOU. If the loss is 0, the model can be updated
by training, and the parameters can be optimized. The degree of overlap between the two
cannot be accurately reflected. In order to solve the problem of gradient disappearance
Electronics 2023, 12, 1515 18 of 21
Electronics 2023, 12, x FOR PEER REVIEW 19 of 22
However, there are two problems with IOU. If the loss is 0, the model can be updated
by training,
without and the
overlap, parameters
GIOU can be optimized.
adds a penalty term on theThe degree
basis of overlap
of IOU, between
which can betterthe two
reflect
cannot
the be accurately
closeness reflected.
and coincidence ofIn order
two to than
boxes solveIOU.
the problem
The GIOU of expression
gradient disappearance
is as follows:
without overlap, GIOU adds a penalty term[C(A on the basis of IOU, which can better reflect
∪ B)]
the closeness and coincidenceGIOU = boxes
of two IOU − than IOU. The GIOU expression is as follows:
[C] (16)
[C(A ∪ B)]
square=ofIOU
DIOU uses the ratio of the GIOU −
the distance (16)
[Cbetween
] the center point of the real
box and the prediction box and the square of the diagonal length of the minimum box as
a partDIOU
of theuses the ratio ofstandard.
measurement the squareTheof calculation
the distancemethod
between theloss
and center point are
of DIOU of the real
as fol-
box and the prediction box and the square of the diagonal length of the minimum box as a
lows:
part of the measurement standard. The calculation method and loss of DIOU are as follows:
𝜌 B, B
DIOU B, B = IOU B, B − ρ2 (B,Bgt )
DIOU B, Bgt = IOU B, Bgt − C2 (17)
(17)
C
𝐿 B,B,BBgt ==11−−DIOU
DIOU B,
B, B
Bgt
LDIOU
DIOU
DIOU solves
solvesthe
theproblem
problem that
thatIOU
IOUcannot
cannotaccurately
accuratelyreflect
reflectthe
thecoincidence
coincidence between
between
two
two frames,
frames, making thethe center
centerpoint
pointofofthe
theprediction
predictionframe
frame close
close to to
thethe center
center point
point of
of the
the
realreal frame.
frame. At the
At the samesame time,
time, DIOU
DIOU cancan converge
converge faster
faster than
than GIOU.
GIOU.
Figure
Figure 13
13 shows
shows the
the detection
detection effect
effect of
of the
the three
three object
object detection
detection algorithms
algorithms in
in the
the
same picture.
same picture.
Figure
Figure 14.
14. Detection
Detection effect
effect of
of MAO-YOLOv5
MAO-YOLOv5 in
in different environment.
different environment.
The results show that MAO-YOLOv5 can detect multiscale objects and is less affected affected
by the background,
background,correctly
correctlymatching
matchingthe the object
object with
with itsits corresponding
corresponding category,
category, whichwhich
can
can
meetmeet the effective
the effective objectobject detection
detection requirements.
requirements. MAO-YOLOv5
MAO-YOLOv5 can effectively
can effectively im-
improve
prove the detection
the detection accuracy accuracy of complex
of complex multiobjects
multiobjects and smallandobjects.
small objects. This structure
This structure carries
carries
out the out thefeature
initial initial extraction
feature extraction
operation operation
of YOLOv5of YOLOv5
in advancein advance
to obtaintomore obtain more
accurate
accurate location information
location information of complex of complex multiobjects
multiobjects and smalland small objects.
objects.
As expected, the FPS of MAO-YOLOv5 (23.07) is lower lower than that of of YOLOV5
YOLOV5 (34.69)
(34.69)
owing to to the
theintegration
integrationofofthetheattention
attention mechanism
mechanism module,
module,butbut
it isitstill higher
is still thanthan
higher that
that
of theofother
the other methods
methods and hasandcertain
has certain benefits.
benefits. More research
More research will bewill be undertaken
undertaken to
to under-
understand
stand how tohow to increase
increase theinFPS
the FPS the in the future.
future.
6. Conclusions
6. Conclusions andand Future
Future Work
Work
The rise
The rise of
of deep
deep learning
learning has
has promoted
promoted thethe rapid
rapid development
development of of computer
computer vision.
vision.
Although the current object detection algorithm based on deep learning has
Although the current object detection algorithm based on deep learning has solved manysolved many
practical problems, it can continue to improve its accuracy and speed by optimizing the
practical problems, it can continue to improve its accuracy and speed by optimizing the
current model.
current model. This
This paper
paper first
first analyzes
analyzes the
the object
object detection
detection algorithms
algorithms SSD,
SSD, YOLOv3,
YOLOv3,
and YOLOv5
and YOLOv5 based
based on
on deep
deep neural
neural networks
networks and
and focuses
focuses on
on the
the network structure, loss
network structure, loss
function, and anchor frame of the YOLOv5 model with good performance.
function, and anchor frame of the YOLOv5 model with good performance. On the basis On the basis
of the
of the above
above research,
research, anan MAO-YOLOv5
MAO-YOLOv5 model model based
based on
on the
the attention
attention mechanism
mechanism andand
context feature fusion is proposed. This model adds the SENet attention mechanism
context feature fusion is proposed. This model adds the SENet attention mechanism to the
to the Backbone that optimizes the YOLOv5 structure and, at the same time, it adds a
Backbone that optimizes the YOLOv5 structure and, at the same time, it adds a context
context feature-fusion structure. The deep semantic information is fused as the background
feature-fusion structure. The deep semantic information is fused as the background infor-
information of shallow object features to solve the problem of the insufficient extraction
mation of shallow object features to solve the problem of the insufficient extraction of ob-
of object semantic information. In the comparative experiment, this paper combines the
ject semantic information. In the comparative experiment, this paper combines the PAS-
PASCAL VOC 2007 and PASCAL VOC 2012 datasets as the entire dataset of this experiment,
CAL VOC 2007 and PASCAL VOC 2012 datasets as the entire dataset of this experiment,
using them to train different deep neural network models. The experimental results show
using them to train different deep neural network models. The experimental results show
that the recognition accuracy of the proposed MAO-YOLOv5 model is better than the
that the recognition accuracy of the proposed MAO-YOLOv5 model is better than the
Electronics 2023, 12, 1515 20 of 21
original YOLOv5 model, and its recognition accuracy is also better than other main object
detection algorithms.
Image object detection should also consider the influence of object scale, differences in
light brightness, multiobject overlap, and other factors, so the performance of deep neural
networks should be further improved in the follow-up work. In addition, the dataset
used in this paper also has shortcomings, such as the imbalance of the number of objects
in different categories, which may affect the detection accuracy. We will also conduct
further research into new YOLO versions: the YOLOv6, YOLOv7, and YOLOv8 models are
available. These are the areas that need to be further improved in future research work.
Author Contributions: Conceptualization, G.S. and J.X.; writing—original draft preparation, G.S.;
writing—review and editing, S.W. and J.X. project administration, J.X. All authors have read and
agreed to the published version of the manuscript.
Funding: This Paper was funded by the High-level Talents Funding Project of Hebei Province (Grant
NO. A202105006); and the Hebei Provincial Higher Education Science and Technology Research Key
Project (Grant No. ZD2021317).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data (PASCAL VOC 2007 and 2012) supporting this paper are avail-
able from the below links: http://host.robots.ox.ac.uk/pascal/VOC/ (accessed on 21 October 2022).
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Wu, Y.; Zhang, H.; Li, Y.; Yang, Y.; Yuan, D. Video Object Detection Guided by Object Blur Evaluation. IEEE Access 2020, 8,
208554–208565. [CrossRef]
2. Zhang, Q.; Wan, C.; Han, W.; Bian, S. Towards a fast and accurate road object detection algorithm based on convolutional neural
networks. J. Electron. Imaging 2018, 27, 053005. [CrossRef]
3. Kaur, J.; Singh, W. Tools, techniques, datasets and application areas for object detection in an image: A review. Multimed. Tools
Appl. 2022, 81, 38297–38351. [CrossRef]
4. Zhang, Z.; Lu, X.; Liu, F. ViT-YOLO: Transformer-based YOLO for object detection. In Proceedings of the 18th IEEE/CVF
International Conference on Computer Vision (ICCV), OCT 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 2799–2808.
5. Silva, L.P.E.; Batista, J.C.; Bellon, O.R.P.; Silva, L. YOLO-FD: YOLO for face detection. In Proceedings of the 24th Iberoamerican
Congress on Pattern Recognition (CIARP), OCT 2019, Havana, Cuba, 28–31 October 2019; Volume 11896, pp. 209–218.
6. Yan, B.; Li, J.; Yang, Z.; Zhang, X.; Hao, X. AIE-YOLO: Auxiliary Information Enhanced YOLO for Small Object Detection. Sensors
2022, 22, 8221. [CrossRef] [PubMed]
7. Ye, J.; Yuan, Z.; Qian, C.; Li, X. CAA-YOLO: Combined-Attention-Augmented YOLO for Infrared Ocean Ships Detection. Sensors
2022, 22, 3782. [CrossRef]
8. Wang, K.; Liu, M. YOLO-Anti: YOLO-based counterattack model for unseen congested object detection. Pattern Recognit. 2022,
131, 108814. [CrossRef]
9. Xu, P. Progress of Object detection: Methods and future directions. In Proceedings of the 2nd IYSF Academic Symposium on
Artificial Intelligence and Computer Engineering, Xi’an, China, 8–10 October 2021; Volume 12079.
10. Murthy, C.B.; Hashmi, M.F.; Bokde, N.D.; Geem, Z.W. Investigations of Object Detection in Images/Videos Using Various Deep
Learning Techniques and Embedded Platforms—A Comprehensive Review. Appl. Sci. 2020, 10, 3280. [CrossRef]
11. Ma, D.W.; Wu, X.J.; Yang, H. Efficient Small Object Detection with an Improved Region Proposal Networks. In Proceedings of the
5th International Conference on Electrical Engineering, Control and Robotics (EECR), Guangzhou, China, 12–14 January 2019;
Volume 533, p. 012062. [CrossRef]
12. Fang, F.; Li, L.; Zhu, H.; Lim, J.-H. Combining Faster R-CNN and Model-Driven Clustering for Elongated Object Detection. IEEE
Trans. Image Process. 2019, 29, 2052–2065. [CrossRef]
13. Hu, B.; Liu, Y.; Chu, P.; Tong, M.; Kong, Q. Small Object Detection via Pixel Level Balancing With Applications to Blood Cell
Detection. Front. Physiol. 2022, 13, 911297. [CrossRef]
14. Afsharirad, H.; Seyedin, S.A. Salient object detection using the phase information and object model. Multimed. Tools Appl. 2019,
78, 19061–19080. [CrossRef]
15. Du, L.; Sun, X.; Dong, J. One-Stage Object Detection with Graph Convolutional Networks. In Proceedings of the 12th International
Conference on Graphics and Image Processing (ICGIP), Xi’an, China, 13–15 November 2020; Volume 11720.
Electronics 2023, 12, 1515 21 of 21
16. Yu, L.; Lan, J.; Zeng, Y.; Zou, J.; Niu, B. One hyperspectral object detection algorithm for solving spectral variability problems of
the same object in different conditions. J. Appl. Remote Sens. 2019, 13, 026514. [CrossRef]
17. Dong, Z.; Liu, Y.; Feng, Y.; Wang, Y.; Xu, W.; Chen, Y.; Tang, Q. Object Detection Method for High Resolution Remote Sensing
Imagery Based on Convolutional Neural Networks with Optimal Object Anchor Scales. Int. J. Remote Sens. 2022, 43, 2677–2698.
[CrossRef]
18. Zhan, Y.; Yu, J.; Yu, T.; Tao, D. Multi-task Compositional Network for Visual Relationship Detection. Int. J. Comput. Vis. 2020, 128,
2146–2165. [CrossRef]
19. Wang, Y.; Dong, Z.; Zhu, Y. Multiscale Block Fusion Object Detection Method for Large-Scale High-Resolution Remote Sensing
Imagery. IEEE Access 2019, 7, 99530–99539. [CrossRef]
20. Dong, Z.; Wang, M.; Wang, Y.; Liu, Y.; Feng, Y.; Xu, W. Multi-Oriented Object Detection in High-Resolution Remote Sensing
Imagery Based on Convolutional Neural Networks with Adaptive Object Orientation Features. Remote Sens. 2022, 14, 950.
[CrossRef]
21. Hou, Q.; Xing, J. KSSD: Single-stage multi-object detection algorithm with higher accuracy. IET Image Process. 2020, 14, 3651–3661.
[CrossRef]
22. Xi, X.; Wang, J.; Li, F.; Li, D. IRSDet: Infrared Small-Object Detection Network Based on Sparse-Skip Connection and Guide Maps.
Electronics 2022, 11, 2154. [CrossRef]
23. Koyun, O.C.; Keser, R.K.; Akkaya, I.B.; Töreyin, B.U. Focus-and-Detect: A small object detection framework for aerial images.
Signal Process. Image Commun. 2022, 104, 116675. [CrossRef]
24. Kim, J.U.; Kwon, J.; Kim, H.G.; Ro, Y.M. BBC Net: Bounding-Box Critic Network for Occlusion-Robust Object Detection. IEEE
Trans. Circuits Syst. Video Technol. 2019, 30, 1037–1050. [CrossRef]
25. Lee, D.-H. CNN-based single object detection and tracking in videos and its application to drone detection. Multimed. Tools Appl.
2020, 80, 34237–34248. [CrossRef]
26. Wu, T.; Liu, Z.; Zhou, X.; Li, K. Spatiotemporal salient object detection by integrating with objectness. Multimed. Tools Appl. 2017,
77, 19481–19498. [CrossRef]
27. Wang, C.; Yu, C.; Song, M.; Wang, Y. Salient Object Detection Method Based on Multiple Semantic Features. In Proceedings of the
9th International Conference on Graphic and Image Processing (ICGIP), Ocean Univ China, Acad Exchange Ctr, Qingdao, China,
14–16 October 2017; Volume 10615.
28. Kang, S. Research on Intelligent Video Detection of Small Objects Based on Deep Learning Intelligent Algorithm. Comput. Intell.
Neurosci. 2022, 2022, 3843155. [CrossRef] [PubMed]
29. Tong, K.; Wu, Y.; Zhou, F. Recent advances in small object detection based on deep learning: A review. Image Vis. Comput. 2020,
97, 103910. [CrossRef]
30. Wu, X.; Sahoo, D.; Hoi, S.C.H. Recent advances in deep learning for object detection. Neurocomputing 2020, 396, 39–64. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.