0% found this document useful (0 votes)

27 views21 pages

Electronics 12 01515

Uploaded by

eslam fouda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views21 pages

Electronics 12 01515

Uploaded by

eslam fouda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

electronics

Article
An Image Object Detection Model Based on Mixed Attention
Mechanism Optimized YOLOv5
Guangming Sun 1,2 , Shuo Wang 1 and Jiangjian Xie 3, *

1 Department of Electrical and Information Engineering, Hebei Jiaotong Vocational and Technical College,
Shijiazhuang 050051, China
2 Road Traffic Perception and Intelligent Application Technology R&D Center of Universities in Hebei Province,
Shijiazhuang 050051, China
3 School of Technology, Beijing Forestry University, Beijing 100083, China
* Correspondence: [email protected]

Abstract: As one of the more difficult problems in the field of computer vision, utilizing object image
detection technology in a complex environment includes other key technologies, such as pattern
recognition, artificial intelligence, and digital image processing. However, because an environment
can be complex, changeable, highly different, and easily confused with the target, the target is easily
affected by other factors, such as insufficient light, partial occlusion, background interference, etc.,
making the detection of multiple targets extremely difficult and the robustness of the algorithm low.
How to make full use of the rich spatial information and deep texture information in an image to
accurately identify the target type and location is an urgent problem to be solved. The emergence of
deep neural networks provides an effective way for image feature extraction and full utilization. By
aiming at the above problems, this paper proposes an object detection model based on the mixed
attention mechanism optimization of YOLOv5 (MAO-YOLOv5). The proposed method fuses the
local features and global features in an image so as to better enrich the expression ability of the feature
map and more effectively detect objects with large differences in size within the image. Then, the
attention mechanism is added to the feature map to weigh each channel, enhance the key features,
remove the redundant features, and improve the recognition ability of the feature network towards
the target object and background. The results show that the proposed network model has higher
Citation: Sun, G.; Wang, S.; Xie, J. An
precision and a faster running speed and can perform better in object-detection tasks.
Image Object Detection Model Based
on Mixed Attention Mechanism Keywords: deep neural network; object detection; YOLOv5; context information; attention
Optimized YOLOv5. Electronics 2023, mechanism
12, 1515. https://doi.org/10.3390/
electronics12071515

Academic Editor: Oscar Deniz

1. Introduction
Suarez
Object detection is a research hotspot in computer vision, and it is also one of the
Received: 18 February 2023 most basic and challenging tasks. It provides a strong feature classification basis, for
Revised: 17 March 2023 instance, segmentation, image analysis, video tracking, and other tasks. Object detection
Accepted: 20 March 2023 includes the classification and positioning of objects. By analyzing the input pictures or
Published: 23 March 2023
videos, each object in the image is accurately detected with a co-ordinate position and
boundary box, and the features of the object are extracted [1]. The object refers to the
entity object to be detected in the image. The task of detection includes classification and
Copyright: © 2023 by the authors.
positioning. Classification is carried out to determine the category of these objects to be
Licensee MDPI, Basel, Switzerland. detected, and positioning helps to accurately find the position of the object in the image. In
This article is an open access article this process, the advantages and disadvantages of the algorithm will affect the detection
distributed under the terms and accuracy, tracking accuracy, stability, and real-time performance of the object. At the same
conditions of the Creative Commons time, object detection technology has been used in many fields, such as road traffic control,
Attribution (CC BY) license (https:// medical image processing, human-computer interaction, video surveillance, unmanned
creativecommons.org/licenses/by/ driving, etc. [2]. In an image, a single pixel or a single target does not exist alone but has
4.0/). some connection with the surrounding pixels and targets. Mining and the utilization of the

Electronics 2023, 12, 1515. https://doi.org/10.3390/electronics12071515 https://www.mdpi.com/journal/electronics

Electronics 2023, 12, 1515 2 of 21

relationship between objects, that is, context information, are beneficial to target detection.
Since the object detection algorithm with manual feature extraction can only extract shallow
feature information, it is difficult to go deep to extract more semantic feature information.
In order to improve detection accuracy, it is necessary to build a more complex feature
extraction model [3]. Although object detection technology has gone through many years of
theoretical research and technical application with the continuous development of society,
the existing technology still needs to be further explored to meet the new requirements in
different fields and, on the basis of continuously enhanced performance, it also needs to
make the design more simplified, more scientific and more robust.
The collected images are mainly affected by various internal and external factors.
Internal factors include window size selection and variable target shape. The selection of
the target window size used in some algorithms is a difficult problem because the window
size is usually predefined, and the local difference is invariant [4,5]. However, if the window
size is too large, you may not be able to detect slight changes in texture features, and this
will consume more computation. In a window of too small a size, some interfering pixels
can be detected as target pixels, such as noise or pixels affected by lighting changes or other
factors [6]. The target may be in high-speed motion when shooting, so the imaging will
show deformation and scale changes, and the robustness of the target detection algorithm
is high. External factors include background noise and various detection equipment. In
a complex background, because the target strength is weak, it is partially blocked by the
interfering object or buried in the clutter and noise [7]. In these cases, it is difficult to
separate the target from the complex and noisy background. The farther the target is from
the acquisition point, the smaller the image area of the target is. At the same time, the
worse the image quality is and the more difficult it is to detect the target. For the target
collection device, different collection devices will be set according to different scenes. In
order to estimate the relative position and absolute position of the target, it is necessary to
obtain the relative distance between the target and the acquisition equipment by means of
ranging [8].
This paper analyzes the object detection algorithm based on deep learning, and the
research contents mainly include:
1. In this paper, the YOLOv5 feature extraction network is improved, and the feature
map extraction operation is advanced. If it is a small object, the deeper the network
level is, the less semantic information is retained by the small object. By extracting
the feature map in advance, more abundant location information can be obtained
to improve the problem of the feature loss of object information as the network
level deepens;
2. The network model proposed in this paper adds a selfattention mechanism. The
attention mechanism can improve the object feature weight, reduce the background
feature weight, let the model obtain the object area that needs to be focused on, and
reconstruct the features;
3. The network model proposed in this paper uses a context feature-fusion structure. The
feature information with rich deep semantics but also unclear location information is
fused to the feature information with clear shallow location information but not rich
semantics to improve the detection difficulty caused by complex multiobject;
4. The remainder of this paper is organized as follows: Section 2 discusses work related
to image object detection, followed, in Section 3, by the object detection algorithms
based on deep learning. In Section 4, the proposed method is addressed along with
the considerations for the analysis of this paper. Section 5 presents the experimental
results and analysis. Finally, in Section 6, a conclusion is drawn.

2. Related Work
As a basic task in computer vision, object detection has been a hot topic in academic
research in recent years. From traditional image processing methods to deep learning-
based methods, from single-stage and two-stage object detection frameworks based on
Electronics 2023, 12, 1515 3 of 21

anchor frames to object detection frameworks based on anchor frames, researchers have
explored better object detection methods from various dimensions. In recent years, the
object detection algorithm has been constantly updated iteratively. It has been constantly
challenged between solving new problems and finding new problems. It has weighed
between bigger and stronger algorithms and lighter and faster algorithms. With the joint
efforts of scholars, the overall research framework of the algorithm has become increasingly
mature [9]. As the downstream task of image classification, the object detection algorithm
needs to complete two tasks: one is to generate the object detection frame to be recognized;
the second is to accurately judge the types of objects in the detection frame. Traditional
object detection methods use artificial features to represent complex features, which are
gradually reduced due to their weak applicability and low detection performance; the
object detection method based on deep learning has been developed with the proposed
convolutional neural network, which eliminates the disadvantages of traditional object
detection, and it is widely used [10]. At present, in the field of deep learning, object
detection algorithms can be divided into two categories: two-stage object detection and
one-stage object detection. The former gradually realizes detection according to the idea
of “from big to small”, such as region-based convolutional neural networks (R-CNN)
series, spatial pyramid pooling network (SPPNet), etc.; the latter performs region category
judgment while generating prediction boxes, with low model complexity and fast detection
speed, including the YOLO series, single shot multibox detector (SSD), and RetinaNet [11].
The squeeze and extinction network (SENet) represents an attention mechanism proposed
by Momenta. In the attention mechanism, SENet is not an independent model design but
is only an optimization of the model. Generally, SENet is used in combination with other
factors [12].
In the actual object detection task, objects of different sizes are mixed together and
interfere with each other, which affects the performance of small object detection. Therefore,
Hu et al. [13] proposed a pixel-level balance (PLB) method, which improved the accuracy of
small object detection. Afsharirad and Seyedin [14] introduced the salient object detection
method using task simulation (SOD-TS) based on saliency detection algorithms. SOD-TS
can detect the salient object, which is the best response to the current task. This method
has a wide range of applications. Du et al. [15] proposed a correlation complement (CC)
module that combines the class representation vector with the relationships between the
categories in the dataset. Yuet al. [16] proposed a multiobject subspace projection sample-
weighted CEM (MSPSW-CEM) algorithm to solve the problem of spectral variability, which
causes very serious false detection and missed detection. MSPSW-CEM showed a better
detection performance than other object detection methods. With respect to adaptively
obtaining optimal object anchor scales for the object detection of high spatial resolution
remote sensing images (HSRIs), Donget al. [17] proposed a novel object detection method
for HSRIs based on CNNs with optimal object anchor scales. Zhan et al. [18] proposed a
novel task for visual relationship detection and significance detection as a complement
to object detection and predicate detection. Meanwhile, they proposed a novel multitask
compositional network (MCN) that simultaneously performed object detection, predicate
detection, and significance detection.
Wang et al. [19] proposed a multiscale block fusion object detection method for large-
scale HSRIs, aiming at how to achieve optimal block object detection for large-scale HSRIs.
This method is superior to other single-scale image block detection methods. With respect
to the multiangle features of object orientation in HSRIs object detection, Dong et al. [20]
presented a novel HSRI object detection method based on a CNN with adaptive object
orientation features; the proposed method can more accurately detect objects with large
aspect ratios and densely distributed objects. Hou et al. [21] proposed a kullback–leibler
single shot multibox detection (KSSD) object detection algorithm to detect small- and
medium-sized objects. This algorithm has higher accuracy and stability than existing
detection algorithms. Xi et al. [22] proposed an infrared small-object detection method to
improve the capacity for detecting thermal objects in complex scenarios. When compared
single shot multibox detection (KSSD) object detection algorithm to detect small- and me-
dium-sized objects. This algorithm has higher accuracy and stability than existing detec-
Electronics 2023, 12, 1515 4 of 21
tion algorithms. Xi et al. [22] proposed an infrared small-object detection method to im-
prove the capacity for detecting thermal objects in complex scenarios. When compared
with the current advanced models, this framework has better performance in infrared
with
smallthe current
object advanced
detection under models,
complex thisbackground
frameworkconditions.
has better performance
Objects in aerial in infrared
images
small object
have the detection under
characteristics of a complex background
small volume conditions.
and dense Objects in
and uncertain aerial images
directions, which have
in-
the characteristics
crease the difficultyofofa detection.
small volume and to
In order dense
solveand
theuncertain
problem, directions,
Koyun et al.which increase
[23] proposed
the difficultyobject
a two-stage of detection.
detectionInframework
order to solve the“Focus-and-Detect”.
called problem, Koyun et Kim al. [23] proposed
et al. [24] pro- a
two-stage object detection framework called “Focus-and-Detect”. Kim et al.
posed a novel object detection framework for obtaining robust object detection in occlu- [24] proposed
asion.
novel
Theobject detection
framework frameworkoffor
is composed anobtaining robust object
object detection detection
framework and in occlusion.
a plug-in The
bound-
framework
ing box (BB)isestimator.
composed of anPascal
Using objectVOC,
detection framework
MS Coco, and
and Kitti a plug-in
datasets, bounding
it was proved thatbox
(BB) estimator. Using
this framework improvesPascal
the VOC, MS Coco,
performance and Kitti
of object datasets, it was proved that this
detection.
framework improves the performance of object detection.
3. Object Detection Algorithms Based on Deep Learning
3. Object Detection Algorithms Based on Deep Learning
3.1. Classification
3.1. Classification of
of Mainstream
Mainstream Algorithms
Algorithms inin Object
Object Detection
Detection
In the
In the field
field of
of object
object detection,
detection, the
the methods
methods of of object
object detection
detection using
using deep
deep learning
learning
technology have been gradually accepted by the majority of researchers. At present,
technology have been gradually accepted by the majority of researchers. At present, object object
detection is mainly divided into two-stage methods and one-stage methods.
detection is mainly divided into two-stage methods and one-stage methods. These two These two
methods have their own advantages and effects for different detection problems.
methods have their own advantages and effects for different detection problems. In general, In gen-
eral,two-stage
the the two-stage
method method has better
has better detection
detection accuracy,
accuracy, while while the one-stage
the one-stage method method has
has faster
faster detection
detection speed speed
[25,26].[25,26].
Two-stage object
Two-stage object detection
detection algorithms
algorithms paypay more
more attention
attention toto the
the extraction
extraction of of high-
high-
quality features from object images to obtain a higher detection accuracy.
quality features from object images to obtain a higher detection accuracy. Its process is Its process is
shown in Figure 1. Classic two-stage object detection algorithms include
shown in Figure 1. Classic two-stage object detection algorithms include R-CNN, Fast R-CNN, Fast R-
CNN, etc.
R-CNN, First,
etc. thethe
First, candidate boxes
candidate boxesareare
extracted, andand
extracted, then they
then are are
they classified andand
classified re-
gressed [27].
regressed [27].

Target
category
Convolution+ Pool candidate
Input image Convolution Convolution
Pooling regions
Target
location
Extract
candidate
regions

Figure 1.
Figure 1. Flow
Flow chart
chart of
of two-stage
two-stage object
object detection
detection algorithms.
algorithms.

One-stage object
One-stage objectdetection
detectionalgorithms
algorithmsonly onlyneed
needone
one regression
regression operation
operation ononthethe in-
input
put image to predict the category and location information of the object image,
image to predict the category and location information of the object image, so they have so they
ahave
fastadetection
fast detection
speed.speed.
TheThe detection
detection flowflow chart
chart of of this
this typeofofalgorithm
type algorithmisisshown
shown in
Figure 2.
Figure 2. Two-stage object detection algorithms,
algorithms, suchsuch as R-CNN, have shortcomings, such
parameters and
as large training parameters and long
long training
training times,
times, which
which are
are not
not suitable
suitable for tasks with
high real-time requirements; classical one-stage object detection algorithms, such as SSD,
EﬃcientNet, and the YOLO series, only use convolutional neural networks to process the
EfficientNet,
whole
whole picture
picture once,
once, greatly
greatly reducing
reducingtraining
trainingparameters
parametersand andprocessing
processingtimestimes[28].
[28].

3.2. SSD Object Detection Algorithm

3.2.1. SSD Algorithm Model Structure
The SSD algorithm extracts the feature information from images through the VGG-16
network, extracting object features through multiple scale convolution layers and complet-
ing co-ordinate determination and object classification for feature information through a
prediction module. Its structure is shown in Figure 3. The SSD algorithm has made a series
of improvements to the VGG-16 network. First, the FC6, FC7, and FC8 full connection
layers of the VGG-16 are deleted, and the 3 × 3 Conv6 and 1 × 1 Conv7 convolution layers
are designed to replace them so as to ensure that the receptive field of the feature map
Target
category

Electronics 2023, 12, 1515

Input image Convolution 5 of 21

Target
location
increases while keeping the feature scale the same, with the feature information extracted
from Conv7 used for prediction processing; then, the pooled core scale of the fifth pooled
Figure 2. Flow
layer is chartfrom
changed of one-stage
2 × 2 toobject
3 ×detection algorithms.
3 and the step size is changed from 2 to 1 to ensure
that the feature layer that can be extracted from the convolution layer after the fifth pooled
3.2. SSD
layer hasObject
a highDetection Algorithm
resolution, meaning that the network model can detect small objects with
3.2.1.
higherSSD Algorithm
accuracy; Model
finally, Structure
through the convolution layer Conv4_3, the feature information
for multiscale feature prediction
The SSD algorithm is extracted,
extracts the adding additional
feature information convolution
from images throughlayers to the
the VGG-
Electronics 2023, 12, x FOR PEER REVIEW
model and selecting different step down-sampling processing to improve layers 5 of 22
the performance
16 network, extracting object features through multiple scale convolution and com-
of the model
pleting to extract
co-ordinate the image features
determination [29]. classification for feature information
and object
through a prediction module. Its structure is shown in Figure 3. The SSD algorithm has
made a series of improvements to the VGG-16 network. First, the FC6, FC7, and FC8 full
connection layers of the VGG-16 are deleted, and the 3 × 3 Conv6 and 1 ×Target 1 Conv7 convo-
lution layers are designed to replace them so as to ensure that the receptive category field of the
feature map increases while keeping the feature scale the same, with the feature infor-
mation extracted from Conv7 used for prediction processing; then, the pooled core scale
Input
of the imagelayer is changed
fifth pooled Convolution
from 2 × 2 to 3 × 3 and the step size is changed from 2
to 1 to ensure that the feature layer that can be extracted from the convolution layer after
the fifth pooled layer has a high resolution, meaning that the network model Target can detect
small objects with higher accuracy; finally, through the convolution layer location
Conv4_ 3, the
feature information for multiscale feature prediction is extracted, adding additional con-
volution
Figure 2.
Figure layerschart
2. Flow to the model and
one-stage
of one-stage object
objectselecting
detectiondifferent
detection algorithms.
algorithms. step down-sampling processing to
improve the performance of the model to extract the image features [29].
The SSD
3.2. The
SSD SSD algorithm
Object Detectioninnovatively
Algorithm designed
innovatively designedmultiscale multiscalefeature feature layer
layer prediction
prediction process-
pro-
ing through
cessing through the original
the originalpyramidpyramidnetwork network structurestructureof the ofCNN network.
the CNN Through
network. these
Through
3.2.1. SSD Algorithm Model Structure
feature
these layers,
feature features
layers, at allatlevels
features of the
all levels of detection
the detection objectobjectcancan be extracted.
be extracted. Moreover,
Moreover, the
objectThe SSD details
feature algorithm extractsinthe
contained the feature
lower information
feature layer from
are are images throughwhich
comprehensive, the VGG-
can
the object feature details contained in the lower feature layer comprehensive, which
16 network,
effectively extracting
detect small object
objects features
in the through
image. multiplefeatures
High-level scale convolution
have highhighlayers and com-
dimensions and
can effectively detect small objects in the image. High-level features have dimensions
pleting
strong
and strong co-ordinate
semantic
semantic determination
information,
information, which which and
can can object
effectively
effectively classification
detectdetectlargelarge for objects. information
objects. feature
through a prediction module. Its structure is shown in Figure 3. The SSD algorithm has
made a series of improvements to the VGG-16 network. First, the FC6, FC7, and FC8 full
Reduced VGG-16
connection layers of the VGG-16 are deleted, and the 3 × 3 Conv6 and 1 × 1 Conv7 convo-
Conv4-3 Conv7 Conv8-2
300 lution layers are designed to replace them Conv9so as to ensure that the receptive field of the
-2
38 Conv10 -2 scale the same, with the feature infor-
feature map increases while 19 keeping the feature Conv11-2
10
mation extracted from Conv7 used for prediction 5 3 1 processing; then, the pooled core scale
1
Detections:
5 3
image of the fifth pooled layer is19changed from 2 × 2256to 3 × 3 and
10 256
Classifierthe step size is changed from 2
38 256
300 to 1 to ensure that the feature layer 512 that can be extracted from the 8732
Layer
(one/scale)
boxes
convolution layer after
the fifth pooled layer has a1024 high resolution, meaning that the network model can detect
512
small objects with higher accuracy; finally, through the convolution layer Conv4_ 3, the
feature information for multiscale feature prediction is extracted, adding additional con-
Feature pyramid
volution layers to the model and selecting different step down-sampling processing to
improve
Figure
Figure thenetwork
3.3.SSD
SSD performance
network of the model to extract the image features [29].
structure.
structure.
The SSD algorithm innovatively designed multiscale feature layer prediction pro-
3.2.2.
3.2.2. Prior
Prior
cessing Framethe original pyramid network structure of the CNN network. Through
Frame
through
these feature
Because
Because the layers,
the features
receptive
receptive at allof
fields
fields oflevels
the of the detection
thefeature
feature maps
mapsofofthe object can belayers
thedifferent
different extracted.
layers areareMoreover,
different
different on
thethe
on corresponding
object
corresponding original
feature details images,
contained
original images, ainprior
the frame
lower
a prior withwith
feature
frame thelayer same
thearescale
same but different
comprehensive,
scale scales
which
but different
will be generated
can effectively detecton different
small objectsfeature
in the layers.
image. If High-level
the featurefeatures map scale have m × dimensions
is high n and each
grid contains
and strong k prior information,
semantic frames, thenwhich the feature map generates
can effectively detectm × nobjects.
large × k prior frames. In
addition, each prior box needs to complete the prediction of category confidence and border
positioning co-ordinates Reduced (x, y, w, and h). If c categories are detected, (c + 4) prediction
VGG-16
values of the corresponding prior boxes will be generated. At this time, the number of
Conv4-3 Conv7
300 output prior boxes is (c + 4) × m × Conv8 n × -k,2
as shown
Conv9 -2 in Figure 4.
38 Conv10-2
19
10 Conv11-2
5 3
1
1 Detections:
5 3
image 19 10
256
256
Classifier
38 256
300 512
Layer 8732 boxes
(one/scale)
1024
512

Feature pyramid
scales will be generated on diﬀerent feature layers. If the feature map scale is m × n and
each grid contains k prior frames, then the feature map generates m × n × k prior frames.
In addition, each prior box needs to complete the prediction of category confidence and
border positioning co-ordinates (x, y, w, and h). If c categories are detected, (c + 4) predic-
Electronics 2023, 12, 1515 tion values of the corresponding prior boxes will be generated. At this time, the number 6 of 21

of output prior boxes is (c + 4) × m × n × k, as shown in Figure 4.

4. Schematic diagram of prior frame

Figure 4. frame prediction
prediction process.
process.

The
The scaling
scaling formula
formula of
of the
the prior
prior frame
frame in
in the
the corresponding
corresponding feature
featuremap
mapisis

Sk 𝑆= =
Smin 𝑆 −−
Smax Smin
𝑆 −−
𝑆 + + m − 1 (k(𝑘 ke𝑘𝜖(1,2,
1),1), .,m
(1, 2, . . … )
, 𝑚) (1)
(1)
𝑚−1
Sk is the scale, Smax is the scale of the highest-level feature map, and 0.9 is taken as
𝑆 is the scale, 𝑆 is the scale of the highest-level feature map, and 0.9 is taken as
the set value under normal conditions; Smin is the scale of the lowest layer feature map.
the set value under normal conditions; 𝑆 is the scale of the lowest layer feature map.
Under normal circumstances, 0.2 is taken as the set value. m refers to the calculated layer
Under normal circumstances, 0.2 is taken as the set value. m refers to the calculated layer
m feature map, and k refers to the layer k of the layer feature map.
m feature map, and k refers to the layer k of the layer feature map.
If the number of network layers is low, it corresponds to a large-scale feature map and
If the number of network layers is low, it corresponds to a large-scale feature map
has low receptive field feature information, so it is necessary to preset a smaller scale prior
and has low receptive field feature information, so it is necessary to preset a smaller scale
frame to detect small objects. On the contrary, if the number of network layers is high, it
prior frame to detect small objects. On the contrary, if the number of network layers is
corresponds to a small-scale feature map and has large receptive field feature information,
high, it corresponds to a small-scale feature map and has large receptive field feature in-
so it is necessary to preset a larger scale prior frame to detect large objects. A total of 8732
formation,
prior framessowill
it isbenecessary
generated tobypreset a largerthe
calculating scale
SSDprior
modelframe
[30].to detect large objects. A
total of 8732 prior frames will be generated by calculating the SSD model [30].
3.2.3. Border Regression
3.2.3. Border Regression
As shown in Figure 5, the red box P refers to the region proposal generated by training,
As shown
the green box Ginrefers
Figure
to 5,
thethe red box
ground P refers
truth toobject,
of the the region
and proposal generated
the blue box by to
Ĝ Refers train-
the
ing, the green
regression box Gthat
window refers to thetoground
is close truthtruth
the ground of the object,
after the and
region blue boxisGfine-tuned,
theproposal Refers to
the regression
that window
is, the prediction that
box, and is defines
close tothe
thecentral
ground truth aftervalue
co-ordinate the region proposal is scale
and width-height fine-
tuned,
of
Electronics 2023, 12, x FOR PEER REVIEWeach that is, the
prediction prediction
box as (x, box,
y, w, and
and defines
h). The the
IOU central
value isco-ordinate
improved value
by and width-
optimizing the 7 of 22
height scale of each prediction box as (x, y, w, and h). The IOU value
gap between the prior frame and the real window to achieve accurate object positioning is improved by op-
timizing
and the gap between the prior frame and the real window to achieve accurate object
detection.
positioning and detection.
G
^
G
P

Figure5.5.Schematic
Figure Schematic diagram
diagram of border
of border regression.
regression.

Frame regression processing is used to solve the function 𝑓, so that

𝑓 P ,P ,P ,P = Ĝ , Ĝ , Ĝ , Ĝ ≈ G ,G ,G ,G (2

Frame regression learning is mainly completed through coding and decoding. The
coding formula is as follows:
Electronics 2023, 12, 1515 7 of 21

Frame regression processing is used to solve the function f , so that

^ ^ ^ ^
f Px , Py , Pw , Ph = Gx , Gy , Gw , Gh ≈ Gx , Gy , Gw , Gh (2)

Frame regression learning is mainly completed through coding and decoding. The
coding formula is as follows:


 lx = (Gx − Px )/Pw
ly = Gy − Py /Ph

(3)
l = log(Gw /Pw )
 w


lh = log(Gh /Ph )

Among them, lx , ly , lw , lh is the conversion value of the real box in the format of a
prior box to facilitate the calculation of the loss function.
Decoding is the reverse process of encoding. It mainly restores a prior frame to the im-
age in the form of translation transformation andscaling, that is, reversely deducing the real

frame positioning information Gx , Gy , Gw , Gh through the output values lx , ly , lw , lh .
In order to improve the proximity between the prior frame and the real frame, the over-
shoot parameters v x , vy , vw , vh are used for fine tuning during decoding. The decoding
formula is 

 Gx = Pw (v1 × lx) + Px
Gy = Ph v2 × ly + Py

(4)

 Gw = Pw · exp( v3 × lw )
Gh = Ph · exp(v4 × lh )


The coding part normalizes the error between the prior frame and the real frame,
which is conducive to the loss value calculation in the SSD network, while the decoding
part deduces the position information of the real frame through the location information of
the prior frame and the offset, which is the main step of frame regression learning.

3.2.4. Loss Function of SSD

The loss function used in the SSD algorithm model has the property of multitask
processing, mainly including the classification of the loss function and the location of the
loss function. Its calculation formula is
1
L(d, c, l, g) = ( Lcon (d, c) + αLloc (d, l, g)) (5)
N
where Lcon is the classification loss function; Lloc is the position loss function; N is the
number of prior frames that can be compared with real frames. If N = 0, the loss function
value L(d, c, l, g) = 0; α is the weighted parameter of the position loss function, which
is normally preset as 1; d is the true or false value of whether the prior frame is similar
to the real frame, or 0 or 1; L is the positioning information of the prediction box; c is
the confidence level of the object to be measured; g is the positioning information of the
real frame.
The classification loss function of the SSD model uses the Softmax function with
multiple classification functions, as shown in the following formula:

p p
 Lcon (d, c) = − ∑iN∈ Pose dij log cbi − ∑i∈ Nega log cbio


p
p
exp(ci ) (6)
 cbi =
 p
∑ p exp(ci )

p
where ci refers to the detection probability when the corresponding prior box of the pth
p p
category object is the ith; cbi refers to ci detection probability after regression; cbio refers to
the detection probability when the category object is the background and its corresponding
Electronics 2023, 12, 1515 8 of 21

p
prior frame is the ith; dij refers to the true or false value of the ith prior to box matching the
p
pth category object of the jth real box, dijP ∈ {0, 1}, if dij = 1. This means that the object in
the prior box is the category of the real box; Pose refers to the number of positive samples,
and Nega refers to the number of negative samples.
The location classification function of the SSD model uses Smooth L1 a function to
calculate the loss of the location offset information of the prior frame relative to the center
co-ordinates and the width and height of the real frame. The expression is as follows:

N k Smooth L l m − g m
L ( d, l, g ) = ∑ ∑ d 1 i
 loc ( i∈ Pose m∈{ x,y,w,h} ij

 b j
0.5x 2 if | x |< 1 (7)
 Smooth L1 = | x |−0.5 otherwise



where lim refers to the location information of the ith prior box; gbm
j refers to the position
information of the jth real frame after it passes the coding phase.

3.3. YOLOv3 Object Detection Algorithm

3.3.1. YOLOv3 Network Structure
YOLOv3 is one of the most used algorithms in one-stage object detection algorithms.
On the network structure of YOLOv2, the algorithm increases the number of network
layers to 53 layers by adding residual blocks. The deeper layers represent stronger learning
ability. It can simultaneously predict the characteristics of low, medium, and high scales,
and there is also an intersection between the three scales, which not only ensures the high
accuracy of the feature map but also ensures the semantic richness of the feature map
and. Thus. the small object detection ability is improved. YOLOv3 has an excellent effect
in practical applications, and there is a lightweight version for edge devices with low
processor performance.
YOLOv3 uses the Darknet-53 network with a deeper network, which is different
from the two-stage object detection algorithm network. It replaces the pooling layer with
a convolution layer with a step size of 2 to solve a problem whereby the pooling layer
will lose some information. The 1 × 1 convolution replaces the full connection layer and
solves the problem of too many parameters in the full connection layer. The connection
mode of the residual network (ResNet) is used as a reference among the networks, with a
residual module added. In Figure 6, DBL consists of conv, BN, Leaky_ Relu composition;
Resunit is similar to the residual block in ResNet, playing the role of deepening the network
and eliminating the negative optimization of network deepening; Resn consists of one
DBL and n Resunits; Concat is a tensor concatenation. It concatenates the samples of the
darknet-53 middle layer with other dimensions. Unlike the add operation of the residual
layer, concatenation expands the tensor dimension. 26 × 26 × 256 and 26 × 26 × 512
concatenate into 768 dimensions. Add does not change the tensor dimension. For example,
two 26 × 26 × 256 tensors remain 26 × 26 × 256 after the add operation. When compared
with YOLOv2, YOLOv3 gives up the Softmax classifier as the final classification and
uses multiple logistic classifiers for classification, which effectively solves the problem of
multiobject detection and classification.

3.3.2. Residual Module

A large number of convolution layers are set in the Darknet-53 network structure
to improve the network’s ability to detect objects. However, the increase in the number
of network layers will also bring adverse effects, such as a decline in training accuracy
and prediction accuracy. In order to solve the problem of gradient dispersion and explo-
sion caused by the deep structure of the network, the residual neural network ResNet
is introduced into the YOLOv3 structure. This residual structure and two convolution
layers constitute the basic unit of the Darknet-53 network structure, where n represents the
number of basic units used by this layer. The residual structure enables the upper layer of
A large number of convolution layers are set in the Darknet-53 network structure to
improve the network’s ability to detect objects. However, the increase in the number of
network layers will also bring adverse eﬀects, such as a decline in training accuracy and
prediction accuracy. In order to solve the problem of gradient dispersion and explosion
caused by the deep structure of the network, the residual neural network ResNet is intro-
Electronics 2023, 12, 1515
duced into the YOLOv3 structure. This residual structure and two convolution layers9con- of 21

stitute the basic unit of the Darknet-53 network structure, where n represents the number
of basic units used by this layer. The residual structure enables the upper layer of the
the network
network to directly
to directly skipskip
twotwo or more
or more layerslayers to connect
to connect the subsequent
the subsequent network,
network, which which
can
can alleviate
alleviate the the gradient
gradient problem
problem caused
caused byby networkdeepening.
network deepening.WhenWhenassuming
assumingthat that the
network layer
layer behind
behindthetheshallow
shallownetwork
networkisisanan identity
identitymapping
mapping layer, directly
layer, fitting
directly the
fitting
the potential identity mapping function H(x) of a layer is diﬃcult, so the residual struc-
potential identity mapping function H ( x ) of a layer is difficult, so the residual structure
doesdoes
ture not directly learn
not directly the object
learn mapping
the object mapping butbut
learns a residual
learns a residualF(x)F(x) ) − x,
= H=(xH(x) − making
x, mak-
the original mapping H ( x ) = F ( x ) + x, as shown in Figure 6, when
ing the original mapping H(x) = F(x) + x, as shown in Figure 6, when F(x) = 0, identity F ( x ) = 0, identity
mapping can
mapping can be
be realized
realized H (x) =
H(x) x.
= x.

Residual structure.
Figure 6. Residual structure.

As shown
As shown in in Figure
Figure 6,6, two
two convolutional
convolutionallayers useaa11××1 1convolution
layersuse convolution core
core and
and a 3a
×3 3×convolution
3 convolution core,
core, respectively,ofofwhich
respectively, whichthethe11××1 1convolution
convolutioncorecoreisismainly
mainly used
used for
for
channel expansion and reduction. The residual network first uses the
channel expansion and reduction. The residual network first uses the 1 × 1 convolution 1 × 1 convolution
check channel
check channel totoshrink
shrinkandandthenthenuses
usesthe ×convolution
the3 3× 3 3 convolution check
check channel
channel to restore.
to restore. Its
Its essence is the idea of matrix decomposition, which is used to reduce
essence is the idea of matrix decomposition, which is used to reduce the number of pa- the number of
rameters. This helps the convolution network reduce the amount of computation to a cer-a
parameters. This helps the convolution network reduce the amount of computation to
certain
tain extent
extent andand
makesmakes
thethe convolution
convolution network
network runrun faster
faster andand more
more efficiently.
eﬃciently.
3.3.3. Loss Function of YOLOv3
3.3.3. Loss Function of YOLOv3
For the YOLO series algorithms, the loss function is the core of the algorithm and
For the YOLO series algorithms, the loss function is the core of the algorithm and
plays a key role in optimization. There are 6 prediction parameters for YOLOv3 object
plays a key role in optimization. There are 6 prediction parameters for YOLOv3 object
detection; (x, y, w, h) are the co-ordinates of the upper left vertex of the object box, width
detection;
and height, (x,class
y, w,ish)the
arecategory
the co-ordinates of the upper
of the detected object,left
and vertex of the object
confidence box, width
is the confidence
and height, class is the category of the detected object,
level of the detected object; the formula of loss function is and confidence is the confidence
level of the detected object; the formula of loss function is
obj
loss(object) = λcoord ∑iK=×0K ∑ jM=0 Iij (2 − wi × hi )[ xi − xbi ]2 + [yi − ybi ]2
h i2
obj
+λcoord ∑iK=×0K ∑ jM=0 Iij (2 − wi × hi )[wi − w b i ]2 + h i − bhi
obj
− ∑iK=×0K ∑ jM=0 1ij [cbi log(ci ) + (1 − cbi )log(1 − ci )] (8)
h i
noobj b
−λnoobj ∑K×K ∑ M I
i =1 j=0 ij Ci log(Ci ) + 1 − Cbi log(1 − Ci )
K ×K obj
− ∑i=0 Iij ∑c∈c lasscs [ pbi (c)log( pi (c)) + (1 − pbi (c))log(1 − pi (c))]

The loss function consists of three parts: the part is the error of the upper left vertex
co-ordinate of the object frame and the frame width and height. The BCE (binary cross
entropy) loss function is used for the upper left vertex co-ordinate error, and the MSE (mean
square error) loss function is used for the frame width and height error; Confidence error
and category error are represented by the obj part and the class part of the formula, and
BCE error function is used. The loss function is an important evaluation index for network
training results, and the model detection ability can be improved through the loss function
optimization algorithm.
(mean square error) loss function is used for the frame width and height error; Confidence
error and category error are represented by the obj part and the class part of the formula,
and BCE error function is used. The loss function is an important evaluation index for
network training results, and the model detection ability can be improved through the
loss function optimization algorithm.
Electronics 2023, 12, 1515 10 of 21

4. Image Object Detection Based on YOLOv5 Optimized by Mixed Attention

Mechanism
4. Image Object Detection
The attention mechanismBased
is aon YOLOv5
powerful Optimized
strategy by Mixed the performance of
for improving
Attention Mechanism
neural network models. By applying different weights to the different regions of the input,
The attention
the model is able tomechanism is most
focus on the a powerful strategy
important and for improving the
discriminative performanceWe
characteristics. of
neural network models. By applying different weights to the different regions
tried to introduce attention mechanisms to improve the detection performance of of the input,
the model is able to focus on the most important and discriminative characteristics. We
YOLOv5.
tried to introduce attention mechanisms to improve the detection performance of YOLOv5.
4.1. Attention Mechanism
4.1. Attention Mechanism
4.1.1.
4.1.1. Channel
ChannelAttention
AttentionMechanism
Mechanism
The
Therepresentative
representativemodel
modelofofthe channel
the channelattention mechanism
attention mechanism is the squeeze
is the and and
squeeze ex-
citation network (SENet). SENet is divided into two parts: compression and
excitation network (SENet). SENet is divided into two parts: compression and activation. activation.
The
The purpose
purpose ofof the
the compression
compression partpart is
is to
to compress
compress the
the global
global spatial
spatial information,
information, andand
then
then conduct feature learning in the channel dimension to form the importance of
conduct feature learning in the channel dimension to form the importance of each
each
channel.
channel. Finally,
Finally,the
theactivation
activationpart
partisisused
usedtotoassign
assigndifferent
differentweights
weightsto toeach
eachchannel.
channel.
Figure
Figure 77 shows
shows the
the structure
structure of
of the
the SENet
SENet module.

Compression Activation
Fex( · ,W)
X U Fsq(·) x
~
1×1×C 1×1×C
H
H’ Ftr H Fscale( ·,· )

W’ W W
C’ C C

Figure 7.
Figure Structureof
7. Structure ofthe
the SENet
SENet module.
module.

In the compression section, the dimension of the input element feature map is
In the compression section, the dimension of the input element feature map is H × W
H × W × C. H, W, and C represents height, width, and number of channels, respectively.
× C. H, W, and C represents height, width, and number of channels, respectively. The func-
The function of the compression part is to change the dimension from H × W × C com-
tion of the compression part is to change the dimension from H × W × C compressed to 1
pressed to 1 × 1 × C. Namely, H × W is compressed to 1 × 1D. In the activation part, the
× 1 × C. Namely, H × W is compressed to 1 × 1D. In the activation part, the dimension of 1
dimension of 1 × 1 × C is integrated into the full connection layer, predicting the impor-
× 1 × C is integrated into the full connection layer, predicting the importance of each chan-
tance of each channel and then encouraging the operation on the corresponding channel
nel and then encouraging the operation on the corresponding channel of the preceding
of the preceding feature map. A simple gating mechanism and the Sigmoid activation
feature map. A simple gating mechanism and the Sigmoid activation function were
function were adopted.
adopted.
4.1.2. Spatial Attention Mechanism
The representative model of the spatial attention mechanism is the spatial transformer
network (STN), which can transform various deformation data in space and automatically
capture important regional features. It can ensure that the image can still obtain the same
results as the original image after clipping, translation, or rotation. The STN network in-
cludes a local network, parametric network sampling (network generator), and differential
image sampling.

4.1.3. Mixed Attention Mechanism

In the combined attention mechanism, channel attention and spatial attention can be
combined in series or parallel. The representative model of the mixed attention mechanism
is the convolutional block attention module (CBAM), which includes the channel attention
module CAM and the spatial attention module SAM. The model structure of CBAM is
shown in Figure 8. It first processes the channel attention module for the input characteristic
diagram; the result is processed by the spatial attention module, and finally, the adjusted
feature is obtained.
In the combined attention mechanism, channel attention and spatial attention can be
combined in
combined in series
series or
or parallel.
parallel. The
The representative
representative model
model of
of the
the mixed
mixed attention
attention mecha-
mecha-
nism is
nism is the
the convolutional
convolutional block
block attention
attention module
module (CBAM),
(CBAM), which
which includes
includes the
the channel
channel
attention module CAM and the spatial attention module SAM. The
attention module CAM and the spatial attention module SAM. The model structure ofmodel structure of
CBAM is shown in Figure 8. It first processes the channel attention module for
CBAM is shown in Figure 8. It first processes the channel attention module for the input the input
Electronics 2023, 12, 1515 characteristic diagram;
characteristic diagram; thethe result
result is
is processed
processed by
by the
the spatial
spatial attention
attention module,
module, and
and finally,
finally,
11 of 21
the adjusted feature is obtained.
the adjusted feature isNetwork
obtained. convolutional attention module

Channel
Channel Spatial
Input features
features attention Spatial Adjusted features
Adjusted features
Input attention attention
module attention
module module
module

Figure 8.
Figure 8. Model
Model structure
structure of
of CBAM.
CBAM.
Figure 8. Model structure of CBAM.

4.1.4. Channel
4.1.4.
4.1.4. Channel Attention
Channel Attention Module
Attention Module CAM
Module CAM
CAM
Theinput
The
The inputof
input ofCAM
of CAMisis
CAM isaaafeature
feature
feature map,
map,
map, andand
andthethe
the dimension
dimension
dimension is set
is set
is set
as Has×
as HHW×× WW ×× C,
× C, C, where
where
where HHH
is
is
the the height
height of of
the the feature
feature map, map,
W is W
the is the
width, width,
and C and
is theC is the
number
is the height of the feature map, W is the width, and C is the number of channels. The number
of of
channels. channels.
The The
thought
thoughtisprocess
process
thought process is represented
represented
is represented by the
by the following:
by the following:
following: first,
first, thefirst,
input the
the input characteristic
characteristic
characteristic
input graph
graph isgraphpooled is
is
globally
pooled and
globallyaveragely;
and (pool
averagely; the space
(pool the dimensions
space to
dimensions compress
to
pooled globally and averagely; (pool the space dimensions to compress the space dimen- the
compress space
the dimensions;
space dimen-
facilitate
sions; the learning
sions; facilitate
facilitate the of the of
the learning
learning characteristics
of of theof
the characteristics
the characteristics channel
of later).later).
the channel
the channel Then,Then,
later). the global
Then, and
the global
the global
evaluation
and pooled
evaluation results
pooled are
results sentare to the
sent multilayer
to the perceptron
multilayer for
perceptron
and evaluation pooled results are sent to the multilayer perceptron for MLP learning; MLP forlearning;
MLP (based
learning;
on the characteristics
(based
(based on the
on of the MLP
the characteristics
characteristics of the
of learning
the channelchannel
MLP learning
MLP learning dimensions
channel and the and
dimensions
dimensions importance
and of each
the importance
the importance
channel).
of each
of Finally,
each channel). the
channel). Finally, MLP
Finally, the outputs
the MLP
MLP outputsthe value
outputs the result,
the value performs
value result, the
result, performs
performs the“add” operation,
the “add”
“add” operation, and
operation,
then obtains
and then the
then obtains final
obtains the “channel
the final
final “channel attention
“channel attention value”
attention value”through
value” through the
through the mapping
the mapping processing
processingthe
mapping processing of of
and of
Sigmoid
the function.
Sigmoid function.Figure 9 shows
Figure 9 channel
shows attention
channel module.
attention module.
the Sigmoid function. Figure 9 shows channel attention module.
Channel attention
Channel attention
Maximize
Maximize module
pooling module
pooling

Average
Average
pooling
pooling Channel
Channel
attention Mc
attention Mc
Sharing MLP
Sharing MLP
Input features
Input features FF
Figure9.
Figure
Figure 9. Channel
9. Channel attention
Channel attention module.
attention module.
module.

The
The calculation
calculation formula
formulaof
ofchannel
channelattention
attentionis
isas
asfollows:
follows:
The calculation formula of channel attention is as follows:
Mc ( F ) = σ (MLP (AvgPool (F)) + MLP(MaxPool (F)))
MM (F)
(F) == 𝜎(𝑀𝐿𝑃(AvgPool(F))
𝜎(𝑀𝐿𝑃(AvgPool(F))

c
+ 𝑀𝐿𝑃(MaxPool(F)))
+ 𝑀𝐿𝑃(MaxPool(F)))
c (9)
c = σ W1 W0 Favg + W1 (W0 (Fmax )) (9)
== 𝜎𝜎 (𝑊
𝑊 (𝑊𝑊 (FFcavg)) +
+𝑊𝑊 (𝑊 (Fc )))
𝑊 (F ) (9)
1 0 avg 1 0 max
In the above formula, Mc is channel attention, MLP is sharing, F is input feature,
AvgPool is average pooling, and MaxPool is maximum pooling.

4.1.5. Spatial Attention Module SAM

The input of SAM is a feature map of the CAM output, and its process is represented
by the following: first, the input characteristic graph is pooled globally and averagely;
pooling is performed in the channel dimension to compress the channel size, and it is
convenient to learn the characteristics of the space later. Then, the results of the global
pooling and average pooling are spliced according to channels, and the dimensions of the
feature map is H × W × 2. Finally, the splicing result is convolved, and the dimensions
of the feature map is H × W × 1; it is then processed through the activation function.
Figure 10 shows the spatial attention module.
The calculation formula of spatial attention is as follows:

= σf 7×7 ([

Ms ( F ) AvgPool(Fi ); MaxPool(F)])
7 × 7
h
s s (10)
=σ f Fang ; Fmax

In the above formula, Ms is spatial attention, F is feature, AvgPool is average pooling,

and MaxPool is maximum pooling.
pooling is performed in the channel dimension to compress the channel size, and it is
convenient to learn the characteristics of the space later. Then, the results of the global
pooling and average pooling are spliced according to channels, and the dimensions of the
feature map is H × W × 2. Finally, the splicing result is convolved, and the dimensions of
Electronics 2023, 12, 1515 the feature map is H × W × 1; it is then processed through the activation function. Figure
12 of 21
10 shows the spatial attention module.

Convolutional
layer

Features F’ Maximize pooling Spatial attention

Average pooling Mc
Figure
Figure10.
10.Spatial
Spatialattention
attentionmodule.
module.

4.2. The
Construction of MAO-YOLOv5
calculation Model
formula of spatial FeatureisExtraction
attention Structure
as follows:
YOLOv5 is an improvement to ×the YOLO series algorithms, where the detection
M (F) = 𝜎(𝑓 ([AvgPool(F); MaxPool(F)]))
principle of the YOLO series algorithms is similar. First, take the whole input image (10)as
the input of the network and divide = 𝜎 𝑓 it× into 𝐹ang ;𝐹
multiple N × N grids of the same size; each
grid can predict B bounding BOXs, with a total of N × N × B candidates, where each of
In the above formula, M is spatial attention, F is feature, AvgPool is average pool-
the boxes contains five variables (pc, bx, by, bh, bw). The original YOLO candidate box has
ing, and MaxPool is maximum pooling.
serious defects, and its width and height are completely unrestricted, which easily leads to
gradients that are out of control and unstable. YOLOv5 fixes this error and ensures that
4.2. Construction of MAO-YOLOv5 Model Feature Extraction Structure
the center point remains unchanged. Therefore, the current equation for YOLOv5 limits
the YOLOv5
multiplesisofan improvement
the anchor pointtofrom the YOLO seriesof
a minimum algorithms,
0 to a maximumwhere the of 4,detection prin-
and the anchor
ciple
frame object matching is also updated based on the width and height multiples. Setthe
of the YOLO series algorithms is similar. First, take the whole input image as the
input of the network
corresponding and divide
confidence it into
threshold multiple
through N × N gridssuppression
non-maximum of the same(NMS) size; each
and grid
select
can
thepredict
anchorBbox bounding BOXs, withconfidence
of the maximum a total of Nto×obtain
N × B candidates,
the prediction where box.each of the boxes
contains five variables (pc, bx, by, bh, bw). The original
YOLOv5 has two CSP architectures. One is with X residual component YOLO candidate box has (Resunit)
serious
defects, and its width and height are completely unrestricted, which
modules, and the other is to replace Resunit with two CBL modules. Resunit is composed easily leads to gradi-of
ents that are out of control and unstable. YOLOv5 fixes this error
two CBL convolution modules+residual networks, which are mainly used in the backbone and ensures that the
center pointThe
network. remains
backbone unchanged.
networkTherefore, the currentofequation
is mainly composed for YOLOv5
CSPdarknet+SPP. limits the
Backbone is a
multiples of the anchor
deeper network system.point from aadding
Therefore, minimumResunitof 0can
to aimprove
maximum the of 4, andvalue
gradient the anchor
during
frame object
reverse matching between
transmission is also updated
the layers,based
thusonpreventing
the width and height multiples.
the gradient generated Set
bythe
the
corresponding
increase fromconfidence threshold through
gradually disappearing so asnon-maximum
to obtain moresuppression
fine-grained(NMS) and se-
characteristics
lect the anchor
without box of
worrying the maximum
about confidence to obtain the prediction box.
network fading.
YOLOv5
The networkhas two CSP architectures.
structure of MAO-YOLOv5 One isisdivided
with X into
residual component
four parts: input,(Resunit)
backbone,
neck module,
modules, and theand head;
other it mainly
is to replaceadvances
Resunit withthe feature
two CBL extraction
modules.operation
Resunit isand needs to
composed
ofadjust the step
two CBL size of the
convolution convolution core networks,
modules+residual of the last convolution
which are mainly structureusedin in
thethe
Backbone
back-
structure
bone to 1 so
network. Theasbackbone
to achievenetwork
the fusion operation
is mainly of the features
composed with the same Backbone
of CSPdarknet+SPP. scale as the
isneck layer.network
a deeper In the original
system.YOLOv5
Therefore,structure,
addingthe convolution
Resunit structure,
can improve the encapsulated
gradient valueby
2D convolution, the BN layer, and the SiLU activation function,
during reverse transmission between the layers, thus preventing the gradient is used for down-sampling
generated
operations, and the sliding operation step of its convolution core
by the increase from gradually disappearing so as to obtain more fine-grained character- is 2.
After MAO-YOLOv5
istics without worrying about conducts × 1 convolution and 3 × 3 convolution operations,
network1 fading.
a new SE module is added. The SE module first pools the input feature map globally,
and then, through a two-layer full connection structure, the correlation between complex
channels can be established. Through weight normalization and channel weighting, the
channels with high-weight ratios will receive more attention so as to achieve the goal of
improving channel attention. The feature map output from the bottle-neck layer is further
input into the neck of the network. The neck is constructed as a pyramid network structure.
The function is to divide the detector head into three different sizes, namely, large, medium,
and small while ensuring that the underlying information is not lost so that the network can
have good detection results for targets of different sizes. The neck contains multiple CSP
bottleneck layers, and each CSP bottleneck layer contains several bottleneck layers added
with SE modules. Therefore, the CSP bottleneck layer enriches the gradient combination
of the architecture and improves the speed and accuracy of reasoning while reducing the
amount of network computation and computing costs.
Due to the sliding window mechanism of the MAO-YOLOv5 network, a target may
generate multiple detection frames. In order to make the detection results more accurate,
Electronics 2023, 12, 1515 13 of 21

non-maximum suppression (NMS) can be applied to find the detection frame with the
maximum probability, and then a judgment can be made whether the intersection ratio
of other detection frames and the detection frame is greater than the set threshold. If it is
greater than the threshold, remove the detection box. If it is less than the threshold value,
the detection frame will be retained, and it will be merged with the original detection frame,
and finally, the rectangle processing will be performed.
MAO-YOLOv5 advances the feature extraction operation and the corresponding
feature scale extracted is twice the feature scale extracted from the YOLOv5 Backbone
structure. It cannot be fused with the corresponding layer of the FPN feature extraction
of the neck layer. The convolution kernel step of the last convolution structure in the
Backbone structure needs to be modified to 1 to successfully achieve the operation of
feature extraction in advance. The modules of the MAO-YOLOv5 network structure are
shown in Figure 11 below.
As can be seen from Figure 11, MAO-YOLOv5 mainly changes the feature extraction
structure of the Backbone part (cf. Figure 11a) of the YOLOv5 benchmark network. The
original network feature extraction operation is carried out in advance. The object features
are extracted from the first C3 module of the Backbone structure and horizontally integrated
into the feature layer of the same scale at the neck layer (cf. Figure 11b); After the SPPF
structure (cf. Figure 11c), the SENet attention mechanism is introduced to reconstruct the
feature weight of the detected object and background information; we added a context
feature fusion structure at the head end of YOLOv5 network. This structure fuses the three
feature maps used to predict the object at the head end, uses the transposed convolution to
transform the width and height scales of the deepest and subdeep features, and sets the
channel scale of both to half of the shallow feature channel scale as the context information
and shallow feature splicing for feature fusion.

4.3. Modified Loss Function

In the object detection task of YOLOv5, in order to make the predicted value of the
model closer to the real value, even if the prediction box is closer to the real box, three loss
functions are introduced for optimization. One is classification loss, the other is confidence
loss, and the final one is regression loss, that is, boundary box positioning loss. GIou loss
is used as the loss of the bounding box. The probability of this class and the loss to the
object value can be calculated by using binary cross entropy and the logits loss function.
The category loss is binary cross entropy loss. The formula is as follows:

∑i∈ pos Σ j∈cia (oij ln(cbij +(1−oij )ln(1−cbij )))

Lcla(o,c) = Npos (11)
cbij = sigmoid cij

where oij ∈ [0, 1], indicating whether the j-type object exists in the prediction object bound-
ing box i, cij is the predicted value, cbij is cij prediction confidence obtained by Sigmoid
function, Npos is the number of positive samples.
Binary cross entropy used for confidence loss:

Lcorf(o,c) = − ∑i
(oi ln(cbt )+(1−oi )ln(1−cbt ))
N
cbt = sigmoid(ci ) (12)

where oi ∈ [0, 1], indicating the IoU of the predicted object bounding box and real object
bounding box, c is the predicted value, cbi is the prediction confidence obtained by Sigmoid
function, N is the number of positive and negative samples, and category loss function also
uses binary cross entropy loss.
Electronics
Electronics 12,12,
2023,
2023, 1515PEER REVIEW
x FOR 1422
14 of of 21

(a)

Figure 11. Cont.

Electronics 2023,
Electronics 2023, 12,
12, 1515
x FOR PEER REVIEW 1515of
of 22
21

Neck

Upper

CONCAT
CSP2_1 CBL
sampling Upper
CSP2_1 CBL

CONCAT
sampling
CSP2_1 CO NV

76×76×255

CBL

CONCAT
CSP2_1 CO NV

38×38×255

CBL

CONCAT
CSP2_1 CONV

19×19×255

(b)

Slice Maxpool
CONCAT

CONCAT
Slice Maxpool
CBL CBL
Focus SPP CBL

Slice Maxpool

Slice

(c)
Figure 11. MAO-YOLOv5 network structure. (a) Backbone module; (b) Neck module; (c) Focus
Figure 11. MAO-YOLOv5 network structure. (a) Backbone module; (b) Neck module; (c) Focus and
and SPP module.
SPP module.

As can be seen
5. Experimental from and
Process Figure 11, MAO-YOLOv5
Result Analysis mainly changes the feature extraction
structure
Two of the Backbone
datasets, VOC2007 part (cf.VOC2012,
and Figure 11a) of the
were YOLOv5
utilized benchmark
to compare network. The
the advantages of
original
MAO-YOLOv5 networkcompared
feature extraction
to YOLOv3, operation
YOLOv5, is carried
and SSD. outFurthermore,
in advance. Thesomeobject
of thefeatures
typical
are extracted
pictures of thefrom the detection
object first C3 module
results of
arethe Backbone
listed structure
to visualize theand
highhorizontally
performance inte-
of
grated into
MAO-YOLOv5. the feature layer of the same scale at the neck layer (cf. Figure 11b); After the
SPPF structure (cf. Figure 11c), the SENet attention mechanism is introduced to recon-
5.1. Experiment
struct the feature Configuration
weight of the anddetected
Dataset object and background information; we added a
context
Thefeature
hardwarefusion structure
equipment of at
thethe head endwas
experiment of YOLOv5
an AMD network. This structure
Ryzen 7 5800H [email protected] fuses
the three GeForce
NVIDIA feature maps
GTX1650used GPU,
to predict
with the
4G object at the head
GPU memory. end, uses
PASCAL VOC theistransposed
a popular
convolution to transform
universal detection datasetthe(http://host.robots.ox.ac.uk/pascal/VOC/)
width and height scales of the deepest and subdeep (accessedfeatures,
on 21
and sets the channel scale of both to half of the shallow feature channel
October 2022), so this chapter conducts experimental training using the VOC dataset. The scale as the context
information
VOC datasetand shallow
includes feature splicing
detection, for feature
segmentation, human fusion.
body layout, action classification
(Object Classification, Object Detection, Object Segmentation, Human Layout, Action
4.3. Modified Loss
Classification), Function
etc. VOC2007 contains 9963 labeled images, which are composed of three
parts:Intrain/val/test, and 24,640
the object detection task ofobjects
YOLOv5, werein marked. VOC the
order to make 2012predicted
containsvalue
20 types of
of the
objects, 11,530 images in train and val, 27,450 target detection tags, and 6929
model closer to the real value, even if the prediction box is closer to the real box, three losssegmentation
tags. During
functions training, VOC2007
are introduced and VOC2012
for optimization. One isare often put together
classification loss, thefor joint
other training to
is confidence
increase
loss, andthethenumber
final oneofissamples
regressionso that
loss,the
thatmodel can learnbox
is, boundary more features. loss. GIou loss
positioning
is used as the loss of the bounding box. The probability of this class and the loss to the
5.2. Evaluating Indicators
object value can be calculated by using binary cross entropy and the logits loss function.
In this subsection,
The category thecross
loss is binary experiment
entropy uses
loss. precision,
The formulamAP
is as(mean average precision),
follows:
frames per second (FPS), and P-R (precision recall) curves to evaluate the performance
of the four object detection algorithms. Suppose that TP (True Positive) means that the
positive samples are correctly classified into positive samples, FP (False Positive) means
that negative samples are wrongly classified into negative samples, FN (False Negative)
means that positive samples are wrongly classified into negative samples, and TN (True
Electronics 2023, 12, 1515 16 of 21

Negative) means that negative samples are correctly classified into negative samples. The
indicators are calculated as follows:
(1) P-R curve
The P-R curve is a curve made with precision as the ordinate and the recall as the
abscissa. The precision and recall are calculated as follows:

TP TP
Precision = Recall = (13)
TP + FP TP + FN
The P-R curve can judge the performance of the model, and the classifier with good
performance can ensure that the Precision value remains a high value with the increase in
Recall value; However, a classifier with poor performance may lose more Precision values
in order to improve the Recall value. In addition, the P-R curve of the classifier with good
performance has a larger offline area.
(2) mAP
mAP stands for the Average Precision (AP) of all categories. The AP value is obtained
by calculating the area under the P-R curve. Assuming that APi represents the average
recognition accuracy of the ith category, the calculation of mAP is as follows:
MAP stands for Average Precision (AP) of all categories. The AP value is obtained
by calculating the area under the P-R curve. Assuming that APi represents the average
recognition accuracy of the ith category, the calculation of mAP is as follows:

∑in=1 APi
mAP = (14)
n
n represents the total number of categories tested.

5.3. Experimental Process and Result Analysis

In this study, the method a comparative experiment was adopted. Three object
detection models were tested on the PASCAL VOC07+12 dataset, respectively, and the
experimental results were analyzed. In the process of training the four object detection
algorithms, YOLOv3, YOLOv5, SSD, and MAO-YOLOv5, instead of retraining the feature
extraction network, the migration learning method was used to load the model with
pretrained weight files with better performance, reducing the number of model training
times and improving the model performance to a certain extent. In addition, the model
training was divided into two stages: freezing and unfreezing to improve the training
speed in the early stage of the model, but the weight value of the trunk is not necessarily
suitable for this dataset, so it is necessary to unfreeze the model to make its training jump
out of the local optimal solution. We set Epoch to 150, where Freeze_ Epoch was set to
50. First, we fine-tuned the network and then defrosted it to improve the accuracy of the
model. As the three object detection algorithms were compared, random gradient descent
was used to optimize and adjust the weight. In order to prevent weight attenuation, we set
the weight decay to 5 × 10−4 .
In this study, four object detection algorithms were used to train on the PASCAL
VOC07+12 dataset. Among the four algorithms, YOLOv3, YOLOv5, SSD, and MAO-
YOLOv5, MAO-YOLOv5 had the best performance. The mAP value reached 92.3%, fol-
lowed by SSD; YOLOv3 only reached 82.36%, as shown in Table 1.

Table 1. mAP and FPS of different models on the PASCAL VOC dataset.

Method mAP FPS

YOLOv3 82.36% 16.74
YOLOv5 84.75% 34.69
SSD 86.64% 18.99
MAO-YOLOv5 92.30% 23.07
Electronics 2023, 12, 1515 17 of 21

Electronics 2023, 12, x FOR PEER REVIEW

Although MAO-YOLOv5 is superior to the other three models in mAP or 18 other
of 22
classified APs, there are still differences in the categories, such as bus, airplane, cow, sheep,
horse, bird, tv monitor, car, bicycle, and person. The AP values of the corresponding
categories are more than 95%, while those of chair, dining table, and potted plant are
thanthan
less 85%.85%.
The same modelmodel
The same corresponds to the to
corresponds objects of different
the objects categories,
of different and the max-
categories, and
imum
the average
maximum recognition
average accuracy
recognition is 19.01%.
accuracy The The
is 19.01%. category withwith
category thethe
largest
largestgap
gapin
YOLOv3 reached 33%, which further verifies the performance
in YOLOv3 reached 33%, which further verifies the performance advantages of MAO- advantages of MAO-
YOLOv5.
YOLOv5.
Wecan
We canalso
alsoobserve
observe the
the APAP values
values of of MAO-YOLOv5
MAO-YOLOv5 in different
in different categories
categories by view-
by viewing
ingP-R
the thecurves
P-R curves
of theof the different
different categories.
categories. The larger
The larger the areathe area the
under under the the
curve, curve, the
better
better
the the performance
performance of theinmodel
of the model in this category.
this category. As shown Asinshown
Figurein12,Figure 12, the perfor-
the performance of
mance
the model of in
thedifferent
model incategories
different iscategories
different,isbut
different, but the
the overall overall
effect effect is better.
is better.

Figure12.
Figure 12.P-R
P-Rcurves
curvesofofMAO-YOLOv5
MAO-YOLOv5inindifferent
diﬀerentcategories.
categories.

MAO-YOLOv5′s
MAO-YOLOv5 0 s improvements
improvements compared
compared with
withother
otherobject
objectdetection
detectionalgorithms
algorithms
mainlyinclude
mainly includeenhanced
enhancedpicture
picturedata
datausing
usingMosaic
Mosaictechnology;
technology;adaptive
adaptiveimage
imagecompres-
compres-
siontechnology
sion technologycan canscale
scaleimages
imagesofofdifferent
diﬀerentscales
scalestotoaafixed
fixedscale,
scale,which
whichisisconvenient
convenient
fornetwork
for networktraining;
training;thetheBackbone
Backbonenetwork
networkadds
addsaafocus
focusstructure
structureand
andfeature
featureextraction
extraction
network
network(CSP)
(CSP)structure;
structure;neckneckadds
addsananFPN+PAN
FPN+PANmodule modulefor fornetwork
networkfeature
featurefusion;
fusion;
GIOU
GIOUisisused
usedininthe
theoutput
output(head).
(head).GIOU_Loss
GIOU_Lossisisthetheloss
lossfunction
functionofofthe
thebounding
boundingbox.
box.
GIOU
GIOUisisan animprovement
improvementofofIOU. IOU.IOU
IOUrepresents
representsthe theintersection
intersectionandandmerger
mergerratio
ratio
between the real box A and the prediction box B. The expression
between the real box A and the prediction box B. The expression is is
AA∩∩BB
IOU
IOU==A ∪ B (15)
A∪B (15)

However, there are two problems with IOU. If the loss is 0, the model can be updated
by training, and the parameters can be optimized. The degree of overlap between the two
cannot be accurately reflected. In order to solve the problem of gradient disappearance
Electronics 2023, 12, 1515 18 of 21
Electronics 2023, 12, x FOR PEER REVIEW 19 of 22

However, there are two problems with IOU. If the loss is 0, the model can be updated
by training,
without and the
overlap, parameters
GIOU can be optimized.
adds a penalty term on theThe degree
basis of overlap
of IOU, between
which can betterthe two
reflect
cannot
the be accurately
closeness reflected.
and coincidence ofIn order
two to than
boxes solveIOU.
the problem
The GIOU of expression
gradient disappearance
is as follows:
without overlap, GIOU adds a penalty term[C(A on the basis of IOU, which can better reflect
∪ B)]
the closeness and coincidenceGIOU = boxes
of two IOU − than IOU. The GIOU expression is as follows:
[C] (16)
[C(A ∪ B)]
square=ofIOU
DIOU uses the ratio of the GIOU −
the distance (16)
[Cbetween
] the center point of the real
box and the prediction box and the square of the diagonal length of the minimum box as
a partDIOU
of theuses the ratio ofstandard.
measurement the squareTheof calculation
the distancemethod
between theloss
and center point are
of DIOU of the real
as fol-
box and the prediction box and the square of the diagonal length of the minimum box as a
lows:
part of the measurement standard. The calculation method and loss of DIOU are as follows:
𝜌 B, B
DIOU B, B = IOU B, B − ρ2 (B,Bgt )
DIOU B, Bgt = IOU B, Bgt − C2 (17)
(17)
C
𝐿 B,B,BBgt ==11−−DIOU
DIOU B,
B, B
Bgt

LDIOU

DIOU
DIOU solves
solvesthe
theproblem
problem that
thatIOU
IOUcannot
cannotaccurately
accuratelyreflect
reflectthe
thecoincidence
coincidence between
between
two
two frames,
frames, making thethe center
centerpoint
pointofofthe
theprediction
predictionframe
frame close
close to to
thethe center
center point
point of
of the
the
realreal frame.
frame. At the
At the samesame time,
time, DIOU
DIOU cancan converge
converge faster
faster than
than GIOU.
GIOU.
Figure
Figure 13
13 shows
shows the
the detection
detection eﬀect
effect of
of the
the three
three object
object detection
detection algorithms
algorithms in
in the
the
same picture.
same picture.

(a) (b) (c) (d)

Figure
Figure 13.
13. Comparison
Comparisonof
ofdetection
detection eﬀects.
effects. (a)
(a)YOLOv3;
YOLOv3;(b)
(b)YOLOv5;
YOLOv5; (c)
(c) SSD;
SSD; (d)
(d) MAO-YOLOv5.
MAO-YOLOv5.

The results show

The results showthat
thatthe
the recognition
recognition ability
ability of SSD
of SSD in a dense
in a dense crowdcrowd is slightly
is slightly weaker
weaker
than thethan
YOLO thealgorithm.
YOLO algorithm.
MAO-YOLOv5 0 s effect on smaller-scale
MAO-YOLOv5′s eﬀect on smaller-scale objects
objects and dense and
objects
dense objects is better than the two other algorithms, and its detection speed is faster. The
Electronics 2023, 12, 1515
x FOR PEER REVIEW 2019of
of 22
21

is better than thealgorithm

MAO-YOLOv5 two other reserves
algorithms, andfeature
more its detection speed is
information forfaster. The MAO-YOLOv5
multiscale objects in the
algorithm
dataset and reserves more feature
can accurately information
identify for multiscale
and present objects
their location andinclassification.
the dataset andAt can
the
accurately identify and present their location and classification. At the same time,
same time, it can also effectively distinguish and recognize some small objects and objects it can also
effectively
with densedistinguish
overlap. and recognize some small objects and objects with dense overlap.
In order to
In order to more
moreintuitively
intuitivelyshow
showthetheperformance
performanceofofthe theMAO-YOLOv5
MAO-YOLOv5 object
objectdetec-
de-
tion algorithm, this paper shows the image detection effects of different
tection algorithm, this paper shows the image detection effects of different types and types and scales in
Figurein
scales 14.Figure 14.

Figure
Figure 14.
14. Detection
Detection eﬀect
effect of
of MAO-YOLOv5
MAO-YOLOv5 in
in diﬀerent environment.
different environment.

The results show that MAO-YOLOv5 can detect multiscale objects and is less affected affected
by the background,
background,correctly
correctlymatching
matchingthe the object
object with
with itsits corresponding
corresponding category,
category, whichwhich
can
can
meetmeet the effective
the effective objectobject detection
detection requirements.
requirements. MAO-YOLOv5
MAO-YOLOv5 can effectively
can effectively im-
improve
prove the detection
the detection accuracy accuracy of complex
of complex multiobjects
multiobjects and smallandobjects.
small objects. This structure
This structure carries
carries
out the out thefeature
initial initial extraction
feature extraction
operation operation
of YOLOv5of YOLOv5
in advancein advance
to obtaintomore obtain more
accurate
accurate location information
location information of complex of complex multiobjects
multiobjects and smalland small objects.
objects.
As expected, the FPS of MAO-YOLOv5 (23.07) is lower lower than that of of YOLOV5
YOLOV5 (34.69)
(34.69)
owing to to the
theintegration
integrationofofthetheattention
attention mechanism
mechanism module,
module,butbut
it isitstill higher
is still thanthan
higher that
that
of theofother
the other methods
methods and hasandcertain
has certain benefits.
benefits. More research
More research will bewill be undertaken
undertaken to
to under-
understand
stand how tohow to increase
increase theinFPS
the FPS the in the future.
future.

6. Conclusions
6. Conclusions andand Future
Future Work
Work
The rise
The rise of
of deep
deep learning
learning has
has promoted
promoted thethe rapid
rapid development
development of of computer
computer vision.
vision.
Although the current object detection algorithm based on deep learning has
Although the current object detection algorithm based on deep learning has solved manysolved many
practical problems, it can continue to improve its accuracy and speed by optimizing the
practical problems, it can continue to improve its accuracy and speed by optimizing the
current model.
current model. This
This paper
paper first
first analyzes
analyzes the
the object
object detection
detection algorithms
algorithms SSD,
SSD, YOLOv3,
YOLOv3,
and YOLOv5
and YOLOv5 based
based on
on deep
deep neural
neural networks
networks and
and focuses
focuses on
on the
the network structure, loss
network structure, loss
function, and anchor frame of the YOLOv5 model with good performance.
function, and anchor frame of the YOLOv5 model with good performance. On the basis On the basis
of the
of the above
above research,
research, anan MAO-YOLOv5
MAO-YOLOv5 model model based
based on
on the
the attention
attention mechanism
mechanism andand
context feature fusion is proposed. This model adds the SENet attention mechanism
context feature fusion is proposed. This model adds the SENet attention mechanism to the
to the Backbone that optimizes the YOLOv5 structure and, at the same time, it adds a
Backbone that optimizes the YOLOv5 structure and, at the same time, it adds a context
context feature-fusion structure. The deep semantic information is fused as the background
feature-fusion structure. The deep semantic information is fused as the background infor-
information of shallow object features to solve the problem of the insufficient extraction
mation of shallow object features to solve the problem of the insuﬃcient extraction of ob-
of object semantic information. In the comparative experiment, this paper combines the
ject semantic information. In the comparative experiment, this paper combines the PAS-
PASCAL VOC 2007 and PASCAL VOC 2012 datasets as the entire dataset of this experiment,
CAL VOC 2007 and PASCAL VOC 2012 datasets as the entire dataset of this experiment,
using them to train different deep neural network models. The experimental results show
using them to train diﬀerent deep neural network models. The experimental results show
that the recognition accuracy of the proposed MAO-YOLOv5 model is better than the
that the recognition accuracy of the proposed MAO-YOLOv5 model is better than the
Electronics 2023, 12, 1515 20 of 21

original YOLOv5 model, and its recognition accuracy is also better than other main object
detection algorithms.
Image object detection should also consider the influence of object scale, differences in
light brightness, multiobject overlap, and other factors, so the performance of deep neural
networks should be further improved in the follow-up work. In addition, the dataset
used in this paper also has shortcomings, such as the imbalance of the number of objects
in different categories, which may affect the detection accuracy. We will also conduct
further research into new YOLO versions: the YOLOv6, YOLOv7, and YOLOv8 models are
available. These are the areas that need to be further improved in future research work.

Author Contributions: Conceptualization, G.S. and J.X.; writing—original draft preparation, G.S.;
writing—review and editing, S.W. and J.X. project administration, J.X. All authors have read and
agreed to the published version of the manuscript.
Funding: This Paper was funded by the High-level Talents Funding Project of Hebei Province (Grant
NO. A202105006); and the Hebei Provincial Higher Education Science and Technology Research Key
Project (Grant No. ZD2021317).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data (PASCAL VOC 2007 and 2012) supporting this paper are avail-
able from the below links: http://host.robots.ox.ac.uk/pascal/VOC/ (accessed on 21 October 2022).
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Wu, Y.; Zhang, H.; Li, Y.; Yang, Y.; Yuan, D. Video Object Detection Guided by Object Blur Evaluation. IEEE Access 2020, 8,
208554–208565. [CrossRef]
2. Zhang, Q.; Wan, C.; Han, W.; Bian, S. Towards a fast and accurate road object detection algorithm based on convolutional neural
networks. J. Electron. Imaging 2018, 27, 053005. [CrossRef]
3. Kaur, J.; Singh, W. Tools, techniques, datasets and application areas for object detection in an image: A review. Multimed. Tools
Appl. 2022, 81, 38297–38351. [CrossRef]
4. Zhang, Z.; Lu, X.; Liu, F. ViT-YOLO: Transformer-based YOLO for object detection. In Proceedings of the 18th IEEE/CVF
International Conference on Computer Vision (ICCV), OCT 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 2799–2808.
5. Silva, L.P.E.; Batista, J.C.; Bellon, O.R.P.; Silva, L. YOLO-FD: YOLO for face detection. In Proceedings of the 24th Iberoamerican
Congress on Pattern Recognition (CIARP), OCT 2019, Havana, Cuba, 28–31 October 2019; Volume 11896, pp. 209–218.
6. Yan, B.; Li, J.; Yang, Z.; Zhang, X.; Hao, X. AIE-YOLO: Auxiliary Information Enhanced YOLO for Small Object Detection. Sensors
2022, 22, 8221. [CrossRef] [PubMed]
7. Ye, J.; Yuan, Z.; Qian, C.; Li, X. CAA-YOLO: Combined-Attention-Augmented YOLO for Infrared Ocean Ships Detection. Sensors
2022, 22, 3782. [CrossRef]
8. Wang, K.; Liu, M. YOLO-Anti: YOLO-based counterattack model for unseen congested object detection. Pattern Recognit. 2022,
131, 108814. [CrossRef]
9. Xu, P. Progress of Object detection: Methods and future directions. In Proceedings of the 2nd IYSF Academic Symposium on
Artificial Intelligence and Computer Engineering, Xi’an, China, 8–10 October 2021; Volume 12079.
10. Murthy, C.B.; Hashmi, M.F.; Bokde, N.D.; Geem, Z.W. Investigations of Object Detection in Images/Videos Using Various Deep
Learning Techniques and Embedded Platforms—A Comprehensive Review. Appl. Sci. 2020, 10, 3280. [CrossRef]
11. Ma, D.W.; Wu, X.J.; Yang, H. Efficient Small Object Detection with an Improved Region Proposal Networks. In Proceedings of the
5th International Conference on Electrical Engineering, Control and Robotics (EECR), Guangzhou, China, 12–14 January 2019;
Volume 533, p. 012062. [CrossRef]
12. Fang, F.; Li, L.; Zhu, H.; Lim, J.-H. Combining Faster R-CNN and Model-Driven Clustering for Elongated Object Detection. IEEE
Trans. Image Process. 2019, 29, 2052–2065. [CrossRef]
13. Hu, B.; Liu, Y.; Chu, P.; Tong, M.; Kong, Q. Small Object Detection via Pixel Level Balancing With Applications to Blood Cell
Detection. Front. Physiol. 2022, 13, 911297. [CrossRef]
14. Afsharirad, H.; Seyedin, S.A. Salient object detection using the phase information and object model. Multimed. Tools Appl. 2019,
78, 19061–19080. [CrossRef]
15. Du, L.; Sun, X.; Dong, J. One-Stage Object Detection with Graph Convolutional Networks. In Proceedings of the 12th International
Conference on Graphics and Image Processing (ICGIP), Xi’an, China, 13–15 November 2020; Volume 11720.
Electronics 2023, 12, 1515 21 of 21

16. Yu, L.; Lan, J.; Zeng, Y.; Zou, J.; Niu, B. One hyperspectral object detection algorithm for solving spectral variability problems of
the same object in different conditions. J. Appl. Remote Sens. 2019, 13, 026514. [CrossRef]
17. Dong, Z.; Liu, Y.; Feng, Y.; Wang, Y.; Xu, W.; Chen, Y.; Tang, Q. Object Detection Method for High Resolution Remote Sensing
Imagery Based on Convolutional Neural Networks with Optimal Object Anchor Scales. Int. J. Remote Sens. 2022, 43, 2677–2698.
[CrossRef]
18. Zhan, Y.; Yu, J.; Yu, T.; Tao, D. Multi-task Compositional Network for Visual Relationship Detection. Int. J. Comput. Vis. 2020, 128,
2146–2165. [CrossRef]
19. Wang, Y.; Dong, Z.; Zhu, Y. Multiscale Block Fusion Object Detection Method for Large-Scale High-Resolution Remote Sensing
Imagery. IEEE Access 2019, 7, 99530–99539. [CrossRef]
20. Dong, Z.; Wang, M.; Wang, Y.; Liu, Y.; Feng, Y.; Xu, W. Multi-Oriented Object Detection in High-Resolution Remote Sensing
Imagery Based on Convolutional Neural Networks with Adaptive Object Orientation Features. Remote Sens. 2022, 14, 950.
[CrossRef]
21. Hou, Q.; Xing, J. KSSD: Single-stage multi-object detection algorithm with higher accuracy. IET Image Process. 2020, 14, 3651–3661.
[CrossRef]
22. Xi, X.; Wang, J.; Li, F.; Li, D. IRSDet: Infrared Small-Object Detection Network Based on Sparse-Skip Connection and Guide Maps.
Electronics 2022, 11, 2154. [CrossRef]
23. Koyun, O.C.; Keser, R.K.; Akkaya, I.B.; Töreyin, B.U. Focus-and-Detect: A small object detection framework for aerial images.
Signal Process. Image Commun. 2022, 104, 116675. [CrossRef]
24. Kim, J.U.; Kwon, J.; Kim, H.G.; Ro, Y.M. BBC Net: Bounding-Box Critic Network for Occlusion-Robust Object Detection. IEEE
Trans. Circuits Syst. Video Technol. 2019, 30, 1037–1050. [CrossRef]
25. Lee, D.-H. CNN-based single object detection and tracking in videos and its application to drone detection. Multimed. Tools Appl.
2020, 80, 34237–34248. [CrossRef]
26. Wu, T.; Liu, Z.; Zhou, X.; Li, K. Spatiotemporal salient object detection by integrating with objectness. Multimed. Tools Appl. 2017,
77, 19481–19498. [CrossRef]
27. Wang, C.; Yu, C.; Song, M.; Wang, Y. Salient Object Detection Method Based on Multiple Semantic Features. In Proceedings of the
9th International Conference on Graphic and Image Processing (ICGIP), Ocean Univ China, Acad Exchange Ctr, Qingdao, China,
14–16 October 2017; Volume 10615.
28. Kang, S. Research on Intelligent Video Detection of Small Objects Based on Deep Learning Intelligent Algorithm. Comput. Intell.
Neurosci. 2022, 2022, 3843155. [CrossRef] [PubMed]
29. Tong, K.; Wu, Y.; Zhou, F. Recent advances in small object detection based on deep learning: A review. Image Vis. Comput. 2020,
97, 103910. [CrossRef]
30. Wu, X.; Sahoo, D.; Hoi, S.C.H. Recent advances in deep learning for object detection. Neurocomputing 2020, 396, 39–64. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

Sensors 23 04681
No ratings yet
Sensors 23 04681
26 pages
Applsci 12 07825
No ratings yet
Applsci 12 07825
23 pages
Fast and Accurate Object Detector For Autonomous D
No ratings yet
Fast and Accurate Object Detector For Autonomous D
14 pages
Research Article: An Evaluation of Deep Learning Methods For Small Object Detection
No ratings yet
Research Article: An Evaluation of Deep Learning Methods For Small Object Detection
18 pages
A Survey and Performance Evaluation of Deep Learning Methods For Small 2021
No ratings yet
A Survey and Performance Evaluation of Deep Learning Methods For Small 2021
14 pages
Object Detection
No ratings yet
Object Detection
17 pages
An Evaluation of Deep Learning Methods For Small Object
No ratings yet
An Evaluation of Deep Learning Methods For Small Object
18 pages
Computer Vision 3
No ratings yet
Computer Vision 3
38 pages
A Survey of Deep Learning-Based Object Detection
No ratings yet
A Survey of Deep Learning-Based Object Detection
30 pages
Object Detection in Real Images
No ratings yet
Object Detection in Real Images
27 pages
Improved YOLOv7-Tiny For Object Detection Based On
No ratings yet
Improved YOLOv7-Tiny For Object Detection Based On
23 pages
Object Detection Using ELAN
No ratings yet
Object Detection Using ELAN
6 pages
A Novel Model To Detect and Categorize Objects From Images by Using A Hybrid Machine Learning Model
No ratings yet
A Novel Model To Detect and Categorize Objects From Images by Using A Hybrid Machine Learning Model
13 pages
Optimizing YOLOv5s
No ratings yet
Optimizing YOLOv5s
5 pages
Electronics 12 02323 v2
No ratings yet
Electronics 12 02323 v2
14 pages
Electronics 12 03089 v2
No ratings yet
Electronics 12 03089 v2
22 pages
An Approach For Object Detection: Based On Image AI: Utkarsh Pundir
No ratings yet
An Approach For Object Detection: Based On Image AI: Utkarsh Pundir
4 pages
Vijay Report
No ratings yet
Vijay Report
14 pages
Computer Vision Application
No ratings yet
Computer Vision Application
2 pages
Electronics-Object Detection YOLO
No ratings yet
Electronics-Object Detection YOLO
12 pages
Image and Video Analytics Unit 3
No ratings yet
Image and Video Analytics Unit 3
18 pages
Ruchitha Paper
No ratings yet
Ruchitha Paper
5 pages
A Survey of Modern Object Detection Literature Using Deep Learning
No ratings yet
A Survey of Modern Object Detection Literature Using Deep Learning
15 pages
5 Ijlemr 77839
No ratings yet
5 Ijlemr 77839
5 pages
Object Detection Report
No ratings yet
Object Detection Report
48 pages
Finalreport
No ratings yet
Finalreport
56 pages
YOLOv8-CAB Improved YOLOv8 For Real-Time Object de
No ratings yet
YOLOv8-CAB Improved YOLOv8 For Real-Time Object de
15 pages
Sapkota Et Al., 2025
No ratings yet
Sapkota Et Al., 2025
28 pages
Object Detection Using Machine Learningand Neural Networks
No ratings yet
Object Detection Using Machine Learningand Neural Networks
10 pages
YOLO Based Object Detection Models: A Review and Its Applications
No ratings yet
YOLO Based Object Detection Models: A Review and Its Applications
40 pages
An Investigation of Deep Neural Network Based Techniques For Object Detection An
No ratings yet
An Investigation of Deep Neural Network Based Techniques For Object Detection An
6 pages
Helmet Detection Using Machine Learning and Automatic License Final
75% (4)
Helmet Detection Using Machine Learning and Automatic License Final
47 pages
(2025-AEJ) Object Detection in Real-Time Video Surveillance Using Attention Based transformer-YOLOv8 Model
No ratings yet
(2025-AEJ) Object Detection in Real-Time Video Surveillance Using Attention Based transformer-YOLOv8 Model
14 pages
E3sconf Icmed-Icmpc2023 01016
No ratings yet
E3sconf Icmed-Icmpc2023 01016
6 pages
Sensors 22 04833
No ratings yet
Sensors 22 04833
17 pages
Object Detectionusing Machine Learningand Deep Learning
No ratings yet
Object Detectionusing Machine Learningand Deep Learning
9 pages
John 2020 Comparative
No ratings yet
John 2020 Comparative
7 pages
Irjet V6i4920
No ratings yet
Irjet V6i4920
7 pages
YOLO Based Object Detection Models: A Review and Its Applications
No ratings yet
YOLO Based Object Detection Models: A Review and Its Applications
40 pages
Object Detection With Deep Learning: A Review
No ratings yet
Object Detection With Deep Learning: A Review
21 pages
Ankit Synopsis
No ratings yet
Ankit Synopsis
13 pages
1 s2.0 S1877050924033301 Main
No ratings yet
1 s2.0 S1877050924033301 Main
7 pages
Object Detection
No ratings yet
Object Detection
13 pages
Object Detection With Deep Learning: A Review
No ratings yet
Object Detection With Deep Learning: A Review
21 pages
Aws RP
No ratings yet
Aws RP
11 pages
SR22804211151
No ratings yet
SR22804211151
8 pages
Object Detection Using Deep Learning
No ratings yet
Object Detection Using Deep Learning
5 pages
1 s2.0 S0045790618319682 Main
No ratings yet
1 s2.0 S0045790618319682 Main
11 pages
Object Detection Using Tensorflow....
No ratings yet
Object Detection Using Tensorflow....
9 pages
2003 07442v1 PDF
No ratings yet
2003 07442v1 PDF
7 pages
A Literature Review of Object Detection Using YOLOv4 Detector
No ratings yet
A Literature Review of Object Detection Using YOLOv4 Detector
7 pages
Object Detection Using YOLO: Challenges, Architectural Successors, Datasets and Applications
No ratings yet
Object Detection Using YOLO: Challenges, Architectural Successors, Datasets and Applications
33 pages
Real Time Object Detection With Deep Learning and OpenCV
No ratings yet
Real Time Object Detection With Deep Learning and OpenCV
5 pages
Presentation1 FINAL 1
No ratings yet
Presentation1 FINAL 1
11 pages
Detection and Content Retrieval of Object in An Image Using YOLO
No ratings yet
Detection and Content Retrieval of Object in An Image Using YOLO
8 pages
Object Detection
No ratings yet
Object Detection
4 pages