Edge-YOLO Lightweight Infrared Object Detection Method Deployed On Edge Devices
Edge-YOLO Lightweight Infrared Object Detection Method Deployed On Edge Devices
sciences
Article
Edge-YOLO: Lightweight Infrared Object Detection Method
Deployed on Edge Devices
Junqing Li and Jiongyao Ye *
School of Information Science and Engineering, East China University of Science and Technology,
Shanghai 200237, China; [email protected]
* Correspondence: [email protected]
Abstract: Existing target detection algorithms for infrared road scenes are often computationally
intensive and require large models, which makes them unsuitable for deployment on edge devices.
In this paper, we propose a lightweight infrared target detection method, called Edge-YOLO, to
address these challenges. Our approach replaces the backbone network of the YOLOv5m model
with a lightweight ShuffleBlock and a strip depthwise convolutional attention module. We also
applied CAU-Lite as the up-sampling operator and EX-IoU as the bounding box loss function. Our
experiments demonstrate that, compared with YOLOv5m, Edge-YOLO is 70.3% less computationally
intensive, 71.6% smaller in model size, and 44.4% faster in detection speed, while maintaining the
same level of detection accuracy. As a result, our method is better suited for deployment on embedded
platforms, making effective infrared target detection in real-world scenarios possible.
Keywords: infrared object detection; lightweight network; convolutional attention; YOLOv5; RK3588
1. Introduction
Visible light images are commonly used in target detection due to their high resolution,
definition, and detailed visual information that is easily interpreted by the human eye.
However, these images are sensitive to external factors such as weather and lighting con-
ditions, which can reduce image quality and negatively impact target detection accuracy.
Citation: Li, J.; Ye, J. Edge-YOLO: This is where infrared imaging technology plays a crucial role. By overcoming these limita-
Lightweight Infrared Object tions, infrared imaging allows for image acquisition under diverse lighting and weather
Detection Method Deployed on Edge conditions, including foggy days and nighttime scenarios. As a result, this technology has
Devices. Appl. Sci. 2023, 13, 4402. been widely adopted in various fields, such as autonomous driving, security monitoring,
https://doi.org/10.3390/ and remote sensing, and offers a broader range of use cases than visible images. Therefore,
app13074402 the importance of infrared imaging technology in enabling efficient and reliable target
Academic Editor: Dimitris Mourtzis
detection cannot be overstated.
Traditional infrared target detection techniques [1–3] are mainly model-based methods,
Received: 26 February 2023 such as template matching, threshold segmentation, and the Hausdorff metric. However,
Revised: 20 March 2023 with the development of deep learning, target detection techniques based on convolutional
Accepted: 28 March 2023 neural networks have emerged in recent years. These methods are primarily divided into
Published: 30 March 2023
two-stage algorithms (e.g., Faster-RCNN [4]) and single-stage algorithms (e.g., SSD [5] and
YOLO [6]). The single-stage algorithm is designed to achieve a balance between detection
speed and accuracy, resulting in a significant improvement in detection speed while main-
Copyright: © 2023 by the authors.
taining accuracy compared with the two-stage algorithm. Consequently, the YOLO series
Licensee MDPI, Basel, Switzerland. has become a widely used representative of the single-stage algorithm. Among the main-
This article is an open access article stream YOLO algorithms, YOLOv5 (version 6.2) has achieved significant improvement in
distributed under the terms and both detection accuracy and speed compared with its predecessor by using Mosaic data en-
conditions of the Creative Commons hancement, C3 modules, and an improved SPPF module. Although YOLOv5 performs well
Attribution (CC BY) license (https:// on visible images, it encounters several challenges when applied to infrared detection. The
creativecommons.org/licenses/by/ first major issue is that infrared images suffer from poor contrast, high noise, and blurred
4.0/). imaging, leading to the loss of crucial target features during deep convolutional network
processing. Moreover, since infrared images lack color information, and the difference
between target and background features is minimal, it becomes challenging for deep convo-
lutional neural networks to distinguish useful information from irrelevant data, reducing
detection accuracy. Another significant challenge is that embedded devices commonly
used in autonomous driving and security monitoring fields have limited computing power,
storage space, and power consumption, which makes deploying large target detection
models such as YOLOv5 difficult. Additionally, these fields require real-time detection,
and collecting data on edge devices and sending them to the server for detection and
analysis can lead to network latency and communication congestion problems in widely
distributed areas.
The challenges mentioned above necessitate lightweight infrared target detection on
edge devices. In summary, this paper proposes a solution to these issues by introducing
Edge-YOLO, an infrared target detection algorithm that utilizes lightweight networks and
attention mechanisms specifically designed for edge devices. The primary enhancements
of this algorithm can be summarized as follows:
(1) The bounding box loss function was redesigned, and a loss function with a power
hyperparameter α was made to accelerate the convergence of the loss function and
solve the uncertainty of the aspect ratio in CIoU;
(2) A lightweight content-aware up-sampling operator was adopted, which can obtain a
larger perceptual field than the original nearest-neighbor up-sampling method, while
only introducing a small number of parameters and computational cost;
(3) The feature extraction network was reconstructed based on the improved Shuf-
fleNetv2, which enhances the extraction ability of strip features in IR scenes and
the perception ability of salient features in IR images by embedding a newly designed
strip depthwise convolutional attention module in ShuffleBlock while significantly
reducing the computational power of the network.
The remaining sections of this paper are organized as follows: Section 2 provides an
overview of related works on target detection with neural networks, including YOLOv5,
and other algorithms in infrared target detection. Section 3 provides a detailed introduction
to the Edge-YOLO algorithm proposed in this paper. Section 4 presents the results of various
experiments conducted to evaluate the performance of Edge-YOLO. Finally, Section 5
summarizes the main contributions of this paper.
2. Related Works
The YOLO family of algorithms, known for their efficiency and simplicity, was first
introduced by Redmon et al. in 2015. In the years that followed, Redmon et al. released
YOLOv2 and YOLOv3 algorithms, which further reduced network complexity and im-
proved detection speed compared with two-stage algorithms [7]. After Redmon withdrew
from the field of computer vision, Glenn Jocher released YOLOv5 in 2020, which has since
been updated to version 6.2. YOLOv5 is composed of a backbone feature extraction module,
a neck feature fusion module, and a head detection module, as illustrated in Figure 1. The
algorithm incorporates five different scales of n, s, m, l, and x, with larger scales delivering
higher detection accuracy but slower real-time performance. However, the network struc-
ture of the models of different scales remains consistent, differing only in the number of
partial layers, and is represented uniformly as “×n” in the figure. YOLOv5 uses CSPDark-
net53 as its backbone network, which includes the Cross Stage Partial (CSP) structure [8].
The CSP structure integrates gradient changes in the feature map, reducing the problem
of repeating gradient information in the backbone network. Moreover, YOLOv5 utilizes a
bottleneck structure with residual connections in the backbone network to prevent network
degradation due to gradient disappearance, and a bottleneck structure without residual
connections in the feature fusion layer to reduce computational effort. Additionally, Jocher
employs a modified Spatial Pooling Pyramid Fast (SPPF) structure in place of the original
SPP [9]. The modified SPPF achieves the same computational results as the original parallel
Appl. Sci. 2023, 13, x FOR PEER REVIEW 3 of 17
effort. Additionally, Jocher employs a modified Spatial Pooling Pyramid Fast (SPPF) struc-
ture in place of the original SPP [9]. The modified SPPF achieves the same computational
results
MaxPool as layers
the original parallel
of three MaxPool
different layers
sizes by of threemultiple
serializing differentMaxPool
sizes by layers
serializing
of themulti-
same
size,MaxPool
ple significantly
layersreducing computational
of the same time. reducing computational time.
size, significantly
Figure
Figure 1.
1. The
The network
network structure
structure of
of YOLOv5
YOLOv5 series
series (including
(includingn,
n, s,s, m,
m, l,l, and
and x,
x, depending
depending on
on the
the
number
number ofof duplicates
duplicates of
of module
module C3).
C3).
In addition
In addition to to YOLOv5,
YOLOv5, variousvarious target
target detection
detection methods
methods have have been
been proposed
proposed for for
infrared scenes
infrared scenes byby researchers.
researchers. Li Li et
et al.
al. [10]
[10] proposed
proposed the the YOLO-FIRI
YOLO-FIRI model, model, an an infrared
infrared
image area-free
image area-free target
target detector
detector based
based on YOLOv5. They They achieved
achieved good good infrared
infrared target
target
detection performance by
detection by improving
improvingthe theCSPCSPstructure
structureand andintroducing
introducing multiple
multiple detection
detec-
heads.
tion Fan Fan
heads. et al.et[11]
al. improved
[11] improved the feature extraction
the feature capability
extraction by using
capability dense dense
by using connection
con-
blocks based on YOLOv5, and improved the detection
nection blocks based on YOLOv5, and improved the detection accuracy by adding accuracy by adding a channel
a chan-
focus
nel mechanism
focus mechanism and and
modifying
modifyingthe loss
the function. Dai etDai
loss function. al. et
[12]al.proposed TIRNet,
[12] proposed which
TIRNet,
adopted
which VGG asVGG
adopted the feature extractor
as the feature and used
extractor anda continuous information
used a continuous fusion strategy
information fusion
to obtaintomore
strategy obtainaccurate and smoother
more accurate detection
and smoother results.results.
detection Li et al.Li[13] designed
et al. a dense
[13] designed a
nested interactive module to achieve progressive interaction among
dense nested interactive module to achieve progressive interaction among high-level and high-level and low-
level features.
low-level YouYou
features. et al.et[14] utilized
al. [14] multiscale
utilized mosaic
multiscale datadata
mosaic augmentation
augmentation to enhance
to enhance the
diversity of objects and proposed a parameter-free attention mechanism
the diversity of objects and proposed a parameter-free attention mechanism to enhance to enhance features.
AlthoughAlthough
features. these methods
these can be applied
methods can betoapplied
IR targettodetection,
IR target they have some
detection, drawbacks.
they have some
For instance,
drawbacks. striped
For instance,targets in IR
striped road scenes
targets require
in IR road scenesa more
requirereasonable combination
a more reasonable com- of
striped convolution and traditional convolution to extract features.
bination of striped convolution and traditional convolution to extract features. The bound-The bounding box loss
function
ing of the
box loss algorithm
function of the needs to be more
algorithm needsaccurately
to be more adapted to the
accurately boundary
adapted regression
to the bound-
ary regression of targets in IR images, and the model needs to be more lightweight to for
of targets in IR images, and the model needs to be more lightweight to be suitable be
practicalfor
suitable edge devices.
practical Therefore,
edge devices.the method the
Therefore, in this paperinfocuses
method on improving
this paper focuses onthe above
improv-
shortcomings.
ing the above shortcomings.
3. Methods
3. Methods
The structure of our Edge-YOLO model is shown in Figure 2 below.
The structure
Firstly, of our Edge-YOLO
the backbone of the modelmodel is shown
uses the in Figure
improved 2 below.to replace the C3
ShuffleBlock
module in YOLOv5 to enhance the feature extraction capability for the characteristics of IR
images of road scenes while reducing the complexity of the model.
Sci. 2023, 13, x FOR PEER REVIEW 4 of 17
Appl. Sci. 2023, 13, 4402 Firstly, the backbone of the model uses the improved ShuffleBlock to replace the C3 4 of 15
module in YOLOv5 to enhance the feature extraction capability for the characteristics of
IR images of road scenes while reducing the complexity of the model.
Secondly, inSecondly,
the featurein up-sampling structure, the
the feature up-sampling original the
structure, nearest-neighbor up-
original nearest-neighbor up-
sampling operator
samplingis replaced
operatorby the improved
is replaced by theCAU-Lite
improved module.
CAU-Lite module.
Thirdly, although
Thirdly,notalthough
shown in theshown
not figure,inwe
theutilize
figure,the
werecently proposed
utilize the recentlyEX-IoU
proposed EX-IoU
instead
instead of CIoU of CIoU
as the as thebox
bounding bounding box loss
loss function function
of our of This
model. our model.
new lossThis new loss function
function
provides
provides better and morebetter and more
accurate accurate convergence
convergence during
during training, thustraining,
leadingthus leading to improved
to improved
detection
detection performance. performance.
where α is the weight coefficient, ρ b, b gt is the distance between the center point of the
predicted box and the ground truth box, c is the diagonal length of the minimum outer
rectangle, v indicates the difference between the aspect ratio of the predicted box and the
ground truth box, and v is 0 if they have the same aspect ratio.
Appl. Sci. 2023, 13, 4402 5 of 15
The CIoU metric used in the current version of the YOLOv5 algorithm is designed to
integrate three aspects of the intersection–union ratio IoU between the prediction box and
the ground truth box. However, the aspect ratio used in CIoU is a relative value, which
can introduce uncertainty during calculation and potentially hinder the optimization of the
model. To address this issue, we propose the use of Efficient-IoU (EIoU) as the bounding
box loss function, as proposed by Zhang et al. [16]. EIoU splits the aspect ratio based
on CIoU and replaces the original aspect ratio difference between the predicted box and
ground truth box with the ratio of the width difference between the predicted box and the
ground truth box to the width of the minimum circumscribed rectangle, and the ratio of
the height difference to the height of the minimum outer rectangle. This approach leads to
a more accurate bounding box loss function and facilitates better model optimization. The
formula is as follows:
ρ2 b, b gt ρ2 w, w gt ρ2 h, h gt
L EIoU = 1 − IoU + + + (2)
c2 Cw2 Ch2
where Cw and Ch denote the width and height of the minimum outer rectangle, respectively.
The literature [17] proposes to use the hyperparameter α as a power on each term in the
loss function IoU, with the following simplified formula.
The parameter α is crucial in emphasizing the importance of the loss and gradient
of objects with high IoU, thereby enhancing the accuracy of the bounding box regression.
To improve the bounding box loss function, this paper incorporates the power α into the
EIoU equation, resulting in a new function known as EX-IoU. This function exponentially
magnifies the importance of the IoU value, centroid distance, width difference, or height
difference between any predicted box and the ground truth box, leading to an exponential
reduction effect on losses and an improvement in the accuracy of the bounding box regres-
sion. The optimal value of α is determined through experiments discussed in Section 4.
of the output feature map is then mapped back to the input feature map in the feature
reassembly module. The region of k up × k up , the center of the map, is taken out and dotted
with the predicted up-sampling kernels at that point to obtain the result. Each channel at
the same spatial location on the feature map uses the same up-sampling kernel.
The analysis reveals that CARAFE uses a zero-padding strategy on the feature map
edge positions in the feature reorganization stage, which leads to imperfect edge informa-
tion of the up-sampled images and it is difficult to correctly upsample the target features
at the edge of the feature map. Based on this, this paper proposes an improved Content-
Aware Up-Sampling-Lite (CAU-Lite) method to replace the nearest neighbor up-sampling
method in YOLOv5. Before finding the neighborhood of k up × k up , the nearest neighbor
interpolation is used to upsample the input feature map, so that the spatial dimension of
the feature map is the same as the spatial dimension of the up-sampled kernel map. Then,
at each spatial position in the feature map, the element with the size 1 × 1 × k2up × C is
taken out and reshaped to the size of k up × k up × C. At the same time, the upper sampling
core of 1 × 1 × k2up is reshaped at the corresponding position in the upper sampling core
map to the size of k up × k up × 1. The product of each channel of the feature map with the
upper sampling core is dotted to obtain a result with the size 1 × 1 × C, and the result
Appl. Sci. 2023, 13, x FOR PEER REVIEW 7 of 17
obtained for all channels is the result of the corresponding position in the output feature
map. The improved CAU-Lite structure and calculation process are shown in Figure 3.
Figure
Figure 3.
3. Structure
Structure of
of CAU-Lite.
CAU-Lite.
3.3. YOLOv5
3.3. YOLOv5 Network
Network Model
Model Improvement
Improvement
The YOLOv5
The YOLOv5 algorithm
algorithmisismainly
mainlydesigned
designedfor forthe
thevisible
visiblelight
light domain
domain and
and is is bet-
better
ter suited for deployment on GPUs due to its high number of model
suited for deployment on GPUs due to its high number of model parameters, large com- parameters, large
computational
putational requirements,
requirements, and large
and large model
model size. size. However,
However, the objective
the objective ofpaper
of this this paper
is to
is to create lightweight target detection networks for edge-embedded
create lightweight target detection networks for edge-embedded devices, making it rea- devices, making it
reasonable to employ a lightweight network structure instead of the
sonable to employ a lightweight network structure instead of the heavy CSPDarknet back-heavy CSPDarknet
backbone
bone network
network of YOLOv5.
of YOLOv5. OneOne promising
promising candidate
candidate forforsuch
suchaa lightweight
lightweight network
network
model is ShuffleNetv2, proposed by Ma et al. of Megvii’s team [20], which achieves a gooda
model is ShuffleNetv2, proposed by Ma et al. of Megvii’s team [20], which achieves
good balance
balance between between
modelmodel accuracy
accuracy and running
and running speed.speed. By using
By using lightweight
lightweight structures
structures such
as grouped convolution and depthwise convolution, ShuffleNetv2 is optimized for com-
putational complexity, storage access cost, and parallelism, resulting in a noticeable im-
provement in actual running speed. In light of this, we chose to use an improved version
of ShuffleNetv2 as the backbone network structure of the Edge-YOLO algorithm. The fol-
Appl. Sci. 2023, 13, 4402 7 of 15
its right
Appl. Sci. 2023, 13, x FOR PEER REVIEW branch, which generates redundancy. To reduce the number of parameters and
9 of 17
computational demand, this paper removes the 1 × 1 convolution layer after the depthwise
convolution.
ImprovedShuffleBlock.
5. Improved
Figure 5. ShuffleBlock.
(a):(a):
the the
basebase
unitunit
withwith
SDCASDCA applied;
applied; (b):
(b): the theunit
base baseforunit for
down-
sampling (2×). (2×).
down-sampling
4. Experiments
4. Experiments
4.1. Experimental Environment and Dataset
4.1. Experimental Environment and Dataset
The experiments in this paper were conducted using an Intel Xeon Platinum 8255C
The experiments in this paper were conducted using an Intel Xeon Platinum 8255C
CPU and an NVIDIA RTX 3090 GPU with CUDA version 11.7. To evaluate the detection
CPU and an NVIDIA RTX 3090 GPU with CUDA version 11.7. To evaluate the detection
performance of Edge-YOLO, we used the publicly available FLIR dataset, which is an
performance of Edge-YOLO, we used the publicly available FLIR dataset, which is an in-
infrared dataset released by FLIR in 2018. The dataset consists of more than 10,000 images
frared dataset released by FLIR in 2018. The dataset consists of more than 10,000 images
classified into four categories: Person, Bicycle, Car, and Dog. However, since there are only
classified
a few Doginto fourincategories:
images Person,
the dataset, Bicycle,
this paper onlyCar, and Dog.
evaluated the However, since there are
detection performance of
only a few Dog images in the dataset, this
Edge-YOLO for the remaining three categories. paper only evaluated the detection performance
of Edge-YOLO for the remaining three categories.
4.2. Bounding Box Hyperparameter Study
4.2. Bounding Box Hyperparameter
In the improved Study box loss function of this paper, there is a hyperpa-
EX-IoU bounding
In the
rameter improved
α that EX-IoU
affects the bounding
model’s accuracyboxperformance.
loss functionToofdetermine
this paper,the
there is a hyperpa-
optimal value of
rameter α that affects the model’s accuracy performance. To determine
α for the Edge-YOLO algorithm, we conducted multiple training and testing experiments the optimal value
of α for
using the Edge-YOLO
different values of α.algorithm, we conducted
The accuracy multiple
results obtained traininginand
are shown testing
Figure experi-
6. From the
ments
results,using
it can different
be observedvalues
thatof α.highest
the The accuracy results
mAP value obtained
of 78.8% are shown
is achieved when in the
Figure 6.
value
From
of α isthe
set results, it can
to 3, while thebemAP
observed
value that the
of the highest
model mAP value
decreases of 78.8%
to 76.9% whenisthe
achieved
value of when
α is
the value
set to of αindicates
8. This is set to that
3, while the mAPdetection
the model’s value ofaccuracy
the model decreases
improves byto 76.9%
2.47% when
when the
using
value of α is set to 8. This indicates that the model’s detection accuracy improves
the optimal value of α, and the model achieves its best performance in terms of detection by 2.47%
when using the
performance. Asoptimal
a result,value of α, selects
this paper and the3 model achieves
as the power of its
eachbest performance
term in EX-IoU to in obtain
terms
thedetection
of best accuracy performance.
performance. As a result, this paper selects 3 as the power of each term in EX-
IoU to obtain the best accuracy performance.
Appl. Sci. 2023, 13, x4402
FOR PEER REVIEW 10 9of
of 17
15
Figure 6. Variation
Variation in accuracy for different values of α on the algorithm.
4.3. Model
4.3. Model Lightweighting
Lightweighting Experiment
Experiment
By replacing
By replacing the
the backbone
backbone feature
feature extraction
extraction network
network of
of YOLOv5
YOLOv5 with
with the
the improved
improved
ShuffleBlock in this paper, i.e., Edge-YOLO shown in Figure 2, the overall number
ShuffleBlock in this paper, i.e., Edge-YOLO shown in Figure 2, the overall number of pa- of
parameters, computation, and model size of the algorithm model can be effectively
rameters, computation, and model size of the algorithm model can be effectively reduced, reduced,
and Table
and Table 11 below
below shows
shows the
the comparison
comparison of of each
each parameter
parameter after
after the
the model
model isis lightened
lightened
and improved.
and improved.
Table 1.
Table Comparison of
1. Comparison of lightweight
lightweight improvement
improvement effects.
effects.
Model
Model Params/M
Params/M Flops/G
Flops/G Size/MB
Size/MB
YOLOv5m
YOLOv5m 20.9
20.9 47.9
47.9 42.2
42.2
Edge-YOLO
Edge-YOLO 5.8
5.8 14.2
14.2 12.0
12.0
The
The table
table above
above shows
shows that
that by
by replacing
replacing the
the backbone
backbone network
network with
with the
the improved
improved
ShuffleBlock, Edge-YOLO reduces the number of network parameters
ShuffleBlock, Edge-YOLO reduces the number of network parameters by 72.2%, the by 72.2%, the
amount
amount of computation by 70.3%, and the model size by 71.6% compared with
of computation by 70.3%, and the model size by 71.6% compared with YOLOv5m. This YOLOv5m.
This demonstrates
demonstrates the significant
the significant lightweight
lightweight effect
effect of the
of the proposed
proposed method,
method, which
which helps
helps to
to reduce
reduce thethe storage
storage andand computation
computation resources
resources required
required by by
thethe model
model andand is more
is more suit-
suitable
able for deployment
for deployment on edge-embedded
on edge-embedded devices.
devices.
4.4. Ablation
Ablation Experiments
Experiments
In this part, the original ShuffleNetv2
ShuffleNetv2 is is firstly
firstly used as the backbone network
network of
YOLOv5m, based
YOLOv5m, based on which the ablation experiments of several improvement strategies
are conducted
proposed in this paper are conducted toto better
betterunderstand
understand the
theeffects
effectsof
ofdifferent
differentimprove-
improve-
ment strategies on the detection
detection effect in Edge-YOLO, and the results are shown inTable
effect in Edge-YOLO, and the results are shown in Table2
below.
2 below.
Table 2.
Table Ablation experiments.
2. Ablation experiments.
As can be seen from Table 2, compared with the first group of experiments using only
As can be seen from Table 2, compared with the first group of experiments using only
the basic model, the second group of experiments with the addition of EX-IoU solves the
the basic model, the second group of experiments with the addition of EX-IoU solves the
problem of uncertainty in the aspect ratio of CIoU by improving the loss function of the
Appl. Sci. 2023, 13, 4402 10 of 15
problem of uncertainty in the aspect ratio of CIoU by improving the loss function of the
bounding box and accelerating the convergence of the loss function, and the detection
accuracy is improved by 1.2% from the results, while the remaining parameters remain
unchanged. The third group of experiments replaces the original nearest neighbor interpo-
lation up-sampling with the CAU-Lite up-sampling operator proposed in this paper, which
senses and aggregates contextual information within a larger reception field, dynamically
generates adaptive up-sampling kernels, and performs feature reorganization based on the
generated up-sampling kernels. It can be seen that with CAU-Lite, the detection accuracy
of the model is improved by 1.6%, but the FPS is also slightly reduced. The fourth group
of experiments applies the strip depthwise convolutional attention module proposed in
this paper, which replaces the original ShuffleNetv2 network structure with an improved
ShuffleBlock, enhancing the feature extraction capability for strip-shaped targets and the
perception of the saliency of infrared targets. As seen in the table, the detection performance
of the model is significantly improved by 3.1% compared with the original model, but the
number of parameters, computation, and size of the model increased due to the addition
of the new module. The final fifth set of experiments uses a combination of the three
improvement points proposed in this paper, and from the results, a larger performance im-
provement is obtained at the cost of fewer computational and storage resources compared
with the original model.
Figure 7 below shows the P–R curves of each class of different improvement strategies
applied to the base model and the complete Edge-YOLO. The figure shows that compared
with the base model, the APs of all three target categories are improved with different
improvement strategies, and the AP of the bicycle category is improved most significantly.
[email protected]/%
Model [email protected]/% FPS Params/M Flops/G Size/MB
Person Bicycle Car
Faster
76.6 50.4 85.7 70.9 17.6 66.1 152.7 133.5
R-CNN
SSD 63.5 43.5 79.3 62.1 31.9 46.9 117.6 96.1
YOLOv5m 83.8 64.1 90.6 79.5 55.9 20.9 47.9 42.2
YOLOv7 86.4 67.3 91.4 81.7 27.9 37.2 105.1 74.8
YOLO-FIRI 79.5 52.4 88.6 73.5 71.4 7.2 20.4 15.0
Edge-
83.2 63.0 90.2 78.8 80.7 5.8 14.2 12.0
YOLO
From the table, we can first see that Faster R-CNN, as a two-stage algorithm, lags far
behind the single-stage algorithm in detection speed, and the current detection accuracy
does not have an advantage over the single-stage algorithm; while the SSD algorithm in
the single-stage algorithm speeds up the detection speed compared with Faster R-CNNN,
but the detection accuracy is also reduced accordingly. Both algorithms are not comparable
to the current YOLO series algorithms. Second, the detection accuracy of the algorithm
in this paper is basically the same compared with the YOLOv5m algorithm, but it has
obvious advantages in detection speed and consumption of computational and spatial
resources. Again, compared with the latest YOLOv7 algorithm, the detection accuracy
of the Edge-YOLO algorithm is slightly behind that of the YOLOv7 algorithm, but the
resources consumed by the YOLOv7 algorithm and its detection speed are completely
inferior to that of this paper. Finally, compared with the YOLO-FIRI target detection
Appl. Sci. 2023, 13, 4402 11 of 15
Figure7.7.P–R
Figure P–Rcurves
curves for
for each
each class.
class. (a):
(a): P-R
P-R curves
curves of
of the
the class
class ‘Person’;
‘Person’; (b):
(b): P-R
P-R curves
curves of
of the
the class
class
‘Bicycle’; (c): P-R curves of the class ‘Car’; (d): P-R curves of all classes.
‘Bicycle’; (c): P-R curves of the class ‘Car’; (d): P-R curves of all classes.
4.6.
4.5. Comparison
ComparisonofExperiments
Test Results
Figure 8 below
To further verifyshows the detection
the detection resultsofof
performance thesome images algorithm,
Edge-YOLO in the dataset under
this section
YOLO-FIRI, YOLOv5m, YOLOv7,
compares Edge-YOLO andR-CNN,
with Faster Edge-YOLO
SSD, of this paper.YOLOv7 [22], and other
YOLOv5m,
Since thetarget
mainstream Fasterdetection
R-CNN and SSD in the
algorithms for previous
comparison subsection are lagging
experiments, and theinresults
detection
are
accuracy
shown inand detection
Table 3 below.speed, only YOLO-FIRI, YOLOv5m, and YOLOv7 are used in the
visualization effect comparison with the algorithms in this paper.
TableFrom the figure,
3. Comparison ofitmainstream
can be seen thatdetection
target compared with YOLO-FIRI, the target detection
algorithms.
algorithm for infrared road scenes, the algorithm in this paper has a certain lead in accuracy
[email protected]/%
and aModel
higher confidence level in [email protected]/%
targets. In addition, observingFlops/G
FPS Params/M the fourth figure,
Size/MB
Person Bicycle Car
we can see that the YOLO-FIRI algorithm misdetects some pedestrian legs as bicycles,
Faster R-CNN 76.6 50.4 85.7 70.9 17.6 66.1 152.7 133.5
which has some defects. After comparing this algorithm with the YOLOv5m algorithm
SSD 63.5 43.5 79.3 62.1 31.9 46.9 117.6 96.1
YOLOv5m 83.8 64.1 90.6 79.5 55.9 20.9 47.9 42.2
YOLOv7 86.4 67.3 91.4 81.7 27.9 37.2 105.1 74.8
YOLO-FIRI 79.5 52.4 88.6 73.5 71.4 7.2 20.4 15.0
Appl. Sci. 2023, 13, 4402 12 of 15
and YOLOv7 algorithm, we can see that the three algorithms basically maintain the same
detection results, and they can detect cars, pedestrians, and a small number of bicycles in
road scenes well. Because the algorithm in this paper is a lightweight network model, it is
Appl. Sci. 2023, 13, x FOR PEER REVIEW 14 of 17
better than the other two algorithms in terms of the number of parameters, computation,
and model size, so this algorithm has a more practical application value.
4.7.Actual
4.7. ActualEdge
EdgeDevice
Device Deployment
Deployment Testing
Testing
Thispaper
This paperuses
usesthe
theRK3588
RK3588embedded
embedded development
development board of Rockchip
Rockchip asas the
the verifi-
verifica-
cation
tion platform,
platform, as as shown
shown ininFigure
Figure99below.
below.RK3588
RK3588 platform
platformisisequipped
equippedwith quad-core
with quad-core
A76+quad-core
A76+ quad-coreA55,
A55, an
an octa-core
octa-core CPU,
CPU, and
andNPU
NPUwith
with6TOPs
6TOPscomputation
computationpower.
power. ItsIts
high
high
computation power NPU supports INT4, INT8, INT16, and FP16 mixed computing,
computation power NPU supports INT4, INT8, INT16, and FP16 mixed computing, which which
canaccelerate
can accelerate the
the inference
inference of
ofnetwork
networkmodels.
models.The photo
The photoof of
RK3588 is shown
RK3588 below.
is shown below.
Appl. Sci. 2023, 13, 4402 13 of 15
Appl. Sci. 2023, 13, x FOR PEER REVIEW 15 of 17
Thealgorithm
The algorithm model
model in in this
this paper
paperandandthe
thecomparison
comparisonalgorithm
algorithm model
modelareare
first ex-
first
ported to the compatible ONNX format, and then converted to the
exported to the compatible ONNX format, and then converted to the RKNN model sup- RKNN model sup-
portedby
ported bythe
theNPU
NPUofofthe theRK3588
RK3588platform
platformusing
usingthe
theRKNN-Toolkit2
RKNN-Toolkit2and andrknpu2
rknpu2toolstools
with inference acceleration such as asymmetric hybrid quantization, and these
with inference acceleration such as asymmetric hybrid quantization, and these models are models are
used to infer the test set images, and the performance comparison is obtained as shownin
used to infer the test set images, and the performance comparison is obtained as shown
inthe
thefollowing
followingtable.
table.InInaddition
additiontotoinference
inference using
using NPU, the performance
NPU, the performance of of only
onlyCPUCPU
inferenceisisalso
inference alsotested
testedininthis
thispaper
paperand
andisisshown
showntogether
togetherininTable
Table44below.
below.
Table4.4.Testing
Table Testingresults
resultson
onthe
theRK3588
RK3588platform.
platform.
As can be seen from the table, the accuracy of all four models on the RK3588 platform
has aAsslight
can be seen from
decrease duethe table, the
to model accuracy ofIn
quantization. alladdition,
four models on the
if only the ARM
RK3588 CPUplatform
is used
has a slight decrease due to model quantization. In addition, if only the ARM
for inference, the FPS of algorithms such as YOLO-FIRI is less than 1, i.e., the number CPU is usedof
for inference, the FPS of algorithms such as YOLO-FIRI is less than 1, i.e.,
images that can be inferred is less than one per second, and the algorithm in this paper the number of
images that can be inferred is less than one per second, and the algorithm in
only has an FPS of 1.1, which cannot be deployed in practical application scenarios. Afterthis paper only
has an FPS
using NPUofto1.1, which cannot
accelerate, we canbe deployed in inference
see that the practical application
speed of each scenarios.
algorithmAfterwas
using
im-
NPU to accelerate, we can see that the inference speed of each algorithm
proved by tens of times. However, the FPS of YOLOv5m and YOLOv7 are only 14.5 and was improved
by tens of times. However, the FPS of YOLOv5m and YOLOv7 are only 14.5 and 8.8,
8.8, respectively, which are more obvious to notice lags in real-world applications, while
respectively, which are more obvious to notice lags in real-world applications, while the
the algorithm in this paper can achieve 31.9 FPS, which can meet the performance require-
algorithm in this paper can achieve 31.9 FPS, which can meet the performance requirements
ments of practical scenarios.
of practical scenarios.
5.5.Conclusions
Conclusions
Theproposed
The proposedmethod
methodininthisthispaper,
paper,Edge-YOLO,
Edge-YOLO,isisaalightweight
lightweightIR IRtarget
targetdetection
detection
approach that aims to ensure good performance in road scenes
approach that aims to ensure good performance in road scenes and is suitable for and is suitable for edge-
edge-
embeddeddevices.
embedded devices.The
Thealgorithm
algorithmutilizes
utilizesan
anoptimized
optimizedbounding
boundingbox boxloss
lossfunction,
function,thethe
improved EX-IoU, to enhance the regression accuracy of the bounding box. Moreover, toto
improved EX-IoU, to enhance the regression accuracy of the bounding box. Moreover,
improvethe
improve theup-sampling
up-sampling effect,
effect, thethe algorithm
algorithm adopts
adopts the improved
the improved CAU-Lite
CAU-Lite up-sam-
up-sampling
pling operator, which perceives the contextual content. Lastly, the
operator, which perceives the contextual content. Lastly, the lightweight ShuffleBlocklightweight Shuffle-
Block replaces the backbone feature extraction part of the network,
replaces the backbone feature extraction part of the network, and the strip depthwise and the strip depth-
wise convolutional
convolutional attention
attention modulemodule
is usedistoused to enhance
enhance the extraction
the extraction capability
capability of strip-
of strip-shaped
shapedand
targets targets
otherand other
salient salientpresent
features featuresinpresent in the IR
the IR feature mapfeature
for themap for the Shuffle-
ShuffleBlock, thus
Block, thus
further furtherthe
enhancing enhancing
detectionthe detection
accuracy accuracy
of the model. ofThe
the model. The experimental
experimental results on the re-
FLIR dataset demonstrate that Edge-YOLO is essentially equivalent to YOLOv5m in termsto
sults on the FLIR dataset demonstrate that Edge-YOLO is essentially equivalent
YOLOv5m in terms of accuracy, while reducing the number of network parameters,
Appl. Sci. 2023, 13, 4402 14 of 15
of accuracy, while reducing the number of network parameters, computation, and model
size by 72.2%, 70.3%, and 71.6%, respectively. Additionally, the detection speed is increased
by 44.4%, making the algorithm more suitable for embedded device applications.
Author Contributions: Conceptualization, J.Y.; methodology, J.L.; software, J.L.; validation, J.L.;
writing—original draft preparation, J.L.; writing—review and editing, J.L. and J.Y.; supervision, J.Y.
All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Chen, C.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A local contrast method for small infrared target detection. IEEE Trans. Geosci. Remote
Sens. 2013, 52, 574–581. [CrossRef]
2. Liu, R.; Lu, Y.; Gong, C.; Liu, Y. Infrared point target detection with improved template matching. Infrared Phys. Technol. 2012, 55,
380–387. [CrossRef]
3. Teutsch, M.; Muller, T.; Huber, M.; Beyerer, J. Low resolution person detection with a moving thermal infrared camera by hot spot
classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH,
USA, 23–28 June 2014; pp. 209–216.
4. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings
of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28.
5. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings
of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham,
Switzerland, 2016; pp. 21–37.
6. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Amsterdam, The Netherlands, 11–14 October 2016; pp. 779–788.
7. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
8. Wang, C.Y.; Liao HY, M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance the learning
capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle,
WA, USA, 14–19 June 2020; pp. 390–391.
9. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans.
Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [CrossRef]
10. Li, S.; Li, Y.; Li, Y.; Li, M.; Xu, X. YOLO-FIRI: Improved YOLOv5 for Infrared Image Object Detection. IEEE Access 2021, 9,
141861–141875. [CrossRef]
11. Fan, Y.; Qiu, Q.; Hou, S.; Li, Y.; Xie, J.; Qin, M.; Chu, F. Application of Improved YOLOv5 in Aerial Photographing Infrared
Vehicle Detection. Electronics 2022, 11, 2344. [CrossRef]
12. Dai, X.; Yuan, X.; Wei, X. TIRNet: Object detection in thermal infrared images for autonomous driving. Appl. Intell. 2021, 51,
1244–1261. [CrossRef]
13. Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target
detection. IEEE Trans. Image Process. 2022, 32, 1745–1758. [CrossRef] [PubMed]
14. You, S.; Ji, Y.; Liu, S.; Mei, C.; Yao, X.; Feng, Y. A thermal infrared pedestrian-detection method for edge computing devices.
Sensors 2022, 22, 6710. [CrossRef] [PubMed]
15. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. Proc.
AAAI Conf. Artif. Intell. 2020, 34, 12993–13000. [CrossRef]
16. Zhang, Y.-F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression.
Neurocomputing 2022, 506, 146–157. [CrossRef]
17. He, J.; Erfani, S.; Ma, X.; Bailey, J.; Chi, Y.; Hua, X.S. Alpha-IoU: A Family of Power Intersection over Union Losses for Bounding
Box Regression. Adv. Neural Inf. Process. Syst. 2021, 34, 20230–20242.
18. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125.
19. Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016.
20. Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of
the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131.
Appl. Sci. 2023, 13, 4402 15 of 15
21. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141.
22. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object
detectors. arXiv 2022, arXiv:2207.02696.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.