0% found this document useful (0 votes)
37 views11 pages

Embedded YOLO A Real-Time Object Detector For Smal

Uploaded by

Khánh Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views11 pages

Embedded YOLO A Real-Time Object Detector For Smal

Uploaded by

Khánh Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Hindawi

Mathematical Problems in Engineering


Volume 2021, Article ID 6555513, 11 pages
https://doi.org/10.1155/2021/6555513

Research Article
Embedded YOLO: A Real-Time Object Detector for Small
Intelligent Trajectory Cars

WenYu Feng,1 YuanFan Zhu,2 JunTai Zheng,1 and Han Wang 1

1
School of Transportation and Civil Engineering, Nantong University, 9 SeYuan Road, Nantong, China
2
School of Management Science and Engineering, Chongqing Technology and Business University, Chongqing, China

Correspondence should be addressed to Han Wang; [email protected]

Received 12 August 2021; Accepted 23 August 2021; Published 2 September 2021

Academic Editor: Xiao Chen

Copyright © 2021 WenYu Feng et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
YOLO-Tiny is a lightweight version of the object detection model based on the original “You only look once” (YOLO) model for
simplifying network structure and reducing parameters, which makes it suitable for real-time applications. Although the YOLO-
Tiny series, which includes YOLOv3-Tiny and YOLOv4-Tiny, can achieve real-time performance on a powerful GPU, it remains
challenging to leverage this approach for real-time object detection on embedded computing devices, such as those in small
intelligent trajectory cars. To obtain real-time and high-accuracy performance on these embedded devices, a novel object detection
lightweight network called embedded YOLO is proposed in this paper. First, a new backbone network structure, ASU-SPP
network, is proposed to enhance the effectiveness of low-level features. Then, we designed a simplified version of the neck network
module PANet-Tiny that reduces computation complexity. Finally, in the detection head module, we use depthwise separable
convolution to reduce the number of convolution stacks. In addition, the number of channels is reduced to 96 dimensions so that
the module can attain the parallel acceleration of most inference frameworks. With its lightweight design, the proposed embedded
YOLO model has only 3.53M parameters, and the average processing time can reach 155.1 frames per second, as verified by Baidu
smart car target detection. At the same time, compared with YOLOv3-Tiny and YOLOv4-Tiny, the detection accuracy is
6% higher.

1. Introduction network (CNN) can automatically identify landmarks, pe-


destrians [3], and other vehicles [4] and also can measure
Intelligent autonomous driving cars, also known as smart distance and speed [5] through relative positions. Because of
cars, have quickly become a research hotspot because of their its strong algorithm generalization and high practicability, it
small size, flexibility, and low energy consumption [1]. Smart has received extensive attention from researchers [6, 7].
cars can perform automatic tracking, obstacle avoidance, Traditional object detection methods, which are based
positioning and parking, and remote image transmission [2]. on a two-stage framework that consists of region selection
They have broad application prospects not only in trans- and then classification, include RCNN [8] and fast-RCNN
portation but also in the military, medicine, and aerospace [9]. In 2016, YOLO [10] was proposed as a new one-stage
fields. With the COVID-19 pandemic, demand for un- framework for target detection. The main idea was to use the
manned vehicles is growing. entire graph as the input of the network and directly return
Automatic obstacle avoidance and path planning are two to the position of the bounding box and the category to
hot research topics related to smart analysis, and processing which the bounding box belongs in the output layer. To
of the data stream from onboard terminal equipment still enhance the accuracy of original YOLO, Redmon proposed a
faces many challenges. Rapid detection of a target is the basic new joint training method in YOLOv2 [11] that trained the
problem of autonomous driving and obstacle avoidance. detector simultaneously on the COCO detection dataset [12]
Visual object detection based on a convolutional neural and the ImageNet classification dataset [13]. As a result,
2 Mathematical Problems in Engineering

YOLOv2 achieved a 19.7 mean average precision (mAP) calculations and number of parameters in the entire feature
score on ImageNet. To improve the processing speed, the fusion module. Finally, the lightweight head structure is
feature extraction backbone network Darknet53 was pro- adopted, and the depth separable convolution is used to
posed in the YOLOv3 [14]. Experimental results show that reduce the number of convolution stacks. At the same time,
Darknet53 is similar in accuracy to the ResNet network but the number of channels is reduced to 96 dimensions so that
has a faster processing speed. To further balance the rela- the model can obtain the parallel acceleration of most in-
tionship between detection accuracy and processing speed, ference frameworks.
Bochkovskiy sorted out the model structure of the YOLO To verify the effectiveness of the proposed model, we
series in YOLOv4 [15], that is, the feature extraction module selected a smart car based on an edge processor board as the
“backbone network,” the feature enhancement module verification environment. After online real-world testing in
“neck network,” and the detection module “head network.” the mobile terminal environment, embedded YOLO
YOLOv4 [15] enlarged the receptive field through the achieved high-accuracy and real-time capability.
CSPDarknet53 CNN and used the PANet [16] feature
pyramid structure for multiscale feature fusion. The ex- 2. Proposed Lightweight Network
perimental results on the COCO dataset [12] proved that the
detection accuracy of the YOLOv4 model is improved on the Figure 1 shows the embedded YOLO network model ar-
basis of a guaranteed processing speed of 30 frames per chitecture. It can be seen from Figure 1 that the model
second (fps). follows the three-module tandem frame design of YOLOv4,
Although the performance of the optimized YOLOv4 which is composed of the underlying feature extraction
model has been improved, the application of the mobile module ASU-SPP network, the multiscale feature fusion
terminal environment for smart cars still has two contra- module PANet-Tiny, and the detection head module Head-
dictory problems. (1) The YOLO series models have high Tiny.
requirements for hardware configuration. In the smart car
terminal environment, the detection speed still cannot meet
real-time requirements. (2) Excessive simplification of the 2.1. ASU-SPP Backbone Network. The traditional YOLOv4
network structure will reduce the effectiveness of features, low-level feature extraction module CSPDarknet53 CNN
causing a significant decrease in detection accuracy while has so many parameters and high computational complexity
increasing the calculation speed, which will decrease the that its network operation speed is only 30 fps, which makes
model’s target detection capability. it difficult to meet the real-time requirements of the smart
In order to balance the above contradictions, this paper car mobile terminal equipment environment. However, the
formulates the following optimization strategies for the simplistic network structure of the underlying feature ex-
YOLO model. The backbone network and neck network in traction module CSPDarknet53-Tiny in YOLOv4-Tiny [17]
the YOLO model determine the detection accuracy of the results in weak feature extraction capabilities. In addition,
model, and its parameters account for more than 60% of the the environment in the front-view image of the smart car is
total model. The backbone network is responsible for the more complex, with multiple target objects with large-scale
extraction of underlying features. To maintain the detection differences often appearing, as well as mutual occlusion
accuracy, the complexity of the backbone network must be phenomena. Therefore, it is difficult for a one-stage detector
preserved or even increased to ensure the effectiveness of its to accurately capture all targets at the same time.
extracted features. The neck network is responsible for In order to compress the network volume, enhance the
feature enhancement, which can simplify the network effectiveness of features, and achieve multiscale bottom-level
structure and calculation process, reduce the volume of feature extraction, this paper proposes the multiscale low-
parameters, and increase the calculation speed while en- level feature extraction module ASU-SPP network model. As
suring its basic functions. shown in Figure 1, the model uses the lightweight ShuffleNet
On the basis of the above optimization strategy, this v2 network as the backbone structure for feature extraction,
paper proposes an ultralightweight target detection network adjusts the weight of its important feature channels by in-
model, called embedded YOLO, for the smart car mobile troducing the attention mechanism squeeze excitation (SE)
terminal environment. unit, and uses SPP to obtain multiscale pooling branch
First, the feature extraction module ASU-SPP network is features.
proposed. In the three-scale feature extraction branch, the Figure 2 shows the attentive ShuffleNet unit (ASU)
attention mechanism is used to adjust the feature channel network structure. After inputting the feature map for
weight of the lightweight network ShuffleNet v2 and then channel splitting, two branches are formed in the channel
SPP is used to enhance the feature of multiscale pooling. dimension. The lower branch is equivalently mapped, and
Second, we design the PANet-Tiny network, a lightweight the upper branch contains three consecutive convolutional
feature fusion module. By excluding all large-scale convo- layers, including two 1 × 1 nongroup convolution. After the
lution operations in PANet, only 1 × 1 convolution is BN layer, the squeeze excitation (SE) attention mechanism
retained for feature channel dimension alignment. Both up- module is employed to adjust and sort the channel feature
sampling and down-sampling are implemented using linear values. To reduce the time loss caused by the introduction of
interpolation. The multiscale fusion method uses an element the SE structure, the channel of the expansion layer is re-
addition operation that significantly reduces the amount of duced to the original 1/8 so as to improve the detection
Mathematical Problems in Engineering 3

GFL: Quality Focal Loss (QFL)


Input (416,416,3) Supervision Supervision
Positives Negatives
0 0.9 0 0 0 0 0 0 0 0
Conv Soft one-hot label (iou label)
(208,208,24) General distribution
1.0
P(x)
MaxPool
y Supervision
(104,104,24)
Distribution Focal Loss (DFL)
box
Attentive ShuffleNet Unit ×3 conf
(52,52,116) SPP
class

Attentive ShuffleNet Unit


×7 SPP
(26,26,232)

SPP

MaxPool
Attentive ShuffleNet Unit 5×5
×3
(13,13,464)

Concate
MaxPool
9×9
① ASU-SPP Network MaxPool ② PANet-Tiny ③ Head-Tiny
(Backbone) 13×13 (Neck) (Head)

Figure 1: Architecture of the embedded YOLO lightweight network model.

Squeeze Excitation

BN
Channel Split

FC, FC,

Channel Shuffle
DW BN Conv
Pool Relu Hard-Swish
Conv BN Relu Multiply

Concate
Conv

1×1×c 1×1×c/8 1×1×c

Figure 2: Structure of attentive ShuffleNet unit (ASU).

accuracy without increasing the time loss. Finally, the feature show that the pooling design of multiscale windows not only
maps extracted from two branches are fused by the con- improves the accuracy of target detection but also signifi-
catenate layer and sent to channel shuffle layer. cantly speeds network convergence during the training
In this paper, the upper branch is selected to introduce process.
the SE mechanism so that the ASU model has two advan- Table 1 shows the structural characteristics of this paper’s
tages: combining the concatenate and channel shuffle at the backbone network and of traditional YOLO series models.
end of each ASU unit with the channel split of the next Although the backbone network proposed in this paper
module unit to form an element-level operation, avoiding introduces the attention model SE and the SPP structure,
the design concept of adding element-level operations, and through its lightweight design, the parameter volume of the
introducing the SE model in the upper branch to reduce the network module is greatly reduced compared with the
amount of calculation. traditional YOLO model so that the model can improve the
In order to increase the acceptance field, the SPP calculation speed while ensuring no loss in detection
structure is introduced at the output end of each ASU unit to accuracy.
perform multiscale pooling fusion processing, as shown in
Figure 1. The output features are pooled in four scales (1 × 1,
5 × 5, 9 × 9, and 13 × 13) and then spliced and fused so that 2.2. PANet-Tiny Neck Network. One type of feature map
the ASU-SPP backbone network can realize the multiscale pyramid network [19] (FPN) for target detection is BiFPN
feature fusion of each branch output feature. Experiments [20]. Although the BiFPN network has high performance, its
4 Mathematical Problems in Engineering

Table 1: Comparison of the structural characteristics of different training and prediction is solved, then in the case of complex
backbone networks. scenes, the positioning of objects tends to be uncertain and
Model name Backbone network Parameters (M) random. To ensure the consistency of training and pre-
YOLOv3-Tiny [18] DarkNet53 86 diction while also taking into account the classification score
YOLOv4-Tiny [17] CSPDarknet53 6.06 and quality prediction score, all positive and negative
Our Embedded YOLO Our ASU-SPP 2.6 samples can be trained. We combine the representations of
the two, retain the classification vector, and use the loss
large number of stacked feature fusion operations causes a function:
decrease in operating speed. The PANet network [16] is used QFL(σ) � − |y − σ|β ((1 − y)log(1 − σ) + y log(σ)), (1)
as a feature fusion module in YOLOv4 because of its simple
structure. However, there is still an excessive amount of where y is the quality label of 0-1, which is the prediction.
calculation in the PANet network structure because of using The global minimum solution of QFL is σ � y. In this way,
stride � 2 convolution to perform multiple scaling of feature the cross-entropy component becomes complete cross-en-
map with different scales and because feature fusion based tropy, and the adjustment factor becomes a power function
on concatenate operation multiplies the number of channels. of the absolute value of the distance. Considering that the
To solve these problems, this paper proposes a simplified real distribution is usually not too far from the marked
multiscale feature scaling and fusion feature map pyramid position, another loss is added:
network structure, called PANet-Tiny, shown in Figure 3.
DFL si , si+1 􏼁 � − yi+1 − y􏼁log si 􏼁 + y − yi 􏼁log si+1 􏼁􏼁.
The network removes all large-scale convolution operations
in the original PANet and only retains 1 × 1 convolution for (2)
feature channel dimension alignment. Both up-sampling
The above formula optimizes the probability that the
and down-sampling are implemented using linear inter-
label y is closest to the left and right positions in the form of
polation, and the multiscale fusion method is adopted with
cross-entropy so that the network can quickly focus on the
the element addition operation. The amount of calculation
distribution of the adjacent area of the target position. Fi-
and number of parameters of PANet-Tiny are significantly
nally, QFL and DFL are uniformly expressed as generalized
reduced compared with the original PANet.
focal loss (GFL), and their specific forms are as follows:
Table 2 shows a comparison of the structural charac-
􏼌􏼌 􏼌􏼌β
teristics of different neck networks. It can be seen that the GFL􏼐pyt , pyr 􏼑 � − 􏼌􏼌􏼌y − 􏼐yt pyt + yr pyr 􏼑􏼌􏼌􏼌 􏼐 yr − y􏼁log􏼐pyt 􏼑
PANet-Tiny neck network proposed in this paper has the
fewest parameters. + y − yt 􏼁log􏼐pyt 􏼑􏼑.
(3)
2.3. Head-Tiny Network. The traditional FCOS [21] detec-
tion head uses four 256-channel convolutions as a branch,
and there are eight channels of 256 convolutions in the two 3. Analysis of Experimental Process and Results
branches of border regression and classification. This creates
a huge calculation overhead. 3.1. Experimental Environment and Evaluation Indicators.
In response to these problems, this paper selects each In the experimental environment of this paper, the training
layer of features to use a set of convolutional structures, uses environment uses an Nvidia GeForce RTX 3090 24G GPU,
depth separable convolution to replace ordinary convolu- and the test environment uses the edgeboard, a computing
tion, and reduces the number of convolution stacks from card based on PGA (the Xilinx Zynq ZU3), 2 GB of DDR4
four groups to two. The 3 × 3 convolution can be replaced by memory, 6 bit OS, and 1.2 TOPS of computing power. The
a depthwise (DW) convolution with a 5 × 5 convolution software consists of the Ubuntu v20.04 operating system,
kernel and a 1 × 1 convolution, and they have the same Cuda v11.1, Python v3.6.12, and PyTorch v1.7.0 deep
receptive field. The number of parameters of DW convo- learning framework. Mean average precision or mAP (0.5 :
lution and 1 × 1 convolution is only one-ninth of 3 × 3 0.95) and frames per second (fps) are used as the evaluation
convolution and at the same time significantly reduces the indicators of model detection accuracy and speed,
computational overhead. The channel dimension is then respectively.
compressed, and the feature dimension is compressed to 96
dimensions, which allows the parallel acceleration of most
3.2. Backbone Network Ablation Experiment. To verify the
inference frameworks. Finally, referencing the YOLO series
effectiveness of the proposed backbone network, an ablation
model, the bounding box regression and classification are
experiment of the ASU-SPP network structure was carried
calculated using the same set of convolutions and then split
out. Our collected dataset includes basic traffic signs, such as
into three branches. This makes the optimized lightweight
traffic lights, zebra (pedestrian) crossings, speed limits, and
detection head structure very small, as shown in Figure 4.
parking, which are captured from a smart car, as shown in
Figure 5. Table 3 shows the detection results based on our
2.4. Loss Function. The three basic elements of object de- collected dataset.
tection are quality prediction, classification, and positioning. From Table 3, it can be seen that compared with the
If the problem of the inconsistency of quality prediction in traditional ShuffleNet v2 [22], the ASU-SPP network has
Mathematical Problems in Engineering 5

52×52×96

output1
SPP
output

52×52×96 26×26×96 26×26×96 26×26×96

Up Down output2
SPP 13×13×96
output 52×52×96
26×26×96
26×26×96 52×52×96 Down
Up
26×26×96 output3
SPP 13×13×96 26×26×96
output 26×26×96
13×13×96
13×13×96
13×13×96

Add Down Down-sample by interpolation

Layer after 1×1 Conv


up and Down
Normal layer
Up Up-sample by interpolation

Figure 3: Network structure of PANet-Tiny.

Table 2: Comparison of structural characteristics of different neck that the accuracy of the ASU-SPP network is significantly
networks. enhanced compared with that of traditional feature ex-
Model name Neck network Parameters (M) traction networks.
YOLOv3-Tiny [18] FPN [19] 14.02
YOLOv4-Tiny [17] PANet [16] 0.17
Our Embedded YOLO Our PANet-Tiny 0.11 3.4. Comparison of Lightweight Detection Networks. To
compare the performance of lightweight target detection
models, this paper selected the MS COCO dataset [12] for
reduced the number of parameters to 2.6M while resulting in the experimental data to train and test the accuracy and real-
an improvement of 2.2% in the Top-1 accuracy (i.e., the time performance of target detection models. The experi-
accuracy rate of the first category consistent with the actual ment sets the input image size to 416 × 416, the initial
results). The experimental results show that the introduction learning rate of the stage to 0.14, and the attenuation co-
of the attention mechanism SE and the multiscale feature efficient to 0.0001. The Multi-Step-LR learning rate method
fusion structure SPP enhances the characterization ability of was used, milestones were set to [240, 260,275], and gamma
the underlying features. was set to 0.1. The SGDM gradient optimization algorithm
was used, momentum was 0.9, weight decay was 0.0001,
batch size was 32, workers per GPU were 2, and training
3.3. Neck Network Ablation Experiment. To verify the ef- epochs were set to 280 rounds. The results of the compar-
fectiveness of the neck network PANet-Tiny, Mish [23] was ative experiment are shown in Table 5.
selected as the neck network activation function, PANet- Table 6 shows the comparison results of target detection
Tiny was combined with different backbone networks, and speed for four YOLO models and ours in the GPU 3090
its parameter size was compared with that of traditional environment. It can be seen that the target detection speed of
PANet networks along with detection accuracy. The ex- the model in this paper is much better than that of the
perimental results are shown in Table 4. traditional lightweight YOLO series models.
From Table 4, it can be seen that under the COCO Figure 6 shows a histogram of the training convergence
dataset, the combined parameters of PANet-Tiny and the process and a performance comparison of different models.
four feature extraction networks are all lower than those of Figure 6(a) depicts the loss curve of the three YOLO series
the traditional PANet network [16], and the detection ac- models during the training process. It can be seen that in the
curacy is not compromised. In addition, under the same case of 280 rounds of model training, the embedded
feature fusion module conditions, the backbone network YOLOv4 proposed in this paper has the lowest loss—about
ASU-SPP network in this paper has the best detection effect, 3%—compared with embedded YOLO and YOLOv3-Tiny.
with an accuracy of 23.3, which is 0.1% higher than To better compare the accuracy of the detection performance
ThunderNet [25] and 0.4% and 1.7% higher than MobileNet of each model, Figure 6(b) shows the performance (AP-FPS)
v3 [24] and ShuffleNet v2 [22], respectively. This also verifies broken line histogram of the four models. It can be seen that
6 Mathematical Problems in Engineering

slice

Conv DWConv Conv DWConv Conv Conv


96×48×1×1 96×1×5×5 96×96×1×1 96×1×5×5 96×96×1×1 120×96×1×1
Figure 4: Network structure of Head-Tiny.

Figure 5: Our collected dataset from the view of a smart car.

Table 3: Parameters of different backbone networks and detection Figure 7 shows examples of target detection of the
results. lightweight YOLO series models. It can be seen that the
Top-1 Acc Top-5 Acc accuracy of the model in this paper is better than that of
Backbone model Parameters (M)
(%) (%) other traditional YOLO series models for small targets in
ShuffleNet v2 [22] 3.07 69.4 88.9 complex environments.
Our ASU-SPP 2.6 72.4 90.3 Figure 8 provides the comparison results of each kind of
target detection and Precision-Recall performance curve
between embedded YOLO and YOLOv4-Tiny model based
on VOC dataset [26]. Figures 8(a) and 8(b) show three object
Table 4: Feature fusion network comparison experiment (PANet detection results of embedded YOLO and YOLOv4-Tiny
vs. PANet-Tiny). model, respectively. It can be seen that YOLOv4-Tiny model
Neck network
Backbone Parameters Accuracy misses some objects in the complex environments but
network (M) (%) embedded YOLO provides good detection results.
PANnet [16] 14.33 21.7 Figures 8(c) and 8(d) show the detection accuracy of each
Our PANet- ShuffleNet v2 [22] kind of target and Precision-Recall performance based on
3.71 21.7
Tiny embedded YOLO and YOLOv4-Tiny, respectively. The mAP
PANnet [16] 12.62 23.3 of our proposed embedded YOLO is 77.89%, while the mAP
Our PANet- Our ASU-SPP of YOLOv4-Tiny is 71.20%. The Precision-Recall perfor-
3.25 23.3
Tiny mance also illustrates that the object detection accuracy of
PANnet [16] 11.8 22.9 the proposed embedded YOLO is higher than that of
Our PANet- MobileNetV3 [24] YOLOv4-Tiny.
4.84 22.9
Tiny In addition, we also conducted simulation target de-
PANnet [16] 10.96 23.2 tection experiments from the perspective of the smart car for
Our PANet- ThunderNet [25] the YOLOv3-Tiny, YOLOv4-Tiny, and embedded YOLO
4.52 23.2
Tiny models, and the test results for each model are shown in
Figure 9. Figure 9(a) shows the detection results of simple
the proposed embedded YOLO model can significantly markers in three models, Figure 9(b) shows the detection
increase the processing speed while ensuring the accuracy of results of fuzzy markers in three models, and Figure 9(c)
target detection. shows the detection results of complex markers of the three
Mathematical Problems in Engineering 7

Table 5: Results of comparison of light target detection models (edgeboard environment).


Model name mAP Parameters (M) Latency (ms) Size (MB) FPS
YOLOv3-Tiny [18] 16.6% 8.86 12.89 33.7 77.6
YOLOv4-Tiny [17] 21.7% 6.06 10.92 23.0 91.5
Embedded YOLO 23.3% 3.53 6.4 7.9 155.1

Table 6: Comparison results of lightweight models (GPU3090 environment).


YOLOv3 [14] YOLOv3-Tiny [18] YOLOv4 [15] YOLOv4-Tiny [17] Ours
FPS 30.32 40.62 24.9 49.06 86.29

24
12

MS COCO object detection AP


10 22

8
20
Loss

4 18

2
16
0
50 100 150 200 250 30 40 50 60 70 80 90
Epoch FPS

Yolov4-Tiny Embedded YOLO


Embedded YOLO Yolov3-Tiny
Yolov3-Tiny Yolov4-Tiny
(a) (b)

Figure 6: Loss curve and AP-FPS performance results.

YOLOv4-Tiny YOLOv3-Tiny

Proposed Embedded YOLO


Figure 7: Comparison of target detection of YOLO series models.
8 Mathematical Problems in Engineering

(a) (b)
1 1.2
0.9
0.8 1.0
0.7 0.8
0.6
Precision

0.5 0.6
0.4
0.3 0.4
0.2 0.2
0.1
0 0.0
0.0 0.2 0.4 0.6 0.8 1.0
dog
cat
horse
car
train
bicycle
motorbike

bus
person
sheep
bird
aeroplane
tvmonitor
sofa
diningtable
boat
bottle

pottedplant
chair
cow

Recall

Embedded YOLO
YOLOv4-Tiny
Embedded YOLO
YOLOv4-Tiny
(c) (d)

Figure 8: Examples of target detection results for YOLO series models. (a) Object detection results of embedded YOLO. (b) Object detection
results of embedded YOLOv4-Tiny. (c) Object detection results. (d) Precision-Recall performance.
Mathematical Problems in Engineering 9

YOLOv3-Tiny YOLOv4-Tiny Embedded YOLO


(a)

YOLOv3-Tiny YOLOv4-Tiny Embedded YOLO


(b)

YOLOv3-Tiny YOLOv4-Tiny Embedded YOLO


(c)

Figure 9: Comparison of target detection results of YOLO series models. (a) Simple markers detection results. (b) Fuzzy markers detection
results. (c) Complex markers detection results.

models. In Figure 9, the red box indicates the detection result the underlying feature extraction method and the GFL loss
and the white box indicates the missed target. function in this paper improves the detection accuracy of the
As can be seen from Figure 9, for simple markers, each model and the quality of the detection frame.
model has a better recognition effect. For fuzzy markers, the
YOLOv3-Tiny model displayed a certain degree of mis-
judgment or missed detection. The YOLOv4-Tiny model 3.5. Online Experiment Results in Smart Terminal
and embedded YOLO model performed better without Environments. Lastly, we applied the proposed model to the
misjudgment; for complex markers, the detection results small intelligent trajectory car terminal system. The small
were compared with the original YOLO series. It was dif- intelligent trajectory car is equipped with an edgeboard, a
ficult for the network to find a suitable target frame in small computing card based on PGA (Xilinx Zynq ZU3), 2GB
target detection, and there are more misjudgments and DDR4 memory, 64 bit OS, and 1.2 TOPS. The experimental
lower confidence. The embedded YOLO network can more car and its main accessories are shown in Figure 10. The
accurately give the correct target frame, improve the con- model in this paper achieves 155.1 fps on the smart car
fidence of target discrimination, and reduce the misjudg- mobile terminal, which meets the real-time performance
ment of target categories. It can be seen that the embedded requirements of the vehicular system. Figure 11 shows the
YOLO model perfectly recognizes the category and location target detection results and automatic driving schematic
of each marker compared with other YOLO series of diagram from the perspective of the car after loading the
lightweight models, which reflects that the improvement of proposed model.
10 Mathematical Problems in Engineering

edgeboard

Camera (320×240×3) 2GB DDR4

Figure 10: Examples of the small intelligent trajectory car’s main accessories.

Figure 11: Examples of smart car first-view target detection and its autonomous driving trajectory.

4. Conclusions Data Availability


In response to the requirement for fast target detection in a The data used to support the findings of this study are in-
smart mobile terminal environment, this paper proposes an cluded within the article.
ultralightweight target detection network model, called
embedded YOLO. First, the attention mechanism SE and the Conflicts of Interest
spatial pyramid pooling SPP structure are used to optimize
the ShuffleNet network. Then, a multiscale feature fusion The authors declare that they have no conflicts of interest.
network module, called PANet-Tiny, is proposed to further
lighten the detection network. Lastly, a lightweight structure Acknowledgments
is used on the detection head to simplify calculations.
After verification by the COCO test dataset and online This work was supported by the Natural Science Foundation
experiments, the embedded YOLO model was compared of China under grant 61872425.
with the traditional lightweight model, and it was found that
while maintaining the mAP performance, the proposed References
model compresses the volume of model parameters and
increases the computing speed. The attention mechanism SE [1] Q. Xu, B. Wang, F. Zhang, D. S. Regani, F. Wang, and
and SPP structure introduced in this paper can enhance the K. J. R. Liu, “Wireless AI in smart car: how smart a car can
information expression ability of the network neck feature Be?” IEEE Access, vol. 8, pp. 55091–55112, 2020.
[2] T. Okuyama, T. Gonsalves, and J. Upadhay, “Autonomous
map and significantly improve the detection accuracy of
driving system based on deep Q learnig,” in Proceedings of the
small targets. In the smart car environment, the processing International Conference on Intelligent Autonomous Systems
speed is 155.1 fps, which meets the requirements of the smart (ICoIAS), pp. 201–205, Singapore, March 2018.
car mobile terminal environment for the performance of the [3] S. Ghosh, P. Amon, A. Hutter, and A. Kaup, “Reliable pe-
target detection network. destrian detection using a deep neural network trained on
Mathematical Problems in Engineering 11

pedestrian counts,” in Proceedings of the IEEE International Computing and Communication Systems (ICACCS),
Conference on Image Processing (ICIP), pp. 685–689, Beijing, pp. 687–694, Coimbatore, India, March 2020.
China, September 2017. [19] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and
[4] W. Dong, Z. Yang, W. Ling, Z. Yonghui, L. Ting, and S. Belongie, “Feature pyramid networks for object detection,”
Q. Xiaoliang, “Research on vehicle detection algorithm based in Proceedings of the IEEE Conference on Computer Vision and
on convolutional neural network and combining color and Pattern Recognition (CVPR), pp. 936–944, Honolulu, HI,
depth images,” in Proceedings of the International Conference USA, July 2017.
on Information Systems and Computer Aided Education [20] M. Tan, R. Pang, and Q. V. Le, “EfficientDet: scalable and
(ICISCAE), pp. 274–277, Dalian, China, September 2019. efficient object detection,” in Proceedings of the Conference on
[5] A. Chetouani and M. Pedersen, “On the use of a convolutional Computer Vision and Pattern Recognition (CVPR),
neural network to predict perceptual quality of images pp. 10778–10787, Seattle, WA, USA, June 2020.
without reference for different viewing distances,” in Pro- [21] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: fully con-
ceedings of the IEEE International Conference on Image volutional one-stage object detection,” in Proceedings of the
Processing (ICIP), pp. 1009–1013, Taipei, Taiwan, September International Conference on Computer Vision (ICCV),
2019. pp. 9626–9635, Seoul, Korea (South), November 2019.
[6] S. . f. Zhang, C. Chi, Y. Q. Yao, Z. Lei, and S. Z. Li, “Bridging [22] N. Ma, X. Y. Zhang, H. T. Zheng, and J. Sun, “ShuffleNet V2:
the gap between anchor-based and anchor-free detection via practical guidelines for efficient CNN architecture design,”
adaptive training sample selection,” in Proceedings of the IEEE 2018, https://arxiv.org/abs/1807.11164.
Conference on Computer Vision and Pattern Recognition [23] D. Misra, “Mish: a self regularized nonmonotonic neural
(CVPR), pp. 9756–9765, Seattle, Washington, June 2020. activation function,” 2019, https://arxiv.org/abs/1908.08681.
[7] D. F. Zhou, J. Fang, X. B. Song et al., “Joint 3D instance [24] A. Howard, M. Sandler, G. Chu et al., “Searching for
segmentation and object detection for autonomous driving,” MobileNetV3,” in Proceedings of the IEEE International
in Proceedings of the IEEE Conference on Computer Vision and Conference on Computer Vision (ICCV), pp. 1314–1324, Seoul,
Pattern Recognition (CVPR), pp. 1836–1846, Seattle, WA, Korea (South), November 2019.
USA, June 2020. [25] Q. Zheng, Z. Li, Z. Zhang et al., “ThunderNet: towards real-
[8] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature time generic object detection on mobile devices,” in Pro-
hierarchies for accurate object detection and semantic seg- ceedings of the International Conference on Computer Vision
mentation,” in Proceedings of the IEEE Conference on Com- (ICCV), pp. 6717–6726, Seoul, Korea (South), November
puter Vision and Pattern Recognition (CVPR), pp. 580–587, 2019.
Columbus, OH, USA, June 2014. [26] M. Everingham, S. M. A. Eslami, L. Van Gool,
[9] R. Girshick, “Fast R-CNN,” in Proceedings of the IEEE In- C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal
ternational Conference on Computer Vision (ICCV), visual object classes challenge: a retrospective,” International
pp. 1440–1448, Santiago, Chile, December 2015. Journal of Computer Vision, vol. 111, no. 1, pp. 98–136, 2015.
[10] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only
look once: unified, real-time object detection,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 779–788, Las Vegas, NV, USA, June
2016.
[11] J. Redmon and A. Farhadi, “YOLO9000: better, faster,
stronger,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 6517–6525,
Honolulu, HI, USA, July 2017.
[12] T.-Y. Lin, M. Maire, S. Belongie et al., “Microsoft COCO:
common objects in context,” in Proceedings of the European
Conference on Computer Vision (ECCV), pp. 740–755, Zurich,
Switzerland, September 2014.
[13] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and Li Fei-Fei,
“ImageNet: a large-scale hierarchical image database,” in
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pp. 248–255, Miami, FL, USA,
June 2009.
[14] J. Redmon and A. Farhadi, “YOLOv3: an incremental im-
provement,” 2018, https://arxiv.org/abs/1804.02767.
[15] A. Bochkovskiy, C. Y. Wang, H. Yuan, and M. Liao,
“YOLOv4: optimal speed and accuracy of object detection,”
2020, https://arxiv.org/abs/2004.10934.
[16] S. Liu, L. Qi, H. F. Qin, J. P. Shi, and J. Y. Jia, “Path aggregation
network for instance segmentation,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 8759–8768, Salt Lake City, UT, USA, June 2018.
[17] https://github.com/AlexeyAB/darknet.
[18] P. Adarsh, P. Rathi, and M. Kumar, “YOLO v3-Tiny: object
Detection and Recognition using one stage improved model,”
in Proceedings of the International Conference on Advanced

You might also like