Research Article: A Real-Time Object Detector For Autonomous Vehicles Based On Yolov4
Research Article: A Real-Time Object Detector For Autonomous Vehicles Based On Yolov4
Research Article
A Real-Time Object Detector for Autonomous Vehicles
Based on YOLOv4
Rui Wang,1 Ziyue Wang,1 Zhengwei Xu ,2 Chi Wang,1 Qiang Li,1 Yuxin Zhang,1
and Hua Li1
1
Changchun University of Science and Technology, School of Compute Science and Technology, Changchun, Jilin 130022, China
2
Chengdu University of Technology, Department of Geophysics, Chengdu, Sichuan 610059, China
Received 21 October 2021; Revised 25 November 2021; Accepted 26 November 2021; Published 10 December 2021
Copyright © 2021 Rui Wang et al. ,is is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Object detection is an important part of autonomous driving technology. To ensure the safe running of vehicles at high speed, real-
time and accurate detection of all the objects on the road is required. How to balance the speed and accuracy of detection is a hot
research topic in recent years. ,is paper puts forward a one-stage object detection algorithm based on YOLOv4, which improves
the detection accuracy and supports real-time operation. ,e backbone of the algorithm doubles the stacking times of the last
residual block of CSPDarkNet53. ,e neck of the algorithm replaces the SPP with the RFB structure, improves the PAN structure
of the feature fusion module, adds the attention mechanism CBAM and CA structure to the backbone and neck structure, and
finally reduces the overall width of the network to the original 3/4, so as to reduce the model parameters and improve the inference
speed. Compared with YOLOv4, the algorithm in this paper improves the average accuracy on KITTI dataset by 2.06% and BDD
dataset by 2.95%. When the detection accuracy is almost unchanged, the inference speed of this algorithm is increased by 9.14%,
and it can detect in real time at a speed of more than 58.47 FPS.
generates region proposal in the first stage and goes on bbox and CA [30], and reducing the computation, improving the
regression and object classification prediction in these re- real-time performance by scaling the width of the network.
gions in the second stage, e.g., R-CNN [8], Fast R-CNN [9],
Faster R-CNN [10], and R-FCN [11]. Two-stage algorithms 2. Related Work
usually have a high accuracy but have a relatively slow
detection speed. One-stage algorithms, such as SSD [12] and YOLO [13] is different from the two-stage algorithm using
YOLO [13], perform classification and regression in just one region proposal to get regions of interest. Instead, it detects
stage. ,ese methods generally have a low accuracy but a objects by segmenting the image into grid cells. Its output
high detection speed. In recent years, object detectors layer information includes bbox coordinates, confidence,
combining various optimization methods have been widely and classification score. ,erefore, it can detect multiple
studied [14–18] in order to take advantage of both types of objects through a single stage, and the speed is much faster
method. MS-CNN [14], a two-stage object detection algo- than two-stage algorithm. However, due to the fact that it
rithm, improves detection speed by a series of intermediate predicts coordinates directly and not based on anchor, it is
layers. RFBNet [18], a one-stage algorithm, proposes re- difficult to detect small objects. YOLOv2 [27] adds BN layer
ceptive filed blocks to expand the receptive field to improve after convolution layer, applies the idea of bbox based on
accuracy. However, previous studies [14–17] can no longer anchor, multiscale training, and uses passthrough layer to
satisfy the detector speed above 30 fps, one of the prereq- fuse fine-grained features, which improves the accuracy
uisites for autonomous driving, when the input resolution compared with YOLO and YOLOv3 [23]; its backbone
reaches 512 × 512 or higher. ,is indicates that the previous DarkNet53 applies residual connection to solve the problem
schemes are incomplete in terms of the trade-off between of deep network gradient disappear; FPN feature fusion
accuracy and speed and therefore difficult to apply in the retains small object fine-grained features; multiscale pre-
field of autonomous driving. diction makes the network detect objects of different sizes. It
,e problem of most object detection algorithms is that has a more obvious improvement compared with YOLO and
large objects are easily detected, while small objects are often YOLOv2. ,e structure of YOLOv4 [24] is shown in Fig-
ignored by the detector. It is extremely dangerous to miss ure 1. On the basis of YOLOv3, a large number of excellent
pedestrians, traffic lights, and traffic signs in autonomous methods and training tricks in recent years are tried.
driving. In recent years, there are many feature fusion al- Backbone CSPDarkNet53 is DarkNet53 integrated into CSP
gorithms for small object detection [19–22]. Kaiming He structure [31]. ,e SPP module [19] after the backbone
proposed SPPNet [19] in 2014 to extract features of any significantly increases the receptive field but hardly affects
aspect ratio region, which provides an idea for the detection the inference speed. ,e repeated extraction process of PAN
algorithms such as YOLOv3 [23] and YOLOv4 [24]. FPN [21] structural features alleviates the problem of serious
[20] is a multiscale feature fusion network structure. FPN information loss when the bottom information is transferred
combines high-level semantic features and low-level location to the top in FPN. As with YOLOv3, the prediction layer is
features to effectively improve the detection accuracy of carried out on three different scales to detect objects of
small targets. PANet [21] is an improved version of FPN, different sizes. ,e inference speed of YOLOv4 is faster than
which adopts the top-down and bottom-up transmission that of YOLO and YOLOv2 because it only consists of 1 × 1
mode to eliminate the problem of information loss from the and 3 × 3 small convolution layers. ,e parameters of the
bottom features to the high features. ASFF [22] is a novel backbone with CSP structure are greatly reduced, and the
feature fusion strategy, which reduces the conflict and in- information exchange between layers is greatly improved.
consistency between different feature layers through adap- ,erefore, the inference speed and accuracy are better than
tive spatial feature fusion and improves the effectiveness of those of YOLOv3. It can also satisfy the high real-time
feature pyramid. requirement of autonomous driving system. However,
In addition, some researchers [25, 26] try to add P6 and generally speaking, its accuracy is still lower than that of the
P7 detection layers after P5 with 32 times downsampling rate two-stage algorithm, and it does not optimize for the sit-
to improve the detection accuracy of small objects, but it uation of many small objects in the autonomous driving
brings huge computational cost and speed loss. YOLO series scene. To make up for this, we use YOLOv4, which has a
algorithm [13, 23, 24, 27] is one of the faster one-stage al- lower complexity than the two-stage algorithm, and improve
gorithms, especially the YOLOv4. It improves the low ac- the accuracy and speed of YOLOv4 through additional
curacy of YOLO [13], YOLOv2 [27], and YOLOv3 by methods, so as to design a more efficient detector for au-
combining the advantages of a large number of excellent tonomous driving.
models and adding a large number of training tricks. Since SENet [32] shined in the last ImageNet classifi-
However, both YOLOv4 and previous algorithms are trained cation competition in 2016, the attention module of plug-
and optimized for MS-COCO [28], which requires a large and-play can be directly applied to the existing neural
number of categories to be detected and its context is highly network because of its flexibility, which is popular in
variable. So these models are suboptimal when applied to the computer vision tasks. CBAM [29] considers the location
field of autonomous driving. ,erefore, this paper proposes information ignored by SE module and uses large-scale
a new method to improve the accuracy of the model by convolution to utilize the location information by reducing
embedding the RFB module [18] into the backbone network, the number of channels, which has better interpretability
optimizing the PAN, adding attention module CBAM [29] than SE module. CA [33] is a newly proposed attention
8483, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2021/9218137 by Egyptian National Sti. Network (Enstinet), Wiley Online Library on [09/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Computational Intelligence and Neuroscience 3
Output
Backbone Neck
Conv
×3 Conv Up
SPP
CSP4 Conv
×3
Predictor
Conv
CSP8 Conv Concat Conv Up
×5
Conv
CSP8 Conv Concat Conv Conv2D
×5
CSP2
Conv
CSP1
Conv
Concat Conv Conv2D
×5
Conv
Conv
Conv
Concat Conv Conv2D
×5
Input
Figure 1: YOLOv4 structure.
module. In order to alleviate the loss of location information aspect ratio in CIoU loss cannot reflect the real difference
caused by 2D global pooling, channel attention is decom- with its confidence, so the real width loss and high loss are
posed into two parallel 1D feature decoding processes, and calculated, respectively, and then added up.
the location information is effectively embedded into ,e autonomous driving scene is different from the daily
channel attention. life scene, which does not need to pay attention to those
Traditional object detection algorithm usually uses mean unimportant classes. ,erefore, most of the advanced
square error (MSE, L2) or smooth L1 [9] to regress the center models optimized for MS-COCO [28] are suboptimal.
point coordinates and the width and height of bbox directly, KITTI [37] is a common dataset in autonomous driving
i.e., {xcenter, ycenter, w, h}, or the upper left corner and lower scenes. It is collected in urban areas, rural areas, and ex-
right corner, i.e., {xtop left, ytop left, xbottom right, ybottom right}. pressways. Each image has up to 15 cars and more than 30
For the anchor-based object detection algorithm, it is to pedestrians, and there are various degrees of occlusion and
regress the offset, that is, {xoffset, yoffset, woffset , hoffset}. But truncation. BDD100k [38] is a large and diverse public
regression of bbox directly is to take the four bbox points as driving dataset released by the Berkeley AI Research (BAIR)
independent variables, without considering the correlation in recent years, including different weather conditions, day
between them, and in the process of training, it is more and night, as well as different lighting conditions and oc-
inclined to large objects, because the loss of small objects is clusion. ,is paper proposes two algorithms based on
originally small. ,erefore, in order to better deal with this YOLOv4. ,e first algorithm improves the accuracy by
problem, IoU loss [34] was proposed to treat bbox as a whole adding CSP [31] structure into feature fusion, inserting
regression and take GT into account. IoU has scale in- attention mechanism, and using EIoU regression loss
variance; it can solve the problem that loss increases with function to accelerate model convergence. ,e second al-
scale in regression. Recently, with the continuous im- gorithm improves the detection accuracy of dense small
provement of researchers, GIoU loss [30] was proposed. In objects by inserting RFB [18] module. Finally, the width is
addition to IoU, GIoU loss also considers the shape and reduced to 3/4 of the original to improve the inference speed,
direction of the object to solve the problem that IoU loss can as shown in Figure 2.
not reflect the size of coincidence degree and return gradient
when IoU is zero. DIoU loss [35] is to replace the penalty 3. Proposed Work
term of GIoU to maximize the overlap area with the min-
imum circumscribed rectangle by minimizing the Euclidean According to YOLOv4 [24], the anchor-based one-stage
distance of bbox and GT center points, so as to accelerate the detection algorithm is generally composed of backbone,
convergence. As for CIoU loss [35], the aspect ratio is neck, and predictor head. ,e first model proposed in this
considered on the basis of DIoU. ,is year, some researchers paper inserts the attention mechanism into the bottleneck of
put forward EIoU loss [36], thinking about that the relative the residual structure and adds the CSP structure into the
8483, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2021/9218137 by Egyptian National Sti. Network (Enstinet), Wiley Online Library on [09/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
4 Computational Intelligence and Neuroscience
Output
Neck
Backbone CSP2 Conv Up
RFB
CSP8 Conv
CBAM ×2
Predictor
CSP8 CSP3
Conv Concat Conv Up
CBAM CBAM
CSP8 CSP3
Conv Concat Conv Conv2D
CBAM CBAM
CSP2
CBAM
Conv
CSP1
CBAM CSP3
Concat Conv Conv2D
CBAM
Conv
Conv
CSP3
Concat Conv Conv2D
CBAM
Input
3.1. Backbone. CSPDarkNet53 of YOLOv4 is an excellent information, suppress irrelevant information, and improve
backbone, which can solve the task of feature extraction in the overall accuracy of object detection. Figure 3 is the CA
most detection scenes. ,e first model proposed in this attention mechanism insertion position of model 1.
paper continues to use CSPDarkNet53 and only adds CA
attention module into bottleneck (see Figure 3). ,e ef-
fectiveness of attention mechanism has been fully verified in 3.2. Neck. For CNN, the more backward layers are rich in
many detection models. It can greatly increase the ability of semantic information. YOLOv4 uses SPP [19] after back-
feature extraction by adding only a small number of pa- bone to increase the receptive field of the network. Com-
rameters. In order to more fully enhance the feature ex- pared with the pure pooling of SPP, RFB [18] draws lessons
traction ability of backbone in complex traffic scenes, the from Inception in structure, adopts the horizontal con-
second model doubled the number of iterations of the last nection fusion network layer, and increases the receptive
layer of its residual structure (i.e., increased to 8). In the field and reduces the amount of calculation through dilated
experiment, it was found that it is better to modify the convolution, which is more robust. As shown in Figure 6,
attention mechanism to CBAM and the insertion position to RFB block is composed of 3 × 3 convolution and three di-
be outside the residual structure and inside the CSP lated convolution layers.
structure, as shown in Figure 4(b). PAN [21] is a feature enhancement structure for feature
CBAM [29] and CA [30] modules are shown in Figure 5. fusion. It adopts a top-down and bottom-up transmission
Both CBAM and CA are attention mechanisms of mixed mode to eliminate the loss of feature information from the
channel and space. Compared with the single channel at- bottom feature to the high feature. However, the layer
tention mechanism SE [32], the neural network will pay structure between PAN is connected in the form of ordinary
more attention to the object area containing important convolution. CSP [31] structure has shown its advantages in
8483, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2021/9218137 by Egyptian National Sti. Network (Enstinet), Wiley Online Library on [09/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Computational Intelligence and Neuroscience 5
Bottle CBAM
Conv Conv2D
Neck
CSPN = ×N Concat BN LeakyReLU Conv Bottle
Conv2D Conv
Neck
CSPN ×N Concat Conv
CBAM =
Conv
(a) (b)
backbone: strengthening information exchange between 1 × 1 convolution layer. ,e final predicted output channel is
channels and reducing the amount of calculation. ,erefore, na × (4 + 1 + nc), where na is the number of anchors in each
adding CSP structure to the layer structure between PAN is detection layer and nc is the number of classes. Proposed
more refined and has less parameters than CSP structure in work follows this structure.
CSPDarkNet53 (see Figure 4).
3.4. Loss Function. For the object detection model, the loss
3.3. Predictor Head. In object detection, the conflict between
function is generally the sum of confidence loss, classifi-
classification and regression tasks is a well-known problem,
cation loss, and bbox regression loss. Binary cross entropy
so the prediction head for classification and regression is
(BCE) was used for confidence loss and classification loss,
widely used in most detectors. YOLOv4 follows the pre-
and EIoU loss was used for bbox regression loss.
dictor head of YOLOv3, which consists of one 3 × 3 and one
1
Lobj � − Oi lnCi ) + 1 − Oi ln1 − Ci )), (2)
N i
1 ij ,
ij + 1 − Oij ln1 − C
Lcls � − O lnC (3)
Npos i∈pos j∈cls ij
In formula (1), λ1 , λ2 , λ3 are the coefficient of each loss, and 36.1%, respectively, compared with YOLOv4 and
which are hyperparameters. In formula (2) Oi ε[0, 1] rep- YOLOv3. In addition, from the perspective of FLOPs,
resents the IoU of the predicted bounding box and the groud proposed work greatly reduces the complexity. At the same
truth, Ci � sigmoid(Ci ), Ci is the predicted value, and N is time, in terms of model size, proposed work (2) only oc-
the number of positive and negative samples. In formula (3), cupies 72.1 MB, which is 40.9% less than that of YOLOv4,
Oij ε{0, 1} indicates whether there is a jth class in the ith which largely depends on the impact of CSP structure in-
prediction bounding box, C ij � sigmoid(Cij ), Cij is the troduced in neck and 3/4 reduction in overall width. It is
predicted value, and Npos is the number of positive samples. suitable for carrying and using in autonomous driving.
In formula (4), ρ2 (b, bgt ) denotes the Euclidean distance
between the center points of bbox and GT, C is the diagonal
of the smallest circumscribed rectangle of the two boxes, and 4. Experiment
Cw , Ch are the width and height of the minimum circum-
scribed rectangle. 4.1. Dataset. In the experiment, we used KITTI [37] and
BDD100k [38], which are commonly used in autonomous
driving research. KITTI dataset consists of 7481 training sets
3.5. 3e Performance of Different Models. ,e parameter and 7518 test sets, including three classes: Car, Cyclist, and
quantity and calculation quantity of different network model Pedestrian. Since the test set has no label, the training set and the
weights are shown in Table 1. All models are tested at validation set are split by randomly dividing the training set into
512 × 512 resolution, with FP16-precision. two halves [39, 40]. BDD100k dataset is composed of 70,000
It can be seen that the parameters of proposed work (1) training sets, 10,000 validation sets, and 20,000 test sets, in-
are 11.61M less than YOLOv4 and 6.35M less than YOLOv3. cluding ten classes: person, rider, car, bus, truck, bike, motor,
,e parameters of proposed work (2) are reduced by 41.3% traffic light, traffic sign, and train. ,e ratio of training set and
8483, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2021/9218137 by Egyptian National Sti. Network (Enstinet), Wiley Online Library on [09/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
6 Computational Intelligence and Neuroscience
Residual Residual
GAP+GMP
X AvgPool Y AvgPool
Conv+ReLU
Concat+Conv2D
Conv1×1
BatchNorm+Non-linear
Re-weight
Conv2D Conv2D
ChannelPool
Sigmoid Sigmoid
Conv7×7
Re-weight Re-weight
(a) (b)
Conv
Concat
Conv1×1
Concat
Conv1×1 Conv1×1
Previous Layer
(a) (b)
verification set is 7 :1. ,ere are about 1.46 million object in- traffic light, and traffic sign. Since we only studied the differ-
stances in training set and validation set, of which about 0.8 ences between models, 1/5 of the training set and validation set
million are car instances, while only 151 are train instances. ,is are randomly sampled as the final dataset. ,e experiment was
kind of unbalanced distribution among categories will lead to carried out on Ubuntu 18.04, NVIDIA Quadro M4000, CUDA
the decline of network feature extraction ability, so train, rider, 10.1, and cuDNN v7.6.5. ,e inference speed is related to the
and motor are ignored in the final evaluation. ,e final BDD hardware equipment. ,e inference test FPS in this paper is
dataset includes seven classes: person, car, bus, truck, bike, carried out on NVIDIA RTX 2080Ti.
8483, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2021/9218137 by Egyptian National Sti. Network (Enstinet), Wiley Online Library on [09/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Computational Intelligence and Neuroscience 7
0.8 0.8
0.6 0.6
Precision
Precision
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Recall Recall
yolov4 yolov4
ours (2) ours (2)
ours (1) ours (1)
(a) (b)
traffic Light AP50 PR-Curve [email protected] PR-Curve
1.0 1.0
0.8 0.8
0.6 0.6
Precision
Precision
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Recall Recall
Figure 8: (a) YOLOv4 inference results. (b) Proposed work (1) inference. (c) Proposed work (2) inference.
(c)
(b)
Computational Intelligence and Neuroscience
(a)
8483, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2021/9218137 by Egyptian National Sti. Network (Enstinet), Wiley Online Library on [09/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
10 Computational Intelligence and Neuroscience
fourth row, model 1 can supplement the detection of in- Proceedings of the IEEE Conference International Conference
correct traffic sign in YOLOv4. In rows 5 and 6, model 1 and on Neural Information Processing, pp. 425–432, Springer,
model 2 can find more small objects than YOLOv4. ,e Daegu, South Korea, November 2013.
weather in the first row and the last row is better, and the [3] X. Dai, “Hybridnet: a fast vehicle detection system for au-
detection frame of the improved algorithm is more accurate. tonomous driving,” Signal Processing: Image Communication,
vol. 70, pp. 79–88, 2019.
Based on these results, model 1 and model 2 can sig-
[4] M. Bassani, L. Rossetti, and L. Catani, “Spatial analysis of road
nificantly improve the detection accuracy, so as to improve crashes involving vulnerable road users in support of road
driving stability and efficiency, prevent fatal accidents, meet safety management strategies,” Transportation Research
the needs of autonomous driving real-time object detection Procedia, vol. 45, pp. 394–401, 2020.
task, and have practical application value. [5] C. Zhang, Y. Liu, D. Zhao, and Y. Su, “Roadview: a traffic
scene simulator for autonomous vehicle simulation testing,”
5. Conclusions in Proceedings of the 17th International IEEE Conference on
Intelligent Transportation Systems (ITSC), pp. 1160–1165,
Real-time object detection technology is of great significance IEEE, Qingdao, China, October 2014.
in the field of autonomous driving. Aimed at the problem of [6] G. S. R. Satyanarayana, S. Majhi, and S. K. Das, “A vehicle
insufficient accuracy of one-stage detector in autonomous detection technique using binary images for heterogeneous
driving scene, based on YOLOv4, this paper replaces SPP and lane-less traffic,” IEEE Transactions on Instrumentation
with RFB structure after backbone, integrates CSP structure and Measurement, vol. 70, pp. 1–14, 2021.
[7] Z. Liu, Y. Cai, H. Wang et al., “Robust target recognition and
with less computation into neck structure, and finally adds
tracking of self-driving cars with radar and camera infor-
CBAM and CA attention mechanism to make the neural mation fusion under severe weather conditions,” IEEE
network pay more attention to the object area containing Transactions on Intelligent Transportation Systems, no. 99,
important information, suppress irrelevant information, and pp. 1–14, 2021.
improve detection accuracy. ,e experimental results show [8] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
that the improved model 1 has higher accuracy than the hierarchies for accurate object detection and semantic seg-
original YOLOv4 in object detection task. ,e mAP is mentation,” in Proceedings of the IEEE Conference on Com-
improved by 2.06% in KITTI validation set and 2.95% in puter Vision and Pattern Recognition, pp. 580–587, Columbus,
BDD validation set. ,e mAP50 of model 2 is increased by OH, USA, June 2014.
1.73%, and the inference speed is increased by 4.83 fps, [9] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE inter-
which verifies the effectiveness of the improved algorithm. It national Conference on Computer Vision, pp. 1440–1448,
provides a theoretical reference for further practical appli- Santiago, Chile, December 2015.
[10] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: towards
cation. In the follow-up work, some researchers are con-
real-time object detection with region proposal networks,” in
cerned about how to improve the detection accuracy of Proceedings of the Advances in Neural Information processing
[7, 41, 42] at night and under bad weather conditions, and systems, pp. 91–99, Montreal, Quebec, Canada, December
further improvement of the detection accuracy will also be 2015.
our next research direction. [11] J. Dai, Yi Li, K. He, and J. S. R-fcn, “Object detection via
region-based fully convolutional networks,” in Proceedings of
Data Availability the Advances in neural information processing systems,
pp. 379–387, Barcelona, Spain, December 2016.
All data included in this study can be downloaded from the [12] W. Liu, D. Anguelov, D. Erhan et al., “Ssd: single shot
official websites of KITTI and BDD100k or obtained by multibox detector,” in Proceedings of the European Conference
contacting the corresponding authors. on Computer Vision, pp. 21–37, Springer, Amsterdam,
Netherlands, October 2016.
[13] R. Joseph, S. Divvala, R. Girshick, and F. Ali, “You only look
Conflicts of Interest once: unified, real-time object detection,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition,
,e authors declare that there are no conflicts of interest
pp. 779–788, Las Vegas, NV, USA, June 2016.
regarding the publication of this paper. [14] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos, “A unified
multi-scale deep convolutional neural network for fast object
Acknowledgments detection,” in Proceedings of the European conference on
computer vision, pp. 354–370, Springer, Amsterdam, Neth-
,is work was financially supported by the Natural Science erlands, October 2016.
Foundation of Jilin Provincial (no. 20200201053JC). [15] X. Hu, X. Xu, Y. Xiao et al., “Sinet: a scale-insensitive con-
volutional neural network for fast vehicle detection,” IEEE
References Transactions on Intelligent Transportation Systems, vol. 20,
no. 3, pp. 1010–1019, 2019.
[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning [16] Q. Zhao, Y. Wang, T. Sheng, and Z. Tang, “Comprehensive
for image recognition,” in Proceedings of the IEEE Conference feature enhancement module for single-shot object detector,”
on Computer Vision and Pattern Recognition, pp. 770–778, Las in Proceedings of the IEEE Conference Asian Conference on
Vegas, NV, USA, June 2016. Computer Vision, Springer, Perth, Australia, December 2018.
[2] F. Liu, B. Liu, C. Sun, M. Liu, and X. Wang, “Deep learning [17] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, “Single-shot
approaches for link prediction in social network services,” in refinement neural network for object detection,” in
8483, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2021/9218137 by Egyptian National Sti. Network (Enstinet), Wiley Online Library on [09/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Computational Intelligence and Neuroscience 11
Proceedings of the IEEE Conference on Computer Vision and [33] Q. Hou, D. Zhou, and J. Feng, “Coordinate attention for
Pattern Recognition, pp. 4203–4212, Salt Lake City, UT, USA, efficient mobile network dessigh,” 2021, https://arxiv.org/abs/
June 2018. 2103.02907.
[18] S. Liu, Di Huang, and Y. Wang, “Receptive field block net for [34] Y. Jiahui, Y. Jiang, Z. Wang, Z. Cao, and T. Huang, “UnitBox:
accurate and fast object detection,” in Proceedings of the an advanced object detection network,” in Proceedings of the
European Conference on Computer Vision (ECCV), pp. 385– 24th ACM international conference on Multimedia, pp. 516–
400, Munich, Germany, September 2018. 520, Amsterdam, Netherlands, October 2016.
[19] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling [35] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-
in deep convolutional networks for visual recognition,” IEEE IoU Loss: faster and better learning for bounding box re-
Transactions on Pattern Analysis and Machine Intelligence, gression,” in Proceedings of the AAAI Conference on Artificial
Intelligence (AAAI), New York, NY, USA, February 2020.
vol. 37, no. 9, pp. 1904–1916, 2015.
[36] Yi-F. Zhang, W. Ren, Z. Zhang, Z. Jia, L. Wang, and T. Tan,
[20] T.-Yi Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and
“Focal and efficient IoU loss for accurate bounding box re-
S. Belongie, “Feature pyramid networks for object detection,”
gression,” 2021, https://arxiv.org/abs/2101.08158.
in Proceedings of the IEEE Conference on Computer Vision and
[37] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for au-
Pattern Recognition(CVPR), pp. 2117–2125, Honolulu, HI, tonomous driving? the kitti vision benchmark suite,” in
USA, July 2017. Proceedings of the 2012 IEEE Conference on Computer Vision
[21] S. Liu, Qi Lu, H. Qin, J. Shi, and J. Jia, “Path aggregation and Pattern Recognition, pp. 3354–3361, IEEE, Providence,
network for instance segmentation,” in Proceedings of the Rhode Island, June 2012.
IEEE Conference on Computer Vision and Pattern Recognition [38] Yu Fisher, W. Xian, Y. Chen et al., “Bdd100k: a diverse driving
(CVPR), pp. 8759–8768, Salt Lake City, UT, USA, June 2018. video database with scalable annotation tooling,” 2018,
[22] S. Liu, Di Huang, and Y. Wang, “Learning spatial fusion for https://arxiv.org/abs/1805.04687.
single-shot object detection,” 2019, https://arxiv.org/abs/1911. [39] J. Choi, D. Chun, H. Kim, and H.-J. Lee, “Gaussian YOLOv3:
09516. an accurate and fast object detector using localization un-
[23] R. Joseph and F. Ali, “YOLOv3: an incremental improve- certainty for autonomous driving,” in Proceedings of the IEEE
ment,” 2018, https://arxiv.org/abs/1804.02767. International Conference on Computer Vision (ICCV),
[24] A. Bochkovskiy, C.-Y. Wang, and H. Y. Mark Liao, “YOLOv4: pp. 502–511, Seoul, South Korea, October 2019.
optimal speed and accuracy of object detection,” 2020, https:// [40] B. Wu, F. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet:
arxiv.org/abs/2004.10934. unified, small, low power fully convolutional neural networks
[25] Y. Cai, T. Luan, H. Gao et al., “YOLOv4-5D: an effective and for real-time object detection for autonomous driving,” in
efficient object detector for autonomous driving,” IEEE Proceedings of the IEEE Conference on Computer Vision and
Transactions on Instrumentation and Measurement, vol. 70, Pattern Recognition Workshops, pp. 129–137, Honolulu, HI,
2021. USA, July 2017.
[26] M. Tan, R. Pang, and V. Le Quoc, “EfficientDet: scalable and [41] A. Bell, T. Mantecón, C. Dı́az, C. R. del-Blanco, F. Jaureguizar,
efficient object detection,” in Proceedings of the IEEE Con- and N. Garcı́a, “A novel system for nighttime vehicle de-
ference on Computer Vision and Pattern Recognition (CVPR), tection based on foveal classifiers with real-time perfor-
mance,” IEEE Transactions on Intelligent Transportation
Seattle, WA, USA, June 2020.
Systems, 2021.
[27] R. Joseph and F. Ali, “YOLO9000: better, faster, stronger,” in
[42] M. Hnewa and H. Radha, “Object detection under rainy
Proceedings of the IEEE Conference on Computer Vision and
conditions for autonomous vehicles: a review of state-of-the-
Pattern Recognition (CVPR), pp. 7263–7271, Honolulu, HI,
art and emerging techniques,” IEEE Signal Processing Mag-
USA, July 2017. azine, vol. 38, no. 1, pp. 53–67, 2020.
[28] T.-Y. Lin, M. Maire, S. Belongie et al., “Microsoft COCO:
common objects in context,” in Proceedings of the European
Conference on Computer Vision (ECCV), pp. 740–755, Zurich,
Switzerland, Septem 2014.
[29] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: con-
volutional block attention module,” in Proceedings of the
European Conference on Computer Vision (ECCV), pp. 3–19,
Munich, Germany, September 2018.
[30] R. Hamid, T. Nathan, J. Y. Gwak, A. Sadeghian, I. Reid, and
S. Savarese, “Generalized intersection over union: a metric
and a loss for bounding box regression,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 658–666, Las Vegas, NV, USA, June 2019.
[31] C.-Y. Wang, H.-Y. Mark Liao, Y.-H. Wu, P.-Y. Chen,
J.-W. Hsieh, and I.-H. Yeh, “CSPNet: a new backbone that can
enhance learning capability of CNN,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition
Workshop (CVPR Workshop), Seattle, WA, USA, June 2020.
[32] J. Hu, Li Shen, and G. Sun, “Squeeze-and-excitation net-
works,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 7132–7141, Salt
Lake City, UT, USA, June 2018.