0% found this document useful (0 votes)
84 views10 pages

Choi Gaussian YOLOv3 An Accurate and Fast Object Detector Using Localization ICCV 2019 Paper

This document proposes Gaussian YOLOv3, a method to improve the accuracy of YOLOv3 object detection for autonomous vehicles while maintaining real-time performance. It models bounding box coordinates as Gaussian parameters to estimate localization uncertainty. It also predicts this uncertainty to reduce false positives and increase true positives during detection. Compared to YOLOv3, Gaussian YOLOv3 improves mean average precision on standard datasets while detecting over 42 frames per second, making it suitable for autonomous driving.

Uploaded by

Gayatri Kulkarni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views10 pages

Choi Gaussian YOLOv3 An Accurate and Fast Object Detector Using Localization ICCV 2019 Paper

This document proposes Gaussian YOLOv3, a method to improve the accuracy of YOLOv3 object detection for autonomous vehicles while maintaining real-time performance. It models bounding box coordinates as Gaussian parameters to estimate localization uncertainty. It also predicts this uncertainty to reduce false positives and increase true positives during detection. Compared to YOLOv3, Gaussian YOLOv3 improves mean average precision on standard datasets while detecting over 42 frames per second, making it suitable for autonomous driving.

Uploaded by

Gayatri Kulkarni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Gaussian YOLOv3: An Accurate and Fast Object Detector Using Localization

Uncertainty for Autonomous Driving

Jiwoong Choi1 , Dayoung Chun1 , Hyun Kim2 , and Hyuk-Jae Lee1


1
Seoul National University, 2 Seoul National University of Science and Technology
{jwchoi, jjeonda}@capp.snu.ac.kr, [email protected], [email protected]

Abstract tion must accurately detect cars, pedestrians, traffic signs,


traffic lights, etc. in real time to ensure safe and correct
The use of object detection algorithms is becoming in- control decisions [25]. To detect such objects, various sen-
creasingly important in autonomous vehicles, and object sors such as cameras, light detection and ranging (Lidar),
detection at high accuracy and a fast inference speed is es- and radio detection and ranging (Radar) are generally used
sential for safe autonomous driving. A false positive (FP) in autonomous vehicles [27]. Among these various types
from a false localization during autonomous driving can of sensors, a camera sensor can accurately identify the ob-
lead to fatal accidents and hinder safe and efficient driv- ject type based on texture and color features and is more
ing. Therefore, a detection algorithm that can cope with cost-effective [24] than other sensors. In particular, deep-
mislocalizations is required in autonomous driving applica- learning based object detection using camera sensors is be-
tions. This paper proposes a method for improving the de- coming more important in autonomous vehicles because it
tection accuracy while supporting a real-time operation by achieves a better level of accuracy than humans in terms of
modeling the bounding box (bbox) of YOLOv3, which is the object detection, and consequently it has become an essen-
most representative of one-stage detectors, with a Gaussian tial method [11] in autonomous driving systems.
parameter and redesigning the loss function. In addition, An object detection algorithm for autonomous vehicles
this paper proposes a method for predicting the localization should satisfy the following two conditions. First, a high
uncertainty that indicates the reliability of bbox. By using detection accuracy of the road objects is required. Sec-
the predicted localization uncertainty during the detection ond, a real-time detection speed is essential for a rapid re-
process, the proposed schemes can significantly reduce the sponse of a vehicle controller and a reduced latency. Deep-
FP and increase the true positive (TP), thereby improving learning based object detection algorithms, which are indis-
the accuracy. Compared to a conventional YOLOv3, the pensable in autonomous vehicles, can be classified into two
proposed algorithm, Gaussian YOLOv3, improves the mean categories: two-stage and one-stage detectors. Two-stage
average precision (mAP) by 3.09 and 3.5 on the KITTI and detectors, e.g., Fast R-CNN [8], Faster R-CNN [22], and R-
Berkeley deep drive (BDD) datasets, respectively. Never- FCN [4], conduct a first stage of region proposal generation,
theless, the proposed algorithm is capable of real-time de- followed by a second stage of object classification and bbox
tection at faster than 42 frames per second (fps) and shows regression. These methods generally show a high accu-
a higher accuracy than previous approaches with a similar racy but have a disadvantage of a slow detection speed and
fps. Therefore, the proposed algorithm is the most suitable lower efficiency. One-stage detectors, e.g., SSD [17] and
for autonomous driving applications. YOLO [19], conduct object classification and bbox regres-
sion concurrently without a region proposal stage. These
methods generally have a fast detection speed and high ef-
1. Introduction ficiency but a low accuracy. In recent years, to take advan-
tage of both types of method and to compensate for their re-
In recent years, deep learning has been actively applied spective disadvantages, object detectors combining various
in various fields including computer vision [9], autonomous schemes have been widely studied [1, 11, 29, 28, 16]. MS-
driving [5], and social network services [15]. The devel- CNN [1], a two-stage detector, improves the detection speed
opment of sensors and GPU along with deep learning al- by conducting detection on various intermediate network
gorithms has accelerated research into autonomous vehi- layers. SINet [11], also a two-stage detector, enables a fast
cles based on artificial intelligence. An autonomous vehi- detection using a scale-insensitive network. CFENet [29], a
cle with self-driving capability without a driver interven- one-stage detector, uses a comprehensive feature enhance-

502
ment module based on SSD to improve the detection accu- tection. However, because they focused on a two-stage de-
racy. RefineDet [28], also a one-stage detector, improves tector, their method cannot support a real-time operation,
the detection accuracy by applying an anchor refinement and remaining a bbox overlap problem, so it is unsuitable
module and an object detection module. Another one-stage for self-driving applications.
detector, RFBNet [16], applies a receptive field block to im- To overcome the problems of previous object detec-
prove the accuracy. However, using an input resolution of tion studies, this paper proposes a novel object detec-
512 × 512 or higher, which is widely applied in object de- tion algorithm suitable for autonomous driving based on
tection algorithms for achieving a high accuracy, previous YOLOv3 [21]. YOLOv3 can detect multiple objects with
studies [1, 11, 29, 28] have been unable to meet a real-time a single inference, and its detection speed is therefore ex-
detection speed of above 30 fps, which is a prerequisite for tremely fast; in addition, by applying a multi-stage de-
self-driving applications. Even if real-time detection is pos- tection method, it can complement the low accuracy of
sible in [16], it is difficult to apply to autonomous driving YOLO [19] and YOLOv2 [20]. Based on these advantages,
due to a low accuracy. This indicates that these previous YOLOv3 is suitable for autonomous driving applications,
schemes are incomplete in terms of a trade-off between ac- but generally achieves a lower accuracy than a two-stage
curacy and detection speed, and consequently, have a limi- method. It is therefore essential to improve the accuracy
tation in their application to self-driving systems. while maintaining a real-time object detection capability.
To achieve this goal, the present paper proposes a method
In addition, one of the most critical problems of most for improving the detection accuracy by modeling the bbox
conventional deep-learning based object detection algo- coordinates of YOLOv3, which only outputs deterministic
rithms is that, whereas the bbox coordinates (i.e., localiza- values, as the Gaussian parameters (i.e., the mean and vari-
tion) of the detected object are known, the uncertainty of ance), and redesigning the loss function of bbox. Through
the bbox result is not. Thus, conventional object detectors this Gaussian modeling, a localization uncertainty for a
cannot prevent mislocalizations (i.e., FPs) because they out- bbox regression task in YOLOv3 can be estimated. Further-
put the deterministic results of the bbox without information more, to further improve the detection accuracy, a method
regarding the uncertainty. In autonomous driving, an FP de- for reducing the FP and increasing the TP by utilizing the
notes an incorrect detection result of bbox on an object that predicted localization uncertainty of bbox during the detec-
is not the ground-truth (GT), or an inaccurate detection re- tion process is proposed. This study is therefore the first
sult of bbox on the GT, whereas a TP denotes an accurate attempt to model the localization uncertainty in YOLOv3
detection result of bbox on the GT. An FP is extremely dan- and to utilize this factor in a practical manner. As a result,
gerous under autonomous driving because it causes exces- the proposed Gaussian YOLOv3 can cope with mislocaliza-
sive reactions such as unexpected braking, which can re- tions in autonomous driving applications. In addition, be-
duce the stability and efficiency of driving and lead to a cause the proposed method is modeled only in bbox of the
fatal accident [18, 23] as well as confusion in the deter- YOLOv3 detection layer (i.e., the output layer), the addi-
mination of an accurate object detection. In other words, tional computation cost is negligible, and the proposed algo-
it is extremely important to predict the uncertainty of the rithm consequently maintains the real-time detection speed
detected bboxes and to consider this factor along with the of over 42 fps with an input resolution of 512 × 512 despite
objectness score and class scores for reducing the FP and the significant improvements in performance. Compared to
preventing autonomous driving accidents. For this reason, the baseline algorithm (i.e., YOLOv3), the proposed Gaus-
various studies have been conducted on predicting uncer- sian YOLOv3 improves the mAP by 3.09 and 3.5 on the
tainty in deep learning. Kendall et al. [12] proposed a mod- KITTI [7] and BDD [26] datasets, respectively. In addi-
eling method for uncertainty prediction using a Bayesian tion, the proposed algorithm reduces the FP by 41.40% and
neural network in deep learning. Feng et al. [6] proposed a 40.62%, respectively, and increases the TP by 7.26% and
method for predicting uncertainty by applying Kendall et al. 4.3%, respectively, on the KITTI and BDD datasets. As a
’s scheme [12] to 3D vehicle detection using a Lidar sensor. result, in terms of the trade-off between accuracy and de-
However, the methods proposed by Kendall et al. [12] and tection speed, the proposed algorithm is suitable for au-
Feng et al. [6] only predict the level of uncertainty, and do tonomous driving because it significantly improves the de-
not utilize this factor in actual applications. Choi et al. [2] tection accuracy and addresses the mislocalization problem
proposed a method for predicting uncertainty in real time while supporting a real-time operation.
using a Gaussian mixture model and applied the method to
an autonomous driving application. However, it was applied 2. Background
to the steering angle, and not object detection, and a compli-
cated distribution is therefore modeled, increasing the com- Instead of the region proposal method used in two-stage
putational complexity. He et al. [10] proposed an approach detectors, YOLO [19] detects objects by dividing an image
for predicting uncertainty and utilized it toward object de- into grid units. The feature map of the YOLO output layer is

503
0 Image grid
1 4
18 36 97 98 103
61 85 86 91 96
79 84
+ … … … … … … … … Responsible grid
* * Prediction boxes
for detecting car
80
92 Box 0 Box 1 Box 2
104
81
Input image
93
Convolution layer 82
105
Up-sample layer
Route layer 94

Detection layer
tx ty t w th Pobj P0 P1 . . . Pn
… Further layers 106
Bounding box Objectness
Class scores
+ Addition coordinates score
* Concatenation Components in the prediction box

(a) (b)

Figure 1: (a) Network architecture of YOLOv3 and (b) attributes of its prediction feature map.

designed to output bbox coordinates, the objectness score, of small-sized convolution filers of 1 × 1 and 3 × 3 like
and the class scores, and thus YOLO enables the detec- YOLOv2 [20], the detection speed is as fast as YOLO [19]
tion of multiple objects with a single inference. Therefore, and YOLOv2 [20]. Therefore, in terms of the trade-off
the detection speed is much faster than that of conventional between accuracy and speed, YOLOv3 is suitable for
methods. However, owing to the processing of the grid unit, autonomous driving applications and is widely used in
localization errors are large and the detection accuracy is autonomous driving research [3]. However, in general, it
low, and thus it is unsuitable for autonomous driving ap- still has a lower accuracy than a two-stage detector using
plications. To address these problems, YOLOv2 [20] has a region proposal stage. To compensate for this drawback,
been proposed. YOLOv2 improves the detection accuracy as taking advantage of the smaller complexity of YOLOv3
compared to YOLO by using batch normalization for the than that of a two-stage detector, a more efficient detector
convolution layer, and applying an anchor box, multi-scale for an autonomous driving application can be designed by
training, and fine-grained features. However, the detection applying the additional method for improving accuracy to
accuracy is still low for small or dense objects. Therefore, YOLOv3 [21]. The Gaussian modeling and loss function
YOLOv2 is unsuitable for autonomous driving applications, reconstruction of YOLOv3 proposed in this paper can
where a high accuracy is required for dense road objects and improve the accuracy by reducing the influence of noisy
small objects such as traffic signs and lights. data during training and predict the localization uncertainty.
To overcome the disadvantages of YOLOv2, In addition, the detection accuracy can be further enhanced
YOLOv3 [21] has been proposed. YOLOv3 consists by using this predicted localization uncertainty. A detailed
of convolution layers, as shown in Figure 1a, and is description of the above aspects is provided in Section 3.
constructed of a deep network for an improved accuracy.
YOLOv3 applies a residual skip connection to solve the 3. Gaussian YOLOv3
vanishing gradient problem of deep networks and uses
3.1. Gaussian modeling
an up-sampling and concatenation method that preserves
fine-grained features for small object detection. The As shown in Figure 1b, the prediction feature map of
most prominent feature is the detection at three different YOLOv3 [21] has three prediction boxes per grid, where
scales in a similar manner as used in a feature pyramid each prediction box consists of bbox coordinates (i.e., tx ,
network [13]. This allows YOLOv3 to detect objects ty , tw , and th ), the objectness score, and class scores.
with various sizes. In more detail, when an image of YOLOv3 outputs the objectness (i.e., whether an object is
three channels of R, G, and B is input into the YOLOv3 present or not in the bbox) and class (i.e., the category of
network, as shown in Figure 1a, information on the object the object), as a score of between zero and one. An object
detection (i.e., bbox coordinates, objectness score, and is then detected based on the product of these two values.
class scores) is output from three detection layers. The Unlike the objectness and class information, bbox coordi-
predicted results of the three detection layers are combined nates are output as deterministic coordinate values instead
and processed using non-maximum suppression. After of a score, and thus the confidence of the detected bbox is
that, the final detection results are determined. Because unknown. Moreover, the objectness score does not reflect
YOLOv3 is a fully convolutional network consisting only the reliability of the bbox well. It therefore does not know

504
how uncertain the result of bbox is. In contrast, the uncer- tx Σtx μty Σty μtw Σtw μth Σth P0 P1 ... Pn
Pobj
tainty of bbox, which is predicted by the proposed method,
Gaussian parameter of bounding box Objectness
serves as the bbox score, and can thus be used as an indi- coordinates score
Class scores
cator of how uncertain the bbox is. The results for this are
described in Section 4.1.
In YOLOv3, bbox regression is to extract the bbox cen-
ter information (i.e., tx and ty ) and bbox size information
(i.e., tw and th ). Because there is only one correct answer tx ty tw th
(i.e., the GT) for the bbox of an object, complex modeling μtx x μty y μtw w μth h
is not required for predicting the localization uncertainty. In
other words, the uncertainty of bbox can be modeled using
Figure 2: Components in the prediction box of proposed algo-
each single Gaussian model of tx , ty , tw , and th . A single
rithm.
Gaussian model of output y for a given test input x whose
output consists of Gaussian parameters is as follows:
109 FLOPs are required. Thus, the penalty for the detec-
p(y|x) = N (y; µ(x), Σ(x)), (1)
tion speed is extremely low because the computation cost
where µ(x) and Σ(x) are the mean and variance functions, increases only by 0.04% as compared with before the mod-
respectively. eling. The related results are shown in Section 4.
To predict the uncertainty of bbox, each of the bbox coor- 3.2. Reconstruction of loss function
dinates in the prediction feature map is modeled as the mean
(µ) and variance (Σ), as shown in Figure 2. The outputs of For training, YOLOv3 [21] uses the sum of the squared
bbox are µ̂tx , Σ̂tx , µ̂ty , Σ̂ty , µ̂tw , Σ̂tw , µ̂th , and Σ̂th . Con- error loss for bbox, and the binary cross-entropy loss for the
sidering the structure of the detection layer in YOLOv3, the objectness and class. Because the bbox coordinates are out-
Gaussian parameters for tx , ty , tw , and th are preprocessed put as Gaussian parameters through Gaussian modeling, the
as follows: loss function of bbox is redesigned as a negative log likeli-
hood (NLL) loss, whereas the loss function for objectness
µtx = σ(µ̂tx ), µty = σ(µ̂ty ), µtw = µ̂tw , µth = µ̂th (2) and class is not changed. The loss function redesigned for
bbox is as follows:
Σtx = σ(Σ̂tx ), Σty = σ(Σ̂ty ),
(3) H X
W X
X K
Σtw = σ(Σ̂tw ), Σth = σ(Σ̂th ) Lx = − γijk log(N (xG
ijk |µtx (xijk ),
(5)
1 i=1 j=1 k=1
σ(x) = . (4)
(1 + exp(−x)) Σtx (xijk )) + ε),
The mean value of each coordinate in the detection layer where Lx is the NLL loss of tx coordinate and the others
is the predicted coordinate of bbox, and each variance rep- (i.e., Ly , Lw , and Lh ) are the same as Lx except for each
resents the uncertainty of each coordinate. µtx and µty in parameter. W and H are the number of grids of each width
(2) must represent the center coordinates of bbox inside the and height, respectively, and K is the number of anchors.
grid, which are thus processed as values between zero and Moreover, µtx (xijk ) denotes the tx coordinates, which is
one with the sigmoid function in (4). The variances of each the output of the detection layer of the proposed algorithm,
coordinate in (3) are also processed as values between zero at the k-th anchor in the (i, j) grid. In addition, Σtx (xijk ) is
and one with a sigmoid function. In YOLOv3, the width also the output of the detection layer, indicating the uncer-
and height information of bbox are processed through tw , tainty of tx coordinate, and xG ijk is the GT of tx coordinate.
th , bbox prior, and exponential functions [21]. In other The GT of bbox is then computed as follows:
words, µtw and µth in (2), which indicate the tw and th of
YOLOv3, are not processed as sigmoid functions because xG G G G
ijk = x × W − i, yijk = y × H − j (6)
they can have both negative and positive values.
Single Gaussian modeling for predicting the uncer- wG × IW hG × IH
G G
tainty of bbox only applies to the bbox coordinates of the wijk = log( ), hijk = log( ), (7)
Awk Ahk
YOLOv3 detection layer shown in Figure 1a. Therefore,
the overall computational complexity of the algorithm does where xG , y G , wG , and hG are the ratios of a GT bbox in an
not increase significantly. In a 512 × 512 input resolution image, IW and IH are the width and height of the resized
and ten classes, YOLOv3 requires 99 × 109 FLOPs; how- image, and Aw h
k and Ak denote the width and height of the
ever, after a single Gaussian modeling for bbox, 99.04 × k-th anchor box prior, respectively. In YOLOv3, centroid of

505
bbox is calculated in grid units, and size of bbox is calcu- Cr. in (10) indicates the detection criterion for Gaus-
lated based on an anchor box, and thus the GT is processed sian YOLOv3, σ(Object) is the objectness score, and
accordingly for training. σ(Classi ) is the score of the i-th class. In addition,
U ncertaintyaver , which is localization uncertainty, indi-
obj
ωscale × δijk cates the average of the uncertainties of the predicted bbox
γijk = (8) coordinates. Localization uncertainty has a value between
2
zero and one, such as the objectness score and class scores,
ωscale = 2 − wG × hG . (9) and the higher the localization uncertainty, the lower the
confidence of the predicted bbox. The results of the pro-
ωscale in (8) is calculated based on the width and height
posed Gaussian YOLOv3 are described in Section 4.
ratios of the GT bbox in an image, as shown in (9). It pro-
vides different weights according to the object size during
obj 4. Experimental Results
training. In addition, δijk in (8) is a parameter applied to
include in the loss only if there is an anchor that is most In the experiment, the KITTI dataset [7], which is com-
suitable in the current object among the predefined anchors. monly used in autonomous driving research, and the BDD
This parameter is assigned as a value of one when the inter- dataset [26], which is the latest published autonomous driv-
section over union (IOU) of the GT and the k-th anchor box ing dataset, are used. The KITTI dataset consists of three
in the (i, j) grid are the largest, and is assigned as a value of classes: car, cyclist, and pedestrian, and consists of 7,481
zero if there is no appropriate GT. For a numerical stability images for training and 7,518 images for testing. Because
of the logarithmic function, ε is assigned a value of 10−9 . there is no GT for testing, the training and validation sets
Because YOLOv3 uses the sum of the squared error loss are made by randomly splitting the training set in half [25].
for bbox, it is unable to cope with noisy data during training. The BDD dataset consists of ten classes: bike, bus, car, mo-
However, the redesigned loss function of bbox can provide tor, person, rider, traffic light, traffic sign, train, and truck.
a penalty to the loss through the uncertainty for inconsistent The ratio of training, validation, and test set is 7:1:2. In
data during training. That is, the model can be learned by this paper, a test set is utilized for the performance eval-
concentrating on consistent data. Therefore, the redesigned uation. In general, the IOU threshold (TH) of the KITTI
loss function of bbox makes the model more robust to noisy dataset is set to 0.7 for cars and 0.5 for cyclists and pedestri-
data [12]. Through this loss attenuation [12], it is possible ans [7], whereas the IOU TH of the BDD dataset is 0.75 for
to improve the accuracy of the algorithm. all classes [26]. In both YOLOv3 and Gaussian YOLOv3
training, the batch size is 64 and the learning rate is 0.0001.
3.3. Utilization of localization uncertainty The anchor size is extracted using k-means clustering for
The proposed Gaussian YOLOv3 can obtain the uncer- each training set of KITTI and BDD. The anchors used in
tainty of bbox for every detection object in an image. Be- the training and evaluation are shown in Table 1. Other
cause it is not an uncertainty for the entire image, it is pos- studies are trained using the default settings in the official
sible to apply uncertainty to each detection result. YOLOv3 code of each algorithm. The experiment is conducted on an
considers only the objectness score and class scores during NVIDIA GTX 1080 Ti with CUDA 8.0 and cuDNN v7.
object detection, and cannot consider the bbox score during 4.1. Validation in utilizing localization uncertainty
the detection process because the score information for the
bbox coordinates is unknown. However, Gaussian YOLOv3 Figure 3 shows the relationship between the IOU and lo-
can output the localization uncertainty, which is the score of calization uncertainty of bbox for the KITTI and BDD val-
bbox. Therefore, localization uncertainty can be considered idation sets. These results are plotted for cars, which is the
along with the objectness score and class scores during the dominant class for all data, and the localization uncertainty
detection process. The proposed algorithm applies localiza- is predicted using the proposed algorithm. To show a typi-
tion uncertainty to the detection criteria of YOLOv3 such cal tendency, the IOU is divided increments of 0.1, and the
that bbox with high uncertainty among the predicted results average value of the IOU and the average value of the local-
is filtered through the detection process. In this way, pre- ization uncertainty are calculated for each range and used as
dictions with high confidence of objectness, class, and bbox a representative value. As shown in Figure 3, the IOU value
are finally selected. Thus, Gaussian YOLOv3 can reduce tends to increase as the localization uncertainty decreases
the FP and increase the TP, which results in improving the in both datasets. A larger IOU indicates that the coordi-
detection accuracy. The proposed detection criterion con- nates of the predicted bbox are closer to those of the GT.
sidering the localization uncertainty is as follows: Based on these results, the localization uncertainty of the
proposed algorithm effectively represents the confidence of
Cr. = σ(Object) × σ(Classi ) × (1 − U ncertaintyaver ). the predicted bbox. It is therefore possible to cope with mis-
(10) localizations and improve the accuracy by utilizing the lo-

506
Anchor 0 Anchor 1 Anchor 2 1
KITTI training set 0.9 KITTI
First detection layer (49,240) (82,170) (118,206) BDD
Second detection layer (45,76) (27,172) (67,116) 0.8
Third detection layer (13,30) (23,53) (17,102) 0.7
BDD training set
0.6

IOU
First detection layer (73,175) (141,178) (144,291)
Second detection layer (32,97) (57,64) (92,109) 0.5
Third detection layer (7,10) (14,24) (27,43)
0.4

Table 1: Results of anchor boxes of training sets. 0.3


0.2
0.1
calization uncertainty predicted by the proposed algorithms. 0.06 0.1 0.14 0.18 0.22 0.26 0.3 0.34 0.38
Localization Uncertainty

4.2. Performance evaluation of Gaussian YOLOv3 Figure 3: IOU versus localization uncertainty on KITTI and BDD
validation sets.
To demonstrate the superiority of the proposed algo-
rithm, its performance (i.e., accuracy and detection speed)
is compared with that of other studies [1, 11, 17, 28, 29, 16, proposed method is 1.8-times better than that of SINet [11].
21]. In the experiment on the KITTI validation set, the other Because there is a trade-off between the accuracy and detec-
studies [1, 11, 17, 28, 16, 21] are trained and evaluated using tion speed, for a fair comparison, the input resolution of the
the official published code of each algorithm. In the case of proposed algorithm is changed and evaluated considering
CFENet [29], the result of the KITTI object detection leader the fps of SINet [11]. The experimental results show that
board is used because the official code has not been pub- the mAP of Gaussian YOLOv3 with a 704 × 704 resolution
lished. In the experiment on the BDD test data, the results shown in the last row of Table 2 is 86.79 at 24.91 fps, and
for the BDD test set of SSD [17], CFENet [29], and Re- consequently, Gaussian YOLOv3 outperforms SINet [11]
fineDet [28] are specified in CFENet [29], and thus the sim- in terms of the accuracy and detection speed.
ulation results of these studies are from [29], whereas the re-
Table 3 shows the performance of the proposed approach
maining comparative studies [1, 11, 16, 21] are trained and
and other methods for the BDD test set. Gaussian YOLOv3
evaluated using the official published codes because these
improves the mAP by 3.5 compared with YOLOv3, and
studies have not been developed as targets for BDD datasets
the detection speed is 42.5 fps, which is almost the same
and therefore have not been evaluated with BDD datasets in
as YOLOv3. In addition, Gaussian YOLOv3 is 3.5 fps
previous studies. For a fair comparison of the one-stage de-
faster than the RFBNet [16], which has the fastest opera-
tectors, the input resolution is set as in CFENet [29]. The
tion speed among the previous studies except for YOLOv3,
two-stage detector uses the default resolution of each offi-
despite the accuracy of Gaussian YOLOv3 outperforming
cial published code. The official evaluation method of each
that of RFBNet [16] by 3.9 mAP. In addition, compared to
dataset is used for an accuracy comparison, and IOU TH
CFENet [29], which has the highest accuracy among the
is set to the value mentioned before. For a comparison of
previous methods, the performance of Gaussian YOLOv3
the accuracy, mAP, which has been widely used in previous
with a 736 × 736 input resolution in the last row of Table 3
studies on object detection, is selected.
shows a better mAP of 1.7 and faster operation speed of
Table 2 shows the performance of the proposed algo-
1.5 fps, and consequently, Gaussian YOLOv3 outperforms
rithm and other methods using the KITTI validation set.
CFENet [29] in terms of the accuracy and detection speed.
The mAP of the proposed algorithm, Gaussian YOLOv3,
improves by 3.09 compared to that of YOLOv3, and the Furthermore, on the COCO dataset [14], the AP of Gaus-
detection speed is 43.13 fps, which enables real-time de- sian YOLOv3 is 36.1, which is 3.1 higher than YOLOv3.
tection with a slight difference from YOLOv3. Gaussian In particular, the AP75 (i.e., strict metric) of Gaussian
YOLOv3 is 3.93 fps faster than that of RFBNet [16], which YOLOv3 is 39.0, which is 4.6 higher than that of YOLOv3.
has the fastest operation speed among the previous studies These results indicate that the proposed algorithm outper-
with the exception of YOLOv3, despite the mAP of Gaus- forms YOLOv3 in general dataset as well as KITTI and
sian YOLOv3 outperforming that of RFBNet [16] by more BDD.
than 10.17. In addition, although the mAP of Gaussian Based on these experimental results, because the pro-
YOLOv3 with a 512 × 512 resolution is 1.81 lower than posed algorithm can significantly improve the accuracy
that of SINet [11], which has the highest accuracy among with little penalty in speed compared to YOLOv3, Gaussian
the previous methods, it is noteworthy that the fps of the YOLOv3 is superior to the previous methods.

507
Average precision (%)
Detection algorithm Car Pedestrian Cyclist mAP (%) FPS Input size
E M H E M H E M H
MS-CNN [1] 92.54 90.49 79.23 87.46 81.34 72.49 90.13 87.59 81.11 84.71 8.13 1920×576
SINet [11] 99.11 90.59 79.77 88.09 79.22 70.30 94.41 86.61 80.68 85.42 23.98 1920×576
SSD [17] 88.37 87.84 79.15 50.33 48.87 44.97 48.00 52.51 51.52 61.29 28.93 512×512
RefineDet [28] 98.96 90.44 88.82 84.40 77.44 73.52 86.33 80.22 79.15 84.36 27.81 512×512
CFENet [29] 90.33 90.22 84.85 - - - - - - - 0.25 -
RFBNet [16] 87.41 88.35 83.41 65.85 61.30 57.71 74.46 72.73 69.75 73.44 39.20 512×512
YOLOv3 [21] 85.68 76.89 75.89 83.51 78.37 75.16 88.94 80.64 79.62 80.52 43.57 512×512
Gaussian YOLOv3 90.61 90.20 81.19 87.84 79.57 72.30 89.31 81.30 80.20 83.61 43.13 512×512
Gaussian YOLOv3 98.74 90.48 89.47 87.85 79.96 76.81 90.08 86.59 81.09 86.79 24.91 704×704

Table 2: Performance comparison using KITTI validation set. E, M, and H refer to easy, moderate, and hard, respectively.

Detection algorithm mAP (%) FPS Input size Gaussian Variation


YOLOv3
MS-CNN [1] 5.7 6.0 1920×576 YOLOv3 rate (%)
SINet [11] 9.0 18.2 1920×576 KITTI validation set
SSD [17] 14.1 23.1 512×512 # of FP 1,681 985 -41.40
RefineDet [28] 17.4 22.3 512×512 # of TP 13,575 14,560 +7.26
CFENet [29] 19.1 21.0 512×512 # of GT 17,607 17,607 0
RFBNet [16] 14.5 39.0 512×512 BDD validation set
# of FP 86,380 51,296 -40.62
YOLOv3 [21] 14.9 42.9 512×512
# of TP 57,261 59,724 +4.30
Gaussian YOLOv3 18.4 42.5 512×512 # of GT 185,578 185,578 0
Gaussian YOLOv3 20.8 22.5 736×736
Table 4: Numerical evaluation of FP and TP.
Table 3: Performance comparison using BDD test set.

for the baseline and Gaussian YOLOv3. The detection TH


4.3. Visual and numerical evaluation of FP and TP is the same as the mentioned before. The KITTI and BDD
validation sets are used to calculate the FP and TP because
For a visual evaluation of Gaussian YOLOv3, Figures 4
the GT is provided in the validation set. For more accurate
and 5 show the detection examples of the baseline and
measurements, the FP and TP of the two datasets are cal-
Gaussian YOLOv3 for the KITTI validation set and the
culated using the official evaluation code of BDD because
BDD test set, respectively. The detection TH is 0.5, which
the KITTI official evaluation method does not count the
is the default test TH of YOLOv3. The results in the first
FP when bbox is within a certain size. For the KITTI and
row of Figure 4 and in the first column of Figure 5 show that
BDD validation sets, Gaussian YOLOv3 reduces the FP by
Gaussian YOLOv3 can detect objects that YOLOv3 cannot
41.40% and 40.62%, respectively, compared to YOLOv3.
find, thereby increasing its TP. These positive results are
In addition, it increases the TP by 7.26% and 4.3%, re-
obtained because the Gaussian modeling and loss function
spectively. It should be noted that the reduction in the FP
reconstruction of YOLOv3 proposed in this paper can pro-
prevents unnecessary unexpected braking, and the increase
vide a loss attenuation effect in the learning process, so that
in the TP prevents fatal accidents from object detection er-
the learning accuracy for bbox can be improved, which en-
rors. In conclusion, Gaussian YOLOv3 shows a better per-
hances the performance of objectness. Next, the results in
formance than YOLOv3 for both the FP and TP related to
the second row of Figure 4 and in the second column of
the safety of autonomous vehicles. Based on the results de-
Figure 5 show that Gaussian YOLOv3 can complement in-
scribed in Sections 4.1, 4.2, and 4.3, the proposed algorithm
correct object detection results found by YOLOv3. In ad-
outperforms previous studies and is most suitable for au-
dition, the results in the third row of Figure 4 and in the
tonomous driving applications.
third column of Figure 5 show that Gaussian YOLOv3 can
accurately detect bbox of object inaccurately detected by
YOLOv3. Based on these results, Gaussian YOLOv3 can
5. Conclusion
significantly reduce the FP and increase the TP, and con- A high accuracy and real-time detection speed of an ob-
sequently, the driving stability and efficiency are improved ject detection algorithm are extremely important for the
and fatal accidents can be prevented. safety and real-time control of autonomous vehicles. Var-
For a numerical evaluation of the FP and TP of Gaus- ious studies related to camera-based autonomous driving
sian YOLOv3, Table 4 shows the numbers of FPs and TPs have been conducted, but are unsatisfactory based on a

508
Figure 4: Detection results of the baseline and proposed algorithms on the KITTI validation set. The first column shows the detection
results of YOLOv3, whereas the second column shows the detection results of Gaussian YOLOv3.

Figure 5: Detection results of the baseline and proposed algorithms on the BDD test set. The first and second rows show the detection
results of YOLOv3 and Gaussian YOLOv3, respectively, and each color is related to a particular object class.

trade-off between the accuracy and operation speed. For tection speed. As a result, the proposed algorithm can sig-
this reason, this paper proposes an object detection algo- nificantly improve the camera-based object detection sys-
rithm that achieves the best trade-off between accuracy and tem for autonomous driving, and is consequently expected
speed for autonomous driving. Through Gaussian mod- to contribute significantly to the wide use of autonomous
eling, loss function reconstruction, and the utilization of driving applications.
localization uncertainty, the proposed algorithm improves
the accuracy, increases the TP, and significantly reduces Acknowledgement
the FP, while maintaining the real-time capability. Com-
pared to the baseline, the proposed Gaussian YOLOv3 al- This work was supported by the National Research
gorithm improves the mAP by 3.09 and 3.5 for the KITTI Foundation of Korea (NRF) grant funded by the Ko-
and BDD datasets, respectively. Furthermore, because the rea government (MSIT) (No. 2019R1F1A1057530) and
proposed algorithm has a higher accuracy than the previ- ”The Project of Industrial Technology Innovation” through
ous studies with a similar fps, the proposed algorithm is the Ministry of Trade, Industry and Energy (MOTIE)
excellent in terms of the trade-off between accuracy and de- (10082585,2017).

509
References Zitnick. Microsoft coco: Common objects in context. In
European conference on computer vision, pages 740–755.
[1] Zhaowei Cai, Quanfu Fan, Rogerio S Feris, and Nuno Vas- Springer, 2014.
concelos. A unified multi-scale deep convolutional neural
[15] Feng Liu, Bingquan Liu, Chengjie Sun, Ming Liu, and Xiao-
network for fast object detection. In European conference
long Wang. Deep learning approaches for link prediction in
on computer vision, pages 354–370. Springer, 2016.
social network services. In International Conference on Neu-
[2] Sungjoon Choi, Kyungjae Lee, Sungbin Lim, and Songhwai
ral Information Processing, pages 425–432. Springer, 2013.
Oh. Uncertainty-aware learning from demonstration using
mixture density networks with sampling-free variance mod- [16] Songtao Liu, Di Huang, et al. Receptive field block net for
eling. In 2018 IEEE International Conference on Robotics accurate and fast object detection. In Proceedings of the Eu-
and Automation (ICRA), pages 6915–6922. IEEE, 2018. ropean Conference on Computer Vision (ECCV), pages 385–
400, 2018.
[3] Aleksa Ćorović, Velibor Ilić, Siniša Durić, Malisa Marijan,
and Bogdan Pavković. The real-time detection of traffic par- [17] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
ticipants using yolo algorithm. In 2018 26th Telecommuni- Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C
cations Forum (TELFOR), pages 1–4. IEEE, 2018. Berg. Ssd: Single shot multibox detector. In European con-
[4] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object ference on computer vision, pages 21–37. Springer, 2016.
detection via region-based fully convolutional networks. In [18] Aarian Marshall. False positive: Self-driving cars and the
Advances in neural information processing systems, pages agony of knowing what matters. WIRED Transportation,
379–387, 2016. 2018.
[5] Xuerui Dai. Hybridnet: A fast vehicle detection system for [19] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali
autonomous driving. Signal Processing: Image Communi- Farhadi. You only look once: Unified, real-time object de-
cation, 70:79–88, 2019. tection. In Proceedings of the IEEE conference on computer
[6] Di Feng, Lars Rosenbaum, and Klaus Dietmayer. Towards vision and pattern recognition, pages 779–788, 2016.
safe autonomous driving: Capture uncertainty in the deep [20] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster,
neural network for lidar 3d vehicle detection. In 2018 21st stronger. In Proceedings of the IEEE conference on computer
International Conference on Intelligent Transportation Sys- vision and pattern recognition, pages 7263–7271, 2017.
tems (ITSC), pages 3266–3273. IEEE, 2018. [21] Joseph Redmon and Ali Farhadi. Yolov3: An incremental
[7] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we improvement. arXiv preprint arXiv:1804.02767, 2018.
ready for autonomous driving? the kitti vision benchmark [22] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
suite. In 2012 IEEE Conference on Computer Vision and Faster r-cnn: Towards real-time object detection with region
Pattern Recognition, pages 3354–3361. IEEE, 2012. proposal networks. In Advances in neural information pro-
[8] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE inter- cessing systems, pages 91–99, 2015.
national conference on computer vision, pages 1440–1448,
[23] Young-Woo Seo, Nathan Ratliff, and Chris Urmson. Self-
2015.
supervised aerial images analysis for extracting parking lot
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
structure. In Twenty-First International Joint Conference on
Deep residual learning for image recognition. In Proceed-
Artificial Intelligence, 2009.
ings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016. [24] Junqing Wei, Jarrod M Snider, Junsung Kim, John M Dolan,
Raj Rajkumar, and Bakhtiar Litkouhi. Towards a viable au-
[10] Yihui He, Xiangyu Zhang, Marios Savvides, and Kris Ki-
tonomous driving research platform. In 2013 IEEE Intelli-
tani. Softer-nms: Rethinking bounding box regression for
gent Vehicles Symposium (IV), pages 763–770. IEEE, 2013.
accurate object detection. arXiv preprint arXiv:1809.08545,
2018. [25] Bichen Wu, Forrest Iandola, Peter H Jin, and Kurt Keutzer.
[11] Xiaowei Hu, Xuemiao Xu, Yongjie Xiao, Hao Chen, Squeezedet: Unified, small, low power fully convolu-
Shengfeng He, Jing Qin, and Pheng-Ann Heng. Sinet: A tional neural networks for real-time object detection for au-
scale-insensitive convolutional neural network for fast vehi- tonomous driving. In Proceedings of the IEEE Conference
cle detection. IEEE Transactions on Intelligent Transporta- on Computer Vision and Pattern Recognition Workshops,
tion Systems, 20(3):1010–1019, 2019. pages 129–137, 2017.
[12] Alex Kendall and Yarin Gal. What uncertainties do we need [26] Fisher Yu, Wenqi Xian, Yingying Chen, Fangchen Liu, Mike
in bayesian deep learning for computer vision? In Advances Liao, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A
in neural information processing systems, pages 5574–5584, diverse driving video database with scalable annotation tool-
2017. ing. arXiv preprint arXiv:1805.04687, 2018.
[13] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, [27] Chi Zhang, Yuehu Liu, Danchen Zhao, and Yuanqi Su. Road-
Bharath Hariharan, and Serge Belongie. Feature pyramid view: A traffic scene simulator for autonomous vehicle simu-
networks for object detection. In Proceedings of the IEEE lation testing. In 17th International IEEE Conference on In-
Conference on Computer Vision and Pattern Recognition, telligent Transportation Systems (ITSC), pages 1160–1165.
pages 2117–2125, 2017. IEEE, 2014.
[14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, [28] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Stan Z Li. Single-shot refinement neural network for object

510
detection. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 4203–4212,
2018.
[29] Qijie Zhao, Yongtao Wang, Tao Sheng, and Zhi Tang. Com-
prehensive feature enhancement module for single-shot ob-
ject detector. In Asian conference on computer vision.
Springer, 2018.

511

You might also like