An Object Detection System Based On YOLO in Traffic Scene
An Object Detection System Based On YOLO in Traffic Scene
Abstract—We build an object detection system for images in (Convolution neural network) mainly includes convolution
traffic scene. It is fast, accurate and robust. Traditional object layer, pooling layer and fully-connected layer. Current
detectors first generate proposals. After that the features are algorithms of object detection are based on CNN. Object
extracted. Then a classifier on these proposals is executed. But detection is divided into classification and location. The
the speed is slow and the accuracy is not satisfying. YOLO an traditional detectors[10] first extract proposals and extract
excellent object detection approach based on deep learning features, then do the classification[11]. It is slow and makes
presents a single convolutional neural network for location and many errors. Some of region-based[9] object detection
classification. All the fully-connected layers of YOLO’s network algorithms such as Fast-RCNN[2] achieve classification by
are replaced with an average pool layer for the purpose of
CNN while extracting proposals with the method selective
reproducing a new network. The loss function is optimized after
the proportion of bounding coordinates error is increased. A new
search. Extracting proposals spends a relatively long time. It is
object detection method, OYOLO (Optimized YOLO), is proposed in YOLO to implement location and classification in
produced, which is 1.18 times faster than YOLO, while a single CNN. YOLO is extremely fast.
outperforming other region-based approaches like R-CNN in In traffic scene, we want a real-time object detection
accuracy. To improve accuracy further, we add the combination algorithm, so we choose YOLO as reference to create a better
of OYOLO and R-FCN to our system. For challenging images in method. The fully-connected layers of CNN play the role of
nights, pre-processing is presented using the histogram classifier, but they bring large calculation. Inspired by FCN
equalization approach. We have got more than 6% improvement
(fully convolutional network)[5], the convolution neural
in mAP on our testing set.
network without fully-connected layer can realize classification.
Keywords—computer vision; object detection; deep learning; The last two fully-connected layers of YOLO network are
convolutional neural network replaced with an average pool layer. Thus a new convolution
neural network is created. We name our method OYOLO
because it can be seen as optimized YOLO. The test-time speed
I. INTRODUCTION per image of OYOLO is 18% faster than YOLO.
Computer vision is a discipline for enabling machines to
"look". It makes computers and other devices take the place of We select some samples from public data sets and mark
the human eyes for object detection and do further image some samples manually to build our own training set and
processing. Image processing and computer vision become the testing set. There are many night images captured in traffic
new trend of research, for the use of building artificial scene. So we choose some night images to expend our own
intelligence systems to obtain information from images such as data set.
intelligent traffic surveillance system. Object detection is an YOLO makes some location errors which are partly caused
important branch of computer vision. General purpose of object by its loss function. So we optimize the loss function by
detection is to be fast and accurate. Good algorithms of object increasing the proportion of location variables. We get about
detection should be convenient for people's lives. 1% improvement of mAP (mean average precision) on our
The purpose of this paper is to classify and locate objects of testing set.
images captured in traffic scene. It is important in traffic In order to avoid location errors, we propose to combine
monitoring system and unmanned vehicle system. Object OYOLO and another object detection algorithm R-FCN which
detection is a very important part of artificial intelligence. performs well in location. When an image is put into OYOLO
Objects in images captured in traffic scene are mainly cars, + R-FCN, two algorithms run in parallel. We take intersection
cyclists and pedestrians. of bounding boxes and average value of class possibilities as
Object detection based on deep learning is better than that the output. The mAP of the method constructed by OYOLO +
using traditional machine learning algorithms in the aspects of R-FCN is improved around 3.3%.
speed and accuracy. Deep learning is hot in the name of Night images are dark, and contrast of them is low. Due to
convolution neural network which is a forward-feedback neural lit lights, there are partial highlights in these images. These
network. It has a unique superiority in object recognition with factors make night images challenging for object detection. We
its special structure of local weights sharing. CNN
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on April 14,2024 at 13:11:47 UTC from IEEE Xplore. Restrictions apply.
propose to pre-process the difficult samples with histogram
begin
equalization, which can bring about 3% mAP improvement.
A good object detection system has been built. There are 2
channels in the system. One is OYOLO responding to high
difficult
speed requirement and the other is OYOLO + R-FCN which sample ?
meets the need of high accuracy.
The remaining part of this paper is organized as follows. yes
Section II introduces related work of object detection on
images. The design of whole system scheme and the structure
of OYOLO’s network are presented in Section III. Section IV pre-process no
illustrates experiments on our own data set. The paper makes
conclusion and presents potential opportunities for future
research in Section V.
316
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on April 14,2024 at 13:11:47 UTC from IEEE Xplore. Restrictions apply.
image into S × S grids. Each grid cell predicts B bounding proportion of the co-ordinates error in the loss function is
boxes and confidence scores for those boxes. Each grid cell increased. The loss function is defined as formula (1).
also predicts C conditional class probabilities.[4] We use
YOLO’s activation functions.
, y, y, w, w, h, hˆ, C , Cˆ , c)
L( x, xˆˆˆ
S *S B
TABLE I. NETWORK
5.5¦¦ II ijobj [( xi xˆˆi ) 2 ( yi yi ) 2 ]
layer name filter size/stride input output i 0 j 0
0 conv 64 7×7/1 448×448×3 448×448×64 S *S B
317
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on April 14,2024 at 13:11:47 UTC from IEEE Xplore. Restrictions apply.
histogram equalization is used to pre-process night images. B. Training Process
More than 3% improvement is returned in mAP on our testing We train OYOLO by SGD (non-maximum suppression).
data set with pre-processing. We use batchsize = 64, that is, 64 samples are processed every
iteration on the training set. 105 epochs of training are taken. In
IV. EXPERIMENTS the process of training, we make learning rate from small to
We install Darknet under operating system ubuntu14.04, large then to small. At the beginning, to make the output
and use cudnn for acceleration. Under Darknet framework, we convergent, the learning rate of the first epoch varies from
perform training and testing accelerated by GPU of Nvidia 0.001 to 0.01. From the 2nd to 65th epoch, the learning rate is
Tesla K40. 0.01. In order to avoid the strong sway phenomenon, the
learning rate is set as 0.001 from the 66th to 95th epoch. In the
96th-105th epoch, the learning rate is 0.0001. Referring to
A. Data Set
number of VOC training set and the ratio of the training set, in
For machine learning, it is important to create a training set this paper, the number of training iteration is set to 42000. The
that fits in with the actual scenes. momentum is set to 0.85 and the decay of 0.00055 is adopted.
We add some images captured by roads’ cameras into The original weights parameters of OYOLO come from
KITTI data set to get our data set. Studies show that when the training results on ImageNet. Using our training set, training is
ratio of training data and testing data is 9:1, the model performs carried out about 4 days with GPU. Then the final object
better. There are totally 7451 images in KITTI data set. We detection weight model our_method.weights is obtained.
choose 6732 images as the training set, and the other 1519
images are treated as testing set. Then we label 1527 images
manually. We choose 1368 images from our labeled images to TABLE II. ACCURACY COMAPRING
extend training set. The remaining 159 images including some mAP mAP
night images are added into testing set. So there are 8100 (trained on our (trained on VOC07 +
images in our training set and 900 images in testing set. training set) VOC12)
318
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on April 14,2024 at 13:11:47 UTC from IEEE Xplore. Restrictions apply.
for images in traffic scene. Our object detection system
becomes more accurate after being added pre-processing
procedure for night images. Thus we have presented an object
detection system based on YOLO in traffic scene which is fast,
accurate and robust.
ACKNOWLEDGEMENTS
This work is supported by the National Natural Science
Foundation of China (No.61002011); the 863 Program of
China (No.2013AA013303); the Fundamental Research Funds
for the Central Universities (No.2013RC1104).
REFERENCES
[1] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards Real-Time
Object Detection with Region Proposal Networks[J]. IEEE Transactions
on Pattern Analysis & Machine Intelligence, 2017, 39(6):1137.
[2] Girshick R. Fast R-CNN[J]. Computer Science, 2015.
[3] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for
accurate object detection and semantic segmentation[J]. Computer
Science, 2013:580-587.
[4] Redmon J, Divvala S, Girshick R, et al. You Only Look Once: Unified,
Real-Time Object Detection[J]. 2015:779-788.
[5] Dai J, Li Y, He K, et al. R-FCN: Object Detection via Region-based
Fully Convolutional Networks[J]. 2016.
[6] Redmon J, Farhadi A. YOLO9000: Better, Faster, Stronger[J]. 2016.
Fig. 3. Object detection results of two images.
[7] Liu W, Anguelov D, Erhan D, et al. SSD: Single Shot MultiBox
Detector[J]. 2015:21-37.
V. CONCLUSION [8] He K, Zhang X, Ren S, et al. Deep Residual Learning for Image
Recognition[C]. Computer Vision and Pattern Recognition, 2016:770-
The best process speed of our system is 44ms per image, 778.
which is faster than YOLO by 18%. The mAP of our system [9] Shrivastava A, Gupta A, Girshick R. Training Region-Based Object
can achieve 86.4% on our testing set while R-FCN is just Detectors with Online Hard Example Mining[J]. 2016:761-769.
67.7%. [10] Felzenszwalb P, Mcallester D, Ramanan D. A discriminatively trained,
multiscale, deformable part model[J]. 2008, 8::1-8.
Our method OYOLO with the new CNN based on YOLO [11] Felzenszwalb P F, Girshick R B, Mcallester D, et al. Object detection
is very fast while outperforming other object detectors. with discriminatively trained part-based models.[J]. IEEE Transactions
OYOLO and R-FCN, the 2 excellent object detection on Pattern Analysis & Machine Intelligence, 2010, 32(9):1627.
algorithms, are combined to get better performance in accuracy
319
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on April 14,2024 at 13:11:47 UTC from IEEE Xplore. Restrictions apply.