0% found this document useful (0 votes)
83 views5 pages

An Object Detection System Based On YOLO in Traffic Scene

An_object_detection_system_based_on_YOLO_in_traffic_scene

Uploaded by

Ashwani Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views5 pages

An Object Detection System Based On YOLO in Traffic Scene

An_object_detection_system_based_on_YOLO_in_traffic_scene

Uploaded by

Ashwani Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2017 6th International Conference on Computer Science and Network Technology (ICCSNT)

An Object Detection System Based on YOLO in


Traffic Scene
Jing Tao, Hongbo Wang, Xinyu Zhang, Xiaoyu Li, Huawei Yang
State Key Lab. of Networking and Switching Technology
Beijing University of Posts and Telecommunications
Beijing, China
{jtao, hbwang, zxy_bupt, 2011212836}@bupt.edu.cn, [email protected]

Abstract—We build an object detection system for images in (Convolution neural network) mainly includes convolution
traffic scene. It is fast, accurate and robust. Traditional object layer, pooling layer and fully-connected layer. Current
detectors first generate proposals. After that the features are algorithms of object detection are based on CNN. Object
extracted. Then a classifier on these proposals is executed. But detection is divided into classification and location. The
the speed is slow and the accuracy is not satisfying. YOLO an traditional detectors[10] first extract proposals and extract
excellent object detection approach based on deep learning features, then do the classification[11]. It is slow and makes
presents a single convolutional neural network for location and many errors. Some of region-based[9] object detection
classification. All the fully-connected layers of YOLO’s network algorithms such as Fast-RCNN[2] achieve classification by
are replaced with an average pool layer for the purpose of
CNN while extracting proposals with the method selective
reproducing a new network. The loss function is optimized after
the proportion of bounding coordinates error is increased. A new
search. Extracting proposals spends a relatively long time. It is
object detection method, OYOLO (Optimized YOLO), is proposed in YOLO to implement location and classification in
produced, which is 1.18 times faster than YOLO, while a single CNN. YOLO is extremely fast.
outperforming other region-based approaches like R-CNN in In traffic scene, we want a real-time object detection
accuracy. To improve accuracy further, we add the combination algorithm, so we choose YOLO as reference to create a better
of OYOLO and R-FCN to our system. For challenging images in method. The fully-connected layers of CNN play the role of
nights, pre-processing is presented using the histogram classifier, but they bring large calculation. Inspired by FCN
equalization approach. We have got more than 6% improvement
(fully convolutional network)[5], the convolution neural
in mAP on our testing set.
network without fully-connected layer can realize classification.
Keywords—computer vision; object detection; deep learning; The last two fully-connected layers of YOLO network are
convolutional neural network replaced with an average pool layer. Thus a new convolution
neural network is created. We name our method OYOLO
because it can be seen as optimized YOLO. The test-time speed
I. INTRODUCTION per image of OYOLO is 18% faster than YOLO.
Computer vision is a discipline for enabling machines to
"look". It makes computers and other devices take the place of We select some samples from public data sets and mark
the human eyes for object detection and do further image some samples manually to build our own training set and
processing. Image processing and computer vision become the testing set. There are many night images captured in traffic
new trend of research, for the use of building artificial scene. So we choose some night images to expend our own
intelligence systems to obtain information from images such as data set.
intelligent traffic surveillance system. Object detection is an YOLO makes some location errors which are partly caused
important branch of computer vision. General purpose of object by its loss function. So we optimize the loss function by
detection is to be fast and accurate. Good algorithms of object increasing the proportion of location variables. We get about
detection should be convenient for people's lives. 1% improvement of mAP (mean average precision) on our
The purpose of this paper is to classify and locate objects of testing set.
images captured in traffic scene. It is important in traffic In order to avoid location errors, we propose to combine
monitoring system and unmanned vehicle system. Object OYOLO and another object detection algorithm R-FCN which
detection is a very important part of artificial intelligence. performs well in location. When an image is put into OYOLO
Objects in images captured in traffic scene are mainly cars, + R-FCN, two algorithms run in parallel. We take intersection
cyclists and pedestrians. of bounding boxes and average value of class possibilities as
Object detection based on deep learning is better than that the output. The mAP of the method constructed by OYOLO +
using traditional machine learning algorithms in the aspects of R-FCN is improved around 3.3%.
speed and accuracy. Deep learning is hot in the name of Night images are dark, and contrast of them is low. Due to
convolution neural network which is a forward-feedback neural lit lights, there are partial highlights in these images. These
network. It has a unique superiority in object recognition with factors make night images challenging for object detection. We
its special structure of local weights sharing. CNN

978-1-5386-0493-9/17/$31.00 ©2017 IEEE 315 Dalian, China•Oct 21-22, 2017

Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on April 14,2024 at 13:11:47 UTC from IEEE Xplore. Restrictions apply.
propose to pre-process the difficult samples with histogram
begin
equalization, which can bring about 3% mAP improvement.
A good object detection system has been built. There are 2
channels in the system. One is OYOLO responding to high
difficult
speed requirement and the other is OYOLO + R-FCN which sample ?
meets the need of high accuracy.
The remaining part of this paper is organized as follows. yes
Section II introduces related work of object detection on
images. The design of whole system scheme and the structure
of OYOLO’s network are presented in Section III. Section IV pre-process no
illustrates experiments on our own data set. The paper makes
conclusion and presents potential opportunities for future
research in Section V.

II. RELATED WORK


require extremely high
no yes
In the evolution of CNN, many excellent CNN structures accuracy?
have appeared. The birth of LeNet in 1998 was the beginning
of CNN. In 2012 ImageNet Large-Scale Visual Recognition
OYOLO
Challenge (ILSVRC), dean of the machine learning professor OYOLO
+ R-FCN
Hinton and his student Alex using convolutional neural
network in image classification task won the first place. The
convolutional neural network structure used by Alex is called
AlexNet. In the ILSVRC2014 competition, Andrew end
Zisserman's group used VGG-Net, and the VGG-Net has more
layers. After the GoogleLeNet and ResNet[8] structure, the
width and depth of the convolutional neural networks are Fig.1. Entire process of an image.
expanded, and the performance of object detection is improved.
In this paper, the convolutional neural network of YOLO is
When deep learning became popular, Microsoft, Google, optimized according to FCN, the fully-connected layers are
Facebook and other companies have invested heavily in removed. But the feature of locating and classifying at the same
research of convolutional neural networks. The difficulty of time of the network is saved. We get a faster method. In
object detection for images mainly lies in the changes of addition, combination of OYOLO and R-FCN and pre-
illumination, angle of view and target interior. In view of the processing difficult samples can improve the accuracy. We
difficulty above, scholars at home and abroad have made many construct a robust object detection system suitable in traffic
attempts to propose a region based detection method. Such scene which can meet real-time requirement of speed and the
algorithms implement object detection through feature need of high accuracy.
extraction, selection of proposals and classification. In 2014,
the R-CNN[3] framework made a great breakthrough in object
detection. Based on convolutional neural networks, SPP-Net, III. DESIGN SCHEME
Fast-R-CNN[2], Faster-R-CNN[1], R-FCN appeared one by one. When an image enters our object detection system, the
These algorithms implement location (extracting proposals) entire process of the image is depicted in Figure 1.
and classification separately. They locate objects accurately,
but extracting proposals take a great deal of time. A. Network Design
From R-CNN to R-FCN, the process of object detection has YOLO uses a single convolution neural network. It is not
become more streamlined, more accurate, faster and faster. But based on region. The greatest advantage of it is its high speed.
all of these algorithms are based on regions. To implement YOLO uses its particular convolution neural network to
these algorithms, proposals of images should be produced at complete classification and location of multiple objects in an
first. In order to improve the speed of object detection, in 2016, image at one time. This design, of course, sacrifices some
Ji Feng, et al of MSRA proposed YOLO. YOLO is not a accuracy, but it is still more accurate than many other object
region-based algorithm which can provide the end to end detection algorithms like R-CNN. Inspired by FCN, in order to
service. YOLO implements location and classification by one simplify the structure of the convolutional neural network in
convolutional neural network. When an image is input, location YOLO and reduce the amount of computation to make object
and classification of all objects of this image can be output. detection process faster, we remove the last two fully-
YOLO is very fast and still has guaranteeing accuracy at the connected layers, and add an average pool instead. Then the
same time. network structure looks like FCN.
FCN can only implement classification. The location
function of YOLO should be saved. So we retain the grid cell
theory of YOLO to locate objects. At first we divide the input

316

Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on April 14,2024 at 13:11:47 UTC from IEEE Xplore. Restrictions apply.
image into S × S grids. Each grid cell predicts B bounding proportion of the co-ordinates error in the loss function is
boxes and confidence scores for those boxes. Each grid cell increased. The loss function is defined as formula (1).
also predicts C conditional class probabilities.[4] We use
YOLO’s activation functions.
, y, y, w, w, h, hˆ, C , Cˆ , c)
L( x, xˆˆˆ
S *S B
TABLE I. NETWORK
5.5¦¦ II ijobj [( xi  xˆˆi ) 2  ( yi  yi ) 2 ]
layer name filter size/stride input output i 0 j 0
0 conv 64 7×7/1 448×448×3 448×448×64 S *S B

1 maxpool 2×2/2 448×448×64 224×224×64  5.5¦¦ II ijobj [( wi  wˆ i ) 2  ( hi  hˆi ) 2 ]


i 0 j 0
2 conv 192 3×3/1 244×224×64 244×224×192
S *S B
3 maxpool 2×2/2 244×224×192 112×112×192  ¦¦ II ijobj (Ci  Cˆ i ) 2
4 conv 128 1×1/1 112×112×192 112×112×128 i 0 j 0
S *S B
5 conv 256 3×3/1 112×112×128 112×112×256
 0.5¦¦ II ijnoobj (Ci  Cˆ i ) 2
6 conv 256 1×1/1 112×112×256 112×112×256 i 0 j 0

7 conv 512 3×3/1 112×112×256 112×112×512 S *S

8 maxpool 2×2/2 112×112×512 56×56×512


 ¦ II iobj ¦ ( pi (c)  pˆ i (c)) 2 (1)
i 0 cclasses
9 conv 256 1×1/1 56×56×512 56×56×256
10 conv 512 3×3/1 56×56×512 56×56×512
The biggest factor that influences detection accuracy of
YOLO is the location error. In YOLOv2[6], to improve the
11 conv 256 1×1/1 56×56×512 56×56×256 location accuracy, selecting anchor boxes manually is adopted.
12 conv 512 3×3/1 56×56×256 56×56×512 SSD was inspired by MultiBox[7], using a similar approach. In
order to improve the location accuracy, we consider the method
13 conv 256 1×1/1 56×56×512 56×56×256 selecting anchor boxes manually too. But YOLO is not an
14 conv 512 3×3/1 56×56×256 56×56×512 object detection algorithm based on regions. It uses only one
convolutional neural network to complete all the steps of object
15 conv 256 1×1/1 56×56×512 56×56×256 detection. We don’t want to destroy this characteristic of
16 conv 512 3×3/1 56×56×256 56×56×512 YOLO, so the method above is not applied in this paper. In
machine learning, combination of multiple classification
17 conv 512 1×1/1 56×56×512 56×56×512 algorithms can achieve better classification results. Of course,
18 conv 1024 3×3/1 56×56×512 56×56×1024 object detection follows the rule. Fast-RCNN is a typical object
detection algorithm based on regions. Selective search is used
19 maxpool 2×2/2 56×56×1024 28×28×1024
to select candidate proposals, after which the proposals are
20 conv 512 1×1/1 28×28×1024 28×28×512 classified and they will be fine-tuned. But the process of
producing proposals takes a lot of time. So Faster-RCNN uses
21 conv 1024 3×3/2 28×28×512 14×14×1024
RPN (region proposal network)[1] to select proposals to
22 conv 512 1×1/1 14×14×1024 14×14×512 improve speed. Later, RPN is also applied to R-FCN, and FCN
23 conv 13 3×3/1 14×14×512 14×14×13
is for classification, thus the whole object detection process of
R-FCN is faster than Faster-RCNN. Therefore we consider
24 avgpool 2×2/2 14×14×13 7×7×13 using a combination of our method and R-FCN a faster variant
of Faster-RCNN. Two algorithms run separately and parallel, if
the intersection of two bounding boxes in their outputs is
There are mainly three kinds of objects in the traffic scene: greater than 0.4 times of total area of the two boxes, we take
car, cyclist and pedestrian. So we need to change the class their intersection as output. We take the average class
probabilities of the output layer into 3 categories. We use S = 7, probabilities of the two algorithms’ outputs as class
B= 2, like YOLO on VOC.[4] We have 3 classes, so we use C = probabilities. NMS (Non-maximum suppression) is used to
3. The final output is a 7 h 7 h 13 three-dimensional array. process these results.
Our network structure is shown in TABLE I.
C. Pre-processing
B. Combining OYOLO and R-FCN Because a large part of images in traffic scene are captured
As shown in [6], YOLO makes more location errors than at night, these images are dim. At that time lights of most cars
classification errors in objects detection process. The location are on, causing partial highlight in an image. We can remove
errors are partly caused by the loss function of YOLO. In the highlights from the night images, then enhance the images
loss function of original YOLO, classification error weights contrast and increase their brightness. After the pre-processing,
equally with location error. In order to avoid location error, deliver the images to our detection algorithm. The performance
of our method will be greatly improved. In this paper,

317

Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on April 14,2024 at 13:11:47 UTC from IEEE Xplore. Restrictions apply.
histogram equalization is used to pre-process night images. B. Training Process
More than 3% improvement is returned in mAP on our testing We train OYOLO by SGD (non-maximum suppression).
data set with pre-processing. We use batchsize = 64, that is, 64 samples are processed every
iteration on the training set. 105 epochs of training are taken. In
IV. EXPERIMENTS the process of training, we make learning rate from small to
We install Darknet under operating system ubuntu14.04, large then to small. At the beginning, to make the output
and use cudnn for acceleration. Under Darknet framework, we convergent, the learning rate of the first epoch varies from
perform training and testing accelerated by GPU of Nvidia 0.001 to 0.01. From the 2nd to 65th epoch, the learning rate is
Tesla K40. 0.01. In order to avoid the strong sway phenomenon, the
learning rate is set as 0.001 from the 66th to 95th epoch. In the
96th-105th epoch, the learning rate is 0.0001. Referring to
A. Data Set
number of VOC training set and the ratio of the training set, in
For machine learning, it is important to create a training set this paper, the number of training iteration is set to 42000. The
that fits in with the actual scenes. momentum is set to 0.85 and the decay of 0.00055 is adopted.
We add some images captured by roads’ cameras into The original weights parameters of OYOLO come from
KITTI data set to get our data set. Studies show that when the training results on ImageNet. Using our training set, training is
ratio of training data and testing data is 9:1, the model performs carried out about 4 days with GPU. Then the final object
better. There are totally 7451 images in KITTI data set. We detection weight model our_method.weights is obtained.
choose 6732 images as the training set, and the other 1519
images are treated as testing set. Then we label 1527 images
manually. We choose 1368 images from our labeled images to TABLE II. ACCURACY COMAPRING
extend training set. The remaining 159 images including some mAP mAP
night images are added into testing set. So there are 8100 (trained on our (trained on VOC07 +
images in our training set and 900 images in testing set. training set) VOC12)

According to the VOC directory structure, firstly, we Fast-RCNN - 60.9%


convert the annotation of our data set into XML format. Files’ Faster-RCNN - 62.4%
paths of training set and testing set are written to train.txt and
test.txt. It is convenient for training process. In this paper, R-FCN - 67.7%
objects in traffic scenes are divided into three classes: car, YOLO - 64.0%
cyclist and pedestrian, while there are 8 classes in KITTI data
set. So the classes need to be transformed. The 'Van', 'Truck', SSD - 64.5%
'Tram' and 'Car' classes are merged into ‘car’ class, while OYOLO 80.1% 66.0%
"Person_sitting" into "pedestrian" class. We make ‘cyclist’
class as original. These two classes "Misc" and "Dontcare" are OYOLO + R- 83.4% 69.3%
FCN
directly ignored, as the background.

time per image˄ms˅ C. Results


Speeds of OYOLO and other object detection algorithms
350
290 are shown in Figure 2. OYOLO takes 44ms to process one
300 273
image. It performs the fastest.
250
180 We judge the model from single images. Results are
200 displayed in Figure 3.
150
90 In order to improve object detection accuracy, we propose
100 54 44 the approach combining OYOLO and R-FCN. We adopted the
50 model of R-FCN trained on VOC directly. Image are put into
0 the channel where OYOLO and R-FCN are combined. We can
get about 3.3% improvement of mAP on our testing set. Of
course, this method consumes the maximum time of OYOLO
and R-FCN which is almost equal to the time of R-FCN for
processing one image. The mAP on our testing set of some
famous object detection algorithms and our approach of
Fig. 2. Speeds of OYOLO and other object detection algorithms OYOLO + R-FCN are compared in TABLE II.
About 100 night images were added to the testing set. If
these night images are pre-processed, we can get another 3%
improvement in mAP. The final mAP of our system is about
86.4% on our testing set.

318

Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on April 14,2024 at 13:11:47 UTC from IEEE Xplore. Restrictions apply.
for images in traffic scene. Our object detection system
becomes more accurate after being added pre-processing
procedure for night images. Thus we have presented an object
detection system based on YOLO in traffic scene which is fast,
accurate and robust.

ACKNOWLEDGEMENTS
This work is supported by the National Natural Science
Foundation of China (No.61002011); the 863 Program of
China (No.2013AA013303); the Fundamental Research Funds
for the Central Universities (No.2013RC1104).

REFERENCES
[1] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards Real-Time
Object Detection with Region Proposal Networks[J]. IEEE Transactions
on Pattern Analysis & Machine Intelligence, 2017, 39(6):1137.
[2] Girshick R. Fast R-CNN[J]. Computer Science, 2015.
[3] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for
accurate object detection and semantic segmentation[J]. Computer
Science, 2013:580-587.
[4] Redmon J, Divvala S, Girshick R, et al. You Only Look Once: Unified,
Real-Time Object Detection[J]. 2015:779-788.
[5] Dai J, Li Y, He K, et al. R-FCN: Object Detection via Region-based
Fully Convolutional Networks[J]. 2016.
[6] Redmon J, Farhadi A. YOLO9000: Better, Faster, Stronger[J]. 2016.
Fig. 3. Object detection results of two images.
[7] Liu W, Anguelov D, Erhan D, et al. SSD: Single Shot MultiBox
Detector[J]. 2015:21-37.
V. CONCLUSION [8] He K, Zhang X, Ren S, et al. Deep Residual Learning for Image
Recognition[C]. Computer Vision and Pattern Recognition, 2016:770-
The best process speed of our system is 44ms per image, 778.
which is faster than YOLO by 18%. The mAP of our system [9] Shrivastava A, Gupta A, Girshick R. Training Region-Based Object
can achieve 86.4% on our testing set while R-FCN is just Detectors with Online Hard Example Mining[J]. 2016:761-769.
67.7%. [10] Felzenszwalb P, Mcallester D, Ramanan D. A discriminatively trained,
multiscale, deformable part model[J]. 2008, 8::1-8.
Our method OYOLO with the new CNN based on YOLO [11] Felzenszwalb P F, Girshick R B, Mcallester D, et al. Object detection
is very fast while outperforming other object detectors. with discriminatively trained part-based models.[J]. IEEE Transactions
OYOLO and R-FCN, the 2 excellent object detection on Pattern Analysis & Machine Intelligence, 2010, 32(9):1627.
algorithms, are combined to get better performance in accuracy

319

Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on April 14,2024 at 13:11:47 UTC from IEEE Xplore. Restrictions apply.

You might also like