Vehicle Detection With Sub Class Training Using R CNN For The UA DETRAC Benchmark

This document discusses vehicle detection using sub-class training with R-CNN. Specifically, it proposes using multiple sub-classes of vehicles (e.g. car, van, bus) rather than a single vehicle class to better learn features of different vehicle types. It evaluates this approach on the UA-DETRAC benchmark, fine-tuning a pre-trained model using transfer learning. Detection results show improved performance over directly applying a model trained on a different dataset.

Uploaded by

Evarist

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views6 pages

Vehicle Detection With Sub Class Training Using R CNN For The UA DETRAC Benchmark

Uploaded by

Evarist

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/320653142

Vehicle detection with sub-class training using R-CNN for the UA-DETRAC
benchmark

Conference Paper · August 2017

DOI: 10.1109/AVSS.2017.8078520

CITATIONS READS
11 1,639

2 authors:

Sitapa Rujikietgumjorn Nattachai Watcharapinchai

National Electronics and Computer Technology Center National Electronics and Computer Technology Center
10 PUBLICATIONS 179 CITATIONS 13 PUBLICATIONS 132 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Sitapa Rujikietgumjorn on 17 February 2020.

The user has requested enhancement of the downloaded file.

Vehicle Detection with Sub-Class Training using R-CNN
for the UA-DETRAC Benchmark

Sitapa Rujikietgumjorn Nattachai Watcharapinchai

National Electronics and Computer Technology Center (NECTEC), Thailand
{sitapa.rujikietgumjorn,nattachai.watcharapinchai}@nectec.or.th

Abstract

Different types of vehicles, such as buses and cars, can

be quite different in shapes and details. This makes it more
difficult to try to learn a single feature vector that can de-
tect all types of vehicles using a single object class. We
proposed an approach to perform vehicle detection with
Sub-Classes categories learning using R-CNN in order to Figure 1: Examples of detected results
improve the performance of vehicle detection. Instead of
using a single object class, which is a vehicle in this exper-
iment, to train on the R-CNN, we used multiple sub-classes based features such as Deformable Part Model (DPM) [4],
of vehicles so that the network can better learn the features or Linked Visual Words [9] are also used for vehicle detec-
of each individual type. In the experiment, we also evalu- tion and classification. Recently, convolutional neural net-
ated the result of using a transfer learning approach to use work is widely used for object classification and detection
a pre-trained weights on a new dataset. and it has been applied to vehicle detection and classifica-
tion [11].
In this paper, the convolutional neural network is used
1. Introduction for vehicle detection based on R-CNN architecture. We pro-
pose to use multiple sub-classes of vehicles instead of using
Vehicle detection is one of the essential process for many a single vehicle class for training. A transfer learning ap-
vision-based traffic surveillance applications. The location proach is also evaluated for fine tuning a pre-trained model
or a bounding box of a vehicle must be extracted from the to be used with a traffic surveillance dataset.
traffic image. This detected location of a vehicle in an image
can be further applied to several applications such as vehicle 2. Transfer Learning for the Convolutional
tracking or counting. The cropped image of a vehicle can
Neural Network
also be used for vehicle type and model classification.
Several challenges occur in vehicle detection domain. Training the convolutional neural network from scratch
Occlusion is one major challenge that can largely reduce the takes time and it needs a sufficiently large dataset. An in-
detection performance especially when a vehicle is highly sufficient amount of dataset may lead to overfitting prob-
occluded. Usually, the videos from traffic surveillance sys- lem. On the other hand, several pre-trained models are
tems are acquired from the practical environment and the available but it may not be feasible to directly apply to
weather and lighting condition can also affect the detection another dataset. Figure 2 shows detection results on UA-
performance. Shadows from vehicles and other objects can DETRAC dataset when directly using the model trained on
also be as obstacle to the detection as well. The variations COCO dataset [7] that has 90 object classes including ve-
in shapes from different orientations also make vehicle de- hicles. There are still a lot of missed detections in the re-
tection more difficult. For example, the front side is visually sult and some incorrect detections such as some cars were
different than the side of the vehicle. classified as boats or cell phones. Nevertheless, these pre-
Vehicle is a rigid object and several features have been trained weights are still useful and can be used for weight
used in vehicle detection such as Aggregate Channel Fea- initialization and fine tuning the network on a new dataset
tures (ACF) [3]. Since vehicle has many unique parts, part- using a transfer learning approach.

978-1-5386-2939-0/17/$31.00 2017
c IEEE IEEE AVSS 2017, August 2017, Lecce, ITALY
the network to a new dataset. These pre-trained weights
were trained using the COCO dataset. Since our test dataset
(UA-DETRAC) is similar to some categories in the COCO
dataset, we used a COCO pre-trained model [6] for fine tun-
ing the full network with UA-DETRAC dataset.

3. Sub-Classes Training Using R-CNN

Our algorithm is based on the Faster R-CNN architec-
ture [8] with residual network. In our experiment, we fine-
tuned R-CNN with 101-layers residual network on the UA-
DETRAC dataset for vehicle detection. Instead of using
one output class of vehicle, we sub-categorized the vehicle
Figure 2: Examples of vehicle detection results using mod- class into four classes which are car, van, bus, and others.
els trained on the COCO dataset without transfer learning The example images of these four sub-classes are shown in
Figure 3. The sub-class is a prior knowledge to help the
single object detection in which the object is quite differ-
Two main factors that will affect the selection of transfer entiated in shapes and details. This sub-classes technique
learning approach are the dataset size and the similarity of should help the network to better learn the features of each
the new dataset to the dataset used in the pre-trained net- different vehicle types.
work. There are four major scenarios for transfer learning
and fine tuning.
Small and similar dataset
When the dataset is small, it is not good to fine tune the
weights as it may result in overfitting problem. The CNN
(a) Car (b) Van (c) Bus (d) Other
can be used as a feature extractor since the higher level fea-
tures of the network are still relevant to the new dataset [12].
And a linear classifier is trained on this CNN features in- Figure 3: Example images of four sub-classes of vehicles
stead.
Small and different dataset In addition to vehicle types, orientation of the vehicles
Since the new dataset is different, the higher level features, can also be quite different such as frontal and side view.
which are more specific to the original dataset, may not be These sub-classes can also be used as a prior knowledge
suitable. It is better to use the lower level features from the as well. Figure 4 shows example images of eight vehicle
early layers because the features are more generic and train orientations we used in the experiment.
the linear classifier on these lower level features.
Large and similar dataset
With sufficient data, we can fine-tune the weights of the
pre-trained network via backpropagation. The pre-trained
weights are used as weight initialization for fine tuning to
the new data set.
Large and different dataset
Since the dataset are different and there are sufficient data,
we can train the network from scratch. However, it may still
be beneficial to initialize the weights from the pre-trained Figure 4: Example images of eight vehicle orientations
model instead of using random weight initialization.
The UA-DETRAC dataset size is large enough for fine
tuning and some classes in the COCO dataset are similar 4. Experimental Results
to the UA-DETRAC dataset. The COCO dataset has 90
object categories including vehicle classes which are car, In our experiment, the algorithms were evaluated accord-
bus, truck, motorcycle, and bicycle. In contrast, the UA- ing to the submission rules as specified in the UA-DETRAC
DETRAC dataset has four vehicle classes in the annotation benchmark [10]. In this section, we give a detailed analysis
which are car, bus, van, and other. For our approach, we of comparison between our proposed method and the base-
use the pre-trained weights as an initialization to fine tune line methods.
4.1. Dataset Method Validation Set
DPM [4] 29.21%
The dataset we used in the experiment is UA-DETRAC
ACF [3] 72.01%
dataset, a real-world multi-object detection and multi-object
R-CNN [5] 74.36%
tracking benchmark. This dataset is a 10-hours traffic video
CompACT [2] 76.43%
sequence at various challenging levels such as variations in
R-CNN + TL with 1 class 88.89%
scale, pose, illumination, night time and day time, occlu-
sion, and background clutters. The dataset is divided into a R-CNN + TL with 4 sub-classes 93.70%
training and test set with 60 and 40 sequences respectively. R-CNN + TL with 4 sub-classes
In our experiment, we divided the original training set and 8 orientations 90.08%
into a training and validation set. Four sequences from the
Table 1: Average precision (AP) scores of all comparison
training set were selected as our validation set. The selected
methods on the validation set.
sequences were MVI 39851, MVI 39861, MVI 40161,
and MVI 40131. In the 2017 UA-DETRAC challenge, the
test set was divided into two levels, beginner and experi-
enced. We participated in the beginner level in which there performance. So, the sub-class should be different enough
were 10 test video sequences. The test set must be evaluated to gain the performance.
through the UA-DETRAC challenge submission server. In the experiment, we also compared our method that
used transfer learning of the model trained on an especially
4.2. Experimental Environment large dataset with lots of object classes and one that was
trained from scratch (no transfer learning). The result in
The hardware specification in the experiment was 2x
Table 2 shows that average precision is much higher when
Intel Xeon E5-2630v3 with RAM 384GB, and GPU 2x
using a transfer learning from a good model to fine tune to
Nvidia Tesla K80. The operating system was Ubuntu 16.04
a new dataset compared to a new training on a new dataset.
(64bits). The code was developed using python 2.7 and Ten-
One issue of using multiple sub-classes instead of sin-
sorflow [1].
gle class for single object detection is that the detection
4.3. Evaluation Metric bounding boxes of different sub-classes are overlapped as
shown in Figure 5 left. From Table 1, the performance of
Average precision (AP) is used for our detection evalu- using 4 sub-classes and 8 orientations decreases compared
ation. The proposed method with various variations were to using just 4 sub-classes which is the result of increasing
evaluated on the validation set comparing to the baseline in false positives. However, in the experiment, perform-
methods. This validation set was used for our evaluation ing non-maximum suppression (NMS) to remove the over-
among different variation of our proposed methods to select lapped bounding box does not increase the detection perfor-
the best method for submitting to the UA-DETRAC chal- mance. Table 3 shows slight decreasing in the average pre-
lenge submission server for further evaluation on the test cision when performing non-maximum suppression on the
set. case of 4 sub-classes and 8 orientations. The reason of per-
formance decreasing is because the non-maximum suppres-
4.4. Detection Results sion also removes the correct bounding boxes being over-
Our proposed method is compared with four baseline lapped as shown in the yellow bounding box in Figure 5
methods which are DPM [4], ACF [3], R-CNN [5], and right.
CompACT [2]. The average precision score for the vali- The evaluated performance of our R-CNN with four sub-
dation set is shown in Table 1. For our method, we evalu- classes on the test set is shown in Table 4. This result was
ated our R-CNN that used transfer learning (TL) with one evaluated by UA-DETRAC submission server. For the be-
class and several sub-classes. For the 4 sub-classes, the ginner level, the test set does not include the medium and
types of vehicles are assigned to outputs of R-CNN such hard sequences. Our proposed method outperforms the
as car, van, bus, and other. To further evaluate the perfor- baseline methods and our overall performance is 93.43%.
mance when increasing the number of sub-class, we also Various conditions are also separately evaluated such as
use 4 sub-classes and 8 vehicle orientations, resulting in 32 cloudy, night, rainy, and sunny. Figure 6 shows detection
sub-classes in total. The result shows that our R-CNN with results on the test set. Different bounding box color indi-
4 sub-classes improves the R-CNN with one vehicle class cates a different sub-class. For row 3, all results in the trans-
by almost 5%. However, the performance decreases when fer learning approach (column2-4) have a traffic sign incor-
using 4 sub-classes and 8 vehicle orientations. Using too rectly detected as a vehicle. However, the scratch training
many sub-classes with similar features, like a slightly dif- (column 1) does not have this incorrect detection. The re-
ferent view of vehicle in this case, does not improve the sult image in row 4 column 3 shows that the multiple types
Method Validation Set that a single sub-class type is outputted in a particular re-
R-CNN + no TL with 1 class 57.08% gion. The full overlap of bounding box but only partial oc-
R-CNN + TL with 1 class 88.89% clusion on the object is another interesting point that may
R-CNN + no TL with 4 sub-classes be improved by using a better bounding box suppression
and 8 orientations 36.31% method.
R-CNN + TL with 4 sub-classes
and 8 orientations 90.08% References
Table 2: Average precision (AP) scores of results compar- [1] M. Abadi and et. al. TensorFlow: Large-scale machine learn-
ing our methods that use transfer learning (TL) and ones ing on heterogeneous systems, 2015. Software available
from tensorflow.org.
without transfer learning
[2] Z. Cai, M. Saberian, and N. Vasconcelos. Learning
complexity-aware cascades for deep pedestrian detection. In
Method Validation Set 2015 IEEE International Conference on Computer Vision
R-CNN + TL with 4 sub-classes (ICCV), pages 3361–3369, Dec 2015.
and 8 orientations 90.08% [3] P. Dollr, R. Appel, S. Belongie, and P. Perona. Fast feature
R-CNN + TL with 4 sub-classes pyramids for object detection. IEEE Transactions on Pattern
and 8 orientations + NMS 88.23% Analysis and Machine Intelligence, 36(8):1532–1545, Aug
2014.
Table 3: Average precision (AP) scores of results comparing [4] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-
our methods with and without non-maximum suppression manan. Object detection with discriminatively trained part-
algorithm based models. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 32(9):1627–1645, Sept 2010.
[5] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
ture hierarchies for accurate object detection and semantic
segmentation. In 2014 IEEE Conference on Computer Vi-
sion and Pattern Recognition, pages 580–587, June 2014.
[6] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara,
A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and
K. Murphy. Speed/accuracy trade-offs for modern convolu-
tional object detectors. CoRR, abs/1611.10012, 2016.
Figure 5: Example cases of multiple overlapped detection [7] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B.
results on the same object (left) and multiple overlapped Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and
detection results on different objects (right). C. L. Zitnick. Microsoft COCO: common objects in context.
CoRR, abs/1405.0312, 2014.
[8] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-
of detecting vehicles are assigned on the same object. wards real-time object detection with region proposal net-
works. In Neural Information Processing Systems (NIPS),
5. Conclusions 2015.
[9] N. Watcharapinchai, S. Aramvith, and S. Siddhichai. Auto-
Sub-Classes categories learning using R-CNN to im- matic vehicle classification using linked visual words. Jour-
prove the performance of vehicle detection is presented in nal of Electronic Imaging, 26(4):043009, 2017.
this paper. Instead of using a single vehicle class for vehi- [10] L. Wen, D. Du, Z. Cai, Z. Lei, M. Chang, H. Qi, J. Lim,
cle detection, we use multiple sub-classes of vehicles so that M. Yang, and S. Lyu. UA-DETRAC: A new benchmark and
the R-CNN can better learn the features of each individual protocol for multi-object tracking. CoRR, abs/1511.04136,
type. Nevertheless, the sub-class should be different enough 2015.
to gain the performance since using too many sub-classes [11] L. Yang, P. Luo, C. C. Loy, and X. Tang. A large-scale
with similar features, such as vehicle orientation, does not car dataset for fine-grained categorization and verification.
In 2015 IEEE Conference on Computer Vision and Pattern
improve the performance. There are several approaches for
Recognition (CVPR), pages 3973–3981, June 2015.
transfer learning the weights to a new dataset. In the ex-
[12] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How
periment, we compared the result of using transfer learning transferable are features in deep neural networks? CoRR,
with the result of training from scratch. The result shows abs/1411.1792, 2014.
that the transfer learning has a better performance.
There is still an issue of multiple bounding box overlap
from different sub-class on the same object. As a future
work, the network should be modified and improved such
Test Set
Method
Overall Easy Medium Hard Cloudy Night Rainy Sunny
DPM [4] 25.70% 34.42% 30.29% 17.62% 24.78% 30.91% 25.55% 31.77%
ACF [3] 46.35% 54.27% 51.52% 38.07% 58.30% 35.29% 37.09% 66.58%
R-CNN [5] 48.95% 59.31% 54.06% 39.47% 59.73% 39.32% 39.06% 67.52%
CompACT [2] 53.23% 64.84% 58.70% 43.16% 63.23% 46.37% 44.21% 71.16%
R-CNN + TL with 4 SC (ours) 93.43% 93.43% N/A* N/A* 96.69% 92.54% 87.30% 94.47%

Table 4: Average precision (AP) score of the results on the test set from the UA-DETRAC submission server (*The beginner
level does not include the medium and hard test set).