Chen Et Al. - 2018 - An Algorithm For Highway Vehicle Detection Based On Convolutional Neural Network-Annotated
Chen Et Al. - 2018 - An Algorithm For Highway Vehicle Detection Based On Convolutional Neural Network-Annotated
Abstract
In this paper, we present an efficient and effective framework for vehicle detection and classification from traffic
surveillance cameras. First, we cluster the vehicle scales and aspect ratio in the vehicle datasets. Then, we use
convolution neural network (CNN) to detect a vehicle. We utilize feature fusion techniques to concatenate high-
level features and low-level features and detect different sizes of vehicles on different features. In order to improve
speed, we naturally adopt fully convolution architecture instead of fully connection (FC) layers. Furthermore, recent
complementary advances such as batch-norm, hard example mining, and inception have been adopted. Extensive
experiments on JiangSuHighway Dataset (JSHD) demonstrate the competitive performance of our method. Our
framework obtains a significant improvement over the Faster R-CNN by 6.5% mean average precision (mAP). With
1.5G GPU memory at test phase, the speed of the network is 15 FPS, three times faster than the Faster R-CNN.
Keywords: Vehicle detection, Convolution neural network, k-means, Feature concatenate
© The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made.
Chen et al. EURASIP Journal on Image and Video Processing (2018) 2018:109 Page 2 of 7
SSD extract candidate regions on high-level feature map, frames sequences to detect the motion object. Frame sub-
the high-level feature has more semantic information, but traction is characterized by simple calculation and adapt-
cannot locate well. (3) Vehicle detection requires high ing dynamic background, but it is not ideal for motion
real-time, but Faster R-CNN adopts FC layers. It takes that is too fast or too slow. Optical flow [16] calculates the
about 0.2 s per image for VGG16 [13] network. motion vector of each pixel and tracks these pixels, but
Aimed to the general object detection methods, there this approach is complex and time-consuming. Back-
exist problems. We present an efficient and effective frame- ground subtraction such as GMM are widely used in ve-
work for vehicle detection and classification from traffic hicle detection by modeling the distribution of the
surveillance cameras. This method fuses the advantages of background and foreground [2]. However, these ap-
two-stage approach and one-stage approach. Meanwhile, proaches cannot classify and detect still vehicles.
we use some tricks such as hard example mining [14], data Hand-crafted feature-based approaches include Histo-
augmentation, and inception [15]. The main contributions gram of Oriented
of our work are summarized as follows: Gradients (HOG) [17], SIFT [18], and Harr-like. Be-
fore the success of CNN-based approaches, hand-crafted
1) We use k-means algorithm to cluster the vehicle scales feature approaches such as deformable part-based model
and aspect ratios in the vehicle datasets. This process (DPM) [19] have achieved the state-of-art performance.
can improve 1.6% mean average precision (mAP). DPM explores improved HOG feature to describe each
2) We detect vehicles on different feature map part of vehicle and followed by classifiers like SVM and
according to different size vehicles. Adaboost. However, hand-crafted feature approaches
3) We fuse the low-level and high-level feature map, have low feature representation.
so the low-level feature map has more semantic CNN-based approaches have shown rich representation
information. power and achieved promising results [4–6, 9, 11]. R-CNN
uses object proposal generated by selective search [20] to
Our detector is time and resource efficient. We evaluate train CNN for detection tasks. Under the R-CNN frame-
our framework on JiangSuHighway Dataset (JSHD) (Fig. 1) work, SPP-Net [7] and Fast R-CNN [5] speed up through
and obtain a significant improvement over the generating region proposal on feature map; these approaches
state-of-the-art Faster R-CNN by 6.5% mAP. Furthermore, only need computer once. Faster R-CNN [4] uses region
our framework achieves 15 FPS on a NVIDIA TITAN XP, proposal network instead of selective search, then it can train
three times faster than the seminal Faster R-CNN. end to end and the speed and accuracy also improve. R-FCN
[8] tries to reduce the computation time with
2 Related work position-sensitive score maps. Considering the high effi-
In this section, we give a brief introduction of vehicle ciency, the one-stage approach attracts much more attention
detection in traffic surveillance cameras. Vision-based recently. YOLO [9] uses a single feed-forward convolutional
vehicle detection algorithms can be divided into three network to directly predict object classes and locations,
categories: motion-based approaches, hand-crafted which is extremely fast. SSD [11] extracts anchors of differ-
feature-based approaches, and CNN-based approaches. ent aspect ratios and scales on multiple feature maps. It can
Motion-based approaches include frame subtraction, obtain competitive detection results and higher speed. For
optical flow, and background subtraction. Frame subtrac- example, the speed of SSD is 58PFS on a NVIDIA TITAN X
tion computers the differences of two or three consecutive for 300 × 300 input, nine times faster than Faster R-CNN.
Fig. 2 Clustering vehicle size and aspect ratios on JSHD. Left is the result of aspect ratios. Right is the result of vehice size
Chen et al. EURASIP Journal on Image and Video Processing (2018) 2018:109 Page 4 of 7
size vehicles on different size feature pyramid layers. And we adopt data augmentation. We mainly use two data aug-
this way can improve detection accuracy. mentation methods to construct a robust model to adapt a
variation vehicle. One is using flipping input image. The
3.4 Reference boxes second is randomly sampling a patch whose edge length is
In order to detect different size vehicles on different fea- {0.5, 0.6, 0.7, 0.8, 0.9} of the original image and at least one
ture map, we generate candidate proposals on different vehicle’s center is within this patch. Please refer to [11] for
feature map. As we all know, different feature maps have more details.
different receptive field sizes. Low-level feature map has
smaller receptive field sizes compared with high-level fea- 4.2 Hard negative mining
ture map. We run k-means on JSHD to get five vehicle After the matching step, most of the default boxes are
sizes and three aspect ratios. The width and height of each negatives; similar to [11], we use hard negative mining to
box are computed with respect to the aspect ratio. In total, mitigate the extreme class imbalance. At each mini-batch,
there are two scales and three aspect ratios at each feature we make the ration between negative and positive below
map location. Finally, we combine them together and use 1:3, instead of using all negative anchors in training.
non-maximum suppression (NMS) to refine the results.
4.3 Loss function
3.5 Inception We use a multi-task loss function to jointly train our
We use some new technology to improve detection re- network end to end. The loss function consists of two
sults including inception and batch normalization. We parts, the loss of classification and bounding box regres-
employ the inception module on our network. In this sion. We denote the loss of bounding box regression
paper, we just use the most simple structure incep-
tionV1, as shown in Fig. 4. The inception module can
improve the feature presentation and detection accuracy.
Batch normalization leads to significant improvements
in convergence while training. By adding batch
normalization on all feature layers in our network, we
obtain more than 1.9% improvement in mAP. Batch
normalization also helps regularize the model.
with Lloc. It optimizes the smoothed L1 loss between the Table 1 Results on the JSHD
predicted bounding box locations t = (tx, ty, tw, th) and Models Overall (mAP) Car Bus Minibus Truck
target offsets t∗ = (t∗x, t∗y, t∗w, t∗h). We adopt the parame- Fast R-CNN 67.2 53.6 83.2 62.5 69.5
terizations of the four coordinates as follows (2): Faster R-CNN 69.2 55.2 85.4 64.7 71.3
t x ¼ ðx−xa Þ=wa ; ty ¼ ðy−ya Þ=ha ; YOLO 58.9 46.6 75.2 53.4 60.5
t w ¼ logðw=wa Þ; t h ¼ logðh=ha Þ; SSD300 68.8 54.4 85.1 64.3 71.5
ð2Þ
t x ¼ ðx −xa Þ=wa ; t y ¼ ðy −ya Þ=ha ; SSD512 71.2 57.4 87.2 66.8 73.4
tw ¼ logðw =wa Þ; t h ¼ logðh =ha Þ; RON320 [25] 73.6 60.2 89.5 69.1 75.5
Ours 75.7 62.4 91.8 71.3 77.3
where x, y, w, and h denote the box’s center coordinates
and its width and height. Variables x, xa, and x∗ are for
the predicted box, anchor box, and ground-truth box re-
vehicle is classified into four categories (bus, minibus,
spectively (likewise for y, w, h).
car, truck). We do not care for the vehicles that are too
The loss of classification is denoted as Lcls, which is
small. Specifically, the vehicle whose height is less than
computed by a Softmax over K + 1 outputs for each
20 pixels will be ignored. We use random 3000 frames
anchors.
to train our network and 2000 frames to test.
Therefore, our loss function is defined as follows (3):
1 1 5.2 Optimization
L¼ Lcls þ λ Lloc ð3Þ During training phase, stochastic gradient descent
N cls N loc
(SGD) is used to optimize our network. We initialize the
where λ balances the two parts. We consider the two parameters for all the newly added layers by drawing
parts equally import, and λ = 1 is set to computer the weights from a zero-mean Gaussian distribution with
multi-stage loss. standard deviation 0.01. We set the learning rate as
0.001 for the first 60k iterations and 0.0001 for the next
5 Experiments, results, and discussion 60k iterations. The batch size is 16 for a 320 × 320
In order to evaluate the effectiveness of our network. model. We use a momentum of 0.9 and a weight decay
We conduct experiments on our vehicle dataset (JSHD). of 0.005.
The experiments are implemented on Ubuntu 16.04
with a GPUs (NVIDIA TITAN XP) and i7 7700 CPU. 5.3 Results
To evaluate the effectiveness of our proposed network,
5.1 Dataset we compare our proposed network to the
There are large number of various types of vehicles in state-of-the-art detectors on JSHD. Table 1 shows the re-
highway surveillance video. And it is suitable for traffic sults of our experiment. It is obvious that our network
video analysis because of the large and long view of the outperforms the other algorithms. We achieve a signifi-
road. So we construct a new dataset from 25 videos of cant overall improvement of 6.5% mAP over the
JiangSu highway, Jiangsu province, which we called state-of-the-art Faster R-CNN and 6.9% over SSD300. It
JSHD, as shown in Figs. 1 and 5. The dataset contains is clear that the precision of the car is lower than that of
5000 frames which are captured from the videos. The the bus and truck because the deep learning detection
Table 2 Module ablation analysis ongoing, and those data and materials are still required by the author and
co-authors for further investigations.
Faster R-CNN Ours
k-means? √ √ √ √ Authors’ contributions
Feature concatenate? √ √ √ LC and YR designed the research. FY, HF, and QC analyzed the data. LC wrote
and edited the manuscript. All authors read and approved the final manuscript.
Detecting on different feature? √ √
Batch normalization? √ Ethics approval and consent to participate
We approved.
JSHD (mAP) 69.2 70.8 72.8 73.8 75.7