Vehicle Counting Using Deep Learning Models
Vehicle Counting Using Deep Learning Models
net/publication/343422746
Article in International Journal of Advanced Computer Science and Applications · January 2020
DOI: 10.14569/IJACSA.2020.0110784
CITATIONS READS
7 583
2 authors, including:
Azizi Abdullah
Universiti Kebangsaan Malaysia
101 PUBLICATIONS 946 CITATIONS
SEE PROFILE
All content following this page was uploaded by Azizi Abdullah on 24 May 2021.
Abstract—Recently, there has been a shift to deep learning to use it for vehicle detection and counting. In the deep
architectures for better application in vehicle traffic control learning architecture, it learns categories gradually throughout
systems. One popular deep learning library used for detecting the hidden layers. For example, in face image recognition, it
vehicle is TensorFlow. In TensorFlow, the pre-trained model is starts with identifying low level features such as bright and
very efficient and can be transferred easily to solve other similar dark areas and then proceeds to recognize lines and shapes for
problems. However, due to inconsistency between the original
dataset used in the pre-trained model and the target dataset for
facial recognition. Each neuron or node represents one feature
testing, this can lead to low-accuracy detection and hinder vehicle and combination of those nodes will give a full representation
counting performance. One major obstacle in retraining deep of the image. The hidden node or layer is represented by a
learning architectures is that the network requires a large corpus weight value that will influence the outcome (output), and this
training dataset to secure good results. Therefore, we propose to value can be changed during the learning process. All these
perform data annotation and transfer learning from an existing layers are learned in hierarchical order and it is very crucial
model to construct a new model for vehicle detection and counting to determine the high-level features of the data to make an
in the real world urban traffic scenes. Then, the new model is accurate decision. The overall approach mentioned has shown
compared with the experimental data to verify the validity of the high accuracy in classifying objects. Zhang et al. [2] proposed
new model. Besides, this paper reports some experimental results, a vehicle counting system that utilized a deep learning network.
comprising a set of innovative tests to identify the best detection
algorithm and system performance. Furthermore, a simple vehicle
The system was implemented for a static image and detect
tracking method is proposed to aid the vehicle counting process vehicles in every frame. However, there is no information
in challenging illumination and traffic conditions. The results stated in this paper about the flow of moving vehicles.
showed a significant improvement of the proposed system with
the average vehicle counting of 80.90%. In the literature, many works have utilized pre-trained
DLN models via transfer learning methodology for vehicle
Keywords—CNN; transfer learning; deep learning; object de- detection. In [3] used a pre-trained model via transfer learning,
tection; vehicle detection i.e. Yolo on vehicle counting. The model is trained using the
standard MS-COCO dataset. After that the researchers re-train
I. I NTRODUCTION the model on different datasets, namely PASCAL VOC 2007,
In the earlier days before the rise of machine learning, KITTI and user’s custom annotated dataset. The mean accuracy
the process of vehicle counting was done manually. It was precision detection is around 75% achieved on an 80-20 train-
performed by a person standing by the roadside; using an test split using 5562 video frames from four different highway
electronic device to record the data using a tally sheet. In locations. Another research on vehicle counting was using
some cases, the person may do the counting by observing MobileNet [4]. A MobileNet model which was pre-trained on
video footage captured by city cams or closed-circuit television the ImageNet dataset with a size of 224×224 pixels for each
(CCTV) placed above the road or highway. According to a image. With a limited set of training images, the accuracy of
study in [1], manual vehicle calculation performance is 99% vehicle detection was 97.4%. As for traffic volume estimates or
accurate. This investigation is based on manual calculation of counting accuracy at the intersections, it was 78%. There were
various vehicles from a 5 minutes video recording. Although two crucial observations in this study. First, the performance
the manual method provides high accuracy, it requires an was unsatisfactory in cases of a highly overlapping vehicles
extensive amount of human resources. Besides, it tends to be such as occlusion due to partial information. The detection
error-prone, especially on severe traffic flow and multiple road performance results at night or under very low-illumination
lines. Therefore, manual calculations are usually performed conditions are also poor. In [5] proposed to use YOLOv3
with only a small sample of data, and the results are extrapo- Darknet-53 for vehicle detection and counting system. The
lated for the whole year or season for long-term forecasts. results have shown that DLN can provide higher detection
and counting accuracies, especially for the detection of small
Vision-based vehicle detection through highly cluttered vehicle objects.
scenes is difficult. At present, this approach can be categorized
into traditional and complex deep learning methods. Recently, Following this, some studies have been conducted to com-
deep learning networks (DLN) based on convolutional neural pare various available CNN models as the detector for vehicle
networks (CNN) have obtained state-of-the-art performance counting systems in general, such as [6] [7] and [8] to name a
on many machine vision task. Therefore, researchers began few. There are also studies specifically on using deep learning
www.ijacsa.thesai.org 697 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 7, 2020
models for vehicle counting systems such as [9] [3] and [10]. II. R ELATED W ORK
Each study has varying results which highlighted the strengths
and weaknesses of each pre-trained models. It seems that the With recent advancements in deep learning, computer
model’s performance is highly correlated with the local dataset vision applications such as object classification and detection
and the characteristics of the vehicle movement. Therefore, can be developed and deployed more effectively. These appli-
there is no single CNN detector model that fits for all situation cations have been proposed and shown significant performance
and providing the optimal detection result. In [8] presents a improvements and enabling real-time processing of streaming
comparative study of CNN detector models using deep learning data for analytic and making decision.
library of TensorFlow which provide portability and ease of
use. They used the COCO dataset for evaluation. A. Deep Neural Networks
Deep architectures are useful in learning and have shown
The general availability of many pre-trained deep learning impressive performance for example in the classification of
models might ease the implementation of an automated vehicle digits in the MNIST dataset [14]; CIFAR [15] and ImageNet
counting system. But the main challenge is to identify the [16] for object classifications. In this scheme, the lowest layer,
best model from among sets of similar pre-trained models that i.e. feature detectors are used to detect simple patterns. After
can perform well on intended datasets. The direct comparison that, these patterns are fed into deeper, following, layers that
to determine the optimal model is difficult due to different form more complex representations of the input data. There
environment settings used in experiments. Thus, a fair com- are several approaches to learn deep architectures. One of
parison using a similar environment for performance evaluation the most frequently used in computer vision is convolution
is needed. One possible problem with the pre-trained models neural networks (CNNs), where the networks preserve the
performance is that the use of standard benchmark datasets spatial structure of the problem by learning internal feature
for training and completely different dataset for testing. It is representations using small squares of input data. Features
common experience for the user to get poor results from the are learned and used across the whole image, allowing for
query of desired objects. Thus, instead of using the benchmark the objects in the images to be shifted or translated in the
datasets, one needs to re-train the model on other large custom scene and still detectable by the network. This is one of
data for networks to learn patterns optimally [11]. But, re- the reasons why deep architecture is so useful for object
training on the large data can be costly and time-consuming recognition such as in picking out digits, faces, objects and
for deployment [12]. For example, training a deep learning so on with different challenging conditions. Thus, to get a
algorithm on huge datasets is time-consuming and computing- good classification result, the network is trained with a vast
intensive to secure good performance results. Therefore, one number of images such as using ImageNet [16] as the dataset
possible solution is to use a pre-trained model and transfer to classify pictures. Besides the classification task, the deep
learning for better weight scaling and convergence speed-ups. architecture is widely used for object detection that draws a
Thus, inspired by the work of [13], a set of images with bounding box around each object of interest in the image and
different illumination is used to re-train the existing model assigns them a class label. The bounding box indicates the
via transfer learning. For vehicle counting, a simple method position and scale of every instance of each object category.
is proposed, where the coordinate locations for each vehicle There are several approaches to object detection in computer
are detected in every frame. The Euclidean distance is used vision such as Faster R-CNN, YOLO and SSD.
to computed between frames of a given video sequence for
tracking and trajectory estimation. A virtual reference line is Typically, deep convolutional neural network models may
constructed, and the vehicle is counted if it crosses the line. take days or even weeks to train on huge datasets for good
performance. A way to reduce the training time is to re-
The contribution of this paper is as follows: (1) we use the model weights from pre-trained models that were
compare the most widely used TensorFlow’s object detection trained using millions of natural images such as from ImageNet
model zoo, namely, Faster R-NN, SSD and Yolov3 for vehicle dataset. Such a methodology is called transfer learning. In this
counting application on urban traffic volumes. (2) we demon- technique, the constructed models can be downloaded and used
strate the effectiveness of using data annotation tool for vehicle directly, whereby a neural network model is first trained on
detection via transfer learning. TensorFlow’s detection model a problem similar to the one we have chosen. One or more
zoo that trains on the standard datasets such as COCO alone layers from the pre-trained model are then used in a new model
is not the best to describe real-world vehicle traffic conditions, trained on the problem of interest. The pre-trained model has
but re-training the model efficiently can enhance its ability in the advantage that it is already learned a rich set of image
detecting features. (3) we propose a simple vehicle counting features. Besides, the model is transferable to the new task
system that uses a virtual reference line and Euclidean distance by fine-tuning the network. In this case, the model can be
for tracking and trajectory estimations. re-trained on a small number of images such that the network
weights are small adjusted to support the new task. Thus, it has
The rest of the paper is organized as follows. Section 2 the benefit not only to decrease the training time for a neural
describes the fundamental principles of deep neural network network model but also can result in lower generalization
and its application to object detection. Section 3 describes error. For example, in [17] use ImageNet initialized models
our system for vehicle counting with a focus on TensorFlow’s for object detection on the Pascal VOC dataset challenge, [18]
object detection model zoo with simple tracking and counting use ImageNet initialized models for semantic segmentation.
algorithms. Experimental results on the urban traffic volumes Other works that utilized the ImageNet dataset for training
on different conditions, i.e. morning, day and night are shown deep learning models for image classification such as in [19],
and discussed in Section 4. Section 5 concludes the paper. [16].
www.ijacsa.thesai.org 698 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 7, 2020
Fig. 2. (a) Original Image with Two Ground Truth Boxes, (b) Two of the
Fig. 4. (a) Sample Images from Two Consequent Frames, i.e. Frame#1 and
8x8 Boxes (Blue Color) are Matched with the Apple, and (c) One of the 4x4
Frame#2. (b) The Vehicle is Tracked from the Minimum Displacement from
Boxes (Red Color) is Matched with the Banana. It is Important to Note that
the First Frame to the Second Frame.
the Boxes in the 8x8 Feature Map are Smaller than those in the 4x4 Feature
Map. In Total, SSD has Six Different Feature Maps, and Each Map
Responsible for a Different Scale of Objects, Enabling it to Detect Objects
Cover a Large Range of Scales.
is (xstart + ( xend −x
2
start
), ystart + ( yend −y
2
start
)). Next, the
euclidean distance is computed for each point from the first
frame to the next frame resulting in four different distance
values. After that the minimum displacement for each point in
frame #2 is determined to obtain the nearest pair from frame
#1 (Fig. 4(a)). This will result in pairs, as shown in Fig. 4
and virtual trajectory lines can be seen between these pairs
(Fig. 4(b)). For counting, a reference line (dot-line) is defined
in these frames, which will be used to determine if a car has
Fig. 3. Summary of Predictions made by YOLO Model. Taken from Joseph passed or not to be counted in the vehicle counter algorithm.
Redmon et al. in (2015) YOLO Paper.
To briefly explain the concept of the euclidean tracking
algorithm suppose the number of vehicles is {xi : i = 1, ..., L}
in the first frame and {yi : i = 1, ..., K} in the second frame.
box directly. YOLO works by taking an image and splits The goal of the euclidean tracking algorithm is to identify the
the input image into a grid of cells. Then, each grid cell nearest pair as follows:
predicts a bounding box if the centre of a bounding box falls
within it. Each grid cell uses a confidence value to predict a X
bounding box that involves spatial coordinate x,y and the width T RACK(L) = min ||xi − yj ||2
1≤j≤K
and height of the box. For each bounding box, the network i=L
calculates a class probability value and offset values for the
bounding box. The bounding boxes having the class probability Thus, the tracking algorithm works by iterating from the
map above a threshold value are then combined into a final set first until all point pairs in the second frame are visited. After
of bounding boxes and class labels. Fig. 3 shows the YOLO that, assign each observation in the first frame to the closest
model architecture taken from [21]. In this model, Darknet distance point in the second frame.
[24] CNN architecture is used for exacting in-depth features
and classification. E. Vehicle Counting
The counting method is based on the vehicle regional
D. Vehicle Tracking bounding box marks and the virtual reference line. This
A simple vehicle tracking algorithm is proposed in this technique assumes that the vehicle movement is in a direction.
work. The process starts with converting video clips to frames. For counting, each detected vehicle in the detection step is
The output from the object detection model is bounding boxes assigned with a unique label and tracked until it reaches
with coordinates and class labels. The coordinates can be used the virtual line. In this work, we have used five different
to determine the centre point of each object and in this case, class labels, namely bicycle, car, motorcycle, bus and truck.
vehicles. Assuming the first two frames of a video clip is And all these labels are categorized as vehicle object and
depicted in Fig. 4. will be used in the counting system. After that, each vehicle
position is checked whether it has crossed the horizontal
In this figure, assume that we have two vehicles at different reference line (yref ) at the y-axis as drawn in Fig. 4(b). If
location and frame. Here, by indicating (m1 , n1 ) and (m2 , n2 ) it passed the line, then it will be counted as one. In this case,
for the first vehicle coordinates in the first and second frame y2 coordinate value > yref coordinate value can be said to
respectively. And (x1 , y1 ) and (x2 , y2 ) for the next vehicle cross the reference line.
coordinates in the first and second frame respectively. These
coordinates are midpoints of the bounding boxes provided by F. Data Annotation
the object detector. For example, let’s say the object detector
detects a vehicle in a video frame and draws a bounding box The need for efficient image recognition is crucial to be
at (xstart , ystart ) and (xend , yend ), where xstart and ystart used in various application, such as for vehicle detection. In
are the x and y coordinates of the upper left corner of the the literature, deep convolutional neural network models have
bounding box respectively. And xend and yend are the x shown remarkable achievement on many computer recognition
and y coordinates of the lower right corner of the bound- tasks. However, these models are heavily reliant on big dataset
ing box respectively. Thus, the midpoint of the coordinate of images taken to form a variety of conditions, such as
www.ijacsa.thesai.org 700 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 7, 2020
Fig. 6. Some Detection Results in Experiment I using Different Models i.e. Fig. 7. Some Detection Results using YOLOv3 DarkNet Detector Model (a)
SSD, Faster R-CNN and YoloV3, respectively. Top Half shows Detected Vehicle before Retraining (b) Bottom Half shows
Detected Vehicle after Retraining using a Data Annotation Tool.
can be done on the meta -architectures and detectors. Besides, [11] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu, “A survey on
the model used for retraining was a light-weighted version deep transfer learning,” ArXiv, vol. abs/1808.01974, 2018.
of YOLO, which is called tiny-YOLO. This is due to the [12] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans-
limitation on the available hardware specification. To retrain actions on Knowledge and Data Engineering, vol. 22, pp. 1345–1359,
2010.
YOLO the recommended minimum GPU memory is 4GB,
[13] M. Murugave., “How to train yolov3 to detect cus-
any specification below that is only suitable for training tiny- tom objects,” https://medium.com/@manivannan data/
YOLO. Thus, it is recommended that future studies need to how-to-train-yolov3-to-detect-custom-objects-ccbcafeb13d2, 2018.
consider the retraining of YOLO instead of tiny-YOLO to [14] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010.
compare the performances. [15] A. Krizhevsky, V. Nair, and G. Hinton, “Learning multiple layers of
features from tiny images,” Tech. Rep., 2009. [Online]. Available:
ACKNOWLEDGMENT http://www.cs.toronto.edu/∼kriz/cifar.html
This work has been supported by the Malaysia’s Min- [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in NIPS 2012, 2012.
istry of Higher Education Fundamental Research Grant
[17] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster r-cnn: Towards real-
FRGS/1/2019/ICT02/UKM/02/8. time object detection with region proposal networks,” IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, vol. 39, pp. 1137–
R EFERENCES 1149, 2015.
[1] P. Zheng and M. Mike, “An investigation on the manual traffic count [18] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
accuracy,” in 8th International Conference on Traffic and Transportation for semantic segmentation,” in CVPR, 2015.
Studies (ICTTS 2012), 2012. [19] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
[2] Z. Zhang, K. Liu, F. Gao, X. Li, and G. Wang, “Vision-based vehicle large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
detecting and counting for traffic flow analysis,” 2016 International [20] R. B. Girshick, “Fast r-cnn,” 2015 IEEE International Conference on
Joint Conference on Neural Networks (IJCNN), pp. 2267–2273, 2016. Computer Vision (ICCV), pp. 1440–1448, 2015.
[3] M. S. Chauhan, A. Singh, M. Khemka, A. Prateek, and R. Sen, [21] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only
“Embedded cnn based vehicle classification and counting in non-laned look once: Unified, real-time object detection,” 2016 IEEE Conference
road traffic,” in ICTD ’19, 2019. on Computer Vision and Pattern Recognition (CVPR), pp. 779–788,
[4] B. Dey and M. K. Kundu, “Turning video into traffic data - an 2015.
application to urban intersection analysis using transfer learning,” IET [22] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
Image Processing, vol. 13, pp. 673–679, 2019. hierarchies for accurate object detection and semantic segmentation,”
[5] H. Song, H. Liang, H. Li, Z. Dai, and X. Yun, “Vision-based vehicle 2014 IEEE Conference on Computer Vision and Pattern Recognition,
detection and counting system using deep learning in highway scenes,” pp. 580–587, 2013.
European Transport Research Review, vol. 11, pp. 1–16, 2019. [23] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” 2017
[6] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy, pp. 6517–6525, 2016.
“Speed/accuracy trade-offs for modern convolutional object detectors,” [24] ——, “Yolov3: An incremental improvement,” ArXiv, vol.
2016. abs/1804.02767, 2018.
[7] Á. A. Garcı́a, J. A. Álvarez, and L. M. Soria-Morillo, “Evaluation of [25] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C.-Y. Fu, and
deep neural networks for traffic sign detection systems,” Neurocomput- A. C. Berg, “Ssd: Single shot multibox detector,” in ECCV, 2016.
ing, vol. 316, pp. 332–344, 2018. [26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
[8] N. Yadav and U. Binay, “Comparative study of object detection algo- recognition,” arXiv preprint arXiv:1512.03385, 2015.
rithms,” IRJET, vol. 11, 2017. [27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
[9] A. Arinaldi, J. A. Pradana, and A. A. Gurusinga, “Detection and V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
classification of vehicles for traffic video analytics,” in INNS Conference 2015 IEEE Conference on Computer Vision and Pattern Recognition
on Big Data, 2018. (CVPR), pp. 1–9, 2015.
[10] B. Dey and M. K. Kundu, “Turning video into traffic data – an [28] T. H. Lee., “Tensornets,” https://github.com/taehoonlee/tensornets,
application to urban intersection analysis using transfer learning,” IET 2018.
Image Processing, vol. 13, pp. 673–679, 2019.
www.ijacsa.thesai.org 703 | P a g e