0% found this document useful (0 votes)
23 views

Vehicle Counting Using Deep Learning Models

Uploaded by

Aman Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Vehicle Counting Using Deep Learning Models

Uploaded by

Aman Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/343422746

Vehicle Counting using Deep Learning Models: A Comparative Study

Article in International Journal of Advanced Computer Science and Applications · January 2020
DOI: 10.14569/IJACSA.2020.0110784

CITATIONS READS
7 583

2 authors, including:

Azizi Abdullah
Universiti Kebangsaan Malaysia
101 PUBLICATIONS 946 CITATIONS

SEE PROFILE

All content following this page was uploaded by Azizi Abdullah on 24 May 2021.

The user has requested enhancement of the downloaded file.


(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 7, 2020

Vehicle Counting using Deep Learning Models: A


Comparative Study

Azizi Abdullah1 , Jaison Oothariasamy2


Center For Artificial Intelligence Technology
Faculty of Information Science and Technology
Universiti Kebangsaan Malaysia
43600 Bandar Baru Bangi Selangor, Malaysia

Abstract—Recently, there has been a shift to deep learning to use it for vehicle detection and counting. In the deep
architectures for better application in vehicle traffic control learning architecture, it learns categories gradually throughout
systems. One popular deep learning library used for detecting the hidden layers. For example, in face image recognition, it
vehicle is TensorFlow. In TensorFlow, the pre-trained model is starts with identifying low level features such as bright and
very efficient and can be transferred easily to solve other similar dark areas and then proceeds to recognize lines and shapes for
problems. However, due to inconsistency between the original
dataset used in the pre-trained model and the target dataset for
facial recognition. Each neuron or node represents one feature
testing, this can lead to low-accuracy detection and hinder vehicle and combination of those nodes will give a full representation
counting performance. One major obstacle in retraining deep of the image. The hidden node or layer is represented by a
learning architectures is that the network requires a large corpus weight value that will influence the outcome (output), and this
training dataset to secure good results. Therefore, we propose to value can be changed during the learning process. All these
perform data annotation and transfer learning from an existing layers are learned in hierarchical order and it is very crucial
model to construct a new model for vehicle detection and counting to determine the high-level features of the data to make an
in the real world urban traffic scenes. Then, the new model is accurate decision. The overall approach mentioned has shown
compared with the experimental data to verify the validity of the high accuracy in classifying objects. Zhang et al. [2] proposed
new model. Besides, this paper reports some experimental results, a vehicle counting system that utilized a deep learning network.
comprising a set of innovative tests to identify the best detection
algorithm and system performance. Furthermore, a simple vehicle
The system was implemented for a static image and detect
tracking method is proposed to aid the vehicle counting process vehicles in every frame. However, there is no information
in challenging illumination and traffic conditions. The results stated in this paper about the flow of moving vehicles.
showed a significant improvement of the proposed system with
the average vehicle counting of 80.90%. In the literature, many works have utilized pre-trained
DLN models via transfer learning methodology for vehicle
Keywords—CNN; transfer learning; deep learning; object de- detection. In [3] used a pre-trained model via transfer learning,
tection; vehicle detection i.e. Yolo on vehicle counting. The model is trained using the
standard MS-COCO dataset. After that the researchers re-train
I. I NTRODUCTION the model on different datasets, namely PASCAL VOC 2007,
In the earlier days before the rise of machine learning, KITTI and user’s custom annotated dataset. The mean accuracy
the process of vehicle counting was done manually. It was precision detection is around 75% achieved on an 80-20 train-
performed by a person standing by the roadside; using an test split using 5562 video frames from four different highway
electronic device to record the data using a tally sheet. In locations. Another research on vehicle counting was using
some cases, the person may do the counting by observing MobileNet [4]. A MobileNet model which was pre-trained on
video footage captured by city cams or closed-circuit television the ImageNet dataset with a size of 224×224 pixels for each
(CCTV) placed above the road or highway. According to a image. With a limited set of training images, the accuracy of
study in [1], manual vehicle calculation performance is 99% vehicle detection was 97.4%. As for traffic volume estimates or
accurate. This investigation is based on manual calculation of counting accuracy at the intersections, it was 78%. There were
various vehicles from a 5 minutes video recording. Although two crucial observations in this study. First, the performance
the manual method provides high accuracy, it requires an was unsatisfactory in cases of a highly overlapping vehicles
extensive amount of human resources. Besides, it tends to be such as occlusion due to partial information. The detection
error-prone, especially on severe traffic flow and multiple road performance results at night or under very low-illumination
lines. Therefore, manual calculations are usually performed conditions are also poor. In [5] proposed to use YOLOv3
with only a small sample of data, and the results are extrapo- Darknet-53 for vehicle detection and counting system. The
lated for the whole year or season for long-term forecasts. results have shown that DLN can provide higher detection
and counting accuracies, especially for the detection of small
Vision-based vehicle detection through highly cluttered vehicle objects.
scenes is difficult. At present, this approach can be categorized
into traditional and complex deep learning methods. Recently, Following this, some studies have been conducted to com-
deep learning networks (DLN) based on convolutional neural pare various available CNN models as the detector for vehicle
networks (CNN) have obtained state-of-the-art performance counting systems in general, such as [6] [7] and [8] to name a
on many machine vision task. Therefore, researchers began few. There are also studies specifically on using deep learning
www.ijacsa.thesai.org 697 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 7, 2020

models for vehicle counting systems such as [9] [3] and [10]. II. R ELATED W ORK
Each study has varying results which highlighted the strengths
and weaknesses of each pre-trained models. It seems that the With recent advancements in deep learning, computer
model’s performance is highly correlated with the local dataset vision applications such as object classification and detection
and the characteristics of the vehicle movement. Therefore, can be developed and deployed more effectively. These appli-
there is no single CNN detector model that fits for all situation cations have been proposed and shown significant performance
and providing the optimal detection result. In [8] presents a improvements and enabling real-time processing of streaming
comparative study of CNN detector models using deep learning data for analytic and making decision.
library of TensorFlow which provide portability and ease of
use. They used the COCO dataset for evaluation. A. Deep Neural Networks
Deep architectures are useful in learning and have shown
The general availability of many pre-trained deep learning impressive performance for example in the classification of
models might ease the implementation of an automated vehicle digits in the MNIST dataset [14]; CIFAR [15] and ImageNet
counting system. But the main challenge is to identify the [16] for object classifications. In this scheme, the lowest layer,
best model from among sets of similar pre-trained models that i.e. feature detectors are used to detect simple patterns. After
can perform well on intended datasets. The direct comparison that, these patterns are fed into deeper, following, layers that
to determine the optimal model is difficult due to different form more complex representations of the input data. There
environment settings used in experiments. Thus, a fair com- are several approaches to learn deep architectures. One of
parison using a similar environment for performance evaluation the most frequently used in computer vision is convolution
is needed. One possible problem with the pre-trained models neural networks (CNNs), where the networks preserve the
performance is that the use of standard benchmark datasets spatial structure of the problem by learning internal feature
for training and completely different dataset for testing. It is representations using small squares of input data. Features
common experience for the user to get poor results from the are learned and used across the whole image, allowing for
query of desired objects. Thus, instead of using the benchmark the objects in the images to be shifted or translated in the
datasets, one needs to re-train the model on other large custom scene and still detectable by the network. This is one of
data for networks to learn patterns optimally [11]. But, re- the reasons why deep architecture is so useful for object
training on the large data can be costly and time-consuming recognition such as in picking out digits, faces, objects and
for deployment [12]. For example, training a deep learning so on with different challenging conditions. Thus, to get a
algorithm on huge datasets is time-consuming and computing- good classification result, the network is trained with a vast
intensive to secure good performance results. Therefore, one number of images such as using ImageNet [16] as the dataset
possible solution is to use a pre-trained model and transfer to classify pictures. Besides the classification task, the deep
learning for better weight scaling and convergence speed-ups. architecture is widely used for object detection that draws a
Thus, inspired by the work of [13], a set of images with bounding box around each object of interest in the image and
different illumination is used to re-train the existing model assigns them a class label. The bounding box indicates the
via transfer learning. For vehicle counting, a simple method position and scale of every instance of each object category.
is proposed, where the coordinate locations for each vehicle There are several approaches to object detection in computer
are detected in every frame. The Euclidean distance is used vision such as Faster R-CNN, YOLO and SSD.
to computed between frames of a given video sequence for
tracking and trajectory estimation. A virtual reference line is Typically, deep convolutional neural network models may
constructed, and the vehicle is counted if it crosses the line. take days or even weeks to train on huge datasets for good
performance. A way to reduce the training time is to re-
The contribution of this paper is as follows: (1) we use the model weights from pre-trained models that were
compare the most widely used TensorFlow’s object detection trained using millions of natural images such as from ImageNet
model zoo, namely, Faster R-NN, SSD and Yolov3 for vehicle dataset. Such a methodology is called transfer learning. In this
counting application on urban traffic volumes. (2) we demon- technique, the constructed models can be downloaded and used
strate the effectiveness of using data annotation tool for vehicle directly, whereby a neural network model is first trained on
detection via transfer learning. TensorFlow’s detection model a problem similar to the one we have chosen. One or more
zoo that trains on the standard datasets such as COCO alone layers from the pre-trained model are then used in a new model
is not the best to describe real-world vehicle traffic conditions, trained on the problem of interest. The pre-trained model has
but re-training the model efficiently can enhance its ability in the advantage that it is already learned a rich set of image
detecting features. (3) we propose a simple vehicle counting features. Besides, the model is transferable to the new task
system that uses a virtual reference line and Euclidean distance by fine-tuning the network. In this case, the model can be
for tracking and trajectory estimations. re-trained on a small number of images such that the network
weights are small adjusted to support the new task. Thus, it has
The rest of the paper is organized as follows. Section 2 the benefit not only to decrease the training time for a neural
describes the fundamental principles of deep neural network network model but also can result in lower generalization
and its application to object detection. Section 3 describes error. For example, in [17] use ImageNet initialized models
our system for vehicle counting with a focus on TensorFlow’s for object detection on the Pascal VOC dataset challenge, [18]
object detection model zoo with simple tracking and counting use ImageNet initialized models for semantic segmentation.
algorithms. Experimental results on the urban traffic volumes Other works that utilized the ImageNet dataset for training
on different conditions, i.e. morning, day and night are shown deep learning models for image classification such as in [19],
and discussed in Section 4. Section 5 concludes the paper. [16].
www.ijacsa.thesai.org 698 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 7, 2020

B. Deep Learning for Object Detection


In general, object detection is a task in computer vision
that involves identifying the presence, location, and type of
one or more intended objects in a given test image. It is a
challenging problem that consists of three main processes,
namely object recognition, localization and classification. In
recent years, deep learning techniques have been applied to Fig. 1. Faster R-CNN Generates Anchors of Different Ratios and Scales for
many vehicle detection problems and show promising results each Sliding Windows on Convolutional Feature Map. After that, the Output
of Regressor Determines a Predicted Bounding Box.
such as on standard benchmark datasets and in computer vision
challenges.
Several approaches are using deep learning techniques for
object detection. Shaoqing Ren et al. [17] proposed a method, generated which all have the same centre, but with three differ-
namely Faster R-CNN to improve both speeds of training ent aspect ratios and three different scales. Finally, the n × n
and detection of the existing Fast R-CNN [20]. The method spatial features extracted from those convolution feature maps
consists of two modules, namely, (a) region proposal network, are fed to a smaller network which performs classification and
where the convolutional neural network is used for proposing regression. The predicted region proposals are then reshaped
regions and the type of object to consider in the region and (b) using a region pooling layer which is then used to classify
Fast R-CNN for extracting features from the proposed regions the image within the proposed region and predict the offset
and outputting the bounding box and class labels. Faster R- values for the bounding boxes. The regressor output determines
CNN has proven to be efficient for object detection and secured the position, width and height of the predicted bounding box.
the first-place on both the ILSVRC-2015 and MSCOCO-2015 The proposed method results outperform Fast R-CNN on a
object recognition and detection competition tasks. Joseph detection speed of 0.2 seconds on each image. Fig. 1 shows the
Redmon et al. [21] proposed an algorithm namely, you only Faster R-CNN model architecture. In this model, ResNet101
look once (YOLO) for object detection. The algorithm is [26] CNN architecture is used for extracting in-depth features
claimed to be much faster than the standard R-CNN [22] and classification.
and achieving object detection in real-time. The authors then
further improved the model performance and referred to as B. Single Shot MultiBox Detector (SSD)
YOLO v2 [23] and YOLO v3 [24]. Another widely used model The SSD architecture was published in 2016 by researchers
for object detection in the industry is the single-shot multi-box from Google for object detection in real-time [25]. It uses
detector (SSD) [25]. It improves R-CNN [22] detection speed VGGNet convolutional neural network [19] as the base net for
by eliminating the need of the region proposal network. feature extraction. In contrast to Faster R-CNN, SSD improves
the detection speed by eliminating the need of the region
III. P ROPOSED M ETHOD proposal network. In SSD, it provides a set of different default
boxes with varying scales for object detection. These features
The proposed method consists of three main steps. The
(multi-scale features and default boxes) are used to recover the
first step is to detect and draw bounding boxes around vehicles
drop in the object detection accuracy.
for every n-frames employing transfer learning with the deep
CNN architecture. The detection algorithms were inspired by Furthermore, each element of the feature map has several
the works of [17], [25] and [21] that introduced Faster R- default boxes associated with it. The feature map sets came
CNN, SSD and YOLO, respectively. Next, the trajectory of from different layers of the CNN network. A typical CNN
each vehicle is extracted by tracking corner points through n network gradually shrinks the feature map size and increase
frames. In this step, a simple method is introduced to identify the depth as it moves to the deeper layers. The deep layers
the trajectory of each vehicle found in the first step. Finally, a cover larger receptive fields and construct more abstract repre-
simple counting algorithm to count the number of vehicles on sentation, while the shallow layers only cover smaller receptive
the street is proposed. Details of the algorithms are as follows: fields. By utilizing this information, it is possible to detect
small objects in shallow layers and large objects in deeper
A. Faster R-CNN layers. For detection, any default box with an Intersection
over Union (IOU) of 0.5 or higher with a ground truth box
As stated before, this method was proposed by Shaoqing is considered a positive sample. Fig. 2 shows the single-shot
Ren et al. [17] which aims to improve both computational multi-box model architecture. In this model, Inception [27]
speeds and the detection accuracy of existing Fast R-CNN CNN architecture is used for extracting in-depth features and
[20]. This technique mainly comprises of two modules which classification.
are region proposal network and Fast R-CNN for extracting
features from the proposed regions. Similar to Fast R-CNN, C. You Only Look Once (YOLO)
the image is provided as an input to a convolutional network
which will output a set of convolutional feature maps on the Joseph Redmon et al. [21] proposed an algorithm namely,
last convolutional layer. Instead of using a selective search you only look once (YOLO) for object detection. The algo-
algorithm on the feature map to identify the region proposals, a rithm is claimed to be much faster than the standard R-CNN
separate network is used to predict the region proposals. In this [22] and achieving object detection in real-time. In contrast
case, a sliding window of size n × n is run spatially on these to the previous schemes, YOLO uses a neural network to
feature maps. For each sliding window, a set of 9 anchors are predict the bounding boxes and class labels for each bounding
www.ijacsa.thesai.org 699 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 7, 2020

Fig. 2. (a) Original Image with Two Ground Truth Boxes, (b) Two of the
Fig. 4. (a) Sample Images from Two Consequent Frames, i.e. Frame#1 and
8x8 Boxes (Blue Color) are Matched with the Apple, and (c) One of the 4x4
Frame#2. (b) The Vehicle is Tracked from the Minimum Displacement from
Boxes (Red Color) is Matched with the Banana. It is Important to Note that
the First Frame to the Second Frame.
the Boxes in the 8x8 Feature Map are Smaller than those in the 4x4 Feature
Map. In Total, SSD has Six Different Feature Maps, and Each Map
Responsible for a Different Scale of Objects, Enabling it to Detect Objects
Cover a Large Range of Scales.
is (xstart + ( xend −x
2
start
), ystart + ( yend −y
2
start
)). Next, the
euclidean distance is computed for each point from the first
frame to the next frame resulting in four different distance
values. After that the minimum displacement for each point in
frame #2 is determined to obtain the nearest pair from frame
#1 (Fig. 4(a)). This will result in pairs, as shown in Fig. 4
and virtual trajectory lines can be seen between these pairs
(Fig. 4(b)). For counting, a reference line (dot-line) is defined
in these frames, which will be used to determine if a car has
Fig. 3. Summary of Predictions made by YOLO Model. Taken from Joseph passed or not to be counted in the vehicle counter algorithm.
Redmon et al. in (2015) YOLO Paper.
To briefly explain the concept of the euclidean tracking
algorithm suppose the number of vehicles is {xi : i = 1, ..., L}
in the first frame and {yi : i = 1, ..., K} in the second frame.
box directly. YOLO works by taking an image and splits The goal of the euclidean tracking algorithm is to identify the
the input image into a grid of cells. Then, each grid cell nearest pair as follows:
predicts a bounding box if the centre of a bounding box falls
within it. Each grid cell uses a confidence value to predict a X
bounding box that involves spatial coordinate x,y and the width T RACK(L) = min ||xi − yj ||2
1≤j≤K
and height of the box. For each bounding box, the network i=L
calculates a class probability value and offset values for the
bounding box. The bounding boxes having the class probability Thus, the tracking algorithm works by iterating from the
map above a threshold value are then combined into a final set first until all point pairs in the second frame are visited. After
of bounding boxes and class labels. Fig. 3 shows the YOLO that, assign each observation in the first frame to the closest
model architecture taken from [21]. In this model, Darknet distance point in the second frame.
[24] CNN architecture is used for exacting in-depth features
and classification. E. Vehicle Counting
The counting method is based on the vehicle regional
D. Vehicle Tracking bounding box marks and the virtual reference line. This
A simple vehicle tracking algorithm is proposed in this technique assumes that the vehicle movement is in a direction.
work. The process starts with converting video clips to frames. For counting, each detected vehicle in the detection step is
The output from the object detection model is bounding boxes assigned with a unique label and tracked until it reaches
with coordinates and class labels. The coordinates can be used the virtual line. In this work, we have used five different
to determine the centre point of each object and in this case, class labels, namely bicycle, car, motorcycle, bus and truck.
vehicles. Assuming the first two frames of a video clip is And all these labels are categorized as vehicle object and
depicted in Fig. 4. will be used in the counting system. After that, each vehicle
position is checked whether it has crossed the horizontal
In this figure, assume that we have two vehicles at different reference line (yref ) at the y-axis as drawn in Fig. 4(b). If
location and frame. Here, by indicating (m1 , n1 ) and (m2 , n2 ) it passed the line, then it will be counted as one. In this case,
for the first vehicle coordinates in the first and second frame y2 coordinate value > yref coordinate value can be said to
respectively. And (x1 , y1 ) and (x2 , y2 ) for the next vehicle cross the reference line.
coordinates in the first and second frame respectively. These
coordinates are midpoints of the bounding boxes provided by F. Data Annotation
the object detector. For example, let’s say the object detector
detects a vehicle in a video frame and draws a bounding box The need for efficient image recognition is crucial to be
at (xstart , ystart ) and (xend , yend ), where xstart and ystart used in various application, such as for vehicle detection. In
are the x and y coordinates of the upper left corner of the the literature, deep convolutional neural network models have
bounding box respectively. And xend and yend are the x shown remarkable achievement on many computer recognition
and y coordinates of the lower right corner of the bound- tasks. However, these models are heavily reliant on big dataset
ing box respectively. Thus, the midpoint of the coordinate of images taken to form a variety of conditions, such as
www.ijacsa.thesai.org 700 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 7, 2020

different orientation, location, scale, illumination, etc. Unfor-


tunately, many existing deep learning models were trained in
a limited set of image conditions, which can increase prob-
lems of overfitting and hinder generalization performance. For
instance, a poorly trained deep learning network would give
high vehicle detection on the daytime condition but provide a
poor performance on the night time. Thus, different types of
illumination conditions would affect the model’s performance Fig. 5. Some Image Examples under Clear-Sky Condition in Kuala Lumpur,
in detecting vehicle objects. This results to lower accuracy Malaysia. Top Half shows the Day Condition and Bottom Half shows the
Night Condition.
obtained for vehicle counting systems.
Inspired by the work of [13], the data annotation tool is
used to increase the variability of training images for gener- TABLE I. D ETAIL OF V IDEO F ILES USED IN E XPERIMENT 1. T HE V IDEOS
alization performance detection models. This tool consists of ARE TAKEN FROM 06 A . M TO 09 P. M . T HE VALUE IN B RACKET SHOWS
three main steps, namely, (a) main process - used to draw THE G ROUND -T RUTH N UMBER OF V EHICLES FOR E ACH S ESSION .
the bounding box, i.e. top-left and bottom-right points on
Video Files with time
image objects for yolo training; (b) convert - to transform the Video File Time (a.m/p.m) Video File Time (a.m/p.m)
bounding box points into yolo input scheme, i.e. class id, x, Video 1 06 a.m (140) Video 6 01 p.m (266)
y, width height of the image objects, whereby x and y is the Video 2 08 a.m (453) Video 7 02 p.m (322)
Video 3 10 a.m (262) Video 8 04 p.m (358)
centre point of the bounding box; (c) the process - used to split Video 4 11 a.m (280) Video 9 07 p.m (237)
the train and test dataset for yolo training. In this work, about Video 5 12 p.m (299) Video 10 09 p.m (202)
510 new image objects from the video samples are used. The
breakdown of total annotated vehicle images are as follows:
bicycle (0), car (1866), motorcycle (457), bus (53) and truck bounding boxes with coordinate and object label. After that,
(74). three detector models are used for comparison to be selected
as the best vehicle detector for the counting system. These
IV. E XPERIMENTAL R ESULTS AND A NALYSIS model are chosen based on the popularity in both past studies
and availability of pre-trained models. Besides, they are widely
Our experiments contain two stages. In the first stage, used in industries for ease of implementation, especially on
we compare the proposed object detection algorithms, namely TensorFlow framework. These models are (a) Faster R-CNN
Faster R-CNN, SSD and Yolov3 on a set of videos. Based on (b) SSD and (c) YOLOv3. The first experiment (Experiment 1)
the results of the first stage, we further extend the experiments investigates the best deep detector models for vehicle counting
by applying data augmentation using a data annotation tool system in the day and night conditions for benchmarking. The
to improve the detection performance. All the pre-trained second experiment (Experiment II) looks at improving the best
models are trained on the COCO dataset and available on the method in the first experiment on selected conditions of traffic
TensorFlow detection model zoo (2019) and TensorNets [28]. flow.
The counting process takes some time, and it depends on the
number of image frames and system configuration. We have
performed experiments on Intel i5-8250 CPU @ 1.60GHz with B. Experiment I
8GB memory and GeForce MX150 GPU with 6GB memory.
In this work, vehicle accuracy counting is used to evaluate The best model in Experiment I is YOLOv3 and a detailed
the performance of each detection model. It is determined by result is presented in the next section. Experiment I resulted in
counting vehicles with ROI passing the reference line in image YOLOv3 as the architecture with the highest average counting
frames. The vehicle counting accuracy (VCA) is computed in accuracy for 10 sample videos tested, as shown in Table II.
equation (1) as follows: However, it was found that the performance of this model
was worse in poor illumination, especially in the morning
and night conditions. The YOLOv3 scores average vehicle
N umber of Detected V ehicles counting accuracy of 66.29% compared to Faster R-CNN,
V CA(%) = × 100 (1) which obtains the second-best average accuracy of 38.12%.
T otal N umber of V ehicles
On the other hand, SSD recorded the fastest processing time
A. Dataset of 0.135 seconds per frame but has the lowest accuracy of
14.53%. The high standard deviation of the YOLOv3 and other
In this work, 10 sample traffic video clips from the same models is due to the high variation of illumination change,
location in two different times of the day (day and night) are especially in the morning and night conditions. As shown in
used in the experiments. The video was recorded in Kuala Table III, YOLOv3 achieves very high accuracy during the
Lumpur, Malaysia from 06 a.m to 09 p.m under clear-sky daytime (10 a.m. to 2 p.m.) but in the early morning (6 a.m.),
condition. Fig. 5 shows some examples of day and night and night (9 p.m.) the accuracy is very low which is similar to
images with different traffic volume and day-night illumination other models. The overfitting of the pre-trained models can be
variations. Table I shows the video list recorded from a CCTV seen appearing in all models tested here. Fig. 6 shows some
camera and time information used in the experiments. Only detection results using different detection models, i.e. SSD,
vehicles flow in one direction is considered for counting. These Faster R-CNN and Yolo V3. The accuracy vehicle counting
videos are then converted to frames and each frame becomes comparison between all three models on two different time
the input to object detection algorithm while, the output are conditions (day and night) is shown in Table III.
www.ijacsa.thesai.org 701 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 7, 2020

Fig. 6. Some Detection Results in Experiment I using Different Models i.e. Fig. 7. Some Detection Results using YOLOv3 DarkNet Detector Model (a)
SSD, Faster R-CNN and YoloV3, respectively. Top Half shows Detected Vehicle before Retraining (b) Bottom Half shows
Detected Vehicle after Retraining using a Data Annotation Tool.

TABLE II. AVERAGE C OUNTING ACCURACY AND P ROCESSING T IME FOR


E ACH M ODEL TABLE IV. T HE AVERAGE D ETECTION ACCURACY BEFORE AND AFTER
DATA A NNOTATION
Model Counting Accuracy Processing Time
(%) (frame per second) Detection Before annotation After annotation
YOLOv3 66.29±33.35 0.26 ±0.013 Model (%) (%)
DarkNetv19 YOLOv3 66.29 ± 33.35 80.90 ± 11.62
FasterRCNN 38.12±26.26 0.532±0.037 DarkNet19
ResNet101
SSD Inception 14.53±14.40 0.135±0.004

tiny-yolo.cfg - is the model configuration file. The retraining


process can be executed by using the training command to
C. Experiment II YOLOv3 DarkNet framework in the YOLO data annotation
The best performing model in experiment I is selected for tool. The output of this process will be weight files for each
further evaluation in Experiment II. The previous results have 100th iteration.
shown that the pre-trained models have some problems with a The final weight file that is produced when the average
high variation of illumination in morning and night conditions. loss ratio has saturated will be used for Experiment II. Then,
Thus, to overcome the problem, the data annotation technique results from this retrained YOLOv3 model is compared with
is proposed. After that we compare the performance of the the result in Experiment I that corresponding to the same video
retrained model with the pre-trained model using the suggested clip samples. The experiment shows that the counting accuracy
data annotation tool. The retraining is done using a custom has improved very significantly with counting accuracy from
dataset taken from the video files. Firstly, the video files of 2.86% to 75.71% and 3.47% to 76.73% in the morning (06
type AVI are converted to a series of image frames of JPG a.m) and night (09 p.m) conditions respectively. This is due to
type. Using the YOLO Annotation Tool [13], which is a Python the model’s ability to detect more vehicles in poor illumination
executable program, the images of the vehicles are annotated conditions. Fig. 7 shows the average counting accuracy results
and labelled. Bounding boxes are drawn on images of vehicles of YOLOv3 before and after retraining using a data annotation
on video frames and labelled accordingly. To simulate the real for vehicle counting system. The counting system improves
traffic flow, two video clips were used one which is taken in the very significantly with average accuracy from 66.29% to
early morning (06 a.m.) and another at night (09 p.m.). In this 80.90%. Table IV shows the average detection accuracy on
work, about 510 images from poorly illuminated video samples the standard Yolo V3 and the proposed data augmentation on
were used. Breakdown of total annotated vehicles is bicycle Yolo V3.
(0 image sample), car (1866 image samples), motorcycle (457
image samples), bus (53 image samples) and truck (74 image
V. C ONCLUSION AND F UTURE W ORK
samples). In this software, three additional files are required to
perform the retraining. The files are (a) obj.names - contains This paper addresses the challenges in the selection of
the classes that need to be trained (b) obj.data - the pointers the best model for the development of a vehicle counting
towards the location of the annotation files and images and (c) system for a custom dataset. Comparison of three models
(Faster R-CNN ResNet101, SSD Inception, YOLOv3 Dark-
Net) which were pre-trained on the COCO dataset showed that
TABLE III. T HE OVERALL ACCURACY V EHICLE C OUNTING ACCURACY YOLOv3 DarkNet19 is achieving the best result. The results
FOR ALL D ETECTOR M ODELS . presented can be used as a reference for future development
of a similar counting system. However, YOLOv3 DarkNet19
Video time YOLOv3 Faster RCNN SSD Inception performs worse in the morning and night condition of the
DarkNet19 ResNet101 v2 (%)
(%) (%) custom dataset. Thus, the solution is to retrain the model
06 a.m 2.86 0.71 0.00 with a custom dataset from the poor illumination condition
08 a.m 73.73 23.18 11.26
10 a.m 94.27 61.45 17.94
environment using a data annotation tool and employs transfer
11 a.m 83.93 52.14 8.58 learning with the weight training initialization method. The
12 p.m 96.32 28.76 10.37 resulting model improves the counting accuracy very signif-
01 p.m 89.84 30.08 12.78
02 p.m 86.65 82.30 40.68 icantly. A tracking mechanism based on consecutive frames
04 p.m 74.02 70.95 41.62 comparison was also proposed to aid the counting system. This
07 p.m 57.82 29.11 2.11 mechanism may work only on vehicles moving in one direction
09 p.m 3.47 2.47 0.00
without occlusion. In future studies, perhaps some uniformity
www.ijacsa.thesai.org 702 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 7, 2020

can be done on the meta -architectures and detectors. Besides, [11] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu, “A survey on
the model used for retraining was a light-weighted version deep transfer learning,” ArXiv, vol. abs/1808.01974, 2018.
of YOLO, which is called tiny-YOLO. This is due to the [12] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans-
limitation on the available hardware specification. To retrain actions on Knowledge and Data Engineering, vol. 22, pp. 1345–1359,
2010.
YOLO the recommended minimum GPU memory is 4GB,
[13] M. Murugave., “How to train yolov3 to detect cus-
any specification below that is only suitable for training tiny- tom objects,” https://medium.com/@manivannan data/
YOLO. Thus, it is recommended that future studies need to how-to-train-yolov3-to-detect-custom-objects-ccbcafeb13d2, 2018.
consider the retraining of YOLO instead of tiny-YOLO to [14] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010.
compare the performances. [15] A. Krizhevsky, V. Nair, and G. Hinton, “Learning multiple layers of
features from tiny images,” Tech. Rep., 2009. [Online]. Available:
ACKNOWLEDGMENT http://www.cs.toronto.edu/∼kriz/cifar.html
This work has been supported by the Malaysia’s Min- [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in NIPS 2012, 2012.
istry of Higher Education Fundamental Research Grant
[17] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster r-cnn: Towards real-
FRGS/1/2019/ICT02/UKM/02/8. time object detection with region proposal networks,” IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, vol. 39, pp. 1137–
R EFERENCES 1149, 2015.
[1] P. Zheng and M. Mike, “An investigation on the manual traffic count [18] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
accuracy,” in 8th International Conference on Traffic and Transportation for semantic segmentation,” in CVPR, 2015.
Studies (ICTTS 2012), 2012. [19] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
[2] Z. Zhang, K. Liu, F. Gao, X. Li, and G. Wang, “Vision-based vehicle large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
detecting and counting for traffic flow analysis,” 2016 International [20] R. B. Girshick, “Fast r-cnn,” 2015 IEEE International Conference on
Joint Conference on Neural Networks (IJCNN), pp. 2267–2273, 2016. Computer Vision (ICCV), pp. 1440–1448, 2015.
[3] M. S. Chauhan, A. Singh, M. Khemka, A. Prateek, and R. Sen, [21] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only
“Embedded cnn based vehicle classification and counting in non-laned look once: Unified, real-time object detection,” 2016 IEEE Conference
road traffic,” in ICTD ’19, 2019. on Computer Vision and Pattern Recognition (CVPR), pp. 779–788,
[4] B. Dey and M. K. Kundu, “Turning video into traffic data - an 2015.
application to urban intersection analysis using transfer learning,” IET [22] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
Image Processing, vol. 13, pp. 673–679, 2019. hierarchies for accurate object detection and semantic segmentation,”
[5] H. Song, H. Liang, H. Li, Z. Dai, and X. Yun, “Vision-based vehicle 2014 IEEE Conference on Computer Vision and Pattern Recognition,
detection and counting system using deep learning in highway scenes,” pp. 580–587, 2013.
European Transport Research Review, vol. 11, pp. 1–16, 2019. [23] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” 2017
[6] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy, pp. 6517–6525, 2016.
“Speed/accuracy trade-offs for modern convolutional object detectors,” [24] ——, “Yolov3: An incremental improvement,” ArXiv, vol.
2016. abs/1804.02767, 2018.
[7] Á. A. Garcı́a, J. A. Álvarez, and L. M. Soria-Morillo, “Evaluation of [25] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C.-Y. Fu, and
deep neural networks for traffic sign detection systems,” Neurocomput- A. C. Berg, “Ssd: Single shot multibox detector,” in ECCV, 2016.
ing, vol. 316, pp. 332–344, 2018. [26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
[8] N. Yadav and U. Binay, “Comparative study of object detection algo- recognition,” arXiv preprint arXiv:1512.03385, 2015.
rithms,” IRJET, vol. 11, 2017. [27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
[9] A. Arinaldi, J. A. Pradana, and A. A. Gurusinga, “Detection and V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
classification of vehicles for traffic video analytics,” in INNS Conference 2015 IEEE Conference on Computer Vision and Pattern Recognition
on Big Data, 2018. (CVPR), pp. 1–9, 2015.
[10] B. Dey and M. K. Kundu, “Turning video into traffic data – an [28] T. H. Lee., “Tensornets,” https://github.com/taehoonlee/tensornets,
application to urban intersection analysis using transfer learning,” IET 2018.
Image Processing, vol. 13, pp. 673–679, 2019.

www.ijacsa.thesai.org 703 | P a g e

View publication stats

You might also like