Final Report
Final Report
CHAPTER NO DESCRIPTION
ACKNOWLEDGEMENT
SYNOPSIS
SOURCE CODE
1 INTRODUCTION
2 LITERATURE SURVEY
3 PREAMBLE
4 REQUIREMENT SPECIFICATION
5 SYSTEM DESIGN
6 SYSTEM IMPLEMENTATION
7 SYSTEM TESTING
8 RESULT
REFERENCES
APPENDIX
SOURCE CODE
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION OF PROJECT
The Police make use of RADAR device to detect the speed of the vehicle, which should
be placed at a certain distance. This requires manpower for focusing the vehicles and also it
must be kept in most of the highways which is expensive. Cameras have been widely used in
traffic operations. While many technologically smart camera solutions in the market can be
integrated into Intelligent Transport Systems (ITS) for automated detection, monitoring and
data generation. Many Network Operations Centers still use legacy camera systems as
manual surveillance devices. Intelligent transportation frameworks have gotten progressively
significant in numerous cutting-edge urban communities as governments depend more on
them to empower smart decision making (for the two organizations and individual clients),
and better use of the current foundation. To resolve the above problem, we are composing a
system to extract traffic data from videos captured by legacy cameras. From the extracted
data we can detect the vehicles as well classify them (like, Car, Bus etc..,), speed of the
detected vehicles and also keep a count on number of vehicles which passes over the Camera
by using a deep learning model. In the region of AI and Computer vision, there are various
strategies that can be utilized for object identification. These strategies can likewise be
applied to detect vehicles and speed of the vehicles. The system that we are introducing uses
the following: (1) YOLOv3 algorithm is used for real time object detection, (2) TensorFlow
library is used to create the models and neural network for object detection/classify the
object. (3) OpenCV library is used to calculate the speed of the vehicles.
The main idea of the project is to develop a vision based pipeline system that for vehicle
counting, speed estimation and vehicle classification. It uses computer vision techniques to
extract traffic data from videos captured by cameras for object detectors and transfer-learning
to detect vehicles, pedestrians, and cyclists from monocular videos. The main objective
ofthis is to make use of a camera instead of Radar which requires man power for focusing the
vehicles and it must be kept in most of the highways. The existing technique which uses
Radar is too costly. Thus, it is necessary to design a system which is affordablethat includes
cost effective.
The system that we are introducing uses the following: (1) YOLOv3 algorithm is used for
real time object detection, (2) TensorFlow library is used to create the models and neural
network for object detection/classify the object. (3) OpenCV library is used to calculate the
speed of the vehicles. For verification and testing of the proposed approach, four different
videos for different environment conditions (like in morning, afternoon, evening and on
partial cloudy day) are used. In the proposed method, detection and tracking of the vehicles
utilizes parameters such as position, height and width of vehicle instead of features
extraction. This requires lesser computation and memory. The proposed approach stores
vehicles parameters, estimated speed of the detected vehicles in the database. The proposed
system can be adopted easily in existing traffic management system.
CHAPTER 2
LITERATURE SURVEY
[1] Alternative Automatic Vehicle Classification Method
The paper manages the new strategy for programmed vehicle arrangement called ALT
(ALTernative). Its trademark highlight is flexibility coming about because of its open
structure, additionally a client can change the quantity of vehicles and their classification as
indicated by singular necessities. It utilizes a calculation for programmed vehicle
acknowledgment utilizing information combination strategies and fluffy sets. High viability
of grouping while at the same time holding high selectivity of division was demonstrated by test
results. The viability of arrangement of all vehicles at the degree of 95% and products trucks of
100% is more than agreeable.
Advantage:The number of categories is not fixed and can be modified according to the
given area traffic if required.
Disadvantage:The decision is taken on the grounds of classical logic (“should” or “should
not”) what, is the reason for a low effectiveness of such classification algorithms
[12] Real Time Object Detection, Tracking, and Distance and Motion
Estimation based on Deep Learning: Application to Smart Mobility
In this paper, they have introduced object detection, localizationand tracking system for
smart mobility applications like traffic road and railwayenvironment. Firstly, an object
detection and tracking approach was firstly carried outwithin two deep learning approaches:
You Only Look (YOLO)V3 and SSD.Secondly, object distance estimation based on
Monodelph algorithm was developed. Thismodel is trained on stereo images dataset but its
inference usesmonocular images. As theoutput data, they have a disparity map that they
combine with the output of objectdetection. To validate, they have tested two models with
different backbones includingVGG and ResNet used with two datasets: Cityscape and
KITTI. As the last step of theapproach, they have developed a new method-based SSD to
analyse the behaviour ofpedestrian and vehicle by tracking their movements even in case of
no detection on someimages of a sequence. They have developed an algorithm based on the
coordinates of theoutput bounding boxes of the SSD algorithm.
The whole of development is tested in real vehicle traffic conditions in Rouen city center,and
with videos taken by embedded cameras along the Rouen tramway.
Disadvantage:
Sometimes the objects might be placed too close due to which those objects
might look to be a single object.
If there is/are a moving object/objects the detection of the object
becomes troublesome.
Table. 2.1 Literature Survey Summary
S
l Year Title Author Methodology Advantage Disadvant
. age
N
o
1 2019 ALTERNATIVE Piotr Burnos Its characteristic feature The number of Decision
is
AUTOMATIC is versatility resulting categories is not taken on
VEHICLE from its open structure, fixed and can be the
CLASSIFICATION moreover a user can modified. grounds
METHOD adjust the number of of
vehicles and their
classical
logic
category according to
individual requirements.
2 2019 Simultaneous Traffic Chris Stauffer, It simultaneously They proposed Since the
estimates the location latest
Sign Detection and W.E.L Grimson and precise boundary the efficient architecture
of traffic signs using a
convolutional neura for
Boundary Estimation network (CNN). traffic sign object
detection
Using Conolutional detection has not
been
Neural Network l method used,the
accuracy
is less.
3 2018 Real-time foreground– David Harwood, It can deal with Interesting Backgroun
ds
background Larry S.Davis scenes containing foreground having
fast
segmentation using moving foundations objects will variations
are
codebook model or brightening be detected not
easily
varieties, and it mixed with modeled
accomplishes other
vigorous location stationary
for various sorts of objects
recordings.
4 2018 Vehicle Colour Chuanping Hu, Colour, as an eminent The usage of It might
make
Recognition with Xiang Bai, Senior and stable trait ofspatial mistakes
information
or
Spatial Pyramid Deep Member, IEEE vehicles, can fill in as further improves give
wrong
Learning a valuable andrecognition prediction
s in
dependable sign in an accuracy. certain
cases
assortment of uses in
wise transportation
frameworks.
5 2017 An Approach to Tan-Jan Ho Fast Fourier They have The
Traffic Flow transforms (FFTs) presented a coping up
Detection and prescribed plausible approach with a
Improvements of Non- smart decisions to to remarkable missing
Contact Microwave enhancing traffic improvements of vehicle
Radar Detectors detection of non- the traffic flow signal
contact microwave detection of a MR
radars (MRs). system
6 2017 Image-Based Learning Jiyong Chung A Using It is difficult to
profound
to Measure and Keemin convolutional neural filters account for how
Traffic
Density Using a Deep Sohn organization was reduces the a CNN can
count
Convolutional Neural concocted to count number of the number
of
Network the number weight vehicles exactly.
of
vehicles on a street parameters
fragment
dependent
on video pictures
7 2017 Image Classification Ksenia Soorkina It is concerned with Image This system
the classification uses
using Convolutional automatic the
refers
image training
Neural Networks extraction, analysis to model
and information
understanding from the SSD_INSPECTI
image by
useful labeling the
information pixels of ON_V2_COCO
the
with image.
images.
8 2017 Real Time Object Aya Hassouneh, It uses object This model is If there is/are a
Detection, Tracking, A.M. Mutawa detection and trained on moving
and Distance and tracking system stereo images object/objects
the
Motion Estimation for smart mobility dataset but its detection of the
based on Deep applications . inference uses object becomes
Learning: Application monocular troublesome.
to Smart Mobility images.
9 2016 Using Bluetooth and Hemjit Sawant, Wellbeing of street This Issues
Sensor Networks for Jindong travel can
Intelligent be increases the like the
Transportation Systems Tan, Qingyan expanded if overall potential
Yang , QiZhi vehicles can sensing communicati
Wang he made ons
ability
to of the
sensors
shape bunches for imposed on the overhead
imparting vehicle.
information among
themselves.
1 2015 Using GPS to Measure Glen M. D’Este, GPS-based GPS equipment There is often
0 Traffic Rocco Zito & can a problem
Michael A. P. blockage
with
System Performance Taylor measures into an ITS
framework, be communicatio
procedures transferred ns.
quickly
for actualizing
congestion and easily from
observing framework, one vehicle
and suggestions for to
urban road system. another.
CHAPTER 3
PREAMBLE
3.1 EXISTING SYSTEM
The word “ Radar” is acronym for Radio Detection and Ranging. The police make use of
this device to detect speed of a vehicle, which should be placed at certain distance. This
requires man power for focusing the vehicles and it must be kept in most of the highways
which is expensive.
3.1.1 Disadvantageous of Existing system
Most existing techniques rely on manual processing of the trajectories of vehicles captured by
video cameras and they are both labour intensive andinaccurate.
RADAR has shorter range.
It cannot distinguish or resolve multiple targets.
In the proposed system, we make use of a camera instead of Radar. The camera
captures the vehicle and detects speed, classifies which type of vehicle and also keep the
count of number of vehicles which passes over the camera. This doesn’t require much of
manpower and generates valid proof for over speeding.
Cost effective: The main objective of developing algorithm of a real time system is
that to provide cost effective. It is necessary to design a system which is affordable
and includes cost effective components for designing.
Fast: The main objective of this project is to develop an algorithm which is extremely
fast compared to the existing ones.
Accuracy: The main objective ofthis project is to develop an algorithm which is
more accurate compared to the existing ones.
3.2.1 Features of Proposed System
Our proposed pipeline combines object detection and multiple object tracking to count and
classify vehicles from video captured.
Our proposed pipeline also includes a visual classifier module.
A deep learning method is proposed to deal with the problem of vehicle colourrecognition.
Workflow Diagram
REQUIREMENT SPECIFICATION
The study of requirement specification is focused specially on the functioning of the
system. It allows the developer or analyst to understand the system function to be carried
out the performance level to the obtained and corresponding interfaces to be established.
The software detects the action of eyes with the help of camera. So, the computer needs to
be installed with camera. The laptop is equipped with camera, but the desktop computer
sometimes doesn’t have camera, so an extra camera needs to be installed and linked to the
computer.
Processor : Intel I5 8th Gen
Ram : 8GB
Graphic Card : Nvidia GeForce GTX 960mx
Gesture Recognition
Mobile Robot
Python 3.5.0
Web development.
Scientific Programming
Desktop GUIs
Network Programming.
Game Programming.
CHAPTER 5
SYSTEM DESIGN
Design is the first step in the development phase for any engineering product or system.
It may be defined as the process of applying various techniques and principles for the
purpose of defining a device, a process or a system in sufficient detail to permit its physical
realization. Design is a meaning full representation of something that is to be built. Software
design is an iterative process through which the requirements are translated into a
“blueprint” for constructing the software.
When designing the system, the points to be taken are:
Identifying the data to be stored
Identifying the user requirements
Need to maintain data and retrieve them when ever wanted
Identifying of inputs and arriving at the user define output
System specification
Security specification
View of future implementation of the projects
A system architecture or systems architecture is the conceptual model that defines the
structure, behaviour, and more views of a system. An architecture description is a formal
description and representation of a system, organized in a way that supports reasoning about
the structures and behaviours of the system.
The system architecture for estimating vehicle speed from video data consists of 9
processes. Each process will do particular work which the result will be used by next
process until estimated speed is calculated. Block diagram for this system is given by Fig.
5.1.
5.1 System Architecture
With the revival of DNN, object detection has achieved significant advances in recent years.
Current top deep-network-based object detection frameworks can be divided into two
categories: the two-stage approach, including and one-stage approach, including. In the two-
stage approach, a sparse set candidate object boxes is first generated by selective search or
region proposal network, and then, they are classified and regressed. In the one-stage
approach, the network straightforward generated dense samples over locations, scales, and
aspect ratios; at the same time, these samples will be classified and regressed. The main
advantage of one-stage is real time; however, its detection accuracy is usually behind the
two- stage, and one of the main reasons is class imbalance problem.
5.4 User Case Diagram
Use case diagram is a behavioral UML diagram type and frequently used to analyze various
systems. They enable you to visualize the different types of roles in a system and how those roles
interact with the system. This use case diagram tutorial will cover the following topics and help
you create use cases better.
SYSTEM IMPLEMENTATION
6.1 SPEED DETECTION METHOD
Initially the video frame is fed as an input to Cascade classifier.
Once the ROI is calculated, it is sent to the NeuralNetwork.
Then the Speed of the vehicle is detected by using EUCLIDEAN DISTANCE.
The function estimasteSpeed takes two parameters “location1 , location2 “
Speed is calculated by multiplying frames per second, distance per pixeland 3.6
(i.e., 3600) which is the default value to convert meter per hour into kilometre per hour.
SYSTEM TESTING
The purpose of testing is to discover errors. Testing is the process of trying to discover every
conceivable fault or weakness in a work product. It provides a way to check the functionality of
components, sub-assemblies, assemblies and/or a finished product It is the process of exercising
software with the intent of ensuring that the Software system meets its requirements and user
expectations and does not fail in an unacceptable manner. There are various types of test. Each test
type addresses a specific testing requirement.
The above figure shows the details of detected car and its speed with the count of number of
vehicles detected.
CHAPTER 9
9.1 CONCLUSION
The proposed project aims to bring out lesser computation and memory and stores
vehicles parameters, estimated speed of the detected vehicles in the database. Detection and
trackingof the vehicles utilizes parameters such as position, height and width of vehicle
instead of features extraction hence proposed system can be adopted easily in existing traffic
management system.
Every application has its own merits and demerits. The project has covered almost all the
requirements. Further requirements and improvements can easily be done since the coding is
mainly structured or modular in nature. Changing the existing modules or adding new
modules can append improvements. Further enhancements can be made to the application,
such that the functions are more accurate and efficient than the present one.
REFERENCES
[1] Chuanping Hu, Xiang Bai, Senior Member, IEEE, Li Qi, Pan Chen, Gengjian Xue,
and Lin Mei. “Vehicle Color Recognition With Spatial Pyramid Deep Learning ”, 2016.
[2] Glen M. D’Este, Rocco Zito & Michael A. P. Taylor. “Using GPS to Measure
Traffic System Performance”, 2015.
[3] Hee Seok Lee and Kang Kim. “Simultaneous Traffic Sign Detection and
Boundary Estimation Using Convolutional Neural Network ”, 2018.
[4] Chris Stauffer and W.E.L Grimson ,The Artificial Intelligence Laboratory
MIT, Cambridge. “Adaptive background mixture models for real-time tracking“
[5] Piotr Burnos. “ALTERNATIVE AUTOMATIC VEHICLE
CLASSIFICATION METHOD ”, 2010.
[6] Hemjit Sawant, Jindong Tan, Qingyan Yangm, QiZhi Wang. “Using Bluetooth
and Sensor Networks for Intelligent Transportation Systems ”, 2004.
[7] Jiyong Chung and Keemin Sohn. ” Image-Based Learning to Measure Traffic
Density Using a Deep Convolutional Neural Network”, 2017.
[8] Koen E.A. van de Sande, Student Member, IEEE, Theo Gevers, Member, IEEE, and Cees
G.M. Snoek, Member, IEEE. ” Evaluating Color Descriptors for Object and Scene
Recognition”, 2010.
APPENDIX
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception ofTRANSPORTATION
TRANSACTIONS ON INTELLIGENT pagination. SYSTEMS 1
whole video frame and output the bounding boxes that enclose
the target objects (e.g., vehicles). Compared to background
subtraction techniques, deep learning based object detectors
are more robust to illumination variation, shadows, and partial
occlusion in the video.
Our research contributions are summarized below:
• Our proposed pipeline combines object detection and
multiple object tracking to count and classify vehicles
from video captured by a single camera at each site. We
Fig. 1. Example images from the Narrows Bridge North site (left) on Kwinana employ state-of-the-art object detectors rather than
Freeway before the afternoon peak hours of a Thursday and the Hutton Street adaptively modelling the background so that the vehicle
site (right) on Mitchell Freeway on a Friday morning rush hour period.
detection module is robust against illumination change
and effect of shadows. By weakly calibrating the camera,
Additionally, implementing such a system reduces reliance on a novelty of our method is using the 3×3 image-to-world
camera vendors, so consistent data can be obtained regardless homography to warp the bounding box that encloses each
of the variety of cameras used to collect the data. detected vehicle onto the ground plane to yield real-world
Perth, the capital city of Western Australia, is one of the measurements. This gives our pipeline the efficacy of
most car-dependentcities in the world. The Perth Metropolitan counting vehicles by lane and classifying vehicles based
Area covers an area larger than London, but has a population on their lengths in metres. Furthermore, using the same
of merely 2 million. Over the years, Main Roads Western homography, our monocular vision pipeline can estimate
Australia (MRWA), the Government road agency responsible the speeds of vehicles in kilometres per hour. To the best
for managing roads of the State, has installed closed-circuit of our knowledge, this is the first time where object
television cameras (CCTV) throughout the metropolitan area detection and projective geometry are combined to yield
as part of its Intelligent Transport System. In particular, there 3D measurement from monocular videos of traffic scenes.
are many cameras mounted above different segments of its • Our proposed pipeline also includes a visual classifier
two major freeways, Mitchell Freeway and Kwinana Freeway, module that can be combined through a voting scheme
which form the spine of this linear city, connecting the with the vehicle length obtained from projective geom-
Northern and Southern suburbs with CBD in the middle. These etry to further improve the vehicle classification results.
cameras are set up in a way that no more than one camera Further modules constituting our pipeline are pedestrian
views the same segment of the road. This means that it is not and cyclist counting based on their direction of travel in
possible to carry out 3D analysis using stereo vision a zoom-in view. All of the metadata produced by the
techniques. Some of these cameras are fixed cameras while modules in the pipeline are automatically saved to a
others are of pan-tilt-zoom type. All of them can be remotely spreadsheet to facilitate traffic analysis.
controlled from the Network Operations Centre. This camera We have empirically shown that a clever application of our
network is a valuable asset of the state and offers important computer vision based technology could dramatically boost
traffic information of the city. the utilisation of legacy assets and achieve counting accuracy
In this paper, we present a computer vision pipeline for that is on a par with dedicated traffic counting devices.
vehicle classification, counting, and speed estimation from Further- more, our solution can also classify vehicles to a
traffic videos captured by these cameras. Our system can reach reasonable degree of success. The estimation of vehicle speed
a counting accuracy as high as 98% of pedestrians and cyclists in km/h using the image plane to ground plane homography
on the shared footpath with respect to their directions of travel. computed from the weakly calibrated camera is also a justified
Figure 1 shows two example traffic scenes that our system contribu- tion to the research in intelligent transportation
deals with in this paper. systems.
One of the important components of the pipeline is the The rest of the paper is structured as follows. Section II
detection of moving vehicles. A common approach adopted in presents some related work. The proposed tracking algorithm
the literature on vehicle detection is to perform background is elaborated in Section III. Section IV details our
subtraction [12], [13] and then identify vehicles as the fore- experimental results and Section V outlines our conclusion
ground pixel blobs. The vehicle counting problem is thus and future work.
turned into a pixel blob counting exercise. This way of vehicle
counting works well only for simple scenes where the vehicles II. RELATED WORK
are far away from each other so that pixel blobs are not con- Our proposed pipeline for traffic video analysis covers
nected together due to partial occlusion or due to long shadow research topics in object detection, object classification, mul-
cast by the late afternoon sun. Using background subtraction, tiple object tracking, and traffic surveillance. Related work in
it is also unclear how to track the irregular foreground pixel the literature for these topics is briefly reviewed in this
blobs for vehicle speed estimation. Recently, object detection section.
algorithms based on deep learning have achieved impressive
detection accuracy [14]–[22]. These algorithms read in the
A. Object Detection
Currently the area of object detection is dominated by two
main approaches: one stage detection and two stage detection.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
One stage detectors, such as exception
YOLO [14] and SSD [17],
of pagination.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception of pagination.
LIU et al.: VISION-BASED PIPELINE FOR VEHICLE COUNTING, SPEED ESTIMATION, AND CLASSIFICATION 3
treat object detection as a regression problem where the have similar appearances. Perera et al. [26] tackle this issue by
object classes and the bounding box coordinates are predicted evaluating all the possible hypotheses in the trajectory splitting
directly. On the other hand, two stage detectors (e.g., R-CNN) and merging step. Their tracking algorithm is applied to traffic
include two stages. The first stage is to generate many region scenes and they use the Stauffer-Grimson background
proposals using a search method (e.g., using a selective modelling algorithm to detect moving objects. Different from
method [23] or a region proposal network (RPN) [20]) and the work above, Huang et al. [27] propose to associate
the second stage is to pass these region proposals for detected results globally in a three-level framework. However,
classifica- tion and bounding box regression. Compared to their algorithm can only work offline, which means that the
the one stage detectors, two stage detectors usually achieve whole video sequence must be read in advance. Other MOT
better detection rates but are also slower due to the number papers targeting at handling interaction between tracked
of steps involved. Proposed in 2016, YOLO [14] is a one objects [28], camera motion [29], online tracking [30], and
stage object detector that achieves a frame rate of 45 on a incorporating different motion models for the objects [31],
Titan X GPU. YOLO has 24 convolutional layers and 2 fully [32] have also been reported.
connected layers. A faster version of YOLO has 9
convolutional layers instead of 24. The final output prediction C. Object Classification
of the network is a×7 ×7 30 tensor. A later version of
Object classification is perhaps the computer vision research
YOLO, known as YOLOv2 (also referred to as YOLO9000) area that has received the most attention from researchers.
[15], can process up to 67 frames per second depending on Related to this research area is the creation of many large
the video resolution. Like its two previous versions, YOLOv3 benchmark datasets, such as ImageNet [33], VOC [34], and
[16] is another fast object detector. It con- tains some COCO [35], making it possible to train complex deep learn-
incremental improvement to YOLOv2. Another one stage ing methods for object classification. Typical deep network
method is the Single Shot MultiBox Detector (SSD) method architectures for object classification include the classical
proposed by Liu et al. [17], where the object classes and LeNet-5 [36], AlexNet [37], GoogLeNet [38], VGGNet [39],
bounding boxes are predicted together for a set of default and ResNet [40]. These architectures are famous for their
anchor boxes.
outstanding performance in the ImageNet Large Scale Visual
Girshick et al. [19] propose a two stage detector known as
Recognition Challenge (ILSVRC). The backbone of these
R- CNN, where convolutional neural network (CNN) features
architectures are convolutional layers, pooling layers, and
are computed for each region proposal. The feature vectors are
suitable activation functions. Although these architectures
scored using the SVM trained for each object class. The final
have been used in classifying objects of different types (e.g.,
scored regions then go through a non-maximum suppression
vehicles, people, bicycles, etc), more recently they have been
step to reject overlapping regions. Fast R-CNN [18], a later
adapted for classifying different classes of objects of the same
method from the same group of authors, extends R-CNN by
types (e.g., different species of birds [41]). Most of these fine
selecting the region-of-interest (ROIs) from feature maps
grain visual classification techniques involve localizing
computed from convolution. It is shown in the paper that their
different parts or modelling the subtle differences of object
method improves the speed as well as accuracy. Faster R-CNN
parts [42]–[45].
[20] further improves Fast R-CNN with an RPN and can run at
For fine grain object classification, our approach differs from
a speed of 5 frames per second. However, this frame rate is
the methods reviewed above in that we classify vehicles based
still quite low compared to YOLOv2 [15] mentioned above.
on their length and appearance (see Section III-D).
By using the so-called focal loss function instead of the
traditional cross entropy loss function, Lin et al. [21] report
that RetinaNet can better handle the foreground-background D. Traffic Surveillance Systems
class imbalance issue. In the same year, the same group of Earlier work focuses on exploring handcrafted features in
authors also propose the Mask R-CNN detector [22] based on traffic scenes, e.g., the bag-of-words descriptors have been
an extension of Faster R-CNN, giving similar detection applied to detect pedestrians and bicyclists for counting [46], a
performance and run- time as the RetinaNet. wheel contour extraction method [47] has been used for traffic
accident analysis. Intelligent traffic systems based on
surveillance cameras have made significant progress recently.
B. Multiple Object Tracking The faster R-CNN object detector and MobileNets [48] have
Tracking by detection is a widely used approach in multiple been combined for traffic sign recognition [49]. To deal with
object tracking (MOT). After putting bounding boxes on the the problem of scale variation in vehicle detection, a scale-
detected targets in each video frame, tracking by detection insensitive model is proposed in [50], where a context-aware
techniques formulate the MOT task as a process of associ- ROI pooling layer is designed to extract feature maps for
ating the detected bounding boxes between successive video vehicles with small scales. They further use a multi-branch
frames. The task in MOT can therefore be considered as a data decision network to classify vehicles with a large variance of
association problem [24], [25]: estimating the correct scales. To estimate the vehicle density in video images
assignment events between the targets found in the previous directly, a model based on deep CNN is designed in [9]. The
frames and the detections in the current frame. Unlike single problem of estimating the vehicle density from the input
object tracking, MOT techniques need to be robust in dealing image is then formulated as a regression problem where the
with the identity switch problem when the objects to be number
tracked
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception of pagination.
4 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
traffic is heavy (some vehicles are partially occluded). In such B. Multiple Object Tracking and Tracklet Construction
scenarios, we would use Faster R-CNN. In most other sce- For each bounding box detected in each video frame, based
narios, we use either YOLOv2 or SSD because of their fast on the class ID (vehicle, motorcycle, pedestrian, and cyclist),
detection speeds. the tracklet construction module uses the Kalman filter (KF) to
To adapt the three detectors for our proposed pipeline, we predict the bounding box’s state vector for the next video
make the following modifications to the network frame and uses data association for prediction-detection
architectures: assignment.
• For YOLOv2, the number of filters (C) for the final ×1 Our simplified Kalman filter model has a fixed state transi-
1 convolutional layer is the only layer that needs to be tion matrix F and a fixed matrix H which maps the state space
changed for retraining. According to the YOLOv2 paper, to the observation space. The prediction and update steps of
C (=5 K)+B, where B denotes the number of anchor our Kalman filter are:
boxes, K denotes the number of classes, and the constant Predict:
5 is for storing the probability, the 2D location, width, xˆ k|k−1 = F xˆ k−1|k−1 (3)
and height
of each bounding box. In our case we have the following
classes: vehicle, motorcycle, pedestrian, and cyclist. So K Pk| = F Pk−1| F T + (4)
4. We follow t=he YOLOv2 paper and set B 5. The k−1 k−1 Qk
Update:
output
is t=herefore C (5 4) 5 45, =i.e., a+45-
tensor. di×men=sional z˜ k = H xˆ k|k−1 (5)
• For SSD, 6 feature maps extracted from the Conv4_3, y˜ k = zk − z˜ k (6)
Conv7, Conv8_2, Conv9_2, Conv10_2, and Conv11_2
Sk = Rk + H Pk|k−1 H T (7)
layers are used for detection. While there are 6 anchor
−1
boxes defined for the Conv7, Conv8_2 and Conv9_2 Kk = Pk|k−1 H T Sk (8)
layers, only 4 anchor boxes are defined for the other (9)
three layers. In SSD, the predictions for bounding boxes Pk|k = ( I − K k H) Pk|k−1( I − K k H) T + K k Rkk
KT
and for classes are separated. Thus, with B anchor boxes pedestrians, and cyclists from 1,150 video frames of different
for each feature, the numbers of bounding boxes and sites. In particular, we focus on cases such as very long
classification confidences are 4B and (4 1+)B (including vehicles,
the “background” class) respectively.
• Similarly, for Faster R-CNN, we set the output number of
the final fully connected layer to 5 (including the
“background” class).
For each detector, the network is retrained by fine-tuning
the initial weights of the network that has been pretrained
using ImageNet [33]. A dataset is composed by manually
cropping 5,000 bounding boxes of vehicles, motorbikes,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
xˆ k |k = xˆ k|k−1 + K k y˜exception
k , of pagination.
(10)
where the subscript j |k denotes the estimation at frame j
given the observation up to (and including) frame k.
(appears in Eq. (4)) andRk ∈ R4×4 (Eqs. (7) and (9)) are
two
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception of pagination.
6 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
Fig. 5. Some example images used for training the object detectors.
Fig. 6. The training and validation loss plots for detectors SSD, YOLOv2 and Faster R-CNN.
covariance matrices. The term Pk|k denotes the posterior error challenge in this module is: at each video frame, a number of
covariance matrix of the state vector estimate; zk , zk R˜4 bounding boxes are returned by the object detector. That is, for
∈denote the actual and predicted observation vectors; yk is each tracklet constructed from the previous videoframes using
referred˜to as the innovation vector; and Kk is the optimal
Kalman gain. As a constant velocity model for the state from a list of bounding boxes detected in the currenta Kalman
filter, a suitable observation zk must be selected
transition is adopted, F takes on the form given in Eq. (11)
below: i.e., the assignment of zk ’s to all the tracklets that have been
constructed so far are done simultaneously by maxi- mizing
the global intersection-over-union (IoU) value [53]. Figure 7
illustrates an example where there are 2 tracklets constructed
up to frame k 1 and there−are 3 bounding boxes detected at
frame k. The blue rectangles are the predicted observations zk
computed using Eqs. (3) and (5) from the Kalman filter. The
⎡ ⎤ obvious assign˜ment that maximizes the IoU value is as shown
I3×3
in the figure where the two red and two
⎢I ⎥
F=⎢ 4×
blue bounding boxes have large overlapping regions. The red
0T ⎥ ∈ R
7×7
4 (11)
⎣ 3⎦ bounding box at the top-left corner is not assigned to any
03×4 I3×3 tracklet as the Hungarian algorithm imposes a 1-to-1 assign-
ment. This bounding box may form a new tracklet or may
. nd H in . Eqs. (5)-(9) is a standard projection matrix: H = be removed. Whichever case that would take place depends
I4×4 04×3 ∈ R4×7, where 0m denotes an Rm zero vector, and tackle this as an assignment problem using the Hungarian algorithm,
× ×× ×
Fig. 7. An example of the construction of two tracklets and a possible new Fig. 9. Perspective view of a vehicle’s bounding box being warped onto the
tracklet. The 3 red rectangles are the detected bounding boxes for the current ground. Relative to the orthogonal coordinate system OrXrYr system, the four
frame. The 2 blue rectangles are the predicted bounding boxes for the 2 Mri points form an arbitrary quadrilateral. The three virtual points Pr, Qr, and
existing tracklets using Eqs. (3) and (5). Rr give the width and length of the vehicle.
TABLE I
homography H , the points{ M ri|i 1=, .. . , 4 },1 which are the
VEHICLE CLASSES ACCORDING TO THE AUSTRALIAN VEHICLE CLASSIFI- warped coordinates of Mi ’s onto the ground plane defined by
CATION SYSTEM. IN OUR PIPELINE, THE LONG VEHICLES IN CLASSES OrXrYr, form an arbitrary quadrilateral. For the vehicleinside
10 TO 12 ARE GROUPED TOGETHER TO FORM ONE LARGE CLASS, the bounding box, of interest is the bottom (the wheels) of the
GIVING A TOTAL OF FOUR CLASSES
vehicle that touches the ground. The three points Pr, Qr, and
Rr are virtual points on the ground at the bottom of the
vehicle. With respect to OrXr Yr , we have QrPr QrRr
a⊥nd Qr Pr is approximately orthogonal to the direction of
the closest lane- mark (denoted by Ar). As shown in Fig. 9, Pri,
Qr, Rr, and {Mr
| i = 1,..., 4} are all points defined on the ground
plane. PrQr and PrRr correspond to the width and length of
the vehicle. Our goal is to define Pr, Qr, and Rr in terms of the
four known Mir points.
Without loss of generality, we drop the third component
(which is just 1) of the homogeneous representation of all the
points involved and work directly on 2D inhomogeneous
coordinates. From Fig. 9, we have
Pr = αMr + (1 − α)Mr
(12)
3r 4r
r
Q = β M + (1 − β)M (13)
1 4
Rr = γ Mr + (1 − γ )Mr . (14)
2 3
detector. From
fall into Class 3-5; Articulated vehicles fall into Class 6-9;
road trains (medium to long trucks with trailers) fall into Class
10-
12. Figure 8 shows some example vehicles in each of these
classes.
1) Vehicle Length Estimation Using Homography: Let Mi
{i 1, .| . . =, 4 be the }image coordinates of the four
corners of a vehicle’s bounding box returned by a vehicle
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
The two unknowns α and βexception (both are in the range 0..1) can
of pagination. views the scene from the right. For the case where the
be solved using the following two constraints: overhead camera views the scene from the left (like the Hutton
• ǁP − Q ǁ= w, where w is the width of the vehicle;
r r Street site shown in Fig. 1), the positions of Pr and Rr would
r r r
• P Q is orthogonal to A , where A is the line
r be reversed. However, the computation remains the same if
coordinates of the lane-mark that is closest to the one properly
vehicle. swaps the Mir terms.
Once α is known, Pr can be computed from Eq. (12). The The above computation requires the width w of vehicle to
parameter γ ∈(0, 1) (see Fig. 9) can be computed in a be known. Most Class 1 vehicles have w ≈ 1.6 m; larger
similar manner using the constraint that PrRr is parallel to 1
Each Mi =r is computed using Eq. (1): λi Mr H Mi which gives
Ar. Finally, Rr is obtained from Eq. (14) and the vehicle’s the homogeneous coordinates of Mr . It is then straightforward to normalize
length PrRr is deduced. Mr so
Figure Fig.
9 shows
4, one the
can setup where
see that, usingthetheoverhead
point camera i i
r
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception of pagination.
8
Fig. 11. The training and validation error plot of the visual classifier.
Fig. 13. Vehicle and pedestrian/cyclist detection results from videos from the (a)-(d) Narrows Bridge North site, (e)&(f) Hutton Street site, and (g)&(h) Narrows
Bridge South site. The object classes are indicated by the colours of their bounding boxes. The Classes 1, 2-5, and 6-9 vehicles have, respectively, blue, green,
and red bounding boxes while pedestrians and cyclists have dark blue and light green ones. White colour boxes mean that the detected objects have not yet been
classified and counted.
cameras on the other two sites only saw the back parts of
vehicles, the YOLOv2 detector was able to robustly detect camera installed at the Hutton Street site being further away
vehicles at all the three sites. Overall, the detector is immune from the scene, making the vehicles smaller. Another reason is
to rainy conditions (Fig. 13(a)), night time (Fig. 13(b)), and that the Hutton Street video was recorded in a morning where
complicated illumination (Figs. 13(e) and 13(f)). When the shadows from the trees and nearby vehicles greatly affect the
traffic is busy where vehicles occlude each other (Fig. 13(h)), performances of all detectors (Figs. 13(e)&(f)).
the detector is still robust enough to get accurate bounding The results from Faster R-CNN were better than those from
boxes of vehicles. Fig. 13(c) shows a group of pedestrians other detectors as expected, as the detector gave more accurate
correctly detected and Fig. 13(d) shows the detection result of bounding boxes of the vehicles which are very important for
a cyclist. tracking, counting and classification. While YOLOv2
achieved very similar performance as Faster R-CNN in all
three sites, SSD dealt with the Narrows Bridge North site
D. Quantitative Evaluation relatively better, where the front parts of vehicles are visible.
1) Vehicle Counting and Classification: As the All the detectors performed well on counting the Class 1
vehicle counting and vehicle classification modules rely vehicles and less so on the Class 2-5 vehicles. While SSD
on the track- ing results from the upstream module, it is misclassified many vehicles from Classes 2-5 and 6-9,
necessary to investigate the performance of the object YOLOv2 and Faster R-CNN achieved much better results for
tracking and tracklet construction module first. We found these two classes. Incorporating visual informa- tion appeared
in our experiments that the IoU value worked well with to help improve the detection rate but occa- sionally reduce the
the Hungarian algorithm, since each input video passed detection rate on the smaller vehicles. For example, for the
to the pipeline has a reasonably high frame rate. This Narrows Bridge South site, there is one vehicle from Class 10-
guarantees that the bounding boxes of the same object 12. While using the homogra- phy method to compute the
have sufficient overlap between frames. The tracking vehicle length gave 3 vehicle counts for both YOLOv2 and
performance of the Kalman filters depends on how Faster R-CNN, incorporating visual information helped to
accurate the detector is in locating the targets and whether remove the two misclassified vehicles.
the false positive and false negative detection ratesare 2) Speed Analysis: The Piezoelectric data available at the
sufficiently low. This means that the accuracy of the Narrow Bridge South site makes it possible to benchmark the
vehicle counting and classification is ultimately governed vehicle speed estimation module of our computer vision
by which detector is used. The vehicle counting results pipeline. The recorded speed data in Fig. 14(a) shows 2
from the proposed method are shown in Table II. The regimes: free flow (at time 14:00-14:35) and congested (14:35-
vehicle counts from YOLOv2 and Faster R-CNN are 15:10). The vehicle speeds from our method and from the
clearly closer to the ground truth values (obtained from piezoelectric sensor are shown as red and blue dots. We note
manual counting) than those from SSD. This is because of that our method computes the average speed over the whole
the shallower network of SSD for feature extraction trajectory length of each vehicle whereas the piezoelectric
compared to those of YOLOv2 and Faster R-CNN, sensor measures the instantaneous speed of the vehicle. In the
which means that the features learned by SSD are not figure, the red and blue curves show the average speed of all
discrimina- tive enough for detection. One can notice that vehicles falling into the same one-minute period. While the
SSD failed to detect all the vehicles in the 6-9 and 10-12 average speeds from both methods are similar for the
classes. In summary, the SSD have 66% to 87% counting congested period, there is a relatively large gap for the free
accuracy, while flow period. The fact that the gap appears to be consistent
YOLOv2 and Faster R-CNN have 90% to 98%. suggest that it was caused by a systematic error. With the
In terms of the performance on the three sites, all the three
detectors had higher detection rates for the Narrows Bridge
North and Narrows Bridge South sites. This is due to the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception of pagination.
LIU et al.: VISION-BASED PIPELINE FOR VEHICLE COUNTING, SPEED ESTIMATION, AND CLASSIFICATION 11
TABLE II
VEHICLE COUNTING RESULTS FOR THE THREE SITES USING THE CLASSIFICATION METHOD BASED ON VEHICLE LENGTH ONLY (SECTION III-D.1) VERSUS THE
CLASSIFICATION METHOD BASED ON BOTH VEHICLE LENGTH AND VISUAL INFORMATION (SECTION III-D.3) ON THE BOUNDING
BOXES EXTRACTED BY THE THREE DETECTORS SSD, YOLOV2, AND FASTER R-CNN
Fig. 14. Speed analysis of data from our method and from the piezoelectric sensor.
Fig. 15. The indices of vehicles for the time-space diagram in Fig. 16.
absence of absolute ground truth, it is not possible to judge Figure 14(b) shows the line fitted to the two sets of average
whether our method overestimated the vehicle speeds or the speed values (denoted by μi and νi , for all i ) computed by the
piezoelectric sensor underestimated them or a combination of two methods. After readjusting the vision-based average speed
both. A contributing factor to our possible overestimation is values using the slope and y-intercept parameters obtained
the slope of the road segment there. This small slope is not from the fitted line, the root mean square error (RMSE)
detectable in the 2D Google Maps and so the 3D coordinates between the two sets of values is 1.85, which corresponds to a
of landmark points that were used for our homography compu- normalized RMSE (computed as RMSE/μ, where μ is the m¯
tation might not be accurately provided by Google Maps. For ean of the μi ’s) of 3%. This shows that after correcting the
slow vehicle speeds (the congested period), the effect from the systematic error, the two sets of values agree well.
small errors of the homography matrix is negligible; however, The speed-density plot is another typical diagram commonly
for higher vehicle speeds, the errors in the homography matrix used in traffic analysis (interested readers can refer to Fig. 2 of
are the possible cause to the overestimation of vehicle speeds. the paper by [59]). Figure 14(c) shows the average speed
versus density plot for the two methods. For each point in the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception of pagination.
12 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
F. Traffic Analysis
Time-space diagrams describe the relationship between the
locations of vehicles in a traffic stream over time as the
vehicles progress along a highway. These diagrams are a
useful diagnostic tool especially for freeway merging and
traffic weaving analysis. However, they are not frequently
Fig. 16. The time-space diagram produced by our method for the Hutton used because of the difficulties in getting the metadata. Most
Street exist- ing techniques rely on manual processing of the
site video shown in Fig. 15. trajectories of vehicles captured by video cameras and they are
both labour intensive and inaccurate. Applying our proposed
diagram, the estimated density value is the number of vehicles pipeline toan example video sequence from the Hutton Street
per km over 5 lanes. This diagram agrees with Fig. 14(a) in site (Fig. 15), such a diagram can be easily generated, as
that the larger average speed values estimated by our method shown in Fig. 16. A traffic shockwave is evident in the figure
give slightly lower density values in this plot (the red dots are (marked by the long diagonal downward arrow) generated
slightly to the left of the blue dots). We can see that in the from the merging of traffic on the far left lane. In frame 202
congested period, the average speed drops to below 50 km/h. (Fig. 15(a)), vehicles 1- 4 follow each other closely. The
By dividing the density by 5 to yield the density per lane, merging of vehicle 5 in frame 508 caused a delay, resulting in
one can obtain the average headway (spacing) (approximately a headway of around 7 seconds between vehicles 4 and 5.
45 metres and 24 metres respectively for the free and Consequently, vehicles 10- 12 had to stop momentarily, as
congested periods) which is a useful piece of information in shown by the two ovals which mark the time and location at
traffic engineering. which vehicles entered the stop- and-start status. Other
valuable information that is useful for traffic analysis, such as
E. Technique Analysis the spatial distance between adjacent vehicles (the spacing
1) “Zoom-in” Detection: As shown in Fig. 13(a), pedes- between trajectories measured along the vertical axis) and
trians and cyclists are very small targets in the whole video headway (the time difference along the horizontal axis), can
frame and are difficult to be detected by many object detec- be obtained from the time-space diagram generated by our
tors, especially for YOLO. In our proposed pipeline, cropped proposed pipeline.
regions instead of the whole video frames can be passed to the
detector. By feeding the small cropped regions near the V. CONCLUSION AND FUTURE WORK
bottom- right corner of Fig. 13(a), the zoom-in region has most We have presented a vision-based pipeline consisting of
of the pedestrians (Fig. 13(c)) and the cyclist (Figs. 13(d)) modules for vehicle counting by lane, vehicle speed esti-
successfully detected. “Zoom-in” detection allows our pipeline mation, vehicle classification, and pedestrian/cyclist counting
to detect small pedestrians and cyclists when the scene is from monocular videos. We achieve these tasks by adapting
viewed from afar. state-of-the-art object detection techniques through transfer
2) Handling Wrong Detections: The detectors sometimes learning and through a novel application of projective geom-
gave false positive detections, which means some wrong areas etry. Our vehicle classification module incorporates a novel
were detected as targets (vehicles, pedestrians, motorbikes, or fusion of visual appearance and geometry information of
cyclists). In this case, a tracklet for each false positive vehicles in the scene. Our pipeline has been demonstrated
detection would be constructed as well. If the area occupied by through extensive experiments to give promising counting and
the false detection is “dynamic” (e.g., vehicles keep moving speed estimation results.
through it), then there is little chance the false detection would In our ongoing research effort, we have already extended
occur again in that area for the subsequent frames. As long as the work reported in this paper to traffic scenes captured by a
the tracklet is not assigned to any new detection for 5 frames, drone at four-way intersections, where vehicles moving in
the tracklet would be deleted. If the area is relatively static different directions need to be separated and more severe
(e.g., a piece of rock on the road side), then the false detection occlusion problems need to be dealt with. A limitation of our
might persist over many frames. In this case, it would not current method is the weak camera calibration step being
affect the final counting results because it would not cross the performed once only at the beginning of the pipeline. If the
counting line (see Section III-F). Our method can therefore camera is panned, tilted, or drifted (especially if the camera is
robustly deal with false positive detections from the detector carried by a drone), then the pre-computed homography is no
through the tracking process. For false negative errors, Faster longer valid. This limitation can be overcome by adding a new
R-CNN is more robust than YOLOv2, which is more robust module to track the calibration landmarks. For any detected
than SSD. Due to partial occlusion, especially when the traffic displacement of these landmarks in the image, the module can
is heavily congested, all the three detectors have higher false invoke an update of the homography.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception of pagination.
LIU et al.: VISION-BASED PIPELINE FOR VEHICLE COUNTING, SPEED ESTIMATION, AND CLASSIFICATION 13
Our future research work can be extended in several direc- [19] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierar-
tions. We intend to deploy the proposed method to daily chies for accurate object detection and semantic segmentation,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587.
operations and use the data generated for network operations, [20] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-
traffic flow monitoring, and also anomaly detection. One of time object detection with region proposal networks,” in Proc. Adv.
our research interests is to detect abnormal events such as Neural Inf. Process. Syst., 2015, pp. 91–99.
[21] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
pedestrians (or cyclists) on a freeways, vehicles driving in dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
wrong directions, vehicles driving at abnormal speeds (too fast Oct. 2017, pp. 2980–2988.
or too slow), and traffic accidents. As mentioned above, we [22] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in
Proc. ICCV, Oct. 2017, pp. 2980–2988.
have already started to process traffic videos captured by [23] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and
drone cameras at intersections. Adding a second camera to the A. W. M. Smeulders, “Selective search for object recognition,” Int. J.
system is part of our future research endeavour. This will help Comput. Vis., vol. 104, no. 2, pp. 154–171, Sep. 2013, doi: 10.1007/
overcome occlusion problems and also the homography error s11263-013-0620-5.
[24] Y. Bar-Shalom and T. E. Fortmann, Tracking and Data Association.
due to the slope of the road mentioned in ourexperiments. London, U.K.: Academic, 1988.
[25] S. S. Blackman, “Multiple hypothesis tracking for multiple target
ACKNOWLEDGMENT tracking,” IEEE Aerosp. Electron. Syst. Mag., vol. 19, no. 1, pp. 5–18,
Jan. 2004.
The Titan Xp GPU funded by Nvidia for this research is [26] A. G. A. Perera, C. Srinivas, A. Hoogs, G. Brooksby, and W. Hu,
greatly appreciated. “Multi- object tracking through simultaneous long occlusions and split-
merge conditions,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis.
Pattern Recognit. (CVPR), vol. 1, Jun. 2006, pp. 666–673.
REFERENCES [27] C. Huang, B. Wu, and R. Nevatia, “Robust object tracking by hier-
archical association of detection responses,” in Proc. ECCV, 2008, pp.
[1] A. L. Bazzan and F. Klügl, “Introduction to intelligent systems in traffic
788–801.
and transpor,” Synth. Lect. Artif. Intell. Mach. Learn., vol. 7, no. 3, pp.
1–137, 2013. [28] G. Duan, H. Ai, S. Cao, and S. Lao, “Group tracking: Exploring mutual
relations for multiple object tracking,” in Proc. ECCV, 2012, pp. 129–
[2] T.-J. Ho and M.-J. Chung, “An approach to traffic flow detection
143.
improvements of non-contact microwave radar detectors,” in Proc. Int.
[29] J. H. Yoon, C.-R. Lee, M.-H. Yang, and K.-J. Yoon, “Online multi-object
Conf. Appl. Syst. Innov. (ICASI), May 2016, pp. 1–4. tracking via structural constraint event aggregation,” in Proc. IEEE
[3] H. Sawant, J. Tan, Q. Yang, and Q. Wang, “Using Bluetooth and sensor Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1392–
networks for intelligent transportation systems,” in Proc. 7th Int. IEEE 1400.
Conf. Intell. Transp. Syst., Oct. 2004, pp. 767–772. [30] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and
[4] G. Leduc, “Road traffic data: Collection methods and applications,”
realtime tracking,” in Proc. IEEE Int. Conf. Image Process. (ICIP), Sep.
Work. Papers Energy, Transp. Climate Change, vol. 1, no. 55, pp. 1–55,
2016, pp. 3464–3468.
2008. [31] Z. Jiang, D. Q. Huynh, W. Moran, and S. Challa, “Appearance and
[5] P. Burnos, J. Gajda, P. Piwowar, R. Sroka, M. Stencel, and T. Zeglen,
motion based data association for pedestrian tracking,” in Proc. IVCNZ,
“Measurements of road traffic parameters using inductive loops and
Auckland, New Zealands, Nov./Dec. 2011, pp. 459–464.
piezoelectric sensors,” Metrol. Meas. Syst., vol. 14, no. 2, pp. 187–203,
[32] Z. Jiang and D. Q. Huynh, “Multiple pedestrian tracking from
2007.
monocular videos in an interacting multiple model framework,” IEEE
[6] P. Burnos, “Alternative automatic vehicle classification method,” Metrol.
Trans. Image Process., vol. 27, no. 3, pp. 1361–1375, Mar. 2018.
Meas. Syst., vol. 17, pp. 323–333, Jan. 2010.
[33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
[7] G. M. D’Este, R. Zito, and M. A. P. Taylor, “Using GPS to measure
A large-scale hierarchical image database,” in Proc. IEEE Conf.
traffic system performance,” Comput.-Aided Civil Infrastruct. Eng., vol.
Comput. Vis. Pattern Recognit., Jun. 2009, pp. 248–255.
14, no. 4, pp. 255–265, Jul. 1999.
[34] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
[8] D. Johnston, “Learnings from the development of a traffic data fusion
A. Zisserman, “The Pascal visual object classes (VOC) challenge,” Int.
methodology,” in Proc. Austral. Inst. Traffic Planning Manage. (AITPM)
J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Jun. 2010.
Nat. Conf., Melbourne, VIC, Australia, 2017, pp. 1–13.
[9] J. Chung and K. Sohn, “Image-based learning to measure traffic density [35] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in
using a deep convolutional neural network,” IEEE Trans. Intell. Transp. Proc. Eur. Conf. Comput. Vis., D. Fleet, T. Pajdla, B. Schiele, and
Syst., vol. 19, no. 5, pp. 1670–1675, May 2018. T. Tuytelaars, Eds. Cham, Switzerland: Springer, 2014, pp. 740–755.
[10] H. S. Lee and K. Kim, “Simultaneous traffic sign detection and [36] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learn-
boundary estimation using convolutional neural network,” IEEE Trans. ing applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp.
2278–2324, Nov. 1998.
Intell. Transp. Syst., vol. 19, no. 5, pp. 1652–1663, May 2018.
[11] C. Hu, X. Bai, L. Qi, P. Chen, G. Xue, and L. Mei, “Vehicle color [37] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
with deep convolutional neural networks,” in Advances in Neural Infor-
recognition with spatial pyramid deep learning,” IEEE Trans. Intell.
mation Processing Systems. Red Hook, NY, USA: Curran Associates,
Transp. Syst., vol. 16, no. 5, pp. 2925–2934, Oct. 2015.
2012, pp. 1097–1105.
[12] C. Stauffer and W. E. L. Grimson, “Adaptive background mixture
models for real-time tracking,” in Proc. IEEE Comput. Soc. Conf. [38] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE
Comput. Vis. Pattern Recognit., vol. 2, Jun. 1999, pp. 246–252. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9.
[13] K. Kim, T. H. Chalidabhongse, D. Harwood, and L. Davis, “Real-time [39] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
foreground-background segmentation using codebook model,” Real- large-scale image recognition,” 2014, arXiv:1409.1556. [Online].
Time Imag., vol. 11, no. 3, pp. 172–185, Jun. 2005. Available: https://arxiv.org/abs/1409.1556
[14] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look [40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788. pp. 770–778.
[15] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in [41] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based R-CNNs
Proc. CVPR, Jul. 2017, pp. 6517–6525. for fine-grained category detection,” in Proc. ECCV, vol. 8689, 2014,
[16] J. Redmon and A. Farhadi, “YOLOv3: An incremental improve- ment,” pp. 834–849.
2018, arXiv:1804.02767. [Online]. Available: http://arxiv.org/abs/ [42] S. Huang, Z. Xu, D. Tao, and Y. Zhang, “Part-stacked CNN for fine-
1804.02767 grained visual categorization,” in Proc. IEEE Conf. Comput. Vis. Pattern
[17] W. Liu et al., “SSD: Single shot MultiBox detector,” in Proc. Eur. Conf. Recognit., 2016, pp. 1173–1182.
Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 21–37. [43] Z. Ge, A. Bewley, C. McCool, P. Corke, B. Upcroft, and C. Sanderson,
[18] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis. “Fine-grained classification via mixture of deep convolutional neural
(ICCV), Dec. 2015, pp. 1440–1448. networks,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV),
Mar. 2016, pp. 1–6.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception of pagination.
14
[44] B. Zhao, J. Feng, X. Wu, and S. Yan, “A survey on deep learning- based
Du Q. Huynh (Senior Member, IEEE) received the
fine-grained object classification and semantic segmentation,” Int. J.
Ph.D. degree in computer vision from The Uni-
Autom. Comput., vol. 14, no. 2, pp. 119–135, Apr. 2017. versity of Western Australia, Perth, WA, Australia,
[45] Y. Peng, X. He, and J. Zhao, “Object-part attention model for fine- in 1994. Since then, she has been with the Australian
grained image classification,” IEEE Trans. Image Process., vol. 27, no. Cooperative Research Centre for Sensor Signal and
3, pp. 1487–1500, Mar. 2018. Information Processing and Murdoch University,
[46] G. Somasundaram, R. Sivalingam, V. Morellas, and Perth, WA, Australia. She is currently an Associate
N. Papanikolopoulos, “Classification and counting of composite objects Professor with the Department of Computer Sci-
in traffic scenes using global and local image analysis,” IEEE Trans. ence and Software Engineering, The University of
Intell. Transp. Syst., vol. 14, no. 1, pp. 69–81, Mar. 2013. Western Australia. She has previously researched
[47] H. Zeng, H. Wu, and X. Wang, “An automatic wheel contour extraction shape from motion, multiple view geometry, and 3-
method,” Sensors Transducers, vol. 165, no. 2, p. 61, 2014. D
[48] A. G. Howard et al., “MobileNets: Efficient convolutional neural reconstruction. Her current research interests include visual object tracking,
networks for mobile vision applications,” 2017, arXiv:1704.04861. video image processing, machine learning, and pattern recognition.
[Online]. Available: https://arxiv.org/abs/1704.04861
[49] J. Li and Z. Wang, “Real-time traffic sign recognition based on efficient
CNNs in the wild,” IEEE Trans. Intell. Transp. Syst., vol. 20, no. 3, pp.
975–984, Mar. 2018.
[50] X. Hu et al., “SINet: A scale-insensitive convolutional neural network Yuchao Sun received the master’s degree in infor-
for fast vehicle detection,” IEEE Trans. Intell. Transp. Syst., vol. 20, no. mation management and the Ph.D. degree from The
3, pp. 1010–1019, Mar. 2018. University of Western Australia in 2003 and 2016,
[51] L. Chen, X. Hu, T. Xu, H. Kuang, and Q. Li, “Turn signal detection respectively. He has worked in both industry and
academia, in which he has participated in a range of
during nighttime by CNN detector and perceptual hashing tracking,”
research and consulting activities, including some
IEEE Trans. Intell. Transp. Syst., vol. 18, no. 12, pp. 3303–3314, Dec.
large scale infrastructure and mining projects. Witha
2017.
focus on applying modern computing techniques to
[52] R. Hartley and A. Zisserman, Multiple View Geometry in Computer
solve transport problems, his main research interests
Vision, 2nd ed. Cambridge, U.K.: Cambridge Univ. Press, 2004. include emergent behavior, bio-inspired algorithms,
[53] M. Everingham et al., “The 2005 PASCAL visual object classes chal-
the impact of future transport technologies
lenge,” in Proc. Mach. Learn. Challenges Workshop. Berlin, Germany:
especially
Springer, Apr. 2005, pp. 117–176.
connected and autonomous vehicles, transport modeling, data analytics, opti-
[54] J. Y. K. Luk, Automatic Vehicle Classification by Vehicle Length.
mization, and discrete event simulation for supply chain management.
Austroads, 2006. [Online]. Available: https://austroads.com.au/
publications/asset-management/ap-t60-06
[55] AUSTROADS Vehicle Classification System. Accessed: Jun. 25, 2020.
[Online]. Available: https://austroads.com.au/publications/pavement/
agpt04k/austroads-vehicle-classification Mark Reynolds (Member, IEEE) received the B.Sc.
[56] O. Russakovsky et al., “ImageNet large scale visual recognition chal- degree (Hons.) in pure mathematics and statistics
lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015. from The University of Western Australia (UWA),
[57] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. Cambridge, Perth, WA, Australia, in 1984, the Ph.D. degree in
MA, USA: MIT Press, 2005. computing from the Imperial College of Science and
[58] M. Pascoal, M. E. Captivo, and J. Clímaco, “A note on a new variant of Technology, University of London, London, U.K., in
Murty’s ranking assignments algorithm,” Quart. J. Belgian, French 1989, and the Diploma of Education degree from
Italian Oper. Res. Soc., vol. 1, no. 3, pp. 243–255, 2003. UWA in 1989. He is currently a Professor and the
[59] H. Wang, J. Li, Q.-Y. Chen, and D. Ni, “Speed-density relationship: Head of the School of Physics, Mathematics, and
From deterministic to stochastic,” in Proc. 88th TRB Annu. Meeting. Computing with UWA. His current research inter-
Washington, DC, USA: Transp. Res. Board, 2009, pp. 1–20. ests include artificial intelligence, optimization of
schedules and real-time systems, optimization of electrical power distribution
networks, machine learning, and data analytics.