0% found this document useful (0 votes)
20 views61 pages

Final Report

The document outlines a project focused on developing a vision-based system for vehicle detection, speed estimation, and classification using deep learning techniques. It emphasizes the use of a single camera and various algorithms, such as YOLOv3 and TensorFlow, to improve traffic data extraction from videos captured by legacy cameras, aiming to create a cost-effective alternative to traditional radar systems. The literature survey reviews various methods and technologies related to traffic monitoring and vehicle classification, highlighting their advantages and disadvantages.

Uploaded by

padmanathan423
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views61 pages

Final Report

The document outlines a project focused on developing a vision-based system for vehicle detection, speed estimation, and classification using deep learning techniques. It emphasizes the use of a single camera and various algorithms, such as YOLOv3 and TensorFlow, to improve traffic data extraction from videos captured by legacy cameras, aiming to create a cost-effective alternative to traditional radar systems. The literature survey reviews various methods and technologies related to traffic monitoring and vehicle classification, highlighting their advantages and disadvantages.

Uploaded by

padmanathan423
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 61

CONTENTS

CHAPTER NO DESCRIPTION

ACKNOWLEDGEMENT

SYNOPSIS

SOURCE CODE

1 INTRODUCTION

2 LITERATURE SURVEY

3 PREAMBLE

4 REQUIREMENT SPECIFICATION

5 SYSTEM DESIGN

6 SYSTEM IMPLEMENTATION

7 SYSTEM TESTING

8 RESULT

9 CONCLUSIONS AND FUTURE ENHAHNCEMENT

REFERENCES

APPENDIX
SOURCE CODE
CHAPTER 1

INTRODUCTION
1.1 INTRODUCTION OF PROJECT

The Police make use of RADAR device to detect the speed of the vehicle, which should
be placed at a certain distance. This requires manpower for focusing the vehicles and also it
must be kept in most of the highways which is expensive. Cameras have been widely used in
traffic operations. While many technologically smart camera solutions in the market can be
integrated into Intelligent Transport Systems (ITS) for automated detection, monitoring and
data generation. Many Network Operations Centers still use legacy camera systems as
manual surveillance devices. Intelligent transportation frameworks have gotten progressively
significant in numerous cutting-edge urban communities as governments depend more on
them to empower smart decision making (for the two organizations and individual clients),
and better use of the current foundation. To resolve the above problem, we are composing a
system to extract traffic data from videos captured by legacy cameras. From the extracted
data we can detect the vehicles as well classify them (like, Car, Bus etc..,), speed of the
detected vehicles and also keep a count on number of vehicles which passes over the Camera
by using a deep learning model. In the region of AI and Computer vision, there are various
strategies that can be utilized for object identification. These strategies can likewise be
applied to detect vehicles and speed of the vehicles. The system that we are introducing uses
the following: (1) YOLOv3 algorithm is used for real time object detection, (2) TensorFlow
library is used to create the models and neural network for object detection/classify the
object. (3) OpenCV library is used to calculate the speed of the vehicles.

1.2 OBJECTIVES AND SCOPE OF THE PROJECT

1.2.1 Key Objectives


This project proposes a novel approach and technique to efficiently detect and track the
vehicles. The proposed technique detects tracks and extracts the vehicles parameter for speed
estimation by using a single camera. This project also proposed a cropping method for
minimization of false positive vehicle detection. In such a system, the camera must
besituated on the traffic signal pole, approximately 10 meters or more above the level of
road projected towards the centre of the road. This installation will minimize the effect of
occlusion.

1.2.2 Scope of the Project:

The main idea of the project is to develop a vision based pipeline system that for vehicle
counting, speed estimation and vehicle classification. It uses computer vision techniques to
extract traffic data from videos captured by cameras for object detectors and transfer-learning
to detect vehicles, pedestrians, and cyclists from monocular videos. The main objective
ofthis is to make use of a camera instead of Radar which requires man power for focusing the
vehicles and it must be kept in most of the highways. The existing technique which uses
Radar is too costly. Thus, it is necessary to design a system which is affordablethat includes
cost effective.

1.2.3 IMPORTANCE OF THE PROJECT

The system that we are introducing uses the following: (1) YOLOv3 algorithm is used for
real time object detection, (2) TensorFlow library is used to create the models and neural
network for object detection/classify the object. (3) OpenCV library is used to calculate the
speed of the vehicles. For verification and testing of the proposed approach, four different
videos for different environment conditions (like in morning, afternoon, evening and on
partial cloudy day) are used. In the proposed method, detection and tracking of the vehicles
utilizes parameters such as position, height and width of vehicle instead of features
extraction. This requires lesser computation and memory. The proposed approach stores
vehicles parameters, estimated speed of the detected vehicles in the database. The proposed
system can be adopted easily in existing traffic management system.
CHAPTER 2

LITERATURE SURVEY
[1] Alternative Automatic Vehicle Classification Method
The paper manages the new strategy for programmed vehicle arrangement called ALT
(ALTernative). Its trademark highlight is flexibility coming about because of its open
structure, additionally a client can change the quantity of vehicles and their classification as
indicated by singular necessities. It utilizes a calculation for programmed vehicle
acknowledgment utilizing information combination strategies and fluffy sets. High viability
of grouping while at the same time holding high selectivity of division was demonstrated by test
results. The viability of arrangement of all vehicles at the degree of 95% and products trucks of
100% is more than agreeable.
Advantage:The number of categories is not fixed and can be modified according to the
given area traffic if required.
Disadvantage:The decision is taken on the grounds of classical logic (“should” or “should
not”) what, is the reason for a low effectiveness of such classification algorithms

[2] Adaptive background mixture models for real-time tracking


A typical strategy for real-time segmentation of moving areas in picture groupings
includes "foundation deduction,"or thresholding the blunder between a gauge of the picture
without moving articles and the current picture. The various ways to deal with this issue vary
in the kind of foundation model utilized and the method used to refresh the model. This paper
examines demonstrating every pixel as a combination of Gaussians and utilizing an on-line
estimate to refresh the model. The Gaussian conveyances of the versatile blend model are
then assessed to figure out which are well on the way to result from a foundation cycle.
Every pixel is ordered dependent on whether the Gaussian appropriation which speaks to it
most viably is viewed as a feature of the foundation model. This outcome in a stable, real-
time outdoor tracker that dependably manages lighting changes, repetitive motions from
clutter, and long term scene changes. This framework has been run ceaselessly for a very
long time, 24 hours every day, through downpour and day off.
Advantage:One of the significant advantages of this Gaussian method is that
whensomething is allowed to become part of the background, it doesn’t destroy the existing
model
of the background.
Disadvantage:Robust tracking of lightning changes is not done.

[3] Real-time foreground–background segmentation using codebook model


We present a constant calculation for frontal area foundation division. Test foundation
esteems at every pixel are quantized into codebooks which speak to a compacted type of
foundation model for a long picture succession. This permits us to catch underlying
foundation variety because of intermittent like movement throughout a significant stretch of
time under restricted memory. The codebook portrayal is effective in memory and speed
contrasted and other foundation demonstrating strategies. Our technique can deal with
scenes containing moving foundations or brightening varieties, and it accomplishes vigorous
location for various sorts of recordings. We contrasted our strategy and other multimode
demonstrating methods. Notwithstanding the essential calculation, two highlights improving
the calculation are introduced—layered demonstrating/identification and versatile codebook
refreshing.
For execution assessment, we have applied bother discovery rate examination to four
foundation deduction calculations and two recordings of various kinds of scenes
Advantage: Interesting foreground objects (e.g., people) will be detected mixed with other
stationary objects (e.g., car)
Disadvantage: Backgrounds having fast variations are not easily modeled with just a few
Gaussians accurately, and it may fail to provide sensitive detection.

[4] Vehicle Colour Recognition with Spatial Pyramid Deep Learning


Colour, as an eminent and stable trait of vehicles, can fill in as a valuable and dependable
sign in an assortment of uses in wise transportation frameworks. Therefore, vehicle shading
acknowledgment in regular scenes has become a significant exploration point here. In this
paper, we propose a profound learning-based calculation for programmed vehicle shading
acknowledgment. Unique in relation to traditional strategies, which ordinarily receive
physically planned highlights, the proposed calculation can adaptively get familiar with a
portrayal that is more viable for the errand of vehicle shading acknowledgment, which
prompts higher acknowledgment exactness and stays away from pre-processing. Also, we
consolidate the generally utilized spatial pyramid procedure with the first convolutional
neural network architecture, which further lifts the acknowledgment precision. As far as we
could possibly know, this is the main work that utilizes profound learning with regards to
vehicle shading acknowledgment. The trials exhibit that the proposed approach
accomplishes better execution over traditional strategies.
Advantage: They introduce spatial information into the vehicle colour recognition algorithm,
by combining the spatial pyramid (SP) strategy with the framework of deep learning. The
usage of spatial information further improves recognition accuracy.
Disadvantage: It might make mistakes or give wrong predictions in certain cases. A majority
of the incorrect predictions are caused by severe illumination or indistinguishable colours.

[5] An Approach to Traffic Flow Detection Improvements of Non-Contact


Microwave Radar Detectors
According to their study fast Fourier transforms (FFTs) and prescribed smart decisions to
enhancing traffic detection of non-contact microwave radars (MRs). Adequate thresholds are
selected for filtering FFTs of regional lane contexts in multiple-lane environments. So,
frequency-modulated continuous waves of a MR reflected from vehicles are distinguished
from clutter and noise. Through lane-crossing FFT side-lobes may occur, smart decision
improve the reliability of detected traffic of each lane. On-site urban traffic experiments
demonstrate the advantage and feasibility of the proposed method.
Advantage: They have presented a plausible approach to remarkable improvements of the
traffic flow detection of a MR system
Disadvantage: The coping up with a missing vehicle signal within a given lane due to
obstruction of a large-sized bus within its adjacent lane.

[6] Image-Based Learning to Measure Traffic Density Using a Deep


Convolutional Neural Network
Existing procedures to check vehicles from a street picture have relied on both hand-made
element designing and rule-based calculations. These require numerous predefined limits to
recognize and follow vehicles. This paper gives a managed learning system that requires no
such component designing. A profound convolutional neural organization was concocted to
count the number of vehicles on a street fragment dependent on video pictures. The current
strategy doesn't view an individual vehicle as an item to be identified independently; rather,
it aggregately checks the number of vehicles as a human would. The test outcomes show that
the proposed system outperforms existing schemes.
Advantage: Using filters reduces the number of weight parameters to be estimated, since
each filter shared weight parameters wherever it resided within an image.
Disadvantage: It is difficult to account for how a CNN can count the number of vehicles
exactly.

[7] Using Bluetooth and Sensor Networks for Intelligent Transportation


Systems
Wellbeing of street travel can be expanded if vehicles can he made to shape bunches for
imparting information among themselves. The Bluetooth convention can be utilized for
between vehicle correspondences among vehicles furnished with Bluetooth gadgets. This
paper presents a novel way to deal with increment the security of street travel utilizing the
ideas of remote sensor organizations and the Bluetooth convention. We examine how
vehicles can shape versatile specially appointed organizations and trade information detected
by the on- board sensors. The combination of these information could give a superior
comprehension of the encompassing traffic conditions. The achievability of utilizing
Bluetooth for information trade among vehicles is assessed. Inclusion territory and
likelihoodof recognition plots for isotropic and non-isotropic sensors are broke down to
consider their utilization to stay away from possible perilous circumstances in rush hour
gridlock. As the recreation results show, Bluetooth and sensor organizations can be utilized
cooperatively to expand the wellbeing of street travel.
Advantage: This increases the overall sensing ability of the sensors imposed on the vehicle.
Disadvantage: This paper does not consider several issues like the potential communications
overhead and its effect on the communication efficiency, the integration of the vehicle-based
ad hoc wireless sensor networks with road side infrastructures, which will involve the
vehicle-to-roadside sensor communications.

[8] Simultaneous Traffic Sign Detection and Boundary Estimation Using


Convolutional Neural Network
We propose a novel traffic sign detection system that simultaneously estimates the
location and precise boundary of traffic signs using a convolutional neural network (CNN).
Estimating the precise boundary of traffic signs is important in navigation systems for
Intelligent vehicles were can be used 3-D landmarks for the road
environment. Previous traffic sign detection systems, including recent methods based on
CNN, only provide bounding boxes of traffic signs as output, and thus requires additional
processes such as contour estimation or image segmentation to obtain the precise boundary
ofsigns. With the predicted 2-D pose and the shape class of a targeted traffic sign in the input,
we estimate the actual boundary of the target sign by projecting the boundary of a
corresponding template sign image into the input image plane. With our architectural
optimization of the CNN-based traffic sign detection network, the proposed method shows a
detection frame rate higher than seven frames/second while providing highly accurate and
robust traffic sign detection and boundary estimation results on a low-power mobile
platform.Advantages: They proposed the efficient traffic sign detection method where
locations of traffic signs are estimated together with their precise boundaries.
Disadvantages: Since the latest architecture for object detection has not been used, the
accuracy is less.

[9] Using GPS to Measure Traffic System Performance


Traffic framework execution can be estimated differently, however from the client's point
of view, congestion is a significant measure. This article looks at some novel employments of
GPS in the estimation of vehicle velocities and travel times and their combination into
proportions of blockage and at last ofthe exhibition of the metropolitan street framework.
The article likewise will talk about the incorporation of GPS-based blockage measures
intoan ITS framework, procedures for actualizing congestion observing framework, and
suggestions for urban road system planners, managers, and users.
Advantage: The GPS equipment can be transferred quickly and easily from one vehicle to
another and can be installed in any type of vehicle from large trucks down to motorcycles
Disadvantage: There is often a problem with communications and efficient exchange of real-
time data when GPS is implemented for large fleets of vehicles (i.e., t it suffers from signal
blockage).

[10] Evaluating Color Descriptors for Object and Scene Recognition


Because many different descriptors exist, a structured overview is required of color invariant
descriptors in the context of image category recognition. Therefore, this paper studies the
invariance properties and the distinctiveness of color descriptors (software to compute the
structured way. The analytical invariance properties of color descriptors are explored, using a
taxonomy based on invariance properties with respect to photometric transformations, and
tested experimentally using a data set with known illumination conditions. In addition, the
distinctiveness of color descriptors is assessed experimentally using two benchmarks, one
from the image domain and one from the video domain. Furthermore, a combined set of
color descriptors outperforms intensity-based SIFT and improves category recognition by 8
percent on the PASCAL VOC 2007 and by 7 percent on the Mediamill Challenge.
Advantages: To increase illumination invariance and discriminative power, color descriptors
have been proposed.
Disadvantages: The three color SIFT descriptors which lacks in performance, (i.e., HSV-
SIFT, C-SIFT, and rgSIFT) large shifts when compared to other SIFT variants.

[11] Image Classification using Convolutional Neural Networks


In recent years, due to the explosive growth of digital content, automatic classification of
images has become one of the most critical challenges in visual information indexing and
retrieval systems. Computer vision is an interdisciplinary and subfield of artificial
intelligence that aims to give similar capability of human to computer for understanding
information from the images. Several research efforts were made to overcome these
problems, but these methods consider the low-level features of image primitives. Focusing
onlow-level image features will not help to process the images. Image classification is a big
problem in computer vision for the decades.
In case of humans the image understanding and classification is done very easily, but in case
of computers it is very expensive task. In general, each image is composed of set of pixels
and each pixel is represented with different values. Henceforth, to store an image the
computer must need more spaces for store data. To classify images, it must perform higher
number of calculations. For this it requires systems with higher configuration and more
computing power. In real time to take decisions basing on the input is not possible because it
takes more time for performing these many computations to provide result. It has been
discussed extraction of the features from Hyper Spectral Images (HSI) by using
Convolutional Neural Network (CNN) deep learning concept. It uses the different pooling
layer in CNN for extraction of the feature (nonlinear, Invariant) from the HIS which are
useful for perfect classification of images and target detection. It also addresses the general
issues between the HSI images features. In the perspective of engineering, it seeks to
automate tasks that the human visual system can do. It is concerned with the automatic
imageextraction, analysis and understanding useful information with images.
In last decade, several approaches for image classification was described and compared with
other approaches. But in general image classification refers to task of extracting information
from the image by labelling the pixels of the image to different classes. It can be done in two
ways one is Supervised classification, Unsupervised classification.
Disadvantage:
 This system uses the training model SSD_INSPECTION_V2_COCO

[12] Real Time Object Detection, Tracking, and Distance and Motion
Estimation based on Deep Learning: Application to Smart Mobility
In this paper, they have introduced object detection, localizationand tracking system for
smart mobility applications like traffic road and railwayenvironment. Firstly, an object
detection and tracking approach was firstly carried outwithin two deep learning approaches:
You Only Look (YOLO)V3 and SSD.Secondly, object distance estimation based on
Monodelph algorithm was developed. Thismodel is trained on stereo images dataset but its
inference usesmonocular images. As theoutput data, they have a disparity map that they
combine with the output of objectdetection. To validate, they have tested two models with
different backbones includingVGG and ResNet used with two datasets: Cityscape and
KITTI. As the last step of theapproach, they have developed a new method-based SSD to
analyse the behaviour ofpedestrian and vehicle by tracking their movements even in case of
no detection on someimages of a sequence. They have developed an algorithm based on the
coordinates of theoutput bounding boxes of the SSD algorithm.
The whole of development is tested in real vehicle traffic conditions in Rouen city center,and
with videos taken by embedded cameras along the Rouen tramway.
Disadvantage:
 Sometimes the objects might be placed too close due to which those objects
might look to be a single object.
 If there is/are a moving object/objects the detection of the object
becomes troublesome.
Table. 2.1 Literature Survey Summary

S
l Year Title Author Methodology Advantage Disadvant
. age
N
o
1 2019 ALTERNATIVE Piotr Burnos Its characteristic feature The number of Decision
is
AUTOMATIC is versatility resulting categories is not taken on
VEHICLE from its open structure, fixed and can be the
CLASSIFICATION moreover a user can modified. grounds
METHOD adjust the number of of
vehicles and their
classical
logic
category according to
individual requirements.
2 2019 Simultaneous Traffic Chris Stauffer, It simultaneously They proposed Since the
estimates the location latest
Sign Detection and W.E.L Grimson and precise boundary the efficient architecture
of traffic signs using a
convolutional neura for
Boundary Estimation network (CNN). traffic sign object

detection
Using Conolutional detection has not
been
Neural Network l method used,the
accuracy
is less.

3 2018 Real-time foreground– David Harwood, It can deal with Interesting Backgroun
ds
background Larry S.Davis scenes containing foreground having
fast
segmentation using moving foundations objects will variations

are
codebook model or brightening be detected not

easily
varieties, and it mixed with modeled
accomplishes other
vigorous location stationary
for various sorts of objects
recordings.
4 2018 Vehicle Colour Chuanping Hu, Colour, as an eminent The usage of It might
make
Recognition with Xiang Bai, Senior and stable trait ofspatial mistakes
information
or
Spatial Pyramid Deep Member, IEEE vehicles, can fill in as further improves give
wrong
Learning a valuable andrecognition prediction
s in
dependable sign in an accuracy. certain
cases
assortment of uses in
wise transportation
frameworks.
5 2017 An Approach to Tan-Jan Ho Fast Fourier They have The
Traffic Flow transforms (FFTs) presented a coping up
Detection and prescribed plausible approach with a
Improvements of Non- smart decisions to to remarkable missing
Contact Microwave enhancing traffic improvements of vehicle
Radar Detectors detection of non- the traffic flow signal
contact microwave detection of a MR
radars (MRs). system
6 2017 Image-Based Learning Jiyong Chung A Using It is difficult to
profound
to Measure and Keemin convolutional neural filters account for how
Traffic
Density Using a Deep Sohn organization was reduces the a CNN can
count
Convolutional Neural concocted to count number of the number
of
Network the number weight vehicles exactly.
of
vehicles on a street parameters
fragment

dependent
on video pictures
7 2017 Image Classification Ksenia Soorkina It is concerned with Image This system
the classification uses
using Convolutional automatic the
refers
image training
Neural Networks extraction, analysis to model
and information
understanding from the SSD_INSPECTI
image by
useful labeling the
information pixels of ON_V2_COCO
the
with image.
images.

8 2017 Real Time Object Aya Hassouneh, It uses object This model is If there is/are a
Detection, Tracking, A.M. Mutawa detection and trained on moving
and Distance and tracking system stereo images object/objects
the
Motion Estimation for smart mobility dataset but its detection of the
based on Deep applications . inference uses object becomes
Learning: Application monocular troublesome.
to Smart Mobility images.

9 2016 Using Bluetooth and Hemjit Sawant, Wellbeing of street This Issues
Sensor Networks for Jindong travel can
Intelligent be increases the like the
Transportation Systems Tan, Qingyan expanded if overall potential
Yang , QiZhi vehicles can sensing communicati
Wang he made ons
ability
to of the

sensors
shape bunches for imposed on the overhead
imparting vehicle.
information among
themselves.
1 2015 Using GPS to Measure Glen M. D’Este, GPS-based GPS equipment There is often
0 Traffic Rocco Zito & can a problem
Michael A. P. blockage
with
System Performance Taylor measures into an ITS
framework, be communicatio
procedures transferred ns.
quickly
for actualizing
congestion and easily from
observing framework, one vehicle
and suggestions for to
urban road system. another.
CHAPTER 3

PREAMBLE
3.1 EXISTING SYSTEM

The word “ Radar” is acronym for Radio Detection and Ranging. The police make use of
this device to detect speed of a vehicle, which should be placed at certain distance. This
requires man power for focusing the vehicles and it must be kept in most of the highways
which is expensive.
3.1.1 Disadvantageous of Existing system
 Most existing techniques rely on manual processing of the trajectories of vehicles captured by
video cameras and they are both labour intensive andinaccurate.
 RADAR has shorter range.
 It cannot distinguish or resolve multiple targets.

3.2 PROPOSED SYSTEM

In the proposed system, we make use of a camera instead of Radar. The camera
captures the vehicle and detects speed, classifies which type of vehicle and also keep the
count of number of vehicles which passes over the camera. This doesn’t require much of
manpower and generates valid proof for over speeding.
 Cost effective: The main objective of developing algorithm of a real time system is
that to provide cost effective. It is necessary to design a system which is affordable
and includes cost effective components for designing.
 Fast: The main objective of this project is to develop an algorithm which is extremely
fast compared to the existing ones.
 Accuracy: The main objective ofthis project is to develop an algorithm which is
more accurate compared to the existing ones.
3.2.1 Features of Proposed System
 Our proposed pipeline combines object detection and multiple object tracking to count and
classify vehicles from video captured.
 Our proposed pipeline also includes a visual classifier module.
 A deep learning method is proposed to deal with the problem of vehicle colourrecognition.

Fig 3.1: Features of proposed system


3.3 METHODOLOGY

Workflow Diagram

Fig. 3.2 Workflow Diagram

 Apply object detection to detect vehicle in a video stream.


 Compute the bounding box coordinates of all detected vehicles.
 Compute Euclidean distance between new bounding boxes and existing bounding
box of a vehicle.
 Update bounding box coordinates of existing vehicle.
 Compute the speed of vehicle using the distance/time
 If any new vehicle detected, register it and assign an ID.
 By keeping a track on vehicle ID, extract the new Bounding box coordinates
and update.
 When a vehicle crosses the threshold fov (field of view) of camera, de-register
the vehicle ID.
CHAPTER 4

REQUIREMENT SPECIFICATION
The study of requirement specification is focused specially on the functioning of the
system. It allows the developer or analyst to understand the system function to be carried
out the performance level to the obtained and corresponding interfaces to be established.

4.1 HARDWARE REQUIREMENTS

The software detects the action of eyes with the help of camera. So, the computer needs to
be installed with camera. The laptop is equipped with camera, but the desktop computer
sometimes doesn’t have camera, so an extra camera needs to be installed and linked to the
computer.
 Processor : Intel I5 8th Gen
 Ram : 8GB
 Graphic Card : Nvidia GeForce GTX 960mx

4.2 SOFTWARE REQUIREMENTS


 Open CV

It is a library of programming functions mainly aimed real-time computer vision.


Originally developed by Intel and it was later supported by WILLOW GARAGE and is now
maintained by IITSSEZ. The library is the cross platform and free for use under the open
source BSD license. Open CV is written in C++ and it is primarily interface is in C++.
There are bindings in python, java and MATLAB/Octave.
Open CV application areas include:

 2D and 3D features toolkits

 Facial recognition system

 Gesture Recognition

 Human Computer Interaction

 Mobile Robot

 Segmentation and Recognition

 Augmented reality and Motion Tracking

 Python 3.5.0

It is an interpreted high-level programming language for general purpose programming.


Python has a design philosophy that emphasizes code readability and a syntax that allows
programmers to express concepts in fewer lines of code, notably using significant white
space. It provides constructs that enable clear programming on both small and large scales.
Python features a dynamic type system and automatic memory management. It supports
multiple programming, paradigms, including object oriented, imperative, functional and
procedural, and has a large and comprehensive standard library. Python Interpreters are
available for many OS. Mostly Python implementation includes a read-eval-print loop,
permitting to function as a command line interpreter for which the user enters statements
sequentially and receives results immediately.

Some things that Python is often used for are:

 Web development.
 Scientific Programming
 Desktop GUIs
 Network Programming.
 Game Programming.
CHAPTER 5

SYSTEM DESIGN

Design is the first step in the development phase for any engineering product or system.
It may be defined as the process of applying various techniques and principles for the
purpose of defining a device, a process or a system in sufficient detail to permit its physical
realization. Design is a meaning full representation of something that is to be built. Software
design is an iterative process through which the requirements are translated into a
“blueprint” for constructing the software.
When designing the system, the points to be taken are:
 Identifying the data to be stored
 Identifying the user requirements
 Need to maintain data and retrieve them when ever wanted
 Identifying of inputs and arriving at the user define output
 System specification
 Security specification
 View of future implementation of the projects
A system architecture or systems architecture is the conceptual model that defines the
structure, behaviour, and more views of a system. An architecture description is a formal
description and representation of a system, organized in a way that supports reasoning about
the structures and behaviours of the system.
The system architecture for estimating vehicle speed from video data consists of 9
processes. Each process will do particular work which the result will be used by next
process until estimated speed is calculated. Block diagram for this system is given by Fig.
5.1.
5.1 System Architecture

Fig 5.1: System architecture


5.2 Flow Diagram
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction
diagram that shows how processes operate with one another and in what order. It is a
construct of a Message Sequence Chart. Sequence diagrams are sometimes called event
diagrams, event scenarios, and timing diagrams.

Fig 5.2 Flow Diagram

 Apply object detection to detect vehicle in a video stream.


 Compute the bounding box coordinates of all detected vehicles.
 Compute Euclidean distance between new bounding boxes and existing bounding
box of a vehicle.
 Update bounding box coordinates of existing vehicle.
 Compute the speed of vehicle using the distance/time
 If any new vehicle detected, register it and assign an ID.
 By keeping a track on vehicle ID, extract the new Bounding box coordinates
and update.
 When a vehicle crosses the threshold fov (field of view) of camera, de-register
the vehicle ID.

5.3 DNN Architecture

Fig 5.3 DNN Architecture

With the revival of DNN, object detection has achieved significant advances in recent years.
Current top deep-network-based object detection frameworks can be divided into two
categories: the two-stage approach, including and one-stage approach, including. In the two-
stage approach, a sparse set candidate object boxes is first generated by selective search or
region proposal network, and then, they are classified and regressed. In the one-stage
approach, the network straightforward generated dense samples over locations, scales, and
aspect ratios; at the same time, these samples will be classified and regressed. The main
advantage of one-stage is real time; however, its detection accuracy is usually behind the
two- stage, and one of the main reasons is class imbalance problem.
5.4 User Case Diagram
Use case diagram is a behavioral UML diagram type and frequently used to analyze various
systems. They enable you to visualize the different types of roles in a system and how those roles
interact with the system. This use case diagram tutorial will cover the following topics and help
you create use cases better.

Fig 5.4: Use Case Diagram


Chapter 6

SYSTEM IMPLEMENTATION
6.1 SPEED DETECTION METHOD
 Initially the video frame is fed as an input to Cascade classifier.
 Once the ROI is calculated, it is sent to the NeuralNetwork.
 Then the Speed of the vehicle is detected by using EUCLIDEAN DISTANCE.
 The function estimasteSpeed takes two parameters “location1 , location2 “
 Speed is calculated by multiplying frames per second, distance per pixeland 3.6
(i.e., 3600) which is the default value to convert meter per hour into kilometre per hour.

PSEUDO CODE FOR SPEED ESTIMATION USING EUCLIDEAN DISTANCE:


def estimateSpeed(location1, location2):
d_pixels = math.sqrt(math.pow(location2[0] -
location1[0], 2) + math.pow(location2[1] -location1[1],
2))
ppm = 8.8
d_meters = d_pixels / ppm
fps = 18
speed = d_meters * fps * 3.6 #MPH TO KMPH
return speed

6.2 GENRATING ROI


Region of interest is an operation widely used in object detection tasks using convolutional
neural networks. For example, to detect multiple cars and pedestrians in asingle image In
the first case the system is supposed to correctly label the dominant object in an image. It’s
important to remember that RoI is NOT a bounding box. It mightlook like one but it’s just
a proposal for further processing.

PSEUDO CODE FOR GENRATING ROI


#loading cascade classifier Features xml and initialing
carCascade = cv2.CascadeClassifier('/content/drive/MyDrive/myhaar.xml’)
#converting bgr image to gray image
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
#detecting roi using cascade classifier
cars = carCascade.detectMultiScale(gray, 1.1, 13, 18, (24, 24))
#adding 30 and 40 pixel extra for roi for accuracy
roi = image[y:y+h+30, x:x+w+40]

6.3 DETECTING VEHICLES


 The output from ROI is fed as an input to Deep Neural Network.
 In the Deep Neural Network, each vehicle gets detected.
 For each vehicles detected in the image gets Bounding Boxes and confidence.
 If confidence is greater than 40% accepted and bounding boxes is stored.
 If confidence is less than 40% it is rejected.

PSEUDO CODE FOR VEHICLE DETECTION


#giving roi as input to dnn for detecting vehicles
blob = cv2.dnn.blobFromImage(roi, size=(300, 300),ddepth=cv2.CV_8U)
net.setInput(blob, scalefactor=1.0/127.5, mean=[127.5,127.5, 127.5])
#running network forward for prediction
detections = net.forward()
#getting confidence of detected vehicle
confidence = detections[0, 0, i, 2]
#if confidence is greater than 40% accepted else rejected
if confidence > 0.4:
print(‘accept’)
else:
print(‘reject’)

6.4 Feasibility Study


The feasibility of the project is analyzed in this phase and business proposal is put forth
with a very general plan for the project and some cost estimates. During system analysis the
feasibility study of the proposed system is to be carried out. This is to ensure that the
proposed system is not a burden to the company. For feasibility analysis, some
understanding of the major requirements for the system is essential.
Three key considerations involved in the feasibility analysis are:
 Economical feasibility
 Technical feasibility
 Social feasibility

6.5 Economical Feasibility


This study is carried out to check the economic impact that the system will have on the
organization. The amount of fund that the company can pour into the research and
development of the system is limited. The expenditures must be justified. Thus, the
developed system as well within the budget and this was achieved because most of the
technologies used are freely available. Only the customized products had to be purchased.

6.6 Technical Feasibility


This study is carried out to check the technical feasibility, that is, the technical
requirements of the system. Any system developed must not have a high demand on the
available technical resources. This will lead to high demands on the available technical
resources. Thiswill lead to high demands being placed on the client. The developed system
must have a modest requirement, as only minimal or null changes are required for
implementing this system.

6.7 Social Feasibility


The aspect of study is to check the level of acceptance of the system by the user. This
includes the process of training the user to use the system efficiently. The user must
notfeelthreatened by the system, instead must accept it as a necessity. The level of
acceptance bythe users solely depends on the methods that are employed to educate the user
about the system and to make him familiar with it. His level of confidence must be raised so
that he is also able to make some constructive criticism, which is welcomed, as he is the
final user of 7
CHAPTER 7

SYSTEM TESTING

The purpose of testing is to discover errors. Testing is the process of trying to discover every
conceivable fault or weakness in a work product. It provides a way to check the functionality of
components, sub-assemblies, assemblies and/or a finished product It is the process of exercising
software with the intent of ensuring that the Software system meets its requirements and user
expectations and does not fail in an unacceptable manner. There are various types of test. Each test
type addresses a specific testing requirement.

7.1 Types of Tests


7.1.1 Unit Testing
Unit testing involves the design of test cases that validate that the internal program logic is
functioning properly, and that program inputs produce valid outputs. All decision branches and
internal code flow should be validated. It is the testing of individual software units of the application
.it is done after the completion of an individual unit before integration. This is a structural testing, that
relies on knowledge of its construction and is invasive. Unit tests perform basic tests at component
level and test a specific business process, application, and/or system configuration. Unit tests ensure
that each unique path of a business process performs accurately to the documented specifications and
contains clearly defined inputs and expected result.

7.1.2 Integration Testing


Integration tests are designed to test integrated software components to determine if they actually
run as one program. Testing is event driven and is more concerned with the basic outcome of screens
or fields. Integration tests demonstrate that although the components were individually satisfaction,
as shown by successfully unit testing, the combination of components is correct and consistent.
Integration testing is specifically aimed at exposing the problems that arise from the combination of
components.

7.2 System Testing


System testing ensures that the entire integrated software system meets requirements. It tests a
configuration to ensure known and predictable results. An example of system testing is the
configuration-oriented system integration test. System testing is based on process descriptions and
flows, emphasizing pre-driven process links and integration points.

7.3 White Box Testing


White Box Testing is a testing in which in which the software tester has knowledge of the inner
workings, structure and language of the software, or at least its purpose. It is purpose. It is used to
test areas that cannot be reached from a black box level.
7.4 Black Box Testing
Black Box Testing is testing the software without any knowledge of the inner workings, structure
or language of the module being tested. Black box tests, as most other kinds of tests, must be written
from a definitive source document, such as specification or requirements document, such as
specification or requirements document. It is a testing in which the software under test is treated, as
ablack box. you cannot “see” into it. The test provides inputs and responds to outputs without
considering how the software works.
7.5 Unit Testing
Unit testing is usually conducted as part of a combined code and unit test phase of the software life
cycle, although it is not uncommon for coding and unit testing to be conducted as two distinct phases.
Test strategy and approach Field testing will be performed manually.
 All field entries must work properly.
 The entry screen, messages and responses must not be delayed.

7.6 Integration Testing


Software integration testing is the incremental integration testing of two or more integrated
software components on a single platform to produce failures caused by interface defects.
 The task of the integration test is to check that components or software applications, e.g.
components in a software system or – one step up – software applications at the company level –
interact without error.
Chapter 8

RESULTS AND DISCUSSIONS


As described , video input data is extracted to frames and then each frame is processed firstly by
preprocessing. Preprocessing is used to minimize solid shadow that can be detected as object in
background subtraction process. We use contrast and brightness adjustment to do preprocessing. The
image result from shadow removal process can contain many gaps inside the object which can split
one object into separate objects. Region of Interest (ROI) selection for this camera position and using
Euclidean distance to calculate the distance traveled by the object is important to be checked further,
because as we can see in real view, if our eyes were the camera above highway, the object that come
toward us will come slightly faster as well as the object come nearer from camera. So with this
perspective, we can draw three regions of video captured, there are slow, medium, and fast region,
marked by color red, green, and yellow respectively, that is given by Fig. 8.1 .To test the best ROI
region to be selected in experiment section based on camera angle, we use three data from one same
speed with three different camera angle to determine which is the best region for each camera angle.

Fig 8.1 : ROI Regions


SNAPSHOTS

Fig 8.2 Snapshots

The above figure shows the details of detected car and its speed with the count of number of
vehicles detected.
CHAPTER 9

CONCLUSIONS AND FUTURE ENHAHNCEMENT

9.1 CONCLUSION

The proposed project aims to bring out lesser computation and memory and stores
vehicles parameters, estimated speed of the detected vehicles in the database. Detection and
trackingof the vehicles utilizes parameters such as position, height and width of vehicle
instead of features extraction hence proposed system can be adopted easily in existing traffic
management system.

9.2 FUTURE ENHANCEMENT

Every application has its own merits and demerits. The project has covered almost all the
requirements. Further requirements and improvements can easily be done since the coding is
mainly structured or modular in nature. Changing the existing modules or adding new
modules can append improvements. Further enhancements can be made to the application,
such that the functions are more accurate and efficient than the present one.
REFERENCES
[1] Chuanping Hu, Xiang Bai, Senior Member, IEEE, Li Qi, Pan Chen, Gengjian Xue,
and Lin Mei. “Vehicle Color Recognition With Spatial Pyramid Deep Learning ”, 2016.
[2] Glen M. D’Este, Rocco Zito & Michael A. P. Taylor. “Using GPS to Measure
Traffic System Performance”, 2015.
[3] Hee Seok Lee and Kang Kim. “Simultaneous Traffic Sign Detection and
Boundary Estimation Using Convolutional Neural Network ”, 2018.
[4] Chris Stauffer and W.E.L Grimson ,The Artificial Intelligence Laboratory
MIT, Cambridge. “Adaptive background mixture models for real-time tracking“
[5] Piotr Burnos. “ALTERNATIVE AUTOMATIC VEHICLE
CLASSIFICATION METHOD ”, 2010.
[6] Hemjit Sawant, Jindong Tan, Qingyan Yangm, QiZhi Wang. “Using Bluetooth
and Sensor Networks for Intelligent Transportation Systems ”, 2004.
[7] Jiyong Chung and Keemin Sohn. ” Image-Based Learning to Measure Traffic
Density Using a Deep Convolutional Neural Network”, 2017.
[8] Koen E.A. van de Sande, Student Member, IEEE, Theo Gevers, Member, IEEE, and Cees
G.M. Snoek, Member, IEEE. ” Evaluating Color Descriptors for Object and Scene
Recognition”, 2010.
APPENDIX
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception ofTRANSPORTATION
TRANSACTIONS ON INTELLIGENT pagination. SYSTEMS 1

A Vision-Based Pipeline for Vehicle Counting,


Speed Estimation, and Classification
Chenghuan Liu , Du Q. Huynh , Senior Member, IEEE, Yuchao Sun ,
Mark Reynolds , Member, IEEE, and Steve Atkinson

Abstract— Cameras have been widely used in traffic opera-


tions. While many technologically smart camera solutions in the induction loops, microwave radar [2], bluetooth detectors [3],
market can be integrated into Intelligent Transport Systems and the Global Positioning System (GPS) [4]. However, every
(ITS) for automated detection, monitoring and data generation, technology has its limitations. For example, compared to most
many Network Operations (a.k.a Traffic Control) Centres still
use legacy camera systems as manual surveillance devices. In this other technologies, counters that are paired with pneumatic
paper, we demonstrate effective use of these older assets by tubes and piezoelectric sensors can provide richer information,
applying computer vision techniques to extract traffic data from such as vehicle volume, speed, and classification; however,
videos captured by legacy cameras. In our proposed vision-based they only provide point measurements on specific spots so
pipeline, we adopt recent state-of-the-art object detectors and their spatial coverage is limited. Furthermore, although they
transfer-learning to detect vehicles, pedestrians, and cyclists from
monocular videos. By weakly calibrating the camera, we demon- are generally accurate, their error rates can increase under
strate a novel application of the image-to-world homography certain circumstances, e.g., in stop-and-start traffic flow, one
which gives our monocular vision system the efficacy of counting slow- moving or stationary vehicle could stay on top of a set
vehicles by lane and estimating vehicle length and speed in real- of sensors for a prolonged period, making the pairing of axles
world units. Our pipeline also includes a module which combines problematic [5]. Inductive loops are prevalent in modern urban
a convolutional neural network (CNN) classifier with projective
geometry information to classify vehicles. We have tested it on traffic control and monitoring systems. However, similar to
videos captured at several sites with different traffic flow tube or piezoelectric counters, they only offer point
conditions and compared the results with the data collected by measurements and their accuracy tends to suffer under high-
piezoelectric sensors. Our experimental results show that the density traffic flows when the electromagnetic field of a
proposed pipeline can process 60 frames per second for pre- vehicle overlaps with that from its leading and/or trailing
recorded videos and yield high-quality metadata for further
traffic analysis. vehicles [5], [6]. Most subsurface sensors also have high
installation and maintenance costs. An alternative technology
Index Terms— Intelligent transportation systems (ITS), is the GPS logs. While, theoretically, GPS logs can offer
machine learning, object detection, traffic image analysis, object
tracking, camera calibration. accurate data over more measurement points, they tend to
have insufficient sample sizes due to the limited number of
probe vehicles [7]. Bluetooth detectors, on the other hand,
I. I NTRODUCTION normally have larger sample sizes, but their technological

I NTELLIGENT transportation systems have become


increasingly important in many modern cities as
governments rely more on them to enable smart decision
nature such as large detection ranges limit their precision.
Since there is no perfect traffic data collection solution, it is
necessary to fuse different sources to achieve synergy [8].
making (for both agencies and individual users) [1], and The fast-growing camera networks in modern cities and the
better utilisation of the existing infrastructure. Many hardware rapid advances in algorithms and computing power make
solutions are available today for traffic data collection, computer vision an increasingly popular part of traffic tech-
including pneumatic tube counters, piezoelectric sensors, nologies. Applications of computer vision techniques in traf-
fic analysis include vehicle density measure [9], traffic sign
Manuscript received May 31, 2019; revised October 1, 2019,
December 19, 2019, and April 23, 2020; accepted June 15, 2020. This work detection [10], vehicle colour recognition [11], etc. Compared
was supported by Main Roads Western Australia. Chenghuan Liu was to underground sensors, cameras require minimal installation
supported by a Scholarship from UWA and in part by the China Scholarship and are non-intrusive. Sophisticated off-the-shelf solutions are
Council. The Associate Editor for this article was J. M. Alvarez.
(Corresponding author: Chenghuan Liu.) offered by commercial suppliers today. Some of these
Chenghuan Liu, Du Q. Huynh, and Mark Reynolds are with the solutions combine cameras with other types of sensors such as
Department of Computer Science and Software Engineering, The infrared, but they often come with high price tags. Given that
University of Western Australia, Perth, WA 6009, Australia (e-mail:
[email protected]; [email protected];
many agencies already have legacy camera networks for
[email protected]). manual traffic surveillance, developing computer vision
Yuchao Sun is with the Planning and Transport Research Centre (PATREC), software solutions that utilise these assets is a more cost-
The University of Western Australia, Perth, WA 6009, Australia (e-mail: effective option. By using centralised computer vision systems
[email protected]).
Steve Atkinson is with Strategy and Communications, Main Roads on the existing camera networks, such as the proposed method
Western Australia, Perth, WA 6004, Australia (e-mail: described in this paper, new data can be obtained from legacy
[email protected]). assets. In this way, thousands of previously deployed cam- eras
Digital Object Identifier 10.1109/TITS.2020.3004066
can be used to end-of-life, reducing replacement costs.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception of pagination.
2 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

whole video frame and output the bounding boxes that enclose
the target objects (e.g., vehicles). Compared to background
subtraction techniques, deep learning based object detectors
are more robust to illumination variation, shadows, and partial
occlusion in the video.
Our research contributions are summarized below:
• Our proposed pipeline combines object detection and
multiple object tracking to count and classify vehicles
from video captured by a single camera at each site. We
Fig. 1. Example images from the Narrows Bridge North site (left) on Kwinana employ state-of-the-art object detectors rather than
Freeway before the afternoon peak hours of a Thursday and the Hutton Street adaptively modelling the background so that the vehicle
site (right) on Mitchell Freeway on a Friday morning rush hour period.
detection module is robust against illumination change
and effect of shadows. By weakly calibrating the camera,
Additionally, implementing such a system reduces reliance on a novelty of our method is using the 3×3 image-to-world
camera vendors, so consistent data can be obtained regardless homography to warp the bounding box that encloses each
of the variety of cameras used to collect the data. detected vehicle onto the ground plane to yield real-world
Perth, the capital city of Western Australia, is one of the measurements. This gives our pipeline the efficacy of
most car-dependentcities in the world. The Perth Metropolitan counting vehicles by lane and classifying vehicles based
Area covers an area larger than London, but has a population on their lengths in metres. Furthermore, using the same
of merely 2 million. Over the years, Main Roads Western homography, our monocular vision pipeline can estimate
Australia (MRWA), the Government road agency responsible the speeds of vehicles in kilometres per hour. To the best
for managing roads of the State, has installed closed-circuit of our knowledge, this is the first time where object
television cameras (CCTV) throughout the metropolitan area detection and projective geometry are combined to yield
as part of its Intelligent Transport System. In particular, there 3D measurement from monocular videos of traffic scenes.
are many cameras mounted above different segments of its • Our proposed pipeline also includes a visual classifier

two major freeways, Mitchell Freeway and Kwinana Freeway, module that can be combined through a voting scheme
which form the spine of this linear city, connecting the with the vehicle length obtained from projective geom-
Northern and Southern suburbs with CBD in the middle. These etry to further improve the vehicle classification results.
cameras are set up in a way that no more than one camera Further modules constituting our pipeline are pedestrian
views the same segment of the road. This means that it is not and cyclist counting based on their direction of travel in
possible to carry out 3D analysis using stereo vision a zoom-in view. All of the metadata produced by the
techniques. Some of these cameras are fixed cameras while modules in the pipeline are automatically saved to a
others are of pan-tilt-zoom type. All of them can be remotely spreadsheet to facilitate traffic analysis.
controlled from the Network Operations Centre. This camera We have empirically shown that a clever application of our
network is a valuable asset of the state and offers important computer vision based technology could dramatically boost
traffic information of the city. the utilisation of legacy assets and achieve counting accuracy
In this paper, we present a computer vision pipeline for that is on a par with dedicated traffic counting devices.
vehicle classification, counting, and speed estimation from Further- more, our solution can also classify vehicles to a
traffic videos captured by these cameras. Our system can reach reasonable degree of success. The estimation of vehicle speed
a counting accuracy as high as 98% of pedestrians and cyclists in km/h using the image plane to ground plane homography
on the shared footpath with respect to their directions of travel. computed from the weakly calibrated camera is also a justified
Figure 1 shows two example traffic scenes that our system contribu- tion to the research in intelligent transportation
deals with in this paper. systems.
One of the important components of the pipeline is the The rest of the paper is structured as follows. Section II
detection of moving vehicles. A common approach adopted in presents some related work. The proposed tracking algorithm
the literature on vehicle detection is to perform background is elaborated in Section III. Section IV details our
subtraction [12], [13] and then identify vehicles as the fore- experimental results and Section V outlines our conclusion
ground pixel blobs. The vehicle counting problem is thus and future work.
turned into a pixel blob counting exercise. This way of vehicle
counting works well only for simple scenes where the vehicles II. RELATED WORK
are far away from each other so that pixel blobs are not con- Our proposed pipeline for traffic video analysis covers
nected together due to partial occlusion or due to long shadow research topics in object detection, object classification, mul-
cast by the late afternoon sun. Using background subtraction, tiple object tracking, and traffic surveillance. Related work in
it is also unclear how to track the irregular foreground pixel the literature for these topics is briefly reviewed in this
blobs for vehicle speed estimation. Recently, object detection section.
algorithms based on deep learning have achieved impressive
detection accuracy [14]–[22]. These algorithms read in the
A. Object Detection
Currently the area of object detection is dominated by two
main approaches: one stage detection and two stage detection.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
One stage detectors, such as exception
YOLO [14] and SSD [17],
of pagination.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception of pagination.
LIU et al.: VISION-BASED PIPELINE FOR VEHICLE COUNTING, SPEED ESTIMATION, AND CLASSIFICATION 3

treat object detection as a regression problem where the have similar appearances. Perera et al. [26] tackle this issue by
object classes and the bounding box coordinates are predicted evaluating all the possible hypotheses in the trajectory splitting
directly. On the other hand, two stage detectors (e.g., R-CNN) and merging step. Their tracking algorithm is applied to traffic
include two stages. The first stage is to generate many region scenes and they use the Stauffer-Grimson background
proposals using a search method (e.g., using a selective modelling algorithm to detect moving objects. Different from
method [23] or a region proposal network (RPN) [20]) and the work above, Huang et al. [27] propose to associate
the second stage is to pass these region proposals for detected results globally in a three-level framework. However,
classifica- tion and bounding box regression. Compared to their algorithm can only work offline, which means that the
the one stage detectors, two stage detectors usually achieve whole video sequence must be read in advance. Other MOT
better detection rates but are also slower due to the number papers targeting at handling interaction between tracked
of steps involved. Proposed in 2016, YOLO [14] is a one objects [28], camera motion [29], online tracking [30], and
stage object detector that achieves a frame rate of 45 on a incorporating different motion models for the objects [31],
Titan X GPU. YOLO has 24 convolutional layers and 2 fully [32] have also been reported.
connected layers. A faster version of YOLO has 9
convolutional layers instead of 24. The final output prediction C. Object Classification
of the network is a×7 ×7 30 tensor. A later version of
Object classification is perhaps the computer vision research
YOLO, known as YOLOv2 (also referred to as YOLO9000) area that has received the most attention from researchers.
[15], can process up to 67 frames per second depending on Related to this research area is the creation of many large
the video resolution. Like its two previous versions, YOLOv3 benchmark datasets, such as ImageNet [33], VOC [34], and
[16] is another fast object detector. It con- tains some COCO [35], making it possible to train complex deep learn-
incremental improvement to YOLOv2. Another one stage ing methods for object classification. Typical deep network
method is the Single Shot MultiBox Detector (SSD) method architectures for object classification include the classical
proposed by Liu et al. [17], where the object classes and LeNet-5 [36], AlexNet [37], GoogLeNet [38], VGGNet [39],
bounding boxes are predicted together for a set of default and ResNet [40]. These architectures are famous for their
anchor boxes.
outstanding performance in the ImageNet Large Scale Visual
Girshick et al. [19] propose a two stage detector known as
Recognition Challenge (ILSVRC). The backbone of these
R- CNN, where convolutional neural network (CNN) features
architectures are convolutional layers, pooling layers, and
are computed for each region proposal. The feature vectors are
suitable activation functions. Although these architectures
scored using the SVM trained for each object class. The final
have been used in classifying objects of different types (e.g.,
scored regions then go through a non-maximum suppression
vehicles, people, bicycles, etc), more recently they have been
step to reject overlapping regions. Fast R-CNN [18], a later
adapted for classifying different classes of objects of the same
method from the same group of authors, extends R-CNN by
types (e.g., different species of birds [41]). Most of these fine
selecting the region-of-interest (ROIs) from feature maps
grain visual classification techniques involve localizing
computed from convolution. It is shown in the paper that their
different parts or modelling the subtle differences of object
method improves the speed as well as accuracy. Faster R-CNN
parts [42]–[45].
[20] further improves Fast R-CNN with an RPN and can run at
For fine grain object classification, our approach differs from
a speed of 5 frames per second. However, this frame rate is
the methods reviewed above in that we classify vehicles based
still quite low compared to YOLOv2 [15] mentioned above.
on their length and appearance (see Section III-D).
By using the so-called focal loss function instead of the
traditional cross entropy loss function, Lin et al. [21] report
that RetinaNet can better handle the foreground-background D. Traffic Surveillance Systems
class imbalance issue. In the same year, the same group of Earlier work focuses on exploring handcrafted features in
authors also propose the Mask R-CNN detector [22] based on traffic scenes, e.g., the bag-of-words descriptors have been
an extension of Faster R-CNN, giving similar detection applied to detect pedestrians and bicyclists for counting [46], a
performance and run- time as the RetinaNet. wheel contour extraction method [47] has been used for traffic
accident analysis. Intelligent traffic systems based on
surveillance cameras have made significant progress recently.
B. Multiple Object Tracking The faster R-CNN object detector and MobileNets [48] have
Tracking by detection is a widely used approach in multiple been combined for traffic sign recognition [49]. To deal with
object tracking (MOT). After putting bounding boxes on the the problem of scale variation in vehicle detection, a scale-
detected targets in each video frame, tracking by detection insensitive model is proposed in [50], where a context-aware
techniques formulate the MOT task as a process of associ- ROI pooling layer is designed to extract feature maps for
ating the detected bounding boxes between successive video vehicles with small scales. They further use a multi-branch
frames. The task in MOT can therefore be considered as a data decision network to classify vehicles with a large variance of
association problem [24], [25]: estimating the correct scales. To estimate the vehicle density in video images
assignment events between the targets found in the previous directly, a model based on deep CNN is designed in [9]. The
frames and the detections in the current frame. Unlike single problem of estimating the vehicle density from the input
object tracking, MOT techniques need to be robust in dealing image is then formulated as a regression problem where the
with the identity switch problem when the objects to be number
tracked
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception of pagination.
4 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 2. Block diagram showing the pipeline of processing modules.


Fig. 3. Lines marked in an image of the Narrows Bridge North site and the
corresponding lines on the ground from Google Maps.
of vehicles in video images is set as the regression target. Chen
et al. [51] propose a framework to detect vehicles’ turn- ing Dual Ar
t.o the
ground (a=homography
r
, br , cr ) of point
i |1, pai rs above, lines on the
=.. . ,ΣN and corresponding lines
signals, using the RPN of [20] to generate potential vehicle i i i i
proposals. A classifier is further adopted to distinguish the in the image Ai (ai ,{bi ,=ci ) i 1, ... , N can also
lights of vehicles from street lights. In [11], a deep learning be used. Indeed, it is a lo}t|=easier to identify lines than points
method is proposed to deal with the problem of vehicle colour in our video. So we compute H −T from the following instead:
recognition. The vehicle image is partitioned and then fed into
γiAri = H−TAi, for i = 1,... , N, (2)
a CNN for feature extraction. They further use the Support
Vector Machine (SVM) instead of softmax classifier to obtain where each γi is an unknown scalar that can be eliminated.
the colour class of the input vehicle image. Figure 3 shows the lines identified in an image of the Narrows
Bridge North site and the corresponding lines on the ground
III. PROPOSED METHOD from the overhead view of the region on Google Maps. Similar
Figure 2 shows the processing modules of our proposed to the case for points, it is necessary that N 4 in ≥order to
pipeline. Each of these processing modules is detailed in this recover H −T. Once H −T is known, it is straightforward to get H
section. As the main objects that our pipeline deals with are for the downstream process. Points measured in the image are
vehicles, apart from Sections III-B, III-C, and III-G, all other in pixel units whereas points on the ground in the scene are in
subsections are relevant for vehicles only. The metadata metres.
output by the four modules in the middle row of Fig. 2 is The origin of the image coordinate system can be set at
saved to a spreadsheet file for further traffic analysis. the default top-left corner of the image, whereas in the scene,
the origin of the 2D ground plane coordinate system can be set
at any pointthat is visible in the image. The mutual orthogonal
A. Weak Camera Calibration
x - and y-axes on the ground plane can run along the lanes and
Prior to the core vehicle tracking and counting processes, a across the lanes for convenience. Both H and H −T are defined
weak calibration process is carried out. If the camera is panned in terms of measurements from these two coordinate systems.
or tilted, this calibration process will need to be repeated. In The computed H −T also allows the line equationof each
the current version of our pipeline, we assume that the camera white lane-mark to be computed on the ground in metres. For
view remains unchanged in the entire video. It is not possible example, there are 5 lanes in the Narrows Bridge North site
to fully calibrate the camera as there are not enough known 3D shown in Fig. 4. These lanes are separated by 4 lane-mark
points in the scene that can be identified. Some of the sites lines that can be manually identified in the video. Suppose
have lamp posts but their exact heights above the ground are that a lane-mark has the line equation ax + by + c = 0 in the
not known. So our calibration component targets at image. In vector form, this can be written as A x = 0, where
mapping/warping image points onto the ground plane only. A = (a, b, c)T and x = (x, y, 1)T. The corresponding line
Such a mapping is the well-known homography equation on the ground is simply ArTxr = 0 where Ar ∼ H
transformation −T
A and xr = (xr, yr, 1) denotes any point on the lane-mark
in computer vision [52]: Given N ≥ 4 image on the
points .x = (x , y , 1)T | i = 1,... , NΣthat are projection of
i i i

points .x = (x , y , 1) | i = 1, ... ground plane. The symbol ∼ denotes equality up to a non-zero


Σ , N on the ground plane
r r r T
i i i scale.
in the s ceen, thtesesewtos of ts are related by the
poin
following equation:
λi xir = Hx i, for i = 1,... , N, (1)
3× 3
where H R ∈ is the homography that needs to be
recovered, and each λi is an unknown scalar that can be
eliminated. Each pair of points from the two sets provide 2
equations. Since H is only recoverable up to an unknown
scale, there are 8 unknowns in H . So N 4 is the minimum
n=umber of point pairs required to compute H .
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
Object Detection With Transferexception
Learning of pagination.

Rather than reinventing the wheel and developing


our own object detector, we evaluate a few of the
well-
known detectors discussed in Section II-A for our
vehicle detection task through transfer learning. We
test
the three most popular object detectors, namely, SSD
[17],
YOLOv2 [15], and Faster R-CNN [20]. We find that SSD
and
YOLOv2 tend to miss small objects and their performance
drops slightly when the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception of pagination.
LIU et al.: VISION-BASED PIPELINE FOR VEHICLE COUNTING, SPEED ESTIMATION, AND CLASSIFICATION 5

vehicles towing a trailer, caravans, etc, that none of the


detectors was able to detect. Figure 5 shows a few example
bounding boxes of our training set. Notice that since our
training images are cropped from videos rather than from still
high-resolution images, they include challenging issues, such
as motion blur, in real applications.
We apportion the collected data described above into train-
ing and validation sets, using an 80-20 split, to ensure that
there is no overlap between the two sets. In each epoch, all the
three detectors are trained in accordance with the protocol
proposed in the original papers [15], [17], [20] using the
training set. The detectors are then tested on the validation set.
Both the training loss and the validation loss are calculated
using the loss functions in the original papers. The plot of the
Fig. 4. Computation of lane-mark equations {Air | i = 1,. .., 4} on the training and validation losses is shown in Figure 6. Thanks to
ground using H −T. The coordinate system in the image is OXY; the the pretraining on ImageNet, the training processes of the
coordinate system three detectors all converged very quickly as the pretrained
on the ground in the scene is OrXr Yr. All entities defined in Or XrYr have model weights only need to be finetuned slightly on our own
the superscript r. The yellow bounding box returned by the vehicle detector is
defined by the top-left corner (x1, y1) and bottom-right corner (x2, y2). dataset.

traffic is heavy (some vehicles are partially occluded). In such B. Multiple Object Tracking and Tracklet Construction
scenarios, we would use Faster R-CNN. In most other sce- For each bounding box detected in each video frame, based
narios, we use either YOLOv2 or SSD because of their fast on the class ID (vehicle, motorcycle, pedestrian, and cyclist),
detection speeds. the tracklet construction module uses the Kalman filter (KF) to
To adapt the three detectors for our proposed pipeline, we predict the bounding box’s state vector for the next video
make the following modifications to the network frame and uses data association for prediction-detection
architectures: assignment.
• For YOLOv2, the number of filters (C) for the final ×1 Our simplified Kalman filter model has a fixed state transi-
1 convolutional layer is the only layer that needs to be tion matrix F and a fixed matrix H which maps the state space
changed for retraining. According to the YOLOv2 paper, to the observation space. The prediction and update steps of
C (=5 K)+B, where B denotes the number of anchor our Kalman filter are:
boxes, K denotes the number of classes, and the constant Predict:
5 is for storing the probability, the 2D location, width, xˆ k|k−1 = F xˆ k−1|k−1 (3)
and height
of each bounding box. In our case we have the following
classes: vehicle, motorcycle, pedestrian, and cyclist. So K Pk| = F Pk−1| F T + (4)
4. We follow t=he YOLOv2 paper and set B 5. The k−1 k−1 Qk
Update:
output
is t=herefore C (5 4) 5 45, =i.e., a+45-
tensor. di×men=sional z˜ k = H xˆ k|k−1 (5)
• For SSD, 6 feature maps extracted from the Conv4_3, y˜ k = zk − z˜ k (6)
Conv7, Conv8_2, Conv9_2, Conv10_2, and Conv11_2
Sk = Rk + H Pk|k−1 H T (7)
layers are used for detection. While there are 6 anchor
−1
boxes defined for the Conv7, Conv8_2 and Conv9_2 Kk = Pk|k−1 H T Sk (8)
layers, only 4 anchor boxes are defined for the other (9)
three layers. In SSD, the predictions for bounding boxes Pk|k = ( I − K k H) Pk|k−1( I − K k H) T + K k Rkk
KT

and for classes are separated. Thus, with B anchor boxes pedestrians, and cyclists from 1,150 video frames of different
for each feature, the numbers of bounding boxes and sites. In particular, we focus on cases such as very long
classification confidences are 4B and (4 1+)B (including vehicles,
the “background” class) respectively.
• Similarly, for Faster R-CNN, we set the output number of
the final fully connected layer to 5 (including the
“background” class).
For each detector, the network is retrained by fine-tuning
the initial weights of the network that has been pretrained
using ImageNet [33]. A dataset is composed by manually
cropping 5,000 bounding boxes of vehicles, motorbikes,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
xˆ k |k = xˆ k|k−1 + K k y˜exception
k , of pagination.

(10)
where the subscript j |k denotes the estimation at frame j
given the observation up to (and including) frame k.

Our state vector xˆk k is a R7 vector of the


form (xk, yk , ak , rk , x˙k , y˙k , a|k )T, where (xk , yk ), ak
, and rk denote the centre, the ar˙ea (width×height), and the
aspect ratio (width/height) of the bounding box,
respectively; and all the entities with an overhead dot
denote the velocity terms. The process noise vector wk R7
and∈ observation noise vector vk
R4 (not sho∈wn in the equations above) are both

wk ∼ N(0, Qk )and vk ∼ N(0, Rk ), where Qk ∈ R7×7


assumed to follow zero-mean Gaussian distributions, i.e.,

(appears in Eq. (4)) andRk ∈ R4×4 (Eqs. (7) and (9)) are

two
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception of pagination.
6 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 5. Some example images used for training the object detectors.

Fig. 6. The training and validation loss plots for detectors SSD, YOLOv2 and Faster R-CNN.

covariance matrices. The term Pk|k denotes the posterior error challenge in this module is: at each video frame, a number of
covariance matrix of the state vector estimate; zk , zk R˜4 bounding boxes are returned by the object detector. That is, for
∈denote the actual and predicted observation vectors; yk is each tracklet constructed from the previous videoframes using
referred˜to as the innovation vector; and Kk is the optimal
Kalman gain. As a constant velocity model for the state from a list of bounding boxes detected in the currenta Kalman
filter, a suitable observation zk must be selected
transition is adopted, F takes on the form given in Eq. (11)
below: i.e., the assignment of zk ’s to all the tracklets that have been
constructed so far are done simultaneously by maxi- mizing
the global intersection-over-union (IoU) value [53]. Figure 7
illustrates an example where there are 2 tracklets constructed
up to frame k 1 and there−are 3 bounding boxes detected at
frame k. The blue rectangles are the predicted observations zk
computed using Eqs. (3) and (5) from the Kalman filter. The
⎡ ⎤ obvious assign˜ment that maximizes the IoU value is as shown
I3×3
in the figure where the two red and two
⎢I ⎥
F=⎢ 4×
blue bounding boxes have large overlapping regions. The red
0T ⎥ ∈ R
7×7
4 (11)
⎣ 3⎦ bounding box at the top-left corner is not assigned to any
03×4 I3×3 tracklet as the Hungarian algorithm imposes a 1-to-1 assign-
ment. This bounding box may form a new tracklet or may
. nd H in . Eqs. (5)-(9) is a standard projection matrix: H = be removed. Whichever case that would take place depends
I4×4 04×3 ∈ R4×7, where 0m denotes an Rm zero vector, and tackle this as an assignment problem using the Hungarian algorithm,
× ×× ×

I a n d 0 denote the n n identity matrix and m n zero


m n antr i x . m n
In the Kalman filter formulation above, the scale of each
object is assumed to be able to vary but the change of the
aspect ratio is negligible. Thus, there is a˙ak but not a rk term
in the state vector. We find that this works better as it helps to
constrain the shape of the object. It should be noted that
the formulation above is for single object tracking only. The
video frame. Rather than treating each tracklet individually, we
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
on whether a bounding box
exception near that region is detected at
of pagination.
frame k + 1.
Vehicle Classification
Vehicles in Australia are traditionally classified based
on
axle configurations [54] (see also the web page [55]).
However, due to the fact that they are not always visible
because of camera angle and occlusion, a simplified
system based on vehicle length is used for objects that
are labelled as vehicle by the detector. Table I shows the
classes of vehicles handled by our pipeline. Normal
light
vehicles fall into Class 1 and they are the majority of
vehicles found on the road. Class 2 contains trailer,
caravan, etc. Although they are towed by light vehicles,
their length and vehicle dynamics (slow acceleration
etc.)
are distinct from Class 1 light vehicles. Buses and rigid
trucks
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception of pagination.
LIU et al.: VISION-BASED PIPELINE FOR VEHICLE COUNTING, SPEED ESTIMATION, AND CLASSIFICATION 7

Fig. 7. An example of the construction of two tracklets and a possible new Fig. 9. Perspective view of a vehicle’s bounding box being warped onto the
tracklet. The 3 red rectangles are the detected bounding boxes for the current ground. Relative to the orthogonal coordinate system OrXrYr system, the four
frame. The 2 blue rectangles are the predicted bounding boxes for the 2 Mri points form an arbitrary quadrilateral. The three virtual points Pr, Qr, and
existing tracklets using Eqs. (3) and (5). Rr give the width and length of the vehicle.

TABLE I
homography H , the points{ M ri|i 1=, .. . , 4 },1 which are the
VEHICLE CLASSES ACCORDING TO THE AUSTRALIAN VEHICLE CLASSIFI- warped coordinates of Mi ’s onto the ground plane defined by
CATION SYSTEM. IN OUR PIPELINE, THE LONG VEHICLES IN CLASSES OrXrYr, form an arbitrary quadrilateral. For the vehicleinside
10 TO 12 ARE GROUPED TOGETHER TO FORM ONE LARGE CLASS, the bounding box, of interest is the bottom (the wheels) of the
GIVING A TOTAL OF FOUR CLASSES
vehicle that touches the ground. The three points Pr, Qr, and
Rr are virtual points on the ground at the bottom of the
vehicle. With respect to OrXr Yr , we have QrPr QrRr
a⊥nd Qr Pr is approximately orthogonal to the direction of
the closest lane- mark (denoted by Ar). As shown in Fig. 9, Pri,
Qr, Rr, and {Mr
| i = 1,..., 4} are all points defined on the ground
plane. PrQr and PrRr correspond to the width and length of
the vehicle. Our goal is to define Pr, Qr, and Rr in terms of the
four known Mir points.
Without loss of generality, we drop the third component
(which is just 1) of the homogeneous representation of all the
points involved and work directly on 2D inhomogeneous
coordinates. From Fig. 9, we have
Pr = αMr + (1 − α)Mr
(12)
3r 4r
r
Q = β M + (1 − β)M (13)
1 4

Rr = γ Mr + (1 − γ )Mr . (14)
2 3
detector. From

Fig. 8. Some example vehicles and their Class IDs.

fall into Class 3-5; Articulated vehicles fall into Class 6-9;
road trains (medium to long trucks with trailers) fall into Class
10-
12. Figure 8 shows some example vehicles in each of these
classes.
1) Vehicle Length Estimation Using Homography: Let Mi
{i 1, .| . . =, 4 be the }image coordinates of the four
corners of a vehicle’s bounding box returned by a vehicle
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
The two unknowns α and βexception (both are in the range 0..1) can
of pagination. views the scene from the right. For the case where the
be solved using the following two constraints: overhead camera views the scene from the left (like the Hutton
• ǁP − Q ǁ= w, where w is the width of the vehicle;
r r Street site shown in Fig. 1), the positions of Pr and Rr would
r r r
• P Q is orthogonal to A , where A is the line
r be reversed. However, the computation remains the same if
coordinates of the lane-mark that is closest to the one properly
vehicle. swaps the Mir terms.
Once α is known, Pr can be computed from Eq. (12). The The above computation requires the width w of vehicle to
parameter γ ∈(0, 1) (see Fig. 9) can be computed in a be known. Most Class 1 vehicles have w ≈ 1.6 m; larger
similar manner using the constraint that PrRr is parallel to 1
Each Mi =r is computed using Eq. (1): λi Mr H Mi which gives
Ar. Finally, Rr is obtained from Eq. (14) and the vehicle’s the homogeneous coordinates of Mr . It is then straightforward to normalize
length PrRr is deduced. Mr so
Figure Fig.
9 shows
4, one the
can setup where
see that, usingthetheoverhead
point camera i i
r
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception of pagination.
8

3) Voting for the Final Class ID: If the visual classification


option is switched on in the pipeline, this sub-module would
combine the outputs from both the vehicle length computation
sub-module and the visual classification sub-module described
above to yield the final classification result. This is done using
a voting scheme outlined below. Suppose that the vehicle
tracklet covers t video frames. Then it would have 2t Class IDs
from the two sub-modules, giving fou{r frequenc|y c=ount
Fig. 10. Network architecture of the visual classifier for classifying each value}s for the four vehicle classes: count(i) i 1, .. ., 4 . The
vehicle object into one of the 4 classes shown in Table I.
final Class c∗ assigned to the tracklet is the one =that has
the most
ˆnumber of votes, i.e., c∗ Class i , where i argmaxcount(i).
= i=1,...,4
Class ID.

Fig. 11. The training and validation error plot of the visual classifier.

vehicles have w ≈ 2 m. We use w = 1.6 and w = 2 to


obtain two length values for each vehicle in each video frame
and then determine the consensus length value along the
tracklet. The above computation also requires the vehicles to
travel along the lanes. In the case where a vehicle is changing
lanes, the assumption about PrQr⊥ Ar (and similarly
PrRr Ar)
would not hold. Again, so long as there are a sufficient num| ber
of video frames along the tracklet where the vehicle’s
direction of travel is parallel to Ar, the system would still be
able to get a good estimate of the vehicle’s length. It should
be also noted that it is not crucial to get the extract length of
each vehicle as our goal is to classify vehicles based on
their lengths into
the four classes shown in Table I.
2) Visual Classification Using a Convolution Neural Net-
work: Apart from using homography to estimate the lengths of
vehicles, we also develop a visual classifier (Fig. 10) to further
improve the accuracy of vehicle classification. In the training
process of the visual classifier, when an object of class vehicle
is detected and tracked to form a tracklet, we extract all the
bounding boxes, manually label their classes, and pass them to
a CNN model.
The CNN used in this sub-module is a Resnet18 that has
been pretrained on the ImageNet 1000-class dataset [56]. The
output layer of our classifier is a fully connected (FC) layer of
4 classes instead of the 1000 classes. Similar to the training of
the detectors before, the collected data are also divided into
training and validation sets using an 80-20 split. We use the
standard softmax with cross entropy as the loss function for
training. In each epoch, the error percentages (wrong
predictions over all predictions) for both sets are calculated.
The training and testing error plot for our visual classifier is
shown in Figure 11. In the testing process (on a different
video), each object that is labelled as vehicle by the object
detector is then passed to the visual classifier to obtain its best
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
C. Speed Estimation exception of pagination.

The speed estimation module is only applicable to objects


that have been classified under the Class vehicle. Once the
Class (see Table I) of a vehicle object is determined from its
tracklet, the speed estimation process is straightforward. In
Fig. 9, Mr, the mid-point of the front part of the vehicle can
be computed from PrQr in each video frame. As both Pr and
Qr are points on the ground plane in every frame of the
tracklet, so is their mid-point Mr, which gives us the
instantaneous location of the vehicle on the ground plane.
The average speed of the vehicle over the whole tracklet is
the total distance travelled by the vehicle divided by the
length of the tracklet. Using the known frame rate of the
video therefore allows the vehicle speed in kilometres per
hour to be estimated.

D. Vehicle Counting by Lane


For each site, a reference line is specified as the counting
line. When an object crosses the reference line, it would be
counted. We adopt this convention so vehicles are counted
on the lane when they cross the reference line. For objects of
class vehicle, the mid-point Mr (see the subsection above)
computed at the frame just before the vehicle exits the
reference line is substituted into all the line equations Ar.
The vehicle is said to be in a particular lane when Mr gives
opposite signs for the line equations of two consecutive lane-
marks.
For an object of class motorbike, the bottom left point of
the detected bounding box Mr is used to indicate its 4
position. Similarly, point Mr is substituted into all the line 4
equations Ar to get the lane occupied by the motorbike.
After an object exits the reference line, its whole tracklet
is removed and the metadata of this particular vehicle is
saved for further analysis.

E. Pedestrian/Cyclist Counting and Direction Determination


For pedestrian/cyclist counting, a reference line Aref is
defined and separates the video frame into two areas A and
B. For each object of class pedestrian or cyclist, the bottom
middle point of the detected bounding box M5 (M4
M=3)/2 i+s used to indicate the position of the object. The
area in which the object occupies is obtained by inspecting
the sign when M5 is substituted into the equation of the
reference line Aref. The count of pedestrians/cyclists is then
collected from two directions A → B and B → A.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception of pagination.
LIU et al.: VISION-BASED PIPELINE FOR VEHICLE COUNTING, SPEED ESTIMATION, AND CLASSIFICATION 9

and (9) were set to diagonal matrices, with


Qk = diag(10−1, 10−1, 10−1, 10−1, 1, 1, 1),
Rk = diag(10−1, 10−1, 10−1, 10−1),
as x˙ k , y˙ k and a˙ k were velocity terms estimated from
the other terms in the state vector, they had larger
uncertainties. The above settings of Qk and Rk were found to
work well for the level of noise of the Kalman filters in our
pipeline. Again, individual terms of the state vector were
assumed to have uncorrelated noise, so P0|0 was initialized to
the same diagonal matrix as Qk , which we found to
work well in all our
experiments.
The object tracking and tracklet construction module used a
time threshold Tmiss which denotes the minimum number of
consecutive frames that a tracklet must have bounding boxes
of some object associated to it. Tracklets having less than Tmiss
frames were considered to be outlying tracklets and were
removed for further processing. In all the experiments reported
in this paper, Tmiss was set to 10.

B. Runtime and Computational Complexity


The main computational work in our proposed model (Figure
2) is in the object detection module and the tracking module.
Fig. 12. The piezoelectric sensor (purple lines) on the Narrows Bridge South The latter consists of Kalman filtering and the Hun- garian
site. (a) Its location on Google Maps. (b) An example video frame captured by algorithm for forming object tracklets. For the object detection
the camera monitoring that region. (c) Setup of the sensor. module, the bottleneck of all the object detectors is the length
of training time. However, the training step only needs to be
IV. E XPERIMENTS performed once, prior to the testing (or deployment) stage.
A. Experimental Settings The computational complexities of all these object detectors
cannot be measured in terms of the traditional big-O notation.
We processed the video clips from MRWA for three differ- Instead, they are measured in terms of the numbers of frames
ent sites along Kwinana Freeway and Mitchell Freeway in they can process at the testing stage. Our proposed pipeline
Western Australia: Narrows Bridge North, Narrows Bridge was implemented in Python and tested on a desktop equipped
South, and Hutton Street. To evaluate the proposed method, with an Intel(R) Core(TM) i7-7700K CPU 3.60GHz CPU, 32
ground truth data was collected for all three sites. For the GB RAM and an NVIDIA Titan
Narrows Bridge South site, an existing piezoelectric sensor Xp GPU. Being a two stage detector, Faster R-CNN has an
installed on the site (Fig. 12) was used to get vehicle classes improved RPN compared to other two stage detectors and can
(see Table I) and vehicle speeds for each of the 5 lanes. For run at 5 frames per second. Both SSD and YOLOv2 are one
the two Narrows Bridge North and Hutton Street sites, data stage detector; their frame rates are 37 and 60 respectively.
from manual counting was collected and used as ground truth; We obtained the above frame rates by taking the average time
however, these two sites have no ground truth for vehicle taken for processing all the modules shown in the pipeline
speeds. So far we have processed a total of 7 video clips from (Fig. 2) per frame over the entire video. Theoretically,the time
the three sites (3 videos for Narrows Bridge North, 2 for complexities of the Kalman filter in the object tracking
Narrows Bridge South, and 2 for Hutton Street). These videos module isO (n2.8+ m2) [57], where n and m are the
are on average one hour long, with traffic flow varying from dimensions of the observation and state vectors. In our
low (2000 vehicles per hour) to high (8000 vehicles per hour). proposed pipeline, n = 4 and m= 7 (Section III-C). The
For object detectors YOLOv2 [15], SSD [17] and Faster complexity of the Hungarian algorithm for multiple vehicle
R-CNN [20], the weights of the neural nets are obtained by tracking (building tracklets) is O (n3) [58], where n is the
training them on our collected datasets as described in Section number of detected vehicles (i.e.,
III-B. Other parameters such as number of anchors and input number of Kalman filters) in the video frame.
image size are selected following the authors’ papers.
Recall that each vehicle tracklet is tracked independently by
a Kalman filter. Depending on how busy the traffic, the C. Detection Results of TransferLearning
number of Kalman filters varies roughly from 0 to 20 in each Some of the detection results from YOLOv2 are shown in
frame. For each Kalman filter, individual components of the Fig. 13. The detector is able to get tight and accurate bounding
process noise and observation noise were assumed to be boxes of vehicles in different camera settings. For example,
uncorrelated so the covariance matrices Qk and Rk , for all k, in although the camera on the Narrows Bridge South site was set
Eqs. (4) up to view the front parts of vehicles and the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception of pagination.
10 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 13. Vehicle and pedestrian/cyclist detection results from videos from the (a)-(d) Narrows Bridge North site, (e)&(f) Hutton Street site, and (g)&(h) Narrows
Bridge South site. The object classes are indicated by the colours of their bounding boxes. The Classes 1, 2-5, and 6-9 vehicles have, respectively, blue, green,
and red bounding boxes while pedestrians and cyclists have dark blue and light green ones. White colour boxes mean that the detected objects have not yet been
classified and counted.

cameras on the other two sites only saw the back parts of
vehicles, the YOLOv2 detector was able to robustly detect camera installed at the Hutton Street site being further away
vehicles at all the three sites. Overall, the detector is immune from the scene, making the vehicles smaller. Another reason is
to rainy conditions (Fig. 13(a)), night time (Fig. 13(b)), and that the Hutton Street video was recorded in a morning where
complicated illumination (Figs. 13(e) and 13(f)). When the shadows from the trees and nearby vehicles greatly affect the
traffic is busy where vehicles occlude each other (Fig. 13(h)), performances of all detectors (Figs. 13(e)&(f)).
the detector is still robust enough to get accurate bounding The results from Faster R-CNN were better than those from
boxes of vehicles. Fig. 13(c) shows a group of pedestrians other detectors as expected, as the detector gave more accurate
correctly detected and Fig. 13(d) shows the detection result of bounding boxes of the vehicles which are very important for
a cyclist. tracking, counting and classification. While YOLOv2
achieved very similar performance as Faster R-CNN in all
three sites, SSD dealt with the Narrows Bridge North site
D. Quantitative Evaluation relatively better, where the front parts of vehicles are visible.
1) Vehicle Counting and Classification: As the All the detectors performed well on counting the Class 1
vehicle counting and vehicle classification modules rely vehicles and less so on the Class 2-5 vehicles. While SSD
on the track- ing results from the upstream module, it is misclassified many vehicles from Classes 2-5 and 6-9,
necessary to investigate the performance of the object YOLOv2 and Faster R-CNN achieved much better results for
tracking and tracklet construction module first. We found these two classes. Incorporating visual informa- tion appeared
in our experiments that the IoU value worked well with to help improve the detection rate but occa- sionally reduce the
the Hungarian algorithm, since each input video passed detection rate on the smaller vehicles. For example, for the
to the pipeline has a reasonably high frame rate. This Narrows Bridge South site, there is one vehicle from Class 10-
guarantees that the bounding boxes of the same object 12. While using the homogra- phy method to compute the
have sufficient overlap between frames. The tracking vehicle length gave 3 vehicle counts for both YOLOv2 and
performance of the Kalman filters depends on how Faster R-CNN, incorporating visual information helped to
accurate the detector is in locating the targets and whether remove the two misclassified vehicles.
the false positive and false negative detection ratesare 2) Speed Analysis: The Piezoelectric data available at the
sufficiently low. This means that the accuracy of the Narrow Bridge South site makes it possible to benchmark the
vehicle counting and classification is ultimately governed vehicle speed estimation module of our computer vision
by which detector is used. The vehicle counting results pipeline. The recorded speed data in Fig. 14(a) shows 2
from the proposed method are shown in Table II. The regimes: free flow (at time 14:00-14:35) and congested (14:35-
vehicle counts from YOLOv2 and Faster R-CNN are 15:10). The vehicle speeds from our method and from the
clearly closer to the ground truth values (obtained from piezoelectric sensor are shown as red and blue dots. We note
manual counting) than those from SSD. This is because of that our method computes the average speed over the whole
the shallower network of SSD for feature extraction trajectory length of each vehicle whereas the piezoelectric
compared to those of YOLOv2 and Faster R-CNN, sensor measures the instantaneous speed of the vehicle. In the
which means that the features learned by SSD are not figure, the red and blue curves show the average speed of all
discrimina- tive enough for detection. One can notice that vehicles falling into the same one-minute period. While the
SSD failed to detect all the vehicles in the 6-9 and 10-12 average speeds from both methods are similar for the
classes. In summary, the SSD have 66% to 87% counting congested period, there is a relatively large gap for the free
accuracy, while flow period. The fact that the gap appears to be consistent
YOLOv2 and Faster R-CNN have 90% to 98%. suggest that it was caused by a systematic error. With the
In terms of the performance on the three sites, all the three
detectors had higher detection rates for the Narrows Bridge
North and Narrows Bridge South sites. This is due to the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception of pagination.
LIU et al.: VISION-BASED PIPELINE FOR VEHICLE COUNTING, SPEED ESTIMATION, AND CLASSIFICATION 11

TABLE II
VEHICLE COUNTING RESULTS FOR THE THREE SITES USING THE CLASSIFICATION METHOD BASED ON VEHICLE LENGTH ONLY (SECTION III-D.1) VERSUS THE
CLASSIFICATION METHOD BASED ON BOTH VEHICLE LENGTH AND VISUAL INFORMATION (SECTION III-D.3) ON THE BOUNDING
BOXES EXTRACTED BY THE THREE DETECTORS SSD, YOLOV2, AND FASTER R-CNN

Fig. 14. Speed analysis of data from our method and from the piezoelectric sensor.

Fig. 15. The indices of vehicles for the time-space diagram in Fig. 16.

absence of absolute ground truth, it is not possible to judge Figure 14(b) shows the line fitted to the two sets of average
whether our method overestimated the vehicle speeds or the speed values (denoted by μi and νi , for all i ) computed by the
piezoelectric sensor underestimated them or a combination of two methods. After readjusting the vision-based average speed
both. A contributing factor to our possible overestimation is values using the slope and y-intercept parameters obtained
the slope of the road segment there. This small slope is not from the fitted line, the root mean square error (RMSE)
detectable in the 2D Google Maps and so the 3D coordinates between the two sets of values is 1.85, which corresponds to a
of landmark points that were used for our homography compu- normalized RMSE (computed as RMSE/μ, where μ is the m¯
tation might not be accurately provided by Google Maps. For ean of the μi ’s) of 3%. This shows that after correcting the
slow vehicle speeds (the congested period), the effect from the systematic error, the two sets of values agree well.
small errors of the homography matrix is negligible; however, The speed-density plot is another typical diagram commonly
for higher vehicle speeds, the errors in the homography matrix used in traffic analysis (interested readers can refer to Fig. 2 of
are the possible cause to the overestimation of vehicle speeds. the paper by [59]). Figure 14(c) shows the average speed
versus density plot for the two methods. For each point in the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception of pagination.
12 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

negative errors, leading to underestimation of vehicle counts


(see Table II). This problem can be alleviated by adjusting the
view angle of the camera so that it has an overhead view of the
scene.

F. Traffic Analysis
Time-space diagrams describe the relationship between the
locations of vehicles in a traffic stream over time as the
vehicles progress along a highway. These diagrams are a
useful diagnostic tool especially for freeway merging and
traffic weaving analysis. However, they are not frequently
Fig. 16. The time-space diagram produced by our method for the Hutton used because of the difficulties in getting the metadata. Most
Street exist- ing techniques rely on manual processing of the
site video shown in Fig. 15. trajectories of vehicles captured by video cameras and they are
both labour intensive and inaccurate. Applying our proposed
diagram, the estimated density value is the number of vehicles pipeline toan example video sequence from the Hutton Street
per km over 5 lanes. This diagram agrees with Fig. 14(a) in site (Fig. 15), such a diagram can be easily generated, as
that the larger average speed values estimated by our method shown in Fig. 16. A traffic shockwave is evident in the figure
give slightly lower density values in this plot (the red dots are (marked by the long diagonal downward arrow) generated
slightly to the left of the blue dots). We can see that in the from the merging of traffic on the far left lane. In frame 202
congested period, the average speed drops to below 50 km/h. (Fig. 15(a)), vehicles 1- 4 follow each other closely. The
By dividing the density by 5 to yield the density per lane, merging of vehicle 5 in frame 508 caused a delay, resulting in
one can obtain the average headway (spacing) (approximately a headway of around 7 seconds between vehicles 4 and 5.
45 metres and 24 metres respectively for the free and Consequently, vehicles 10- 12 had to stop momentarily, as
congested periods) which is a useful piece of information in shown by the two ovals which mark the time and location at
traffic engineering. which vehicles entered the stop- and-start status. Other
valuable information that is useful for traffic analysis, such as
E. Technique Analysis the spatial distance between adjacent vehicles (the spacing
1) “Zoom-in” Detection: As shown in Fig. 13(a), pedes- between trajectories measured along the vertical axis) and
trians and cyclists are very small targets in the whole video headway (the time difference along the horizontal axis), can
frame and are difficult to be detected by many object detec- be obtained from the time-space diagram generated by our
tors, especially for YOLO. In our proposed pipeline, cropped proposed pipeline.
regions instead of the whole video frames can be passed to the
detector. By feeding the small cropped regions near the V. CONCLUSION AND FUTURE WORK
bottom- right corner of Fig. 13(a), the zoom-in region has most We have presented a vision-based pipeline consisting of
of the pedestrians (Fig. 13(c)) and the cyclist (Figs. 13(d)) modules for vehicle counting by lane, vehicle speed esti-
successfully detected. “Zoom-in” detection allows our pipeline mation, vehicle classification, and pedestrian/cyclist counting
to detect small pedestrians and cyclists when the scene is from monocular videos. We achieve these tasks by adapting
viewed from afar. state-of-the-art object detection techniques through transfer
2) Handling Wrong Detections: The detectors sometimes learning and through a novel application of projective geom-
gave false positive detections, which means some wrong areas etry. Our vehicle classification module incorporates a novel
were detected as targets (vehicles, pedestrians, motorbikes, or fusion of visual appearance and geometry information of
cyclists). In this case, a tracklet for each false positive vehicles in the scene. Our pipeline has been demonstrated
detection would be constructed as well. If the area occupied by through extensive experiments to give promising counting and
the false detection is “dynamic” (e.g., vehicles keep moving speed estimation results.
through it), then there is little chance the false detection would In our ongoing research effort, we have already extended
occur again in that area for the subsequent frames. As long as the work reported in this paper to traffic scenes captured by a
the tracklet is not assigned to any new detection for 5 frames, drone at four-way intersections, where vehicles moving in
the tracklet would be deleted. If the area is relatively static different directions need to be separated and more severe
(e.g., a piece of rock on the road side), then the false detection occlusion problems need to be dealt with. A limitation of our
might persist over many frames. In this case, it would not current method is the weak camera calibration step being
affect the final counting results because it would not cross the performed once only at the beginning of the pipeline. If the
counting line (see Section III-F). Our method can therefore camera is panned, tilted, or drifted (especially if the camera is
robustly deal with false positive detections from the detector carried by a drone), then the pre-computed homography is no
through the tracking process. For false negative errors, Faster longer valid. This limitation can be overcome by adding a new
R-CNN is more robust than YOLOv2, which is more robust module to track the calibration landmarks. For any detected
than SSD. Due to partial occlusion, especially when the traffic displacement of these landmarks in the image, the module can
is heavily congested, all the three detectors have higher false invoke an update of the homography.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception of pagination.
LIU et al.: VISION-BASED PIPELINE FOR VEHICLE COUNTING, SPEED ESTIMATION, AND CLASSIFICATION 13

Our future research work can be extended in several direc- [19] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierar-
tions. We intend to deploy the proposed method to daily chies for accurate object detection and semantic segmentation,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587.
operations and use the data generated for network operations, [20] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-
traffic flow monitoring, and also anomaly detection. One of time object detection with region proposal networks,” in Proc. Adv.
our research interests is to detect abnormal events such as Neural Inf. Process. Syst., 2015, pp. 91–99.
[21] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
pedestrians (or cyclists) on a freeways, vehicles driving in dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
wrong directions, vehicles driving at abnormal speeds (too fast Oct. 2017, pp. 2980–2988.
or too slow), and traffic accidents. As mentioned above, we [22] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in
Proc. ICCV, Oct. 2017, pp. 2980–2988.
have already started to process traffic videos captured by [23] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and
drone cameras at intersections. Adding a second camera to the A. W. M. Smeulders, “Selective search for object recognition,” Int. J.
system is part of our future research endeavour. This will help Comput. Vis., vol. 104, no. 2, pp. 154–171, Sep. 2013, doi: 10.1007/
overcome occlusion problems and also the homography error s11263-013-0620-5.
[24] Y. Bar-Shalom and T. E. Fortmann, Tracking and Data Association.
due to the slope of the road mentioned in ourexperiments. London, U.K.: Academic, 1988.
[25] S. S. Blackman, “Multiple hypothesis tracking for multiple target
ACKNOWLEDGMENT tracking,” IEEE Aerosp. Electron. Syst. Mag., vol. 19, no. 1, pp. 5–18,
Jan. 2004.
The Titan Xp GPU funded by Nvidia for this research is [26] A. G. A. Perera, C. Srinivas, A. Hoogs, G. Brooksby, and W. Hu,
greatly appreciated. “Multi- object tracking through simultaneous long occlusions and split-
merge conditions,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis.
Pattern Recognit. (CVPR), vol. 1, Jun. 2006, pp. 666–673.
REFERENCES [27] C. Huang, B. Wu, and R. Nevatia, “Robust object tracking by hier-
archical association of detection responses,” in Proc. ECCV, 2008, pp.
[1] A. L. Bazzan and F. Klügl, “Introduction to intelligent systems in traffic
788–801.
and transpor,” Synth. Lect. Artif. Intell. Mach. Learn., vol. 7, no. 3, pp.
1–137, 2013. [28] G. Duan, H. Ai, S. Cao, and S. Lao, “Group tracking: Exploring mutual
relations for multiple object tracking,” in Proc. ECCV, 2012, pp. 129–
[2] T.-J. Ho and M.-J. Chung, “An approach to traffic flow detection
143.
improvements of non-contact microwave radar detectors,” in Proc. Int.
[29] J. H. Yoon, C.-R. Lee, M.-H. Yang, and K.-J. Yoon, “Online multi-object
Conf. Appl. Syst. Innov. (ICASI), May 2016, pp. 1–4. tracking via structural constraint event aggregation,” in Proc. IEEE
[3] H. Sawant, J. Tan, Q. Yang, and Q. Wang, “Using Bluetooth and sensor Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1392–
networks for intelligent transportation systems,” in Proc. 7th Int. IEEE 1400.
Conf. Intell. Transp. Syst., Oct. 2004, pp. 767–772. [30] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and
[4] G. Leduc, “Road traffic data: Collection methods and applications,”
realtime tracking,” in Proc. IEEE Int. Conf. Image Process. (ICIP), Sep.
Work. Papers Energy, Transp. Climate Change, vol. 1, no. 55, pp. 1–55,
2016, pp. 3464–3468.
2008. [31] Z. Jiang, D. Q. Huynh, W. Moran, and S. Challa, “Appearance and
[5] P. Burnos, J. Gajda, P. Piwowar, R. Sroka, M. Stencel, and T. Zeglen,
motion based data association for pedestrian tracking,” in Proc. IVCNZ,
“Measurements of road traffic parameters using inductive loops and
Auckland, New Zealands, Nov./Dec. 2011, pp. 459–464.
piezoelectric sensors,” Metrol. Meas. Syst., vol. 14, no. 2, pp. 187–203,
[32] Z. Jiang and D. Q. Huynh, “Multiple pedestrian tracking from
2007.
monocular videos in an interacting multiple model framework,” IEEE
[6] P. Burnos, “Alternative automatic vehicle classification method,” Metrol.
Trans. Image Process., vol. 27, no. 3, pp. 1361–1375, Mar. 2018.
Meas. Syst., vol. 17, pp. 323–333, Jan. 2010.
[33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
[7] G. M. D’Este, R. Zito, and M. A. P. Taylor, “Using GPS to measure
A large-scale hierarchical image database,” in Proc. IEEE Conf.
traffic system performance,” Comput.-Aided Civil Infrastruct. Eng., vol.
Comput. Vis. Pattern Recognit., Jun. 2009, pp. 248–255.
14, no. 4, pp. 255–265, Jul. 1999.
[34] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
[8] D. Johnston, “Learnings from the development of a traffic data fusion
A. Zisserman, “The Pascal visual object classes (VOC) challenge,” Int.
methodology,” in Proc. Austral. Inst. Traffic Planning Manage. (AITPM)
J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Jun. 2010.
Nat. Conf., Melbourne, VIC, Australia, 2017, pp. 1–13.
[9] J. Chung and K. Sohn, “Image-based learning to measure traffic density [35] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in
using a deep convolutional neural network,” IEEE Trans. Intell. Transp. Proc. Eur. Conf. Comput. Vis., D. Fleet, T. Pajdla, B. Schiele, and
Syst., vol. 19, no. 5, pp. 1670–1675, May 2018. T. Tuytelaars, Eds. Cham, Switzerland: Springer, 2014, pp. 740–755.
[10] H. S. Lee and K. Kim, “Simultaneous traffic sign detection and [36] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learn-
boundary estimation using convolutional neural network,” IEEE Trans. ing applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp.
2278–2324, Nov. 1998.
Intell. Transp. Syst., vol. 19, no. 5, pp. 1652–1663, May 2018.
[11] C. Hu, X. Bai, L. Qi, P. Chen, G. Xue, and L. Mei, “Vehicle color [37] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
with deep convolutional neural networks,” in Advances in Neural Infor-
recognition with spatial pyramid deep learning,” IEEE Trans. Intell.
mation Processing Systems. Red Hook, NY, USA: Curran Associates,
Transp. Syst., vol. 16, no. 5, pp. 2925–2934, Oct. 2015.
2012, pp. 1097–1105.
[12] C. Stauffer and W. E. L. Grimson, “Adaptive background mixture
models for real-time tracking,” in Proc. IEEE Comput. Soc. Conf. [38] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE
Comput. Vis. Pattern Recognit., vol. 2, Jun. 1999, pp. 246–252. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9.
[13] K. Kim, T. H. Chalidabhongse, D. Harwood, and L. Davis, “Real-time [39] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
foreground-background segmentation using codebook model,” Real- large-scale image recognition,” 2014, arXiv:1409.1556. [Online].
Time Imag., vol. 11, no. 3, pp. 172–185, Jun. 2005. Available: https://arxiv.org/abs/1409.1556
[14] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look [40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788. pp. 770–778.
[15] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in [41] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based R-CNNs
Proc. CVPR, Jul. 2017, pp. 6517–6525. for fine-grained category detection,” in Proc. ECCV, vol. 8689, 2014,
[16] J. Redmon and A. Farhadi, “YOLOv3: An incremental improve- ment,” pp. 834–849.
2018, arXiv:1804.02767. [Online]. Available: http://arxiv.org/abs/ [42] S. Huang, Z. Xu, D. Tao, and Y. Zhang, “Part-stacked CNN for fine-
1804.02767 grained visual categorization,” in Proc. IEEE Conf. Comput. Vis. Pattern
[17] W. Liu et al., “SSD: Single shot MultiBox detector,” in Proc. Eur. Conf. Recognit., 2016, pp. 1173–1182.
Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 21–37. [43] Z. Ge, A. Bewley, C. McCool, P. Corke, B. Upcroft, and C. Sanderson,
[18] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis. “Fine-grained classification via mixture of deep convolutional neural
(ICCV), Dec. 2015, pp. 1440–1448. networks,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV),
Mar. 2016, pp. 1–6.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the
exception of pagination.
14

[44] B. Zhao, J. Feng, X. Wu, and S. Yan, “A survey on deep learning- based
Du Q. Huynh (Senior Member, IEEE) received the
fine-grained object classification and semantic segmentation,” Int. J.
Ph.D. degree in computer vision from The Uni-
Autom. Comput., vol. 14, no. 2, pp. 119–135, Apr. 2017. versity of Western Australia, Perth, WA, Australia,
[45] Y. Peng, X. He, and J. Zhao, “Object-part attention model for fine- in 1994. Since then, she has been with the Australian
grained image classification,” IEEE Trans. Image Process., vol. 27, no. Cooperative Research Centre for Sensor Signal and
3, pp. 1487–1500, Mar. 2018. Information Processing and Murdoch University,
[46] G. Somasundaram, R. Sivalingam, V. Morellas, and Perth, WA, Australia. She is currently an Associate
N. Papanikolopoulos, “Classification and counting of composite objects Professor with the Department of Computer Sci-
in traffic scenes using global and local image analysis,” IEEE Trans. ence and Software Engineering, The University of
Intell. Transp. Syst., vol. 14, no. 1, pp. 69–81, Mar. 2013. Western Australia. She has previously researched
[47] H. Zeng, H. Wu, and X. Wang, “An automatic wheel contour extraction shape from motion, multiple view geometry, and 3-
method,” Sensors Transducers, vol. 165, no. 2, p. 61, 2014. D
[48] A. G. Howard et al., “MobileNets: Efficient convolutional neural reconstruction. Her current research interests include visual object tracking,
networks for mobile vision applications,” 2017, arXiv:1704.04861. video image processing, machine learning, and pattern recognition.
[Online]. Available: https://arxiv.org/abs/1704.04861
[49] J. Li and Z. Wang, “Real-time traffic sign recognition based on efficient
CNNs in the wild,” IEEE Trans. Intell. Transp. Syst., vol. 20, no. 3, pp.
975–984, Mar. 2018.
[50] X. Hu et al., “SINet: A scale-insensitive convolutional neural network Yuchao Sun received the master’s degree in infor-
for fast vehicle detection,” IEEE Trans. Intell. Transp. Syst., vol. 20, no. mation management and the Ph.D. degree from The
3, pp. 1010–1019, Mar. 2018. University of Western Australia in 2003 and 2016,
[51] L. Chen, X. Hu, T. Xu, H. Kuang, and Q. Li, “Turn signal detection respectively. He has worked in both industry and
academia, in which he has participated in a range of
during nighttime by CNN detector and perceptual hashing tracking,”
research and consulting activities, including some
IEEE Trans. Intell. Transp. Syst., vol. 18, no. 12, pp. 3303–3314, Dec.
large scale infrastructure and mining projects. Witha
2017.
focus on applying modern computing techniques to
[52] R. Hartley and A. Zisserman, Multiple View Geometry in Computer
solve transport problems, his main research interests
Vision, 2nd ed. Cambridge, U.K.: Cambridge Univ. Press, 2004. include emergent behavior, bio-inspired algorithms,
[53] M. Everingham et al., “The 2005 PASCAL visual object classes chal-
the impact of future transport technologies
lenge,” in Proc. Mach. Learn. Challenges Workshop. Berlin, Germany:
especially
Springer, Apr. 2005, pp. 117–176.
connected and autonomous vehicles, transport modeling, data analytics, opti-
[54] J. Y. K. Luk, Automatic Vehicle Classification by Vehicle Length.
mization, and discrete event simulation for supply chain management.
Austroads, 2006. [Online]. Available: https://austroads.com.au/
publications/asset-management/ap-t60-06
[55] AUSTROADS Vehicle Classification System. Accessed: Jun. 25, 2020.
[Online]. Available: https://austroads.com.au/publications/pavement/
agpt04k/austroads-vehicle-classification Mark Reynolds (Member, IEEE) received the B.Sc.
[56] O. Russakovsky et al., “ImageNet large scale visual recognition chal- degree (Hons.) in pure mathematics and statistics
lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015. from The University of Western Australia (UWA),
[57] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. Cambridge, Perth, WA, Australia, in 1984, the Ph.D. degree in
MA, USA: MIT Press, 2005. computing from the Imperial College of Science and
[58] M. Pascoal, M. E. Captivo, and J. Clímaco, “A note on a new variant of Technology, University of London, London, U.K., in
Murty’s ranking assignments algorithm,” Quart. J. Belgian, French 1989, and the Diploma of Education degree from
Italian Oper. Res. Soc., vol. 1, no. 3, pp. 243–255, 2003. UWA in 1989. He is currently a Professor and the
[59] H. Wang, J. Li, Q.-Y. Chen, and D. Ni, “Speed-density relationship: Head of the School of Physics, Mathematics, and
From deterministic to stochastic,” in Proc. 88th TRB Annu. Meeting. Computing with UWA. His current research inter-
Washington, DC, USA: Transp. Res. Board, 2009, pp. 1–20. ests include artificial intelligence, optimization of
schedules and real-time systems, optimization of electrical power distribution
networks, machine learning, and data analytics.

Chenghuan Liu received the bachelor’s degree in


Steve Atkinson received the B.Sc. degree in com-
control engineering from the Harbin Institute of
puter science, the M.B.A. degree (Hons.), and the
Technology, Harbin, China, in 2015. He is currently
Master of Business Leadership degree (Hons.) from
pursuing the Ph.D. degree in computer vision with
Curtin University, Perth, WA, Australia, in 1996,
The University of Western Australia, Perth, WA,
2006, and 2010, respectively. He has worked for a
Australia. His current research interests include
variety of leading Western Australian organizations.
visual object tracking and smart transporta- tion
He is currently the Principal of analyst strategic
systems.
planning with Main Roads Western Australia, a role
which includes management of the corporate Inno-
vation and Research Program.

You might also like