Travel Time Prediction Using Machine Learning and Weather Impact On Traffic Conditions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

2019 5th International Conference for Convergence in Technology (I2CT)

Pune, India. Mar 29-31, 2019

Travel Time Prediction using Machine Learning and


Weather Impact on Traffic Conditions

Bilash Deb Salehin Rahman Khan Khandker Tanvir Hasan


CSE Department CSE Department CSE Department
BRAC University BRAC University BRAC University
Dhaka, Bangladesh Dhaka, Bangladesh Dhaka, Bangladesh
[email protected] [email protected] [email protected]

Ashikul Haque Khan Dr. Md. Ashraful Alam


CSE Department CSE Department
BRAC University BRAC University
Dhaka, Bangladesh Dhaka, Bangladesh
[email protected] [email protected]

Abstract — The growth of Intelligent Traffic System (ITS) foreseeing blockage levels and proactively dealing with the
have recently been quite fast and impressive. Analysis and traffic before clog is to come.
prediction of network traffic has become a priority in day to
day planning in social, economic and more widespread set of B. Motivation
areas. With a vision to further contribute to this vast field of The increase of population density and of the relative
research, we propose an approach to forecast level of traffic amount of car owners makes traffic jams an important
congestion on the basis of a time series analysis of collected problem of modern societies. Traffic jams are a major
data using machine learning. Moreover, the proposed source of discomfort of drivers, but also the cause of an
approach allows us to find a correlation between varying increased number of traffic accidents, especially in large
parameter of weather and level of traffic congestion. Traffic cities. According to The New Indian Express A study has
data collected from Uber Movement for the city of Mumbai,
pegged the avoidable social cost of traffic congestion in
India was fed to multiple of pre assessed machine learning
algorithm. Comparative analysis of the results of the different
Bengaluru at 38,000 crore Indian rupees annually [2].
machine learning algorithms used have shown us that logistic The cost covers time delays, man-hours lost, extra fuel
regression works best with an accuracy of 85% on the collected consumed, vehicle wear and tear, traffic accidents and
Uber data. Thus our model can accurately predict the time to environmental damage. The study, commissioned by taxi
travel between different nodes (locations) in Mumbai city aggregator Uber and done by Boston Consulting Group,
based on the data collected from Uber Movement. claimed India loses about 1.5 lakh crore Indian rupees
annually due to traffic congestion in Delhi, Mumbai,
Keywords—Machine Learning, Traffic congestion, Bengaluru and Kolkata. To add more to the list, it’s not
Forecasting, Weather, Intelligent Transport System (ITS), only cities with stupendous population that are facing
Support Vector Machine (SVM), Logistic Regression, this dilemma. Developed cities all around the globe are
Correlation, Uber Movement spending this extraneous cost just by wasting time on
road[3].With little being done to provide efficient
I. INTRODUCTION transport solutions, people are getting used to spending
A. Introduction more and more time commuting from one point to
another. In appreciation of this problem, we wanted to
There are various conditions for which congestion can create a model which will help to accurately predict these
happen such as road work, peak hour traffic, accidents, and congestions and can be aided in various sectors from
inclement weather conditions [1]. In this work, the traffic government planning to more personal daily basis planning.
congestion caused by weather conditions is studied and the
effect of weather conditions on mean travel time between C. Contribution summary
different nodes of Mumbai city is analyzed and a prediction The whole target was to build a forecasting model that
model is proposed based on the result of some learning would also take the weather conditions as a contributing
algorithms to accurately forecast traffic congestion. For the factor to predict the traffic congestion. The results from this
most part, two strategies are considered to lessen blockage model can be interpreted and used in a different of ways as
on urban expressways; one is to expand the complete per the user’s point of view. It can be used to improve the
interstate limit by including paths in the current streets or ITS by traffic management system of individual cities by
new streets, yet this requires additional terrains and analyzing their data through our system and generalizing a
stupendous expense on developing infrastructures which is pattern of traffic movement to take action beforehand
not feasible in most of the cases in numerous urban regions. accordingly. This paper also attempts to find a correlation
Another arrangement is to utilize different traffic control between several variables of weather and traffic congestion
techniques so as to effectively utilize the current and outlines the effect of weather on traffic. To analyze the
expressways. The control systems regularly include pattern of traffic flow we have fed our secondary data to

978-1-5386-8075-9/19/$31.00 ©2019 IEEE 1


Authorized licensed use limited to: Ontario Tech University. Downloaded on June 17,2024 at 14:13:37 UTC from IEEE Xplore. Restrictions apply.
different machine learning algorithms to compare the authors such as Turner [14] proposed betweenness centrality
performance. So we have also outlined the prediction as a good predictor of traffic flow. Although Gao, et al.
analysis of our selected algorithm that works best on our [15], criticized this approach and proposed a new model of
collected data from Uber Movement. Thus the contribution traffic flow based on the non-uniform distribution of human
of this paper can be summarized as follows. activity and the distance-decay law.
x A model is proposed for long term prediction of C. Research with other taxicab data from GPS
traffic congestion that can be used from urban Zheng, et al. [16] provide an interesting framework for
planning to be used by common people. analyzing taxicab data, which consists of linking pairs of
x Impact of several variable of weather on the traffic regions (i, j) to three key features: (1) the number of taxis
congestion is studied. going from region i to region j, (2) the average speed these
taxi drives when commuting from region i to region j, and
x Performance analysis of different machine learning (3) the ratio between the actual travel distance and the
algorithm on our secondary data from Uber distance between the centroids of these two regions. By
Movement is outlined. mapping taxi trajectory data from 30,000 taxis driving in
Beijing from March to May in 2009 and 2010 onto this
II. LITERATURE REVIEW framework Zheng et al. seek flaws in current urban
Although much work has been done in creating a model planning.
that accurately forecasts traffic congestion, but a few paper
takes machine learning into account. Furthermore, there has III. METHODOLOGY
not been an approach before to create a learning model A. Understanding the Data Set
based on a data collected from a secondary source such as
Uber Movement. It is difficult to accurately estimate the All of the data used in our research are secondary data. That
traffic conditions since they can widely vary in spatial and is the data has been collected by other organization or entity
time domain quite good amount of research has been done that might have been collected in their own way for some
in the related areas of this field. To outline some of the other purpose [4]. Traffic data of Mumbai city from 2016
relevant works, Bauza R. et.al, [6] proposed a cooperative till date have been collected from Uber through a project
traffic congestion detection based upon vehicle to vehicle they call Uber Movement. Data for weather for the city of
communication for road traffic congestion prediction and Mumbai is collected from Wunderground for the same time
got congestion detection probabilities of 90%. Manish period.
R.Joshi [7] and TheyaznHassnHadi [7] did an intensive 1) Traffic Data : This January, Uber unveiled “Uber
research on Different prediction techniques and reviewed on Movement”, a tool intended for use by city planners and
network traffic analysis. Eric Horvitz et.al, [8] did a study researchers looking into ways to improve urban mobility. It
on deployed traffic forecasting service. Their research has provides anonymized data from over two billion Uber trips
led to the deployment of a service named JamBayes that is in the cities of Bogota, Boston, Johannesburg, Manila, Paris,
being actively used by over 2,500 users via smartphones Sydney, Washington D.C., Mumbai and are adding more
and desktop versions of the system. Jerome Treboux et.al,
cities to the list to help urban planning around the world.
[9] did a short term prediction with more than 99% accuracy
where they collected the data using sensors that they placed These data are open sourced and was targeted so that city
on different locations of Santander City. This is one of the officials can measure the impact of road improvements,
few papers that has the weather constraint included. A major events and transit lines. So that planners and policy
research on a prediction model Jiwan Lee, Bonghee Hong makers can analyze transportation patterns and make smart
and Kyungmin Lee [10], used 48 weather forecasting investments on future infrastructure projects and to power
factors and attempted to correlate this data with multiple breakthrough insights and ideas with open data for all,
linear regression analysis. Mainly researches have been specifically, it includes the arithmetic mean, geometric
carried out in the following three categories. mean, and standard deviations for aggregated travel times
over a selected date-range between every zone pair in each
A. Data Mining Research : of these cities. Uber Movement is open to the public and
Several studies have systematically reviewed data can be download in .csv [comma separated value] format
collecting methodologies, in particular collecting section directly from [Uber Movement’s Website]. Below are some
based data such as travel time [11]. In [12], the authors have tables that depicts the type of data that we have used from
proposed a model on video based data collection. Recently, the Uber Movement site. Table 1 shows an example of the
the proliferation of wireless communication infra-structures mean time taken and its upper and lower bound for each day
and navigation technologies have enhanced data collection
between two particular nodes of Mumbai city during a
and data coverage. These technologies (i) collect vehicle
specific date range. This helps in classifying a pattern
positions, (ii) infer relevant information concerning
vehicular kinematic characteristics and congestion, and (iii) between each day of the week between two nodes for the
provide congestion information to drivers [13]. whole year. The features used for our leaning model were
mainly the origin ID, destination ID, day of the week were
B. Traffic Flow Prediction Research : considered as interger labelling them from 1 to 7
Historically, several authors have claimed that the considering Sunday as 1 and mean travel time. Some of the
configuration of a city’s street network plays an important sample data collected from Uber Movement are shown
role in vehicular flow and, hence, used centrality measures below.
of a street graph to model and predict traffic. Specifically,

2
Authorized licensed use limited to: Ontario Tech University. Downloaded on June 17,2024 at 14:13:37 UTC from IEEE Xplore. Restrictions apply.
TABLE 1. MEAN TRAVEL TIMES DAY OF THE WEEK
Date Mean
Day of Destination
Origin ID Range Travel
Week ID
(M/D/Y) Time/s
12/1/2017 -
Mon 541 108 2340
12/31/2017
12/1/2017 -
Tues 541 108 2365
12/31/2017
12/1/2017 -
Wed 541 108 2466
12/31/2017
12/1/2017 -
Thu 541 108 2623
12/31/2017
12/1/2017 -
Fri 541 108 2764 Fig. 1. The Uber Web interface colors cells in the city grid based on the
12/31/2017,
average travel time to them from the specified pin [courtesy: Uber
Movement]
Each day is also segmented into 5 stages such as AM
Peak, Midday, PM Peak, Evening, early Morning to study 2) Weather Data : Data of weather around Mumbai is
the pattern of the rush hour. Table 2 shows an example of collected from the year 2016 till date to match with the
the mean time taken and its upper and lower bound for each
timeline. There are many factors of weather that have a
day for each of the aforementioned segments between two
combinatorial effect on different regions of a country. So
particular nodes of Mumbai city during a specific date
range. far, we have narrowed down few key factors such as
average temperature, humidity, dew point, wind speed,
TABLE II. MEAN TRAVEL TIMES HOUR OF THE DAY pressure and precipitation collected from Wunderground.
Date Mean Weather Underground or Wunderground is a commercial
Time of Destinatio
Origin ID Range Travel weather service providing real-time weather information via
Day n ID the Internet. Weather Underground provides weather reports
(M/D/Y) Time/s
for most major cities across the world on its website, as well
Daily 12/1/2017 -
Average
541 108
12/31/2017
as local weather reports for newspapers and websites. Its
12/1/2017 - information comes from the National Weather Service
AM Peak 541 108
12/31/2017 (NWS), and over 250,000 personal weather stations (PWS)
12/1/2017 - [5]. The table below shows a portion of the weather data
Midday 541 108 2642
12/31/2017
that we have collected from Wunderground. The average
12/1/2017 -
PM Peak 541 108 2677 were calculated by Wunderground using the following
12/31/2017
12/1/2017 - equation
Evening 541 108 12/31/2017 2446 max  min
, avg (1)
2

Table 3 shows the hourly aggregated data that further TABLE IV. PARAMETERS OF WEATHER DATA
helped us to analyze the hourly pattern.
Avg Avg Max Max Avg
TABLE III. HOURLY AGGREGATED DATA BETWEEN SEVERAL Max
COMBINATIONS OF NODES Date Tem Dew Wind Pressur Precipi
Humid
standard (d/m/y) p Point Speed e tation
hour of the mean ity (%)
sourceid dstid deviation (Ԭ) (Ԭ) (mph) (Hg) (in)
day travel time
travel time
1 3 18 4825.54 836.44 13/09/16 80 76 89 13 29.83 0.04
1 4 12 4154.69 627.68 14/09/16 82 76 89 13 29.83 0
1 5 6 1093 538.85
1 6 0 2860.52 611.29 15/09/16 78 77 100 13 29.8 3.15
1 7 19 4526.77 772.83 16/09/16 79 77 94 8 29.74 0.35
2 1 19 6936.63 1179.43
17/09/16 78 77 100 14 29.69 1.54
2 3 7 4215.85 884.97
2 6 14 2703.1 427.93 18/09/16 77 77 100 23 29.78 1.38
2 9 21 1384.31 314.54 19/09/16 79 76 100 14 29.83 0.87
3 1 8 3359 1128.74 20/09/16 78 76 100 14 29.83 5.16
3 4 15 2855.21 481.24 21/09/16 76 76 100 15 29.83 2.68
3 5 9 4318.09 1729.94
B. Uber Movement: Travel Time Calculation Methodology
Similarly, the data was also aggregated monthly to study Although the data collected comes with the mean travel
the seasonal variation and the pattern in the mean travel time between nodes, it is worth mentioning in this paper
time between selected nodes. how the travel times are actually calculated. The Uber
Partner app, while on trip, records latitude, longitude, and a
timestamp (Date/time) every 4 seconds. These GPS trace
pings are commonly used to provide navigational routing,
fare calculations, match partners with riders, and user
experience elements, such as plotting the position of the car

3
Authorized licensed use limited to: Ontario Tech University. Downloaded on June 17,2024 at 14:13:37 UTC from IEEE Xplore. Restrictions apply.
in the Uber Rider app. When aggregated, these GPS trace
pings can also be used to derive average travel times
between the zones in a given region. Uber Movement
processes these GPS trace pings using the following high-
level steps:
STEP 1 - Zone Assignment: For each trip, unsorted GPS
trace pings are assigned an appropriate zone as defined by a
shape file. Fig. 3. Workflow of the system

STEP 2 - Mean Epoch: For each zone a trip passes The selected classifiers for our model are described in
through, the mean GPS ping within that zone is computed. the section below. They were selected on the basis of their
After this step, the overall trajectory is lost but we do know ability to recognize pattern in data set
the average timestamp within each zone a trip passed
through. 1) Decision Tree : A decision tree is a predictor
h:XėY, that predicts the label associated with an instance x
STEP 3 - Zone to Zone Travel Time: The elapsed time by travelling from a root node of a tree to a leaf. At each
from each mean GPS ping to all subsequent GPS pings is node on the root-to-leaf Path, the successor child is chosen
measured, thereby providing zone-to-zone travel times for on the basis of a splitting of the input space. The main
each trip. purpose of Decision Tree is to shrink the training dataset in
the smallest tree [17].
STEP 4 - Aggregate Trips: Zone-to-zone travel times are Pseudo code:
aggregated from all trips. After this step, trip level
information is lost and we only know statistical measures of INPUT: training set S, feature subset A ž[d]
Zone-to-zone travel times aggregated from many trips. if all examples in S are labeled by 1, return a leaf 1
if all examples in S are labeled by 0, return a leaf 0
STEP 5 - Privacy Constraints: Travel time statistics are if A 䳯 , return a leaf whose value = majority of labels
removed for zone pairs that either a) do not meet a
in S
minimum number of trips or b) the minimum count of
else :
unique riders necessary to preserve rider privacy. (This step
is implemented in tandem with Step 4 but listed as a Let j argmaxiùA Gain(Si )
separate step for ease of understanding) if all examples in S have the same label
Return a leaf whose value = majority of labels in S
STEP 6 - Release: Zone-to-zone travel time averages are else
made available via Movement’s interactive travel time’s Let T1 be the tree returned by ID3(^(xy) ùS :x j 
solution, including several available CSV export options. Different algorithms use different implementation of
A ?^j of
Gain(S,i). The simplest1`definition `). gain is the decrease in
training error. One of such gain measure
Let T2 be the tree returned by ID3(^(x y)isùS
information
:x j 
gain, the equation for information gain is given below,
p 0`An ?^j `).
IG I( , )  remainder ( A ) (2)
p  n p  nthe tree
Return
Where
v
pi  ni pi ni
 remainder ( A) ¦ I( , ) ሺ͵ሻ
i 1 p  n i n p  n
Fig. 2. The zones passed by a Uber trip between the colored starting and
end points.
2) Random Forest : A random forest is a classifier
C. Learning Models consisting of a collection of decision trees, where each tree
The data in the format of .csv file has been read as input is constructed by applying an algorithm A on the training
and selected classifiers called upon from an array have been set S and an additional random vector, θ, where θ is sampled
used to train upon this data set. Data was split into two independently and identically distributed from some
parts; one the training set and the other the testing set using distribution. The prediction of the random forest is obtained
train_test_split function in Sklearn. The test_size=0.2 inside by a majority vote over the predictions of the individual
the function indicates the percentage of the data that should trees. It is very efficient on large data sets. Random forest
be held over for testing. The ratio was altered to see the uses Gini index for deciding the final class of each tree. If
performance of each algorithm upon different ration of data set T contains examples from n classes Gini index, Gini
training and testing data. Below is a figure of the workflow (T) is defined as
of our methodology. n
2
(4)
Gini ( T ) 1 ¦ (Pj )
j 1

4
Authorized licensed use limited to: Ontario Tech University. Downloaded on June 17,2024 at 14:13:37 UTC from IEEE Xplore. Restrictions apply.
3) Linear Regression for polynomial regression task The value of ρ ranges from -1<ρ<1, where ρ=1
The hypotheses of the multivalued regression analysis is represents a perfectly positive correlation, that is, the
sample data points of (x, y) lie on a straight line with a
h0 ( x ) T
T
x T 0 x 0  T 1 x 1  T 2 x 2  ...  T n x n (5) positive slope and ρ=-1 indicates a perfectly negative
correlation. Months used for analysis was chosen based on
What linear regression does is that it tries to plot a best months with typical winter, rainy and typical summer to
fit line through a scatter diagram of recorded points, the observe seasonal effects.
linear equation of the best fit line is the linear squared
regression equation where the value of dependent variable E. Data Used in Learning Models
can be found out from one or more independent variables. If we look back to the type of data shown in the tables in
The best fit line is found out by decreasing the average Chapter 2, we would see that data from Uber Movement
distance of original value to the points on the linear consists of data from one node to all possible nodes of
equation. This distance is called the cost function which is Mumbai city. The source ID(sourceid) and destination
calculated by the formula Id(dstid) were labeled by Uber Movement. That is a
sourceid of 541 will always represent the location
1
m Mantranalay Road and the dstid 108 will always represent
Costfuncti on ( J T ) ¦ (hT ( x
(i)
 y
(i)
))
2
(6) the location R.B.I Branch of Mumbai city. So for the
2m i 1
machine learning part, the feature set for the training
4) Logistic Regression : Logistic Regression searches included the source id, the destination id, time of the day,
the whole datasets to find the hyper plane which fits the day of the week for every combination of origin and
most for identifying the classes. The core of logistic destination between nodes that is available. The data were
regression is “logistic function”. Logistic function is also subdivided into four .csv files, one for each quarter of the
called the sigmoid function. This Function was mainly year. The accuracy is calculated, for testing data in the same
developed for describing the properties of population quarter of the same year. To analyze the pattern, we have
growth in ecology, rising quickly and maxing out at the selected a single mother node and have analyzed the data
from that node to 10 other nodes over the period of 2016 till
carrying capacity of the environment. It is a ‘S’ shaped
date. The choice of these nodes were not random but rather
curve which can take real-valued number and map it into a were particularly selected from prior knowledge as they
value between 0 and 1. The function given below: require to travel through busy highways and intersections to
1 travel from one node to the other. The algorithms were
I sig ( z ) (7) trained upon three models of data
1  exp(  z )
5) Support Vector Machine(SVM) : Support vector Model 1: Using data from table 1, Travel time day of the
machines (SVM) are kernel machines that implement week to analyze overall mean time to travel between the
maximum margin methods. The maximum margin is mother node to the other nodes in each day of the week.
generated by the kernel using a set of weighted vectors of Model 2: Using data from table 2, Travel time hour of
training data called support vectors. Basic concept of this the day to see a pattern of congestion during the rush hour.
algorithm is finding a hyper plane in order to classify the
datasets. There are two kinds of SVM classifiers– Model 3: Using monthly aggregated data to see a
seasonal variation in the pattern.
1) SVM Linear Classifier
2) SVM non-linear Classifier. The data set were split into two parts, i) the training set
SVM uses the quadratic approach to define the problem of which was used to train the learning models and ii) the
maximizing separability between classes. The margin is testing set, to figure out the accuracy of our prediction. The
subject to constraint of the smoothness of the solution. ratio of training set to testing set were also varied to study
the learning curve of each of the machine learning
D. Analyzing Weather Impact On Traffic Congestion algorithms. Weather data was not directly used as a
In order to analyze how the weather conditions, affect parameter to train the algorithms but rather our study was
traffic congestion, the product moment correlation merely to find a correlation between weather variables and
coefficient between each of the parameters of weather mean travelling time. The correlation coefficient for each of
discussed in Chapter 2 and the mean travel time taken the factors of weather and the mean travel time for the same
between two nodes in different days in varying weather day were measured
conditions is calculated.
IV. RESULT ANALYSIS
1) Correlation Coefficient : Correlation coefficients for
A. Learning Perfeormance
two variables signify the degree of linearity between them.
For sufficient amount of data, the degree of linearity can be Throughout these models we are predicting the mean
travel time between nodes. As the output is a numerical
measured as strong, positive, negative or no correlation.
value the performance of the algorithms was evaluated
Correlation coefficient ρ for two variable x and y is defined using the root mean squared error. Root Mean Square Error
as: (RMSE) is the standard deviation of the residuals
n n n

n¦ xi yi  ¦ xi ¦ yi (8) (prediction errors). Residuals are a measure of how far from


v
i 1 i 1 i 1 the regression line data points are; RMSE is a measure of
n
2
n n
2
n
how spread out these residuals are. In other words, it tells
n ¦ x i  (¦ x i ) n ¦ y i  (¦ y i )
2 2

i 1 i 1 i 1 i 1
you how concentrated the data is around the line of best fit.
The formula is shown below:

5
Authorized licensed use limited to: Ontario Tech University. Downloaded on June 17,2024 at 14:13:37 UTC from IEEE Xplore. Restrictions apply.
2
N (z  z0 ) 1

RMSE [¦
fi i
]2 (9)
i 1 N
Where‫ݖ‬௙೔ െ ‫ݖ‬௢೔ = difference between original and
predicted value and N is the sample size. The table below
shows the percentage accuracy that we have obtained for the
learning algorithms that we have used for different size of
training set.
The table below shows the percentage accuracy that we
have obtained for the learning algorithms that we have used
for different size of training set.

TABLE V. PERCENTAGE ACCURACY OF ALGORITHMS FOR


DIFFERENT SIZE OF DATA SETS Fig. 5. Multivalued linear regression learning rate.

Percentage of Data in Test Set B. Trend Analysis


40% 50% 60% 80% Some expected pattern was found in the behavior of the
Algorithm commuters of the Mumbai city. Predictably, the mean travel
Accuracy
Name time during rush hour is upper bounded to be 11.9% longer
Decision than travelling in early morning. The pattern is the same for
Tree 42% 54% 68% 73% all of the routes between the mother node to the other nodes.
Regression The algorithms generated some charts that depicts the
Random pattern in our study which are shown below. Figure 10 to 13
56% 62% 74% 83% shows the average travel times by day of the week for four
Forest
Multivalued quarters of the year 2017
Linear 48% 69% 75% 84%
Regression
Logistic
48% 63% 78% 85%
Regression
SVM 35% 60% 68% 70%

Percentage accuracy was measured according to the


formula below:
Fig. 6. Mean travel time by day for quarter 1[January 2017 – March
Percentage accuracy (1  error ) u 100 (10) 2017]

Below is a chart that shows the percentage accuracy for


each of the machine learning algorithms for different ratio
of testing and training set.

Fig. 7. Mean travel time by day for quarter 2[April 2017 – June 2017]

Fig. 4. Percentage accuracy for different algorithms


Fig. 8. Mean travel time by day for quarter 3[July 2017 – September
The learning rate of multivalued linear regression for 2017]
different size of training and testing set is shown below.
Note that the mean travel time increased for almost
every day for quarter 3 than quarter 2 as that quarter is
subjected to heavy rainfall in the region of Mumbai

6
Authorized licensed use limited to: Ontario Tech University. Downloaded on June 17,2024 at 14:13:37 UTC from IEEE Xplore. Restrictions apply.
traffic flow but it surely causes other mishaps like road
blockage, water clog, that lead to traffic congestion. Below
is a figure that shows the positive correlation between
precipitation and mean travel time between two nodes of
Mumbai city. The correlation coefficient found from the
data collected between the two randomly selected node is
0.78.
Fig. 9. Mean travel time by day for quarter 4[October 2017 –
December 2017]

Figure 10 and 11 shows the average travelling time by


period for each day for two of the quarters in the year 2017

Fig. 13. Positive correlation between mean travel time and precipitation
Fig. 10. Mean travel time by period for quarter 2[April 2017 – June
2017] V. CONCLUSION
In this research we presented a model using machine
learning algorithms to forecast the mean travelling time
with an accuracy up to 85% trained on the collected data of
Mumbai city from Uber Movement. The study is
categorized into three components. 1) The performance
analysis of different machine learning algorithm trained on
different size of training set. It is seen that the regression
analysis worked best on the data that we had collected. 2)
Recognizing pattern of daily commuters and analyzing data
on a quarterly basis to study seasonal variation. This
Fig. 11. Mean travel time by period for quarter 3[July 2017 – September showed us that travelling in evening can sometimes take
2017] longer than PM peak period and certain holidays can cause
a particular day to have an irregular pattern than usual 3) the
Again figure 11 shows that the average travelling time impact of weather events on the travel time prediction was
throughout the days in the third quarter were longer than investigated. The third part of our research showed us that
those of the second. the other factors of weather does not affect travelling time
as much as precipitation. Although it cannot be said that
C. Correlation Analysis precipitation has a direct link to traffic congestion but in the
For our study, we have all the parameters as mentioned case of Mumbai city. One possible cause might be that
earlier in table 4 with which tried to find out a correlation precipitation might have reduced the freeway capacity but
with travelling time. Except for the precipitation factor none the trip demand to use the freeway has not decreased
of the other parameters had any effect on the mean accordingly.
travelling time in between the nodes. Figure shown below
tells us that there is almost no correlation between The limitation to our system comes mainly from the data
temperature and mean travel time. The data points are too that we have obtained. Uber Movement data does not
scattered to draw a best fit line. contain all the necessary information that could have made
our predictive model more realistic, or could have outputted
90 result more precise to urban planning. Uber Movement does
not contain the following data that we think could have
Temperature

80 helped our cause better.


x The route taken by the Uber trips.
70
0 2000 4000 6000 8000 x The width and the traffic flow and density.
Mean Travel Time x The number of trips made between every nodes per
day
Fig. 12. Scatter diagram of temperature against mean travel time.
The model learned from data obtained only from Uber
On the other hand, the data shows, there is an increase in Movement. Uber mainly works as a taxi aggregator in the
travel time of around 12.6% for 1-inch increase in city of Mumbai where majority of the vehicles are sedan
precipitation. Although it cannot be claimed that cars. So the prediction model is subjective specifically to
precipitation has a direct correlation on traffic demand, sedan cars and taxi like models explicitly. Other means of

7
Authorized licensed use limited to: Ontario Tech University. Downloaded on June 17,2024 at 14:13:37 UTC from IEEE Xplore. Restrictions apply.
transport might have different mean travelling times than [8] Horvitz, Eric et al. “Prediction, Expectation, and Surprise: Methods,
the result obtained from our model. Designs, and Study of a Deployed Traffic Forecasting Service.” UAI
(2005).
In the future we would like to come up with a way to [9] Treboux, Jérôme&Jara, Antonio J. &Dufour, Luc &Genoud,
mine our own data as per our need instead of relying on Dominique. (2015). A predictive data-driven model for traffic-jams
forecasting in smart santader city-scale testbed. 64-68.
secondary sources. We would like to include further 10.1109/WCNCW.2015.7122530.
parameters such as traffic flow, traffic density, average [10] Michael Bolt,J. Craig Prather, Haley Harrell, Tyler Horton, John
speed, etc. We would like to conduct a similar experiment Manobianco, Mark L. Adams, "Design and Testing of Novel
in another city where there might be other factors of Airborne Atmospheric Sensor Nodes", Geoscience and Remote
weather having an impact on the flow of traffic. Sensing Letters IEEE, vol. 15, pp. 73-77, 2018, ISSN 1545-598X.
[11] J. Zhang, F.Ǧ Y. Wang, K. Wang, W.Ǧ H. Lin, X. Xu and C. Chen, "
Instead of just analyzing and forecasting somewhat DataǦ driven intelligent transportation systems: A survey.," IEEE Tra
predictable pattern of data, we would like to look for ns. Intelligent Transportation Systems, vol. 12(4), pp. 1624Ǧ 1639,
2011.
solutions to reduce the level of traffic congestion. [12] N. Buch, S. A. Velastin and J. Orwell, "A review of computer vision
Geospatial and geo temporal data can be analyzed to techniques for the analysis of urban traffic," IEEE Trans. Intelligent T
identify busy intersections using betweeness centrality to ransportation Systems, vol. 12(3), pp. 920Ǧ 939, 2011.
help urban planners to come up with ways to liquidate flow [13] G. Marfia, M. Roccetti and A. Amoroso, "A new traffic congestion pr
of traffic further. In the past few years ITS has covered huge ediction model for advanced traveler information and management sy
stems," Wireless Communications and Mobile Computing,
steps but there is more way to go in this area of research. vol. 13(3), p. 266–276, 2013.
[14] Alasdair Turner. From axial to road-centre lines: a new representation
REFERENCES
for space syntax and a new model of route choice for transport
[1] Chin, Kwai-Sang &Tummala, V & P. F. Leung, Jendy& Tang, network analysis. Environment and Planning B: Planning and Design,
Xiaoqing. (2004). A study on supply chain management practices: 34(3):539–555, 2007.
The Hong Kong manufacturing perspective. International Journal of [15] Song Gao, Yaoli Wang, Yong Gao, and Yu Liu. Understanding urban
Physical Distribution & Logistics Management. 34. 505-524. traffic-flow characteristics: a rethinking of betweenness centrality.
10.1108/09600030410558586. Environment and Planning B: Planning and Design, 40(1):135–153,
[2] “The money lost on our Roads” The New Indian Express, N.p.,20 2013.
April 2018. Web.[Online] [16] Yu Zheng, Yanchi Liu, Jing Yuan, and Xing Xie. Urban computing
http://www.newindianexpress.com/opinions/editorials/2018/apr/20/th with taxicabs. In Proceedings of the 13th international conference on
e-money-lost-on-our-roads-1803847.html Ubiquitous computing, pages 89–98. ACM, 2011.
[3] The Data Team, "The Economist," 28 February 2018. [Online]. Avail [17] N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer,
able: “SMOTE:synthetic minority over-sampling technique”, Journal of
https://www.economist.com/blogs/graphicdetail/2018/02/dailyǦ chart artificial intelligence Research, 2002, pp. 321-357.
Ǧ 20. [18] Texture Feature Extraction and Classification of SEM Images of
[4] Salkind, N. J. (2010). Encyclopedia of research design Thousand Wheat Straw/Polypropylene Composites in Accelerated Aging Test -
Oaks, CA: SAGE Publications Ltd doi: 10.4135/9781412961288 Scientific Figure on ResearchGate. Available from:
[5] Wikipedia contributors. (2018, July 11). Weather Underground https://www.researchgate.net/Flow-chart-of-SVM-algorithm-on-
(weather service). In Wikipedia, The Free Encyclopedia. Retrieved predicting-classification_fig6_283469343 [accessed 18 Jul, 2018]
20:22, July 17, 2018, [19] Finding the Most Significant Elements for the Classification of
from https://en.wikipedia.org/w/index.php?title=Weather_Undergrou Organic Orange Leaves: A Data Mining Approach - Scientific Figure
nd_(weather_service)&oldid=849770444 on ResearchGate. Available from:
[6] Bauza, R. and Gozálvez, J., 2013. Traffic congestion detection in https://www.researchgate.net/Decision-boundary-margins-and-
large-scale scenarios using vehicle-to-vehicle communications. parameters-of-a-support-vector-machine_fig1_317928807[accessed
Journal of Network and Computer Applications, 36(5), pp.1295-1307. 18 Jul, 2018]
[7] Joshi, Manish &Aldhayni, TheyaznTheyazn. (2015). A Review of
Network Traffic Analysis and Prediction Techniques.

8
Authorized licensed use limited to: Ontario Tech University. Downloaded on June 17,2024 at 14:13:37 UTC from IEEE Xplore. Restrictions apply.

You might also like