Predicting Taxi Demand Using Machine Learning: International Research Journal of Engineering and Technology (Irjet)
Predicting Taxi Demand Using Machine Learning: International Research Journal of Engineering and Technology (Irjet)
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Taxi service is imbalanced in big cities. Taxi network reliability for both companies and clients can be
drivers have to decide where to wait for passengers to pick up improved with a smart approach regarding this issue: a
someone as soon as possible. Passengers always prefer a quick clever allotment of vehicles throughout stands will reduce
taxi service whenever needed. The busy area to be the average waiting time to pick-up a passenger whereas
concentrated can be decided by the control centre. The sensors the distance travelled will be profitable. Passengers will
that are installed in these vehicles help in automatically also experience a lower waiting time to get a taxi which will
discovering new facts. This data is already being used by be automatically dispatched or directly picked-up at a
transporting systems to find time-saving routes, taxi stand.
dispatching and other such aspects. By organizing the
availability of the taxi, more customers can be served in a
2. PROPOSED METHODOLOGY
short time. In this paper we are using six different algorithms
along with the streaming data to increase the performance of Prediction of taxi demand is a time series analysis problem.
The different steps involved are; cleaning the data,
demand prediction and distribution of taxi-passenger in a
clustering, Fourier Transform and making predictions using
short term time horizon. We evaluate our method on the
machine learning models. In the system a minimum Pentium
dataset of New York City. We do this by dividing the city into
2.266 MHz processor and Python language is used. 1GB RAM
smaller areas and then analyzing and predicting the demand
and 250mb disk space is required. A collection of libraries
in each area. The data set includes around nineteen different such as dask, folium, numpy, pandas, matplotlib, etc are also
features with properties like the GPS location, pickup points, used. The input data has been collected from New York City
drop-off location, etc. This model can be used to predict the Taxi and Limousine Commission’s website. The collected
demand in the different areas of the city at a particular time data was around 7000 examples which had to be cleaned
and we show which the algorithm that gives the best results and it was brought up to 92% accuracy. The dataset is
cleaned in the preprocessing. Redundant data was removed
depending on factors like if the pickup point was outside the
Key Words: Taxi Demand Prediction, Baseline Models, city, if the trip lasted for more than 24hrs and also removing
Regression Models, Time Series Data records which are incomplete. Once the cleaned data set is
available it is then clustered using the K-means algorithm. All
the time series data will be then converted into frequency
1.INTRODUCTION domain to get frequency and amplitude using Fourier
transforms. This is further on given as input to various
Taxi drivers need to choose someplace to wait for the baseline models and regression models for which output will
passengers so that they can pick someone fast. Likewise, be the accuracy. The model with best accuracy will be
passengers also need to find their cabs quickly. Dispatching selected for prediction. The different baseline models used in
the taxi resourcefully helps both the customers and drivers this system are Simple Moving Average, Weighted Moving
and also helps to reduce waiting time for customers, as well Average and Exponential Moving Average and the different
as drivers. In this system, a real-time taxi demand prediction regression models used include Linear Regression, Random
is proposed and streaming data is used to predict the future Forest and xg Boost.
demand for taxis in a particular area at a particular time. The The data points analyzed by creating a series of averages of
few real-time objectives include managing many numbers of various subsets of the complete data set is the moving
taxis in a crowded area, utilization of resources effectively to average. It can also be called as moving mean or rolling mean
lessen waiting time, organizing the available taxi to serve and it is a type of finite impulse response filter. The different
more customers in a short time. Our system uses features types are: simple, weighted and exponential. A simple
like GPS location and other properties of the taxi like pickup moving average (SMA) can be defined as an arithmetic
point, drop point etc. to predict taxi demand. moving average which is calculated by taking the sum of all
the recent values and then dividing that by the number of
Our work focuses on the real-time choice problem about values. Short-term averages react quickly to changes in the
going to the best taxi stand after a passenger drop-off (i.e. values of the underlying, while long-term averages are
where a quicker pickup of a passenger can be got). The
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 7338
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 05 | May 2020 www.irjet.net p-ISSN: 2395-0072
slower to react. The Formula for SMA is: SMA={A_1 + A_2 + that is scalar and one or more independent variables. The
... + A_n}/ n case where one independent variable is present is called a
simple linear regression.
An exponentially weighted moving average (EWMA) which is For cases with more than one independent variable is called
also known as exponential moving average (EMA), is a first- multiple linear regression. Here, multiple correlated
order infinite impulse response filter that applies weighting dependent variables are predicted, rather than a single
factors which decrease exponentially. There is an exponential scalar variable. In linear regression, the linear predictor
decrement in weighting for each older datum, never reaching functions are used to model the relationships whose
zero. The graph at right shows an example of the weight unknown parameters are estimated from the data. These
decrease. Fig. 1 shows the simple and exponential moving models are called linear models. Given the value of the
average indicators. predictors, linear regression focuses on the conditional
probability distribution of the response, instead of the joint
probability distribution of all of these variables. The linear
regression model has an extensive use because these models
depend linearly on their unknown parameters and are easier
to fit than the models which are non-linearly related to their
parameters. Fig. 3 shows how the values are divided in a
simple linear regression.
Linear regression is a type of Regression model. It is a linear Fig -4: Random Forest Structure
approach to modelling the relationship between a response
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 7339
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 05 | May 2020 www.irjet.net p-ISSN: 2395-0072
XGBoost is a tool that is highly flexible and versatile which only the features required are extracted. After data mining
can work through most of the regression, classification and we perform clustering using k-means algorithm. Then
other ranking problems. It can be easily accessed and used later the data is passed as input to the different models
through different platforms. XGBoost stands for eXtreme and the predictions are obtained as output. The model
Gradient Boosting. This algorithm was developed to reduce which gives the most accurate prediction is obtained.
the processing time of a computer and to allocate the usage
of memory resources. Handling the missing values, support 4. CONCLUSIONS
parallelization in the construction of a tree, etc are some of
the important features. The fig. 5 shows an example Random Forest Regression seems to be best model where
structure of XGBoost. MAPE of train value decrease below 12% there is not any
sign of overfitting or under fitting but other models seems
little bit overfitting. All models have test MAPE in range of
12.6 to 13.6%.
3. GENERAL STRUCTURE
REFERENCES
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 7340
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 05 | May 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 7341