Document Reference
Document Reference
BY
Project Report
This is to certify that this is the bonafide record of the mini project titled “Uber Fares
Prediction”
Done during second year first semester for the completion of MSc.
“Task successful” makes everyone happy. But the happiness will gold without glitter if we didn’t state
the persons who have supported us to make a success.
Success will be crowned to people who made it reality but people who constant guidance and
encouragement made it possible crowned first on the eve of success.
This acknowledgement transcends the reality of formality when we would like to express deep
gratitude and respect to all those people behind the screen who guided, inspired and helped me for
the completion of my project work.
I consider myself lucky enough to get such a project. This mini project would as an asset to my
academic profile.
I express my thanks to our beloved principal Rev Fr Dr L. Joji Reddy SJ, who always inspire and
motivated us to perform in a better and efficient manner.
I would like to express our thankfulness to my project guide Dr. G. Anitha Mary, Head of the
Department for her constant motivation and valuable help through the project work, and I express my
gratitude to, for her supervision, guidance and cooperation through the project.
This is to inform that I Sneha Sharma the student of MSc. Data Science had completed the
mini project work on “Uber Fares Prediction” in the year 2022-23.
I have completed my mini project under the guidance of Dr. G. Anitha Mary (HOD),
department of MSc Data Science.
I hereby declare that this mini project report submitted by me is the original work done by a part of
my academic course and has not been submitted to any university or institution for the award of any
degree or diploma.
4|Page
TABLE OF CONTENTS
4. CHAPTER 4 : DEPLOYMENT 18
5. CHAPTER 5 : CONCLUSION 19
6. CHAPTER 6 : REFERENCES 27
1|Page
CHAPTER 1: PROBLEM STATEMENT
The project is about the world's largest taxi company Uber inc. In this project, we're looking to predict the
fare for their future transactional cases. Uber delivers service to lakhs of customers daily. Now it becomes
really important to manage their data properly to come up with new business ideas to get best results.
So, it becomes really important to estimate the fare prices accurately.
2|Page
CHAPTER 2: PROCEDURE
According to industry standards, the process of Data Analyzing mainly includes 6 main steps and this
process is abbreviated as CRISP DM Process, which is Cross-Industry Process for Data Mining. And the Six
main steps of CRISP DM Methodology for developing a model are:
1. Business understanding
2. Data understanding
3. Data preparation/Data Preprocessing
4. Modeling
5. Evaluation
6. Deployment
And in this project, all the above steps are followed to develop the model.
It is important to understand the idea of business behind the data set. The given data set is asking us to
predict fare amount. And it really becomes important for us to predict the fare amount accurately. Else,
there might be great loss to the revenue of the firm. Thus, we have to concentrate on making the model
most efficient.
In this project, I have used various libraries to perform several tasks such as
Numpy
Pandas
Sklearn
Matplotlib
Seaborn
Tkinter
Pandas_Profiling
Datetime
Pickle
Statistics
To get the best results and the most effective model it is really important to understand our data very well.
This dataset is downloaded from Kaggle. The given train data is a CSV file that consists 9 variables and 200000
Observation.
3|Page
Table 2.1 : Train Data
From the given train data it is understood that, we have to predict fare amount, and other variables will
help me achieve that, here pickup_latitude/longitude, dropoff_latitude/longitude this data are signifying
the location of pick up and drop off. It is explaining starting point and end point of the ride. So, these
variables are crucial for us. Passenger_count is another variable, that explains about how many people or
passenger boarded the ride, between the pickup and drop off locations. And pick up date time gives
information about the time the passenger is picked up and ride has started. But unlike pick up and drop off
locations has start and end details both in given data. The time data has only start details and no time
value or time related information of end of ride. So, during pre-processing of data we will drop this
variable. As it seems the information of time is incomplete.
4|Page
2.3 Data Preparation
The next step in the CRISP DM Process is, Data preprocessing. It is a data mining process that involves
transformation of raw data into a format that helps us execute our model well. As, the data often we get
are incomplete, inconsistent and also may contain many errors. Thus, Data preprocessing is a generic
method to deal with such issues and get a data format that is easily understood by machine and that helps
developing our model in best way. In this project also we have followed data pre-processing methods to
rectify errors and issues in our data. And this is done by popular data preprocessing techniques, this are
following below.
Note: I have removed the variable “Unnamed : 0” as it is not of much significance , so in this data set it
seems, it will have no impact in the target variable, and also it lead to redundancy and model accuracy
issues, so I preferred to drop it.
Missing value is availability of incomplete observations in the dataset. This is found because of reasons like,
incomplete submission, wrong input, manual error etc. These Missing values affect the accuracy of model.
So, it becomes important to check missing values in our given data.
In the given dataset it is found that there are few values which are missing. It is found in the following
type:
1. Blank spaces : Which are converted to NA and NaN in Python for further operations
After the identification of the missing values the next step is to impute the missing values. And this
imputation is normally done by following methods.
To use the best method it is necessary for us to check, which method predicts values close to the original
data. And this done by taking a subset of data, taking an example variable and noting down its original
value and the replacing that value with NA and then applying available methods. And noting down every
value from the above methods for the example variable we have taken, now we chose the method which
gives most close value.
In this project, Central Tendencies worked the best. So, I am using backward fill method to impute missing
Values.
5|Page
Outlier Analysis
Outlier is an abnormal observation that stands or deviates away from other observations. These happens
because of manual error, poor quality of data and it is correct but exceptional data. But, It can cause an
error in predicting the target variables. So we have to check for outliers in our data set and also remove or
replace the outliers wherever required
In this dataset, I have found some irregular data, those are considered as outliers. These are explained
below.
Passenger_count:
I have always found a cab with 4 seats to maximum of 8 seats. But in this dataset I have found passenger
count more than this, and in some cases a large number of values. This seems irregular data, or a manual
error. Thus, these are outliers and needs to be removed or treated. Few instances are following.
All this outliers mentioned above happened because of manual error, or interchange of data, or may be
correct data but exceptional. But all these outliers can hamper our data model. So there is a requirement
to eliminate or replace such outliers. And impute with proper methods to get better accuracy of the model.
In this project, I used median method to impute the outliers in passenger count. Visualization of passenger
count variable after outliers are filled is shown below :
6|Page
Fare Amount:
I have observed that there are few outliers present in fare amount variable as well. In this project, I used
median method to impute the outliers in fare amount. Visualization of fare amount variable after outliers are
filled is shown below. From here we can analyze that The Target Variable seems to be be highly skewed, with
most datapoints lying near 0.
Feature Selection
Sometimes it happens that, all the variables in our data may not be accurate enough to predict the target
variable, in such cases we need to analyze our data, understand our data and select the dataset variables
that can be most useful for our model. In such cases we follow feature selection. Feature selection helps by
reducing time for computation of model and also reduces the complexity of the model.
After understanding the data, preprocessing and selecting specific features, there is a process to engineer
new variables if required to improve the accuracy of the model.
Feature Extraction :
Distance :
In this project the data contains only the pick up and drop points in longitude and latitude. The
fare_amount will mailnly depend on the distance covered between these two points. Thus, we have to
create a new variable prior further processing the data. And in this project the variable I have created is
Distance variable (dist), which is a numeric value and explains the distance covered between the pick up
7|Page
and drop of points. After researching I found a formula called The haversine formula, that determines the
distance between two points on a sphere based on their given longitudes and latitudes. These formula
calculates the shortest distance between two points in a sphere.
The function of haversine function is described, which helped me to engineer our new variable,
Distance.
Hour:
Our dataset consists only of pickup datetime variable. So we will find out at what hour did the journey start
using datetime package in python naming it as hour variable. The visualization for the hour feature is given
below:
After excecuting the haversine function and the hour feature in our project, I got new variable distance, hour
and some instances of data are mentioned below.
8|Page
EXPLORATORY DATA ANALYSIS :
Exploratory data analysis (EDA) is used by data scientists to analyse and investigate datasets and
summarize their main characteristics, often employing data visualization methods. It helps determine
how best to manipulate data sources to get the answers you need, making it easier for data scientists
to discover patterns, spot anomalies, test a hypothesis, or check assumptions.
EDA is primarily used to see what data can reveal beyond the formal modelling or hypothesis testing
task and provides a provides a better understanding of data set variables and the relationships
between them. It can also help determine if the statistical techniques you are considering for data
analysis are appropriate. Originally developed byAmerican mathematician John Tukey in the 1970s,
EDA techniques continue to be a widely used method in the data discovery process today.
Data Visualization:
Data visualization is defined as a graphical representation that contains the information and the
data. By using visual elements like charts, graphs, and maps, data visualization techniques provide
an accessible way to see and understand trends, outliers, and patterns in data.
In modern days we have a lot of data in our hands i.e., in the world of Big Data, data
visualization tools, and technologies are crucial to analyse massive amounts of information and
make data-driven decisions.
It is used in many areas such as:
To model complex events.
Visualize phenomenon’s that cannot be observed directly, such as weather patterns,
medical conditions, or mathematical relationships
Univariate Analysis :
9|Page
10 | P a g e
Bivariate Analysis :
11 | P a g e
Correlation Analysis:
In some cases it is asked that models require independent variables free from collinearity issues. This can
be checked by correlation analysis for the categorical variables and continuous variables. Correlation
analysis is a process that is defined to identify the level of relation between two variables.
In this project, our Predictor variable is continuous, so we will plot a correlation table that will predict the
correlation strength between independent variables and the ‘fare_amount’ variable
Correlation Plot:
From the above plot it is found that most of the variables are highly correlated with each other, like fare
amount is highly correlated with distance variable. All the dark blue charts represents that variables are
highly correlated. And as there is no dark red charts, which represents negative correlation, it can be
summarized that our dataset has strong or highly positive correlation between the variables.
12 | P a g e
Model Development
After all the above processes the next step is developing the model based on our prepared data.
In this project we got our target variable as “fare_amount”. The model has to predict a numeric value.
Thus, it is identified that this is a Regression problem statement. And to develop a regression model, the
various models that can be used are
Decision trees,
Random Forest
Linear Regression
KNN imputation.
Decision Tree
Decision Tree is a supervised learning predictive model that uses a set of binary rules to calculate the
target value/dependent variable.
Decision trees are divided into three main parts this are :
In this project Decision tree is applied in Python, details are described following.
The above fit plot shows the criteria that is used in developing the decision tree in Python. To develop the
model in python, I haven’t provided any input argument of my choice, except the depth as 2, to visualize
the tree better. All other arguments in the model are default, in developing the model. After this the fit_DT
is used to predict in test data and the error rate and accuracy is calculated.
Random Forest
The next model to be followed in this project is Random forest. It is a process where the machine follows
an ensemble learning method for classification and regression that operates by developing a number of
decision trees at training time and giving output as the class that is the mode of the classes of all the
individual decision trees.
In this project Random Forest is applied in Python, details are described following.
13 | P a g e
Random Forest in Python
Like the Decision tree above are all the criteria values that are used to develop the Random Forest model
in python.
Linear Regression
The next method in the process is Linear regression. It is used to predict the value of variable Y based on
one or more input predictor variables X. The goal of this method is to establish a linear relationship
between the predictor variables and the response variable. Such that, we can use this formula to estimate
the value of the response Y, when only the predictors (X- Values) are known.
In this project Linear Regression is applied in Python, details are described following.
This shows that the model is performing very poor. This may be because the relationship between the
independent and dependent variable might be nonlinear.
The next process to be followed is The KNN model. It finds the nearest neighbors and tries to predict target
value. The method goes as, for the value of new point to be assigned, this value is assigned on the basis of
how closely this point resembles the other points in the training set. The process of implementing KNN
methodology is little easy in compare other models. After implementing the KNN model, the KNN function
is imported from respective libraries in Python , package “Scikit Learn” library. After that the model is Run,
and the prediction fit is used to predict in test data. Finally the error and accuracy is calculated.
14 | P a g e
KNN Imputation in Python
Model Summary:
Above mentioned Decision Tree, Random Forest, Linear Regression and KNN Method are the various
models that can be developed for the given data. At first place, The Data is divided into train and test. Then
the models are developed on the train data. After that the model is fit into it to test data to predict the
target variable. After predicting the target variable in test data, the actual and predicted values of target
variable are compare to get the error and accuracy. And looking over the error and accuracy rates, the best
model for the data is identified and it is kept for future usage.
15 | P a g e
CHAPTER 3: EVALUATION OF THE MODEL
So, now we have developed few models for predicting the target variable, now the next step is to identify
which one to choose for deployment. To decide these according to industry standards, we follow several
criteria. Few among this are, calculating the accuracy is used in our project. RMSE is not used because we
are not working with Timestamp value.
3.1 Accuracy
The method is to identify or compare for better model is Accuracy. It is the ratio of number of correct
predictions to the total number of predictions made.
16 | P a g e
CHAPTER 4 : DEPLOYMENT
Python pickle module is used for serializing and de-serializing python object structures. The process
to converts any kind of python objects (list, dict, etc.) into byte streams (0s and 1s) is called pickling
or serialization or flattening or marshalling. We can convert the byte stream (generated through
pickling) back into python objects by a process called as unpickling. In real world scenario, the use
pickling and unpickling are widespread as they allow us to easily transfer data from one
server/system to another and then store it in a file or database.
Deployment:
Machine learning deployment is the process of deploying a machine learning model in a live
environment. The model can be deployed across a range of different environments and will often be
integrated with apps through an API. Deployment is a key step in an organization gaining
operational value from machine learning. For this project I deployed the model using tkinter
package in python. Python offers multiple options for developing GUI (Graphical User Interface).
Out of all the GUI methods, tkinter is the most commonly used method. It is a standard Python
interface to the Tk GUI toolkit shipped with Python. Python with tkinter is the fastest and easiest
way to create the GUI applications. Creating a GUI using tkinter is an easy task.
The snap shot of the output window that is the Graphical user interface is given below :
17 | P a g e
CODE
import numpy as np
import pandas as pd
import tkinter as tk
import pickle
import tkinter.font as font
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import string
import random
import statistics as st
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split,cross_val_score
from pandas_profiling import ProfileReport
from tkintermapview import TkinterMapView
import tkintermapview
from tkinter import Frame
from tkinter import *
from tkinter import messagebox
data=pd.read_csv("uber.csv")
data.shape
data.columns
data.head()
18 | P a g e
data=data.drop(['Unnamed: 0'],axis=1)
data.isnull().sum()
data=data.fillna(method='bfill')
data.duplicated().any()
data.passenger_count.value_counts()
sns.boxplot(data['passenger_count'],color='red')
median = float(data['passenger_count'].median())
data["passenger_count"] = np.where(data["passenger_count"] > 6, median, data['passenger_count'])
data["passenger_count"] = np.where(data["passenger_count"] ==0, median, data['passenger_count'])
data.passenger_count.value_counts()
sns.countplot(data['passenger_count'])
19 | P a g e
data.fare_amount.value_counts()
Finding the largest fare amount paid by the customers
data.fare_amount.nlargest()
med = float(data['fare_amount'].median())
data["fare_amount"] = np.where(data["fare_amount"] < 0, med, data['fare_amount'])
data["fare_amount"] = np.where(data["fare_amount"] > 499, med, data['fare_amount'])
fig,ax=plt.subplots(figsize=(8,8))
sns.distplot(data.fare_amount,color='green')
sns.boxplot(data['fare_amount'],color='blue')
Adding another column in the dataset to find when the customer started his journey
data['pickup_datetime'] = pd.to_datetime(data['pickup_datetime'])
data['Hour'] = data['pickup_datetime'].apply(lambda time: time.hour)
Adding another column to determine how much distance is travelled during the journey.
data['Distance']=
haversine(data['pickup_longitude'],data['dropoff_longitude'],data['pickup_latitude'],data['dropoff_latitud
e'])
data.info()
Viewing the first few rows of the column after feature extraction
data.head()
min_val = data[:].min()
max_val = data[:].max()
min_val
max_val
sns.distplot(data['pickup_longitude'],color='orange')
plt.title('The distribution of Pick up Longitude')
sns.distplot(data['dropoff_longitude'],color='purple')
plt.title('The distribution of Drop off Longitude')
sns.distplot(data['dropoff_latitude'],color='brown')
plt.title('The distribution of drop off Latitude')
sns.distplot(data['pickup_latitude'],color='pink')
plt.title('The distribution of pick up Latitude')
plt.scatter(data['fare_amount'],data['passenger_count'])
plt.xlabel("fare_amount")
plt.ylabel("passenger_count")
sns.catplot(x="Hour", y="fare_amount",kind="bar",data=data)
21 | P a g e
sns.relplot(x="Distance", y="fare_amount", data=data, kind="scatter",color='maroon')
grids=sns.PairGrid(data)
grids.map(plt.scatter,color='yellow')
data.corr()
plt.figure(figsize=(8,8))
sns.heatmap(data.corr(),annot=True)
x=data[["pickup_longitude","pickup_latitude","dropoff_longitude","dropoff_latitude",'Hour']]
y=data["fare_amount"]
xtr,xte,ytr,yte=train_test_split(x,y,test_size=0.3,random_state=442)
xtr.shape
ytr.shape
model_rf=RandomForestRegressor()
model_rf.fit(xtr,ytr)
ypred=model_rf.predict(xte)
RandomForestRegressor=r2_score(yte,ypred)*100
print("Accuracy of the model is ",RandomForestRegressor)
model_dt=DecisionTreeRegressor()
22 | P a g e
model_dt.fit(xtr,ytr)
ypred=model_dt.predict(xte)
DecisionTreeRegressor=r2_score(yte,ypred)*100
print("Accuracy of the model is ",DecisionTreeRegressor)
model_knn=KNeighborsRegressor()
model_knn.fit(xtr,ytr)
ypred=model_knn.predict(xte)
KNeighborsRegressor=r2_score(yte,ypred)*100
print("Accuracy of the model is ",KNeighborsRegressor)
model_lr=LinearRegression()
model_lr.fit(xtr,ytr)
ypred=model_lr.predict(xte)
LinearRegression=r2_score(yte,ypred)*100
print("Accuracy of the model is ",LinearRegression)
cv_score=cross_val_score(model_rf,x,y,cv=5)
print("Cross Validation of the model is ",cv_score)
mean_accuracy=sum(cv_score)/len(cv_score)
mean_accuracy=mean_accuracy*100
mean_accuracy=round(mean_accuracy,2)
print("Mean Accuracy of the model is ",mean_accuracy)
Front End
def predict():
pickup_longitude = float(scale1.get())
pickup_latitude = float(scale2.get())
dropoff_longitude=float(scale3.get())
23 | P a g e
dropoff_latitude = float(scale4.get())
Hour=int(cb2.get())
prediction = int(duration.predict(inp.reshape(1,5)))
t.insert('1.0',prediction)
if(len(cb2.get())==0):
message.showerror("details missing","please enter the details")
def reset():
scale1.set(0)
scale2.set(0)
scale3.set(0)
scale4.set(0)
cb2.delete('0','end')
t.delete('1.0','end')
def close():
root.destroy()
Creating a GUI
label1_font=font.Font(family='Helvetica',size=30,weight='bold')
label1 = tk.Label(text="UBER FARE PREDICTION",fg="dark slate blue",bg="skyblue",font=label1_font)
label1.pack(anchor=tk.CENTER)
my_label=LabelFrame(root)
my_label.pack(side=BOTTOM)
map_widget=tkintermapview.TkinterMapView(my_label,width=1500,height=440,corner_radius=0)
map_widget.set_position(40.71278,-74.00594)
map_widget.pack(side=BOTTOM)
map_widget.set_address("Auckland,NewZealand",marker=True)
map_widget.set_zoom(75)
24 | P a g e
v = DoubleVar()
label1 = Label(root)
label1.place(x=100,y=110)
w = DoubleVar()
label2 = Label(root)
label2.place(x=350,y=110)
x = DoubleVar()
label3 = Label(root)
label3.place(x=600,y=110)
y = DoubleVar()
label4 = Label(root)
label4.place(x=850,y=110)
25 | P a g e
def my_upd2():
cb2.set('1')
label8.config(text=cb2.get()+':'+ str(cb2.current()))
months1=['0','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23']
cb2 = ttk.Combobox(root, values=months1,width=15)
cb2.place(x=1150,y=120)
b2=tk.Button(root,text="set('1')", command=lambda: my_upd2())
b2.place(x=1080,y=120)
label8=tk.Label(root,text='hour',fg='green',font=15)
label8.place(x=1040,y=120)
print(cb2.get())
t =tk.Text(root,highlightbackground='green',fg="white",bg="azure4",height=1,width=25,font=15)
t.place(x=550,y=180)
button2_font= font.Font(size=15,weight='bold')
button2 = tk.Button(root,text='RESET',bg="cyan",fg="white",command=reset,font=button2_font)
button2.place(x=520,y=220)
button3_font= font.Font(size=15,weight='bold')
button3 = tk.Button(root,text='CLOSE',bg="CadetBlue1",fg="white",command=close,font=button3_font)
button3.place(x=620,y=220)
root.mainloop()
26 | P a g e
CHAPTER 5: REFERENCES
Websites:
www.edwisor.com : Videos from Mentor :
https://rpubs.com/ : Coding Doubts :
https://stackoverflow.com/questions/51488949/use-haversine-package-to-compare-all-
distances-possibilities-of-a-csv-list-of-lo : Haversine Doubt
https://gist.github.com/rochacbruno/2883505 : Calculate Haversine in python
Videos Channels:
https://www.youtube.com/watch?v=7YfyIhhmwq4 : Distance development in Python
https://www.youtube.com/watch?v=Uct_EbThV1E&list=PLZ7s-
Z1aAtmIbaEj_PtUqkqdmI1k7libK&index=2&t=690s : Python Coding
27 | P a g e
THANK YOU
28 | P a g e
29 | P a g e