0% found this document useful (0 votes)
87 views33 pages

Document Reference

This document is a mini project report submitted by Sneha Sharma from Loyola Academy Degree & PG College. The project aims to predict Uber fares using a dataset containing variables like pickup/dropoff locations and times, passenger count, and fare amount. The report describes the data understanding, preprocessing, model development and evaluation steps followed according to the CRISP-DM methodology. Models tested include decision trees, random forests, linear regression and KNN. The goal is to accurately predict fare amounts to help Uber better manage costs and revenues.

Uploaded by

094Sneha Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views33 pages

Document Reference

This document is a mini project report submitted by Sneha Sharma from Loyola Academy Degree & PG College. The project aims to predict Uber fares using a dataset containing variables like pickup/dropoff locations and times, passenger count, and fare amount. The report describes the data understanding, preprocessing, model development and evaluation steps followed according to the CRISP-DM methodology. Models tested include decision trees, random forests, linear regression and KNN. The goal is to accurately predict fare amounts to help Uber better manage costs and revenues.

Uploaded by

094Sneha Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

LOYOLA ACADEMY DEGREE & PG COLLEGE

(An Autonomous Degree College affiliated to Osmania University)

Accredited with “A” grade by NAAC

A “College with Potential for Excellence” by UGC

Alwal, Secunderabad -500010


2022-2023

MINI PROJECT REPORT


ON
“UBER FARES PREDICTION”

BY

Sneha Sharma 111721720007

Department of MSc. Data Science


LOYOLA ACADEMY DEGREE & PG COLLEGE
(An autonomous Degree College affiliated to Osmania University)
Accredited with “A” grade by NAAC
A COLLEGE WITH POTENTIAL FOR EXCELLENCE BY UGC
Alwal, Secunderabad-500010

Project Report

This is to certify that this is the bonafide record of the mini project titled “Uber Fares

Prediction”

Done during second year first semester for the completion of MSc.

Data Science Degree by

Sneha Sharma UID : 111721720007

Internal Examiner Principal External Examiner


ACKNOWLEDGEMENT

“Task successful” makes everyone happy. But the happiness will gold without glitter if we didn’t state
the persons who have supported us to make a success.

Success will be crowned to people who made it reality but people who constant guidance and
encouragement made it possible crowned first on the eve of success.

This acknowledgement transcends the reality of formality when we would like to express deep
gratitude and respect to all those people behind the screen who guided, inspired and helped me for
the completion of my project work.

I consider myself lucky enough to get such a project. This mini project would as an asset to my
academic profile.

I express my thanks to our beloved principal Rev Fr Dr L. Joji Reddy SJ, who always inspire and
motivated us to perform in a better and efficient manner.

I would like to express our thankfulness to my project guide Dr. G. Anitha Mary, Head of the
Department for her constant motivation and valuable help through the project work, and I express my
gratitude to, for her supervision, guidance and cooperation through the project.

Sneha Sharma (111721720007)


DECLARATION

This is to inform that I Sneha Sharma the student of MSc. Data Science had completed the
mini project work on “Uber Fares Prediction” in the year 2022-23.

I have completed my mini project under the guidance of Dr. G. Anitha Mary (HOD),
department of MSc Data Science.

I hereby declare that this mini project report submitted by me is the original work done by a part of
my academic course and has not been submitted to any university or institution for the award of any
degree or diploma.

Sneha Sharma (UID:111721720007)

4|Page
TABLE OF CONTENTS

1. CHAPTER 1 : PROBLEM STATEMENT 2


2. CHAPTER 2 : PROCEDURE 3
o 2.1 Business Understanding 3
o 2.2 Data Understanding 3
o 2.3 Data Preparation 5
 Missing Value Analysis 5
 Outlier Analysis 6
 Feature Selection 7
 Exploratory Data Analysis 9
o 2.4 Model Development 14
 Decision Tree 14
 Random Forest 14
 Linear Regression 15
 KNN 15

3. CHAPTER 3 : EVALUATION OF THE MODEL 17


o 3.1 : Accuracy 17
o 3.2 : Cross Validation 17
o 3.3 : Model Selection 17

4. CHAPTER 4 : DEPLOYMENT 18

5. CHAPTER 5 : CONCLUSION 19

6. CHAPTER 6 : REFERENCES 27

1|Page
CHAPTER 1: PROBLEM STATEMENT

The project is about the world's largest taxi company Uber inc. In this project, we're looking to predict the
fare for their future transactional cases. Uber delivers service to lakhs of customers daily. Now it becomes
really important to manage their data properly to come up with new business ideas to get best results.
So, it becomes really important to estimate the fare prices accurately.

2|Page
CHAPTER 2: PROCEDURE

According to industry standards, the process of Data Analyzing mainly includes 6 main steps and this
process is abbreviated as CRISP DM Process, which is Cross-Industry Process for Data Mining. And the Six
main steps of CRISP DM Methodology for developing a model are:

1. Business understanding
2. Data understanding
3. Data preparation/Data Preprocessing
4. Modeling
5. Evaluation
6. Deployment

And in this project, all the above steps are followed to develop the model.

2.1 Business Understanding

It is important to understand the idea of business behind the data set. The given data set is asking us to
predict fare amount. And it really becomes important for us to predict the fare amount accurately. Else,
there might be great loss to the revenue of the firm. Thus, we have to concentrate on making the model
most efficient.

Importing Necessary Libraries

In this project, I have used various libraries to perform several tasks such as
 Numpy
 Pandas
 Sklearn
 Matplotlib
 Seaborn
 Tkinter
 Pandas_Profiling
 Datetime
 Pickle
 Statistics

2.2 Data Understanding

To get the best results and the most effective model it is really important to understand our data very well.
This dataset is downloaded from Kaggle. The given train data is a CSV file that consists 9 variables and 200000
Observation.

A snapshot of the data is provided.

3|Page
Table 2.1 : Train Data

The different variables of the data are:

fare_amount : fare of the given cab ride.


pickup_datetime : timestamp value explaining the time of ride start.
pickup_longitude : a float value explaining longitude location of the ride start.
pickup_latitude : a float value explaining latitude location of the ride start.
dropoff_longitude : a float value explaining longitude location of the ride end.
dropoff_latitude : a float value explaining latitude location of the ride end
passenger_count : an integer indicating the number of passengers
key : a unique identifier for each trip

Following table explains how the variables are categorized.

Independent Variables Dependent


pickup_datetime Variables
pickup_longitude Fare_amount
pickup_latitude
dropoff_longitude
dropoff_latitude
passenger_count

Table 2.2 : Independent Variables Table 2.3 : Dependent/Target Variable

From the given train data it is understood that, we have to predict fare amount, and other variables will
help me achieve that, here pickup_latitude/longitude, dropoff_latitude/longitude this data are signifying
the location of pick up and drop off. It is explaining starting point and end point of the ride. So, these
variables are crucial for us. Passenger_count is another variable, that explains about how many people or
passenger boarded the ride, between the pickup and drop off locations. And pick up date time gives
information about the time the passenger is picked up and ride has started. But unlike pick up and drop off
locations has start and end details both in given data. The time data has only start details and no time
value or time related information of end of ride. So, during pre-processing of data we will drop this
variable. As it seems the information of time is incomplete.
4|Page
2.3 Data Preparation

The next step in the CRISP DM Process is, Data preprocessing. It is a data mining process that involves
transformation of raw data into a format that helps us execute our model well. As, the data often we get
are incomplete, inconsistent and also may contain many errors. Thus, Data preprocessing is a generic
method to deal with such issues and get a data format that is easily understood by machine and that helps
developing our model in best way. In this project also we have followed data pre-processing methods to
rectify errors and issues in our data. And this is done by popular data preprocessing techniques, this are
following below.

Note: I have removed the variable “Unnamed : 0” as it is not of much significance , so in this data set it
seems, it will have no impact in the target variable, and also it lead to redundancy and model accuracy
issues, so I preferred to drop it.

Missing Value Analysis

Missing value is availability of incomplete observations in the dataset. This is found because of reasons like,
incomplete submission, wrong input, manual error etc. These Missing values affect the accuracy of model.
So, it becomes important to check missing values in our given data.

Missing Value Analysis in Given Data:

In the given dataset it is found that there are few values which are missing. It is found in the following
type:

1. Blank spaces : Which are converted to NA and NaN in Python for further operations

Impute the missing value:

After the identification of the missing values the next step is to impute the missing values. And this
imputation is normally done by following methods.

1. Central Tendencies: by the help of Mean, Median or Mode


2. Distance based or Data mining method like KNN imputation
3. Prediction Based: It is based on Predictive Machine Learning Algorithm

To use the best method it is necessary for us to check, which method predicts values close to the original
data. And this done by taking a subset of data, taking an example variable and noting down its original
value and the replacing that value with NA and then applying available methods. And noting down every
value from the above methods for the example variable we have taken, now we chose the method which
gives most close value.

In this project, Central Tendencies worked the best. So, I am using backward fill method to impute missing
Values.

5|Page
Outlier Analysis

Outlier is an abnormal observation that stands or deviates away from other observations. These happens
because of manual error, poor quality of data and it is correct but exceptional data. But, It can cause an
error in predicting the target variables. So we have to check for outliers in our data set and also remove or
replace the outliers wherever required

Outliers in this project.

In this dataset, I have found some irregular data, those are considered as outliers. These are explained
below.

Passenger_count:

I have always found a cab with 4 seats to maximum of 8 seats. But in this dataset I have found passenger
count more than this, and in some cases a large number of values. This seems irregular data, or a manual
error. Thus, these are outliers and needs to be removed or treated. Few instances are following.

Table 2.7 : Passenger_count Outliers

All this outliers mentioned above happened because of manual error, or interchange of data, or may be
correct data but exceptional. But all these outliers can hamper our data model. So there is a requirement
to eliminate or replace such outliers. And impute with proper methods to get better accuracy of the model.
In this project, I used median method to impute the outliers in passenger count. Visualization of passenger
count variable after outliers are filled is shown below :

6|Page
Fare Amount:

I have observed that there are few outliers present in fare amount variable as well. In this project, I used
median method to impute the outliers in fare amount. Visualization of fare amount variable after outliers are
filled is shown below. From here we can analyze that The Target Variable seems to be be highly skewed, with
most datapoints lying near 0.

Feature Selection

Sometimes it happens that, all the variables in our data may not be accurate enough to predict the target
variable, in such cases we need to analyze our data, understand our data and select the dataset variables
that can be most useful for our model. In such cases we follow feature selection. Feature selection helps by
reducing time for computation of model and also reduces the complexity of the model.

After understanding the data, preprocessing and selecting specific features, there is a process to engineer
new variables if required to improve the accuracy of the model.

Feature Extraction :

Distance :

In this project the data contains only the pick up and drop points in longitude and latitude. The
fare_amount will mailnly depend on the distance covered between these two points. Thus, we have to
create a new variable prior further processing the data. And in this project the variable I have created is
Distance variable (dist), which is a numeric value and explains the distance covered between the pick up
7|Page
and drop of points. After researching I found a formula called The haversine formula, that determines the
distance between two points on a sphere based on their given longitudes and latitudes. These formula
calculates the shortest distance between two points in a sphere.

The function of haversine function is described, which helped me to engineer our new variable,
Distance.

Hour:

Our dataset consists only of pickup datetime variable. So we will find out at what hour did the journey start
using datetime package in python naming it as hour variable. The visualization for the hour feature is given
below:

After excecuting the haversine function and the hour feature in our project, I got new variable distance, hour
and some instances of data are mentioned below.

Table 2.10: Engineering new Variable Distance

8|Page
EXPLORATORY DATA ANALYSIS :

Exploratory data analysis (EDA) is used by data scientists to analyse and investigate datasets and
summarize their main characteristics, often employing data visualization methods. It helps determine
how best to manipulate data sources to get the answers you need, making it easier for data scientists
to discover patterns, spot anomalies, test a hypothesis, or check assumptions.

EDA is primarily used to see what data can reveal beyond the formal modelling or hypothesis testing
task and provides a provides a better understanding of data set variables and the relationships
between them. It can also help determine if the statistical techniques you are considering for data
analysis are appropriate. Originally developed byAmerican mathematician John Tukey in the 1970s,
EDA techniques continue to be a widely used method in the data discovery process today.

Data Visualization:

Data visualization is defined as a graphical representation that contains the information and the
data. By using visual elements like charts, graphs, and maps, data visualization techniques provide
an accessible way to see and understand trends, outliers, and patterns in data.
In modern days we have a lot of data in our hands i.e., in the world of Big Data, data
visualization tools, and technologies are crucial to analyse massive amounts of information and
make data-driven decisions.
It is used in many areas such as:
 To model complex events.
 Visualize phenomenon’s that cannot be observed directly, such as weather patterns,
medical conditions, or mathematical relationships

Univariate Analysis :

9|Page
10 | P a g e
Bivariate Analysis :

11 | P a g e
Correlation Analysis:

In some cases it is asked that models require independent variables free from collinearity issues. This can
be checked by correlation analysis for the categorical variables and continuous variables. Correlation
analysis is a process that is defined to identify the level of relation between two variables.

In this project, our Predictor variable is continuous, so we will plot a correlation table that will predict the
correlation strength between independent variables and the ‘fare_amount’ variable

Correlation Plot:

Graph 2.2 : Correlation Plot

From the above plot it is found that most of the variables are highly correlated with each other, like fare
amount is highly correlated with distance variable. All the dark blue charts represents that variables are
highly correlated. And as there is no dark red charts, which represents negative correlation, it can be
summarized that our dataset has strong or highly positive correlation between the variables.

12 | P a g e
Model Development

After all the above processes the next step is developing the model based on our prepared data.
In this project we got our target variable as “fare_amount”. The model has to predict a numeric value.
Thus, it is identified that this is a Regression problem statement. And to develop a regression model, the
various models that can be used are
 Decision trees,
 Random Forest
 Linear Regression
 KNN imputation.

Decision Tree

Decision Tree is a supervised learning predictive model that uses a set of binary rules to calculate the
target value/dependent variable.
Decision trees are divided into three main parts this are :

Root Node : performs the first split


Terminal Nodes : that predict the outcome, these are also called leaf nodes
Branches : arrows connecting nodes, showing the flow from root to other leaves.

In this project Decision tree is applied in Python, details are described following.

Decision Tree in Python

The above fit plot shows the criteria that is used in developing the decision tree in Python. To develop the
model in python, I haven’t provided any input argument of my choice, except the depth as 2, to visualize
the tree better. All other arguments in the model are default, in developing the model. After this the fit_DT
is used to predict in test data and the error rate and accuracy is calculated.

Random Forest

The next model to be followed in this project is Random forest. It is a process where the machine follows
an ensemble learning method for classification and regression that operates by developing a number of
decision trees at training time and giving output as the class that is the mode of the classes of all the
individual decision trees.

In this project Random Forest is applied in Python, details are described following.

13 | P a g e
Random Forest in Python

Like the Decision tree above are all the criteria values that are used to develop the Random Forest model
in python.

Linear Regression

The next method in the process is Linear regression. It is used to predict the value of variable Y based on
one or more input predictor variables X. The goal of this method is to establish a linear relationship
between the predictor variables and the response variable. Such that, we can use this formula to estimate
the value of the response Y, when only the predictors (X- Values) are known.

In this project Linear Regression is applied in Python, details are described following.

Linear Regression in Python

After this the model is developed following details are found.

This shows that the model is performing very poor. This may be because the relationship between the
independent and dependent variable might be nonlinear.

KNN imputation model

The next process to be followed is The KNN model. It finds the nearest neighbors and tries to predict target
value. The method goes as, for the value of new point to be assigned, this value is assigned on the basis of
how closely this point resembles the other points in the training set. The process of implementing KNN
methodology is little easy in compare other models. After implementing the KNN model, the KNN function
is imported from respective libraries in Python , package “Scikit Learn” library. After that the model is Run,
and the prediction fit is used to predict in test data. Finally the error and accuracy is calculated.

14 | P a g e
KNN Imputation in Python

Model Summary:

Above mentioned Decision Tree, Random Forest, Linear Regression and KNN Method are the various
models that can be developed for the given data. At first place, The Data is divided into train and test. Then
the models are developed on the train data. After that the model is fit into it to test data to predict the
target variable. After predicting the target variable in test data, the actual and predicted values of target
variable are compare to get the error and accuracy. And looking over the error and accuracy rates, the best
model for the data is identified and it is kept for future usage.

15 | P a g e
CHAPTER 3: EVALUATION OF THE MODEL

So, now we have developed few models for predicting the target variable, now the next step is to identify
which one to choose for deployment. To decide these according to industry standards, we follow several
criteria. Few among this are, calculating the accuracy is used in our project. RMSE is not used because we
are not working with Timestamp value.

3.1 Accuracy
The method is to identify or compare for better model is Accuracy. It is the ratio of number of correct
predictions to the total number of predictions made.

Accuracy= number of correct predictions / Total predictions made

Method Accuracy (in


Percentage)
Decision Tree 72.09
Random Forest 76.20
Linear Regression 72.56
KNN Imputation 65.80

Table 3.4: Accuracy in Python Models

3.2 Cross Validation


Cross-validation is a technique for evaluating ML models by training several ML models on subsets of the
available input data and evaluating them on the complementary subset of the data. Use cross-validation to
detect overfitting, ie, failing to generalize a pattern.

Cross Validation for the given project is calculated by :

3.3 Model Selection


After comparison of the accuracy, the next step we come to is Selection of the most effective model.
From the values of Error and accuracy, it is found that all the models perform close to each other. In this
case any model can best used for further processes, but Random forest gives better results compared to
all other methods. So I will prefer Random Forest Model to be used for further processes.

16 | P a g e
CHAPTER 4 : DEPLOYMENT

Pickling the File:

Python pickle module is used for serializing and de-serializing python object structures. The process
to converts any kind of python objects (list, dict, etc.) into byte streams (0s and 1s) is called pickling
or serialization or flattening or marshalling. We can convert the byte stream (generated through
pickling) back into python objects by a process called as unpickling. In real world scenario, the use
pickling and unpickling are widespread as they allow us to easily transfer data from one
server/system to another and then store it in a file or database.

Deployment:

Machine learning deployment is the process of deploying a machine learning model in a live
environment. The model can be deployed across a range of different environments and will often be
integrated with apps through an API. Deployment is a key step in an organization gaining
operational value from machine learning. For this project I deployed the model using tkinter
package in python. Python offers multiple options for developing GUI (Graphical User Interface).
Out of all the GUI methods, tkinter is the most commonly used method. It is a standard Python
interface to the Tk GUI toolkit shipped with Python. Python with tkinter is the fastest and easiest
way to create the GUI applications. Creating a GUI using tkinter is an easy task.

The snap shot of the output window that is the Graphical user interface is given below :

17 | P a g e
CODE

Importing all the necessary packages required for the project

import numpy as np
import pandas as pd
import tkinter as tk
import pickle
import tkinter.font as font
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import string
import random
import statistics as st
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split,cross_val_score
from pandas_profiling import ProfileReport
from tkintermapview import TkinterMapView
import tkintermapview
from tkinter import Frame
from tkinter import *
from tkinter import messagebox

Loading the data

data=pd.read_csv("uber.csv")

Finding the shape of the data

data.shape

Finding the no of columns in the data

data.columns

Viewing the first few rows of the dataset

data.head()

Dropping the unnamed column

18 | P a g e
data=data.drop(['Unnamed: 0'],axis=1)

Checking for missing values

data.isnull().sum()

Filling the missing values

data=data.fillna(method='bfill')

Checking if there are any duplicates in the data

data.duplicated().any()

Checking for outliers

data.passenger_count.value_counts()

Visualizing using boxplot

sns.boxplot(data['passenger_count'],color='red')

Finding the central tendencies to treat outliers

print("Mean of fare prices is % s "


% (st.mean(data['passenger_count'])))

print("Median of fare prices is % s "


% (st.median(data['passenger_count'])))

print("Standard Deviation of Fare Prices is % s "


% (st.stdev(data['passenger_count'])))

Treating the outliers by replacing them with median

median = float(data['passenger_count'].median())
data["passenger_count"] = np.where(data["passenger_count"] > 6, median, data['passenger_count'])
data["passenger_count"] = np.where(data["passenger_count"] ==0, median, data['passenger_count'])

Checking the value count after treating outliers

data.passenger_count.value_counts()

Visualizing after treating outliers

sns.countplot(data['passenger_count'])

Checking for outliers

19 | P a g e
data.fare_amount.value_counts()
Finding the largest fare amount paid by the customers

data.fare_amount.nlargest()

Finding the central tendencies to treat outliers

print("Mean of fare prices is % s "


% (st.mean(data['fare_amount'])))

print("Median of fare prices is % s "


% (st.median(data['fare_amount'])))

print("Standard Deviation of Fare Prices is % s "


% (st.stdev(data['fare_amount'])))

Treating the outliers by replacing them with median

med = float(data['fare_amount'].median())
data["fare_amount"] = np.where(data["fare_amount"] < 0, med, data['fare_amount'])
data["fare_amount"] = np.where(data["fare_amount"] > 499, med, data['fare_amount'])

Visualizing using distplot after treating outliers

fig,ax=plt.subplots(figsize=(8,8))
sns.distplot(data.fare_amount,color='green')

Visualizing using boxplot after treating outliers

sns.boxplot(data['fare_amount'],color='blue')

Adding another column in the dataset to find when the customer started his journey

data['pickup_datetime'] = pd.to_datetime(data['pickup_datetime'])
data['Hour'] = data['pickup_datetime'].apply(lambda time: time.hour)

Visualizing the created column using histogram

plt.rcParams["figure.figsize"] = [7.50, 3.50]


plt.rcParams["figure.autolayout"] = True
fig, ax = plt.subplots()
N, bins, patches = ax.hist(data.Hour, edgecolor='black', linewidth=1)
plt.xlabel('Hour')
for i in range(len(N)):
patches[i].set_facecolor("#" + ''.join(random.choices("ABCDEF" + string.digits, k=6)))
plt.show()

Adding another column to determine how much distance is travelled during the journey.

def haversine (lon_1, lon_2, lat_1, lat_2):


20 | P a g e
lon_1, lon_2, lat_1, lat_2 = map(np.radians, [lon_1, lon_2, lat_1, lat_2]) #Degrees to Radians
diff_lon = lon_2 - lon_1
diff_lat = lat_2 - lat_1
km = 2 * 6371 * np.arcsin(np.sqrt(np.sin(diff_lat/2.0)**2 +
np.cos(lat_1) * np.cos(lat_2) * np.sin(diff_lon/2.0)**2))
return km

data['Distance']=
haversine(data['pickup_longitude'],data['dropoff_longitude'],data['pickup_latitude'],data['dropoff_latitud
e'])

Identifing the type of columns in the data

data.info()

Viewing the first few rows of the column after feature extraction

data.head()

min_val = data[:].min()
max_val = data[:].max()
min_val
max_val

Univariate Data Analysis

fig = plt.figure(figsize =(10, 7))


sns.boxplot(data['Distance'],color='orange')

sns.distplot(data['pickup_longitude'],color='orange')
plt.title('The distribution of Pick up Longitude')

sns.distplot(data['dropoff_longitude'],color='purple')
plt.title('The distribution of Drop off Longitude')

sns.distplot(data['dropoff_latitude'],color='brown')
plt.title('The distribution of drop off Latitude')

sns.distplot(data['pickup_latitude'],color='pink')
plt.title('The distribution of pick up Latitude')

Bivariate Data Analysis

plt.scatter(data['fare_amount'],data['passenger_count'])
plt.xlabel("fare_amount")
plt.ylabel("passenger_count")

sns.catplot(x="Hour", y="fare_amount",kind="bar",data=data)
21 | P a g e
sns.relplot(x="Distance", y="fare_amount", data=data, kind="scatter",color='maroon')

Visualizing the whole dataset using pair plot

grids=sns.PairGrid(data)
grids.map(plt.scatter,color='yellow')

Creating a whole report for the data

report=ProfileReport(data,title="Pandas Profiling Report")


report
data.describe()

Finding the correlation between the columns in the dataset

data.corr()

Visualizing the correlation between the columns using a heatmap

plt.figure(figsize=(8,8))
sns.heatmap(data.corr(),annot=True)

Defining the training and testing variables

x=data[["pickup_longitude","pickup_latitude","dropoff_longitude","dropoff_latitude",'Hour']]
y=data["fare_amount"]

Splitting the data for testing and training

xtr,xte,ytr,yte=train_test_split(x,y,test_size=0.3,random_state=442)

Viewing the shape of x train

xtr.shape

Vieiwng the shape of y train

ytr.shape

Fitting the model using Random Forest

model_rf=RandomForestRegressor()
model_rf.fit(xtr,ytr)
ypred=model_rf.predict(xte)
RandomForestRegressor=r2_score(yte,ypred)*100
print("Accuracy of the model is ",RandomForestRegressor)

Fitting the model using Decision Tree

model_dt=DecisionTreeRegressor()
22 | P a g e
model_dt.fit(xtr,ytr)
ypred=model_dt.predict(xte)
DecisionTreeRegressor=r2_score(yte,ypred)*100
print("Accuracy of the model is ",DecisionTreeRegressor)

Fitting the model using KNeighbors

model_knn=KNeighborsRegressor()
model_knn.fit(xtr,ytr)
ypred=model_knn.predict(xte)
KNeighborsRegressor=r2_score(yte,ypred)*100
print("Accuracy of the model is ",KNeighborsRegressor)

Fitting the model using Linear Regression

model_lr=LinearRegression()
model_lr.fit(xtr,ytr)
ypred=model_lr.predict(xte)
LinearRegression=r2_score(yte,ypred)*100
print("Accuracy of the model is ",LinearRegression)

Finding the cross validation score

cv_score=cross_val_score(model_rf,x,y,cv=5)
print("Cross Validation of the model is ",cv_score)

Finding the mean accuracy

mean_accuracy=sum(cv_score)/len(cv_score)
mean_accuracy=mean_accuracy*100
mean_accuracy=round(mean_accuracy,2)
print("Mean Accuracy of the model is ",mean_accuracy)

Saving the model using pickle file

with open('model_li.pkl','wb') as files:


pickle.dump(model_rf,files)

with open('model_li.pkl','rb') as file:


duration=pickle.load(file)

Front End

def predict():

pickup_longitude = float(scale1.get())

pickup_latitude = float(scale2.get())

dropoff_longitude=float(scale3.get())

23 | P a g e
dropoff_latitude = float(scale4.get())

Hour=int(cb2.get())

inp = np.array([pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude,Hour])

prediction = int(duration.predict(inp.reshape(1,5)))

t.insert('1.0',prediction)

if(len(cb2.get())==0):
message.showerror("details missing","please enter the details")

def reset():
scale1.set(0)
scale2.set(0)
scale3.set(0)
scale4.set(0)
cb2.delete('0','end')
t.delete('1.0','end')
def close():
root.destroy()

Creating a GUI

from tkinter import ttk


root = tk.Tk()
root.configure(background="lavender")
root.title('fare prediction')
root.geometry("1080x768")

label1_font=font.Font(family='Helvetica',size=30,weight='bold')
label1 = tk.Label(text="UBER FARE PREDICTION",fg="dark slate blue",bg="skyblue",font=label1_font)
label1.pack(anchor=tk.CENTER)

my_label=LabelFrame(root)
my_label.pack(side=BOTTOM)
map_widget=tkintermapview.TkinterMapView(my_label,width=1500,height=440,corner_radius=0)
map_widget.set_position(40.71278,-74.00594)
map_widget.pack(side=BOTTOM)
map_widget.set_address("Auckland,NewZealand",marker=True)
map_widget.set_zoom(75)

label9=Label(root,text="Please select from the following :",font=15)


label9.place(x=10,y=50)

24 | P a g e
v = DoubleVar()

scale1 = Scale( root, variable=v, from_=-1340.64841, to=57.418457,


orient=HORIZONTAL,sliderlength=40,troughcolor="skyblue",relief=RAISED)
scale1.place(x=100,y=120)

btn1 = Button(root, text="Pickup Longitude",width=25,bg="darkblue",fg="white")


btn1.place(x=50,y=90)

label1 = Label(root)
label1.place(x=100,y=110)

w = DoubleVar()

scale2 = Scale( root, variable=w, from_=-74.015515, to=1644.421482,


orient=HORIZONTAL,sliderlength=40,troughcolor="skyblue",relief=RAISED)
scale2.place(x=350,y=120)

btn2 = Button(root, text="Pickup Latitude",bg="darkblue",fg="white",width=25)


btn2.place(x=300,y=90)

label2 = Label(root)
label2.place(x=350,y=110)

x = DoubleVar()

scale3 = Scale( root, variable=x, from_=-3356.6663, to=1153.572603,


orient=HORIZONTAL,sliderlength=40,troughcolor="skyblue",relief=RAISED)
scale3.place(x=600,y=120)

btn3 = Button(root, text="Dropoff Longitude",bg="darkblue",fg="white",width=25)


btn3.place(x=550,y=90)

label3 = Label(root)
label3.place(x=600,y=110)

y = DoubleVar()

scale4 = Scale( root, variable=y, from_=-881.985513, to=872.697628,


orient=HORIZONTAL,sliderlength=60,troughcolor="skyblue",relief=RAISED)
scale4.place(x=850,y=120)

btn4 = Button(root, text="Dropoff Latitude",bg = "darkblue",fg = "white",width=25)


btn4.place(x=800,y=90)

label4 = Label(root)
label4.place(x=850,y=110)

25 | P a g e
def my_upd2():
cb2.set('1')
label8.config(text=cb2.get()+':'+ str(cb2.current()))

months1=['0','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23']
cb2 = ttk.Combobox(root, values=months1,width=15)
cb2.place(x=1150,y=120)
b2=tk.Button(root,text="set('1')", command=lambda: my_upd2())
b2.place(x=1080,y=120)
label8=tk.Label(root,text='hour',fg='green',font=15)
label8.place(x=1040,y=120)
print(cb2.get())

button1 = tk.Button(root,text='PREDICT',fg="dark slate blue",bg='bisque2',command=predict,font=15)


button1.place(x=420,y=180)

t =tk.Text(root,highlightbackground='green',fg="white",bg="azure4",height=1,width=25,font=15)
t.place(x=550,y=180)

button2_font= font.Font(size=15,weight='bold')
button2 = tk.Button(root,text='RESET',bg="cyan",fg="white",command=reset,font=button2_font)
button2.place(x=520,y=220)

button3_font= font.Font(size=15,weight='bold')
button3 = tk.Button(root,text='CLOSE',bg="CadetBlue1",fg="white",command=close,font=button3_font)
button3.place(x=620,y=220)

root.mainloop()

26 | P a g e
CHAPTER 5: REFERENCES

Websites:
 www.edwisor.com : Videos from Mentor :
 https://rpubs.com/ : Coding Doubts :
 https://stackoverflow.com/questions/51488949/use-haversine-package-to-compare-all-
distances-possibilities-of-a-csv-list-of-lo : Haversine Doubt
 https://gist.github.com/rochacbruno/2883505 : Calculate Haversine in python

Videos Channels:
 https://www.youtube.com/watch?v=7YfyIhhmwq4 : Distance development in Python
 https://www.youtube.com/watch?v=Uct_EbThV1E&list=PLZ7s-
Z1aAtmIbaEj_PtUqkqdmI1k7libK&index=2&t=690s : Python Coding

27 | P a g e
THANK YOU

28 | P a g e
29 | P a g e

You might also like