0% found this document useful (0 votes)
7 views13 pages

ML 1 16

Machine learning practical for uber ride price prediction using logistic and random
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views13 pages

ML 1 16

Machine learning practical for uber ride price prediction using logistic and random
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

In [1]: 1 # Name: Vedika Santosh Jadhav

2 # Roll_no:2441011
3
4 #Assignment No.:1
5 #Title:Predict the price of the Uber ride from a given pickup
6 #point to the agreed drop-off location.
7 # Perform following tasks:
8 # 1. Pre-process the dataset.
9 # 2. Identify outliers.
10 # 3. Check the correlation.
11 # 4. Implement linear regression and random forest regression models.
12 # 5. Evaluate the models and compare their respective scores like R2, RMSE,etc.
13 ​
14 import pandas as pd
15 import numpy as np
16 import seaborn as sns
17 import matplotlib.pyplot as plt

In [2]: 1 df=pd.read_csv("uber - uber.csv")

In [3]: 1 df

Out[3]:
Unnamed:
key fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_lon
0

2015-
2015-05-07
0 24238194 05-07 7.5 -73.999817 40.738354 -73.9
19:52:06 UTC
19:52:06

2009-
2009-07-17
1 27835199 07-17 7.7 -73.994355 40.728225 -73.9
20:04:56 UTC
20:04:56

2009-
2009-08-24
2 44984355 08-24 12.9 -74.005043 40.740770 -73.9
21:45:00 UTC
21:45:00

2009-
2009-06-26
3 25894730 06-26 5.3 -73.976124 40.790844 -73.9
08:22:21 UTC
8:22:21

2014-
2014-08-28
4 17610152 08-28 16.0 -73.925023 40.744085 -73.9
17:47:00 UTC
17:47:00

... ... ... ... ... ... ...

2012-
2012-10-28
199995 42598914 10-28 3.0 -73.987042 40.739367 -73.9
10:49:00 UTC
10:49:00

2014-
2014-03-14
199996 16382965 03-14 7.5 -73.984722 40.736837 -74.0
01:09:00 UTC
1:09:00

2009-
2009-06-29
199997 27804658 06-29 30.9 -73.986017 40.756487 -73.8
00:42:00 UTC
0:42:00

2015-
2015-05-20
199998 20259894 05-20 14.5 -73.997124 40.725452 -73.9
14:56:25 UTC
14:56:25

2010-
2010-05-15
199999 11951496 05-15 14.1 -73.984395 40.720077 -73.9
04:08:00 UTC
4:08:00

200000 rows × 9 columns


 
In [4]: 1 df.head()

Out[4]:
Unnamed:
key fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude
0

2015-
2015-05-07
0 24238194 05-07 7.5 -73.999817 40.738354 -73.999512
19:52:06 UTC
19:52:06

2009-
2009-07-17
1 27835199 07-17 7.7 -73.994355 40.728225 -73.994710
20:04:56 UTC
20:04:56

2009-
2009-08-24
2 44984355 08-24 12.9 -74.005043 40.740770 -73.962565
21:45:00 UTC
21:45:00

2009-
2009-06-26
3 25894730 06-26 5.3 -73.976124 40.790844 -73.965316
08:22:21 UTC
8:22:21

2014-
2014-08-28
4 17610152 08-28 16.0 -73.925023 40.744085 -73.973082
17:47:00 UTC
17:47:00

 

In [5]: 1 df.tail()

Out[5]:
Unnamed:
key fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_lon
0

2012-
2012-10-28
199995 42598914 10-28 3.0 -73.987042 40.739367 -73.9
10:49:00 UTC
10:49:00

2014-
2014-03-14
199996 16382965 03-14 7.5 -73.984722 40.736837 -74.0
01:09:00 UTC
1:09:00

2009-
2009-06-29
199997 27804658 06-29 30.9 -73.986017 40.756487 -73.8
00:42:00 UTC
0:42:00

2015-
2015-05-20
199998 20259894 05-20 14.5 -73.997124 40.725452 -73.9
14:56:25 UTC
14:56:25

2010-
2010-05-15
199999 11951496 05-15 14.1 -73.984395 40.720077 -73.9
04:08:00 UTC
4:08:00

 

In [6]: 1 df.isna().sum()

Out[6]: Unnamed: 0 0
key 0
fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 1
dropoff_latitude 1
passenger_count 0
dtype: int64
In [7]: 1 df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 200000 non-null int64
1 key 200000 non-null object
2 fare_amount 200000 non-null float64
3 pickup_datetime 200000 non-null object
4 pickup_longitude 200000 non-null float64
5 pickup_latitude 200000 non-null float64
6 dropoff_longitude 199999 non-null float64
7 dropoff_latitude 199999 non-null float64
8 passenger_count 200000 non-null int64
dtypes: float64(5), int64(2), object(2)
memory usage: 13.7+ MB

In [8]: 1 df.dtypes

Out[8]: Unnamed: 0 int64


key object
fare_amount float64
pickup_datetime object
pickup_longitude float64
pickup_latitude float64
dropoff_longitude float64
dropoff_latitude float64
passenger_count int64
dtype: object

In [9]: 1 df.shape

Out[9]: (200000, 9)

In [10]: 1 df['pickup_datetime']=pd.to_datetime(df['pickup_datetime'])

In [11]: 1 df.dtypes

Out[11]: Unnamed: 0 int64


key object
fare_amount float64
pickup_datetime datetime64[ns, UTC]
pickup_longitude float64
pickup_latitude float64
dropoff_longitude float64
dropoff_latitude float64
passenger_count int64
dtype: object

In [12]: 1 df=df.drop("Unnamed: 0",axis=1)


2 df=df.drop("key",axis=1)
In [13]: 1 df

Out[13]:
fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitud

2015-05-07
0 7.5 -73.999817 40.738354 -73.999512 40.7232
19:52:06+00:00

2009-07-17
1 7.7 -73.994355 40.728225 -73.994710 40.75032
20:04:56+00:00

2009-08-24
2 12.9 -74.005043 40.740770 -73.962565 40.77264
21:45:00+00:00

2009-06-26
3 5.3 -73.976124 40.790844 -73.965316 40.80334
08:22:21+00:00

2014-08-28
4 16.0 -73.925023 40.744085 -73.973082 40.76124
17:47:00+00:00

... ... ... ... ... ...

2012-10-28
199995 3.0 -73.987042 40.739367 -73.986525 40.74029
10:49:00+00:00

2014-03-14
199996 7.5 -73.984722 40.736837 -74.006672 40.73962
01:09:00+00:00

2009-06-29
199997 30.9 -73.986017 40.756487 -73.858957 40.69258
00:42:00+00:00

2015-05-20
199998 14.5 -73.997124 40.725452 -73.983215 40.6954
14:56:25+00:00

2010-05-15
199999 14.1 -73.984395 40.720077 -73.985508 40.76879
04:08:00+00:00

200000 rows × 7 columns


 

In [14]: 1 df.fillna(0,inplace=True)

In [15]: 1 df.isnull().sum()

Out[15]: fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 0
dropoff_latitude 0
passenger_count 0
dtype: int64

In [16]: 1 df=df.assign(hour=df.pickup_datetime.dt.hour,
2 day=df.pickup_datetime.dt.day,
3 month=df.pickup_datetime.dt.month,
4 year=df.pickup_datetime.dt.year,
5 daysofweek=df.pickup_datetime.dt.dayofweek)
In [17]: 1 df

Out[17]:
fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitud

2015-05-07
0 7.5 -73.999817 40.738354 -73.999512 40.7232
19:52:06+00:00

2009-07-17
1 7.7 -73.994355 40.728225 -73.994710 40.75032
20:04:56+00:00

2009-08-24
2 12.9 -74.005043 40.740770 -73.962565 40.77264
21:45:00+00:00

2009-06-26
3 5.3 -73.976124 40.790844 -73.965316 40.80334
08:22:21+00:00

2014-08-28
4 16.0 -73.925023 40.744085 -73.973082 40.76124
17:47:00+00:00

... ... ... ... ... ...

2012-10-28
199995 3.0 -73.987042 40.739367 -73.986525 40.74029
10:49:00+00:00

2014-03-14
199996 7.5 -73.984722 40.736837 -74.006672 40.73962
01:09:00+00:00

2009-06-29
199997 30.9 -73.986017 40.756487 -73.858957 40.69258
00:42:00+00:00

2015-05-20
199998 14.5 -73.997124 40.725452 -73.983215 40.6954
14:56:25+00:00

2010-05-15
199999 14.1 -73.984395 40.720077 -73.985508 40.76879
04:08:00+00:00

200000 rows × 12 columns


 

In [18]: 1 df=df.drop("pickup_datetime",axis=1)
In [19]: 1 df.plot()

Out[19]: <Axes: >

In [20]: 1 df.plot(kind="box")

Out[20]: <Axes: >


In [21]: 1 df.plot(kind="box",subplots=True,layout=(7,2),figsize=(15,20))

Out[21]: fare_amount Axes(0.125,0.786098;0.352273x0.0939024)


pickup_longitude Axes(0.547727,0.786098;0.352273x0.0939024)
pickup_latitude Axes(0.125,0.673415;0.352273x0.0939024)
dropoff_longitude Axes(0.547727,0.673415;0.352273x0.0939024)
dropoff_latitude Axes(0.125,0.560732;0.352273x0.0939024)
passenger_count Axes(0.547727,0.560732;0.352273x0.0939024)
hour Axes(0.125,0.448049;0.352273x0.0939024)
day Axes(0.547727,0.448049;0.352273x0.0939024)
month Axes(0.125,0.335366;0.352273x0.0939024)
year Axes(0.547727,0.335366;0.352273x0.0939024)
daysofweek Axes(0.125,0.222683;0.352273x0.0939024)
dtype: object
In [22]: 1 def remove_outlier(df1,col):
2 Q1=df1[col].quantile(0.25)
3 Q3=df1[col].quantile(0.75)
4 IQR=Q3-Q1
5 lower=Q1-1.5*IQR
6 upper=Q3+1.5*IQR
7 df[col]=np.clip(df1[col],lower,upper)
8 return df1
9 ​
10 def treat_outliers(df,col_list):
11 for c in col_list:
12 df1=remove_outlier(df,c)
13 return df1
14 ​
15 df=treat_outliers(df,df.iloc[:,0::])
16 df.plot(kind="box",subplots=True,layout=(7,2),figsize=(15,20))

Out[22]: fare_amount Axes(0.125,0.786098;0.352273x0.0939024)


pickup_longitude Axes(0.547727,0.786098;0.352273x0.0939024)
pickup_latitude Axes(0.125,0.673415;0.352273x0.0939024)
dropoff_longitude Axes(0.547727,0.673415;0.352273x0.0939024)
dropoff_latitude Axes(0.125,0.560732;0.352273x0.0939024)
passenger_count Axes(0.547727,0.560732;0.352273x0.0939024)
hour Axes(0.125,0.448049;0.352273x0.0939024)
day Axes(0.547727,0.448049;0.352273x0.0939024)
month Axes(0.125,0.335366;0.352273x0.0939024)
year Axes(0.547727,0.335366;0.352273x0.0939024)
daysofweek Axes(0.125,0.222683;0.352273x0.0939024)
dtype: object
In [23]: 1 corr=df.corr()
2 print(corr.shape)
3 plt.figure(figsize=(6,6))
4 sns.heatmap(corr,cbar=True,square=True,fmt='.1f',
5 annot=True,annot_kws={'size':15},cmap="Oranges")

(11, 11)

Out[23]: <Axes: >

In [24]: 1 pip install numpy

Defaulting to user installation because normal site-packages is not writeable


Requirement already satisfied: numpy in c:\programdata\anaconda3\lib\site-packages
(1.24.3)
Note: you may need to restart the kernel to use updated packages.

In [25]: 1 pip install numpy

Defaulting to user installation because normal site-packages is not writeable


Requirement already satisfied: numpy in c:\programdata\anaconda3\lib\site-packages
(1.24.3)
Note: you may need to restart the kernel to use updated packages.
In [26]: 1 pip install --force-reinstall haversine

Defaulting to user installation because normal site-packages is not writeable


Collecting haversine
Obtaining dependency information for haversine from https://files.pythonhosted.or
g/packages/5b/f1/b7274966f0b5b665d9114e86d09c6bc87d241781d63d8817323dcfa940c6/havers
ine-2.8.1-py2.py3-none-any.whl.metadata (https://files.pythonhosted.org/packages/5b/
f1/b7274966f0b5b665d9114e86d09c6bc87d241781d63d8817323dcfa940c6/haversine-2.8.1-py2.
py3-none-any.whl.metadata)
Using cached haversine-2.8.1-py2.py3-none-any.whl.metadata (5.9 kB)
Using cached haversine-2.8.1-py2.py3-none-any.whl (7.7 kB)
Installing collected packages: haversine
Attempting uninstall: haversine
Found existing installation: haversine 2.8.1
Uninstalling haversine-2.8.1:
Successfully uninstalled haversine-2.8.1
Successfully installed haversine-2.8.1
Note: you may need to restart the kernel to use updated packages.

In [27]: 1 import haversine as hs

In [28]: 1 travel_dist = []
2 for pos in range(len(df['pickup_longitude'])):
3 long1,lati1,long2,lati2 = [df['pickup_longitude']
4 [pos],df['pickup_latitude'][pos],df['dropoff_longitude']
5 [pos],df['dropoff_latitude'][pos]]
6 loc1=(lati1,long1)
7 loc2=(lati2,long2)
8 c = hs.haversine(loc1,loc2)
9 travel_dist.append(c)
10
11 ​
12 print(travel_dist)
13 df['dist_travel_km'] = travel_dist
14 df.head()

IOPub data rate exceeded.


The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Out[28]:
fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count h

0 7.5 -73.999817 40.738354 -73.999512 40.723217 1.0

1 7.7 -73.994355 40.728225 -73.994710 40.750325 1.0

2 12.9 -74.005043 40.740770 -73.962565 40.772647 1.0

3 5.3 -73.976124 40.790844 -73.965316 40.803349 3.0

4 16.0 -73.929786 40.744085 -73.973082 40.761247 3.5

 
In [29]: 1 x=df[['pickup_longitude','pickup_latitude',
2 'dropoff_longitude','dropoff_latitude','passenger_count',
3 'hour','day','month','year','daysofweek','dist_travel_km']]
4 y=df['fare_amount']
5 ​
6 ​

In [30]: 1 from sklearn.model_selection import train_test_split


2 x_train,x_test,y_train,y_test=
3 train_test_split(x,y,test_size=0.2,random_state=42)

In [31]: 1 from sklearn.linear_model import LinearRegression


2 model=LinearRegression()
3 model.fit(x_train, y_train)
4 LinearRegression()
5 y_pred = model.predict(x_test)
6 y_pred

Out[31]: array([ 7.68008794, 10.60894655, 8.47854262, ..., 6.36062388,


6.57836563, 8.154517 ])

In [32]: 1 from sklearn.metrics import mean_absolute_error,


2 mean_squared_error, r2_score

In [33]: 1 r2 = r2_score(y_test, y_pred)


2 mae = mean_absolute_error(y_test, y_pred)
3 mse = mean_squared_error(y_test, y_pred)
4 rmse = np.sqrt(mse)

In [34]: 1 print(f'R² Score: {r2}')


2 print(f'Mean Absolute Error (MAE): {mae}')
3 print(f'Mean Squared Error (MSE): {mse}')
4 print(f'Root Mean Squared Error (RMSE): {rmse}')

R² Score: 0.6589516503634483
Mean Absolute Error (MAE): 2.157564875844668
Mean Squared Error (MSE): 10.146783070723341
Root Mean Squared Error (RMSE): 3.185401555647787

In [35]: 1 ​
2 def custom_accuracy(y_true, y_pred, tolerance=0.1):
3 return np.mean(np.abs(y_true - y_pred) <= tolerance)
4 ​
5 accuracy = custom_accuracy(y_test, y_pred, tolerance=0.5)
6 print(f'Custom Accuracy: {accuracy}')
7 ​

Custom Accuracy: 0.170375

In [37]: 1 from sklearn.ensemble import RandomForestRegressor

In [38]: 1 rf_model = RandomForestRegressor(n_estimators=100,


2 random_state=42)
In [40]: 1 rf_model.fit(x_train, y_train)

Out[40]: ▾ RandomForestRegressor
RandomForestRegressor(random_state=42)

In [43]: 1 y_pred_rf = rf_model.predict(x_test)


2 print("Random Forest Model:", y_pred_rf)

Random Forest Model: [ 5.855 12.729 7.338 ... 4.816 5.828 8.495]

In [46]: 1 r2_lr = r2_score(y_test,y_pred )


2 rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred))

In [47]: 1 r2_rf = r2_score(y_test, y_pred_rf)


2 rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
3 print("Random Forest Regression R2:", r2_rf)
4 print("Random Forest Regression RMSE:",rmse_rf)

Random Forest Regression R2: 0.7944021741272649


Random Forest Regression RMSE: 2.473235494329611

In [ ]: 1 ​

You might also like