0% found this document useful (0 votes)
13 views9 pages

Predict The Price of The Uber Ride From A Given Pickup Point To The Agreed Drop-Off Location

The document outlines an assignment to predict Uber ride prices using a dataset. It includes tasks such as data pre-processing, outlier identification, correlation checking, and the implementation of linear and random forest regression models. The assignment also emphasizes evaluating and comparing the performance of these models using metrics like R2 and RMSE.

Uploaded by

jshruti6896
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views9 pages

Predict The Price of The Uber Ride From A Given Pickup Point To The Agreed Drop-Off Location

The document outlines an assignment to predict Uber ride prices using a dataset. It includes tasks such as data pre-processing, outlier identification, correlation checking, and the implementation of linear and random forest regression models. The assignment also emphasizes evaluating and comparing the performance of these models using metrics like R2 and RMSE.

Uploaded by

jshruti6896
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

07/08/2024, 15:10 Assignment 1_ML - Jupyter Notebook

Name : J a d h a v S h r u t i

Roll No : 2441027

Batch : C

Predict the price of the Uber ride from a given pickup point
to the agreed drop-off location.
Perform following tasks:

Pre-process the dataset. Identify outliers. Check the correlation. Implement linear regression and random forest
regression models. Evaluate the models and compare their respective scores like R2, RMSE, etc.

In [1]: import pandas as pd


import numpy as np

In [2]: df=pd.read_csv("Downloads/uber.csv")
df

Out[2]:
Unnamed:
key fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitu
0

2015-05-07 2015-05-07
0 24238194 7.5 -73.999817 40.738354 -73.9995
19:52:06.0000003 19:52:06 UTC

2009-07-17 2009-07-17
1 27835199 7.7 -73.994355 40.728225 -73.9947
20:04:56.0000002 20:04:56 UTC

2009-08-24 2009-08-24
2 44984355 12.9 -74.005043 40.740770 -73.9625
21:45:00.00000061 21:45:00 UTC

2009-06-26 2009-06-26
3 25894730 5.3 -73.976124 40.790844 -73.9653
08:22:21.0000001 08:22:21 UTC

2014-08-28 2014-08-28
4 17610152 16.0 -73.925023 40.744085 -73.9730
17:47:00.000000188 17:47:00 UTC

... ... ... ... ... ... ...

2012-10-28 2012-10-28
199995 42598914 3.0 -73.987042 40.739367 -73.9865
10:49:00.00000053 10:49:00 UTC

2014-03-14 2014-03-14
199996 16382965 7.5 -73.984722 40.736837 -74.0066
01:09:00.0000008 01:09:00 UTC

2009-06-29 2009-06-29
199997 27804658 30.9 -73.986017 40.756487 -73.8589
00:42:00.00000078 00:42:00 UTC

2015-05-20 2015-05-20
199998 20259894 14.5 -73.997124 40.725452 -73.9832
14:56:25.0000004 14:56:25 UTC
2010-05-15 2010-05-15
199999 11951496 14.1 -73.984395 40.720077 -73.9855
04:08:00.00000076 04:08:00 UTC

200000 rows × 9 columns

In [3]: df.shape

Out[3]: (200000, 9)

localhost:8888/notebooks/Assignment 1_ML.ipynb 1/9


07/08/2024, 15:10 Assignment 1_ML - Jupyter Notebook

In [4]: df.dtypes

Out[4]: Unnamed: 0 int64


key object
fare_amount float64
pickup_datetime object
pickup_longitude float64
pickup_latitude float64
dropoff_longitude float64
dropoff_latitude float64
passenger_count int64
dtype: object

In [5]: df.head()

Out[5]:
Unnamed:
key fare_amou nt pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dr
0

2015-05-07 2015-05-07
0 24238194 7.5 -73.999817 40.738354 -73.999512
19:52:06.0000003 19:52:06 UTC

2009-07-17 2009-07-17
1 27835199 7.7 -73.994355 40.728225 -73.994710
20:04:56.0000002 20:04:56 UTC

2009-08-24 2009-08-24
2 44984355 12.9 -74.005043 40.740770 -73.962565
21:45:00.00000061 21:45:00 UTC

2009-06-26 2009-06-26
3 25894730 5.3 -73.976124 40.790844 -73.965316
08:22:21.0000001 08:22:21 UTC

2014-08-28 2014-08-28
4 17610152 16.0 -73.925023 40.744085 -73.973082
17:47:00.000000188 17:47:00 UTC

In [6]: df.tail()

Out[6]:
Unnamed:
key fare_amou nt pickup_datetime pickup_longitude pickup_latitude dropoff_longitud
0

2012-10-28 2012-10-28
199995 42598914 3.0 -73.987042 40.739367 -73.98652
10:49:00.00000053 10:49:00 UTC

2014-03-14 2014-03-14
199996 16382965 7.5 -73.984722 40.736837 -74.00667
01:09:00.0000008 01:09:00 UTC

2009-06-29 2009-06-29
199997 27804658 30 .9 00:42:00 UTC
-73.986017 40.756487 -73.85895
00:42:00.00000078

2015-05-20 2015-05-20
199998 20259894 14.5 14:56:25 UTC -73.997124 40.725452 -73.98321
14:56:25.0000004

2010-05-15 2010-05-15
199999 11951496 14 .1 04:08:00 UTC
-73.984395 40.720077 -73.98550
04:08:00.00000076

In [7]: df=df.drop("Unnamed: 0",axis=1)

localhost:8888/notebooks/Assignment 1_ML.ipynb 2/9


07/08/2024, 15:10 Assignment 1_ML - Jupyter Notebook

In [8]: df

Out[8]:
key fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_

2015-05-07 2015-05-07
0 7.5 -73.999817 40.738354 -73.999512 40
19:52:06.0000003 19:52:06 UTC

2009-07-17 2009-07-17
1 7.7 -73.994355 40.728225 -73.994710 40
20:04:56.0000002 20:04:56 UTC

2009-08-24 2009-08-24
2 12.9 -74.005043 40.740770 -73.962565 40
21:45:00.00000061 21:45:00 UTC

2009-06-26 2009-06-26
3 5.3 -73.976124 40.790844 -73.965316 40
08:22:21.0000001 08:22:21 UTC

2014-08-28 2014-08-28
4 16.0 -73.925023 40.744085 -73.973082 40
17:47:00.000000188 17:47:00 UTC

... ... ... ... ... ... ...

2012-10-28 2012-10-28
199995 3.0 -73.987042 40.739367 -73.986525 40
10:49:00.00000053 10:49:00 UTC

2014-03-14 2014-03-14
199996 7.5 -73.984722 40.736837 -74.006672 40
01:09:00.0000008 01:09:00 UTC

2009-06-29 2009-06-29
199997 30.9 -73.986017 40.756487 -73.858957 40
00:42:00.00000078 00:42:00 UTC

2015-05-20 2015-05-20
199998 14.5 -73.997124 40.725452 -73.983215 40
14:56:25.0000004 14:56:25 UTC

2010-05-15 2010-05-15
199999 14.1 -73.984395 40.720077 -73.985508 40
04:08:00.00000076 04:08:00 UTC

200000 rows × 8 columns

In [9]: df=df.drop("key",axis=1)
df

Out[9]:
fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_

2015 -05-07
0 7.5 -73.999817 40.738354 -73.999512 40.723217
19:52:06 UTC

2009 -07-17
1 7.7 -73.994355 40.728225 -73.994710 40.750325
20:04:56 UTC

2009 -08-24
2 12.9 -74.005043 40.740770 -73.962565 40.772647
21:45:00 UTC

2009 -06-26
3 5.3 -73.976124 40.790844 -73.965316 40.803349
08:22:21 UTC

2014 -08-28
4 16.0 -73.925023 40.744085 -73.973082 40.761247
17:47:0 0 UTC

... ... ... ... ... ... ...

2012 -10-28
199995 3.0 -73.987042 40.739367 -73.986525 40.740297
10:49:0 0 UTC

2014 -03-14
199996 7.5 -73.984722 40.736837 -74.006672 40.739620
01:09:00 UTC

2009 -06-29
199997 30.9 -73.986017 40.756487 -73.858957 40.692588
00:42:00 UTC

2015 -05-20
199998 14.5 -73.997124 40.725452 -73.983215 40.695415
14:56:25 UTC

2010 -05-15
199999 14.1 -73.984395 40.720077 -73.985508 40.768793
04:08:00 UTC

200000 rows × 7 columns

localhost:8888/notebooks/Assignment 1_ML.ipynb 3/9


07/08/2024, 15:10 Assignment 1_ML - Jupyter Notebook

In [10]: df.dtypes

Out[10]: fare_amount float64


pickup_datetime object
pickup_longitude float64
pickup_latitude float64
dropoff_longitude float64
dropoff_latitude float64
passenger_count int64
dtype: object

In [11]: df["pickup_datetime"]=pd.to_datetime(df["pickup_datetime"])# used to change from object


df.dtypes

Out[11]: fare_amount float64


pickup_datetime datetime64[ns, UTC]
pickup_longitude float64
pickup_latitude float64
dropoff_longitude float64
dropoff_latitude float64
passenger_count int64
dtype: object

In [12]: df.isna().sum()

Out[12]: fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 1
dropoff_latitude 1
passenger_count 0
dtype: int64

In [13]: df.fillna(0,inplace=True)

In [14]: df.isnull().sum()

Out[14]: fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 0
dropoff_latitude 0
passenger_count 0
dtype: int64

In [15]: df=df.assign(hour=df.pickup_datetime.dt.hour,day=df.pickup_datetime.dt.day,month=df.pick

localhost:8888/notebooks/Assignment 1_ML.ipynb 4/9


07/08/2024, 15:10 Assignment 1_ML - Jupyter Notebook

In [16]: df

Out[16]:
fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_

2015-05-07
0 7.5 -73.999817 40.738354 -73.999512 40.723217
19:52:06+00:00

2009-07-17
1 7.7 -73.994355 40.728225 -73.994710 40.750325
20:04:56+00:00

2009-08-24
2 12.9 -74.005043 40.740770 -73.962565 40.772647
21:45:00+00:00

2009-06-26
3 5.3 -73.976124 40.790844 -73.965316 40.803349
08:22:21+00:00

2014-08-28
4 16.0 -73.925023 40.744085 -73.973082 40.761247
17:47:00+00:00

... ... ... ... ... ... ...

2012-10-28
199995 3.0 -73.987042 40.739367 -73.986525 40.740297
10:49:00+00:00

2014-03-14
199996 7.5 -73.984722 40.736837 -74.006672 40.739620
01:09:00+00:00

2009-06-29
199997 30.9 -73.986017 40.756487 -73.858957 40.692588
00:42:00+00:00

2015-05-20
199998 14.5 -73.997124 40.725452 -73.983215 40.695415
14:56:25+00:00
2010-05-15
199999 14.1 -73.984395 40.720077 -73.985508 40.768793
04:08:00+00:00

200000 rows × 10 columns

In [17]: df=df.drop("pickup_datetime",axis=1)

In [18]: df

Out[18]:
fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count hour day

0 7.5 -73.999817 40.738354 -73.999512 40.723217 1 19 7

1 7.7 -73.994355 40.728225 -73.994710 40.750325 1 20 17

2 12.9 -74.005043 40.740770 -73.962565 40.772647 1 21 24

3 5.3 -73.976124 40.790844 -73.965316 40.803349 3 8 26

4 16.0 -73.925023 40.744085 -73.973082 40.761247 5 17 28

... ... ... ... ... ... ... ... ...

199995 3.0 -73.987042 40.739367 -73.986525 40.740297 1 10 28

199996 7.5 -73.984722 40.736837 -74.006672 40.739620 1 1 14

199997 30.9 -73.986017 40.756487 -73.858957 40.692588 2 0 29

199998 14.5 -73.997124 40.725452 -73.983215 40.695415 1 14 20

199999 14.1 -73.984395 40.720077 -73.985508 40.768793 1 4 15

200000 rows × 9 columns

localhost:8888/notebooks/Assignment 1_ML.ipynb 5/9


07/08/2024, 15:10 Assignment 1_ML - Jupyter Notebook

In [19]: df.plot(kind="box",subplots=True,layout=(7,2),figsize=(15,20))

Out[19]: fare_amount AxesSubplot(0.125,0.786098;0.352273x0.0939024)


pickup_longitude AxesSubplot(0.547727,0.786098;0.352273x0.0939024)
pickup_latitude AxesSubplot(0.125,0.673415;0.352273x0.0939024)
dropoff_longitude AxesSubplot(0.547727,0.673415;0.352273x0.0939024)
dropoff_latitude AxesSubplot(0.125,0.560732;0.352273x0.0939024)
passenger_count AxesSubplot(0.547727,0.560732;0.352273x0.0939024)
hour AxesSubplot(0.125,0.448049;0.352273x0.0939024)
day AxesSubplot(0.547727,0.448049;0.352273x0.0939024)
month AxesSubplot(0.125,0.335366;0.352273x0.0939024)
dtype: object

In [20]: def find_outliers_IQR(df,col):


q1=df[col].quantile(0.25)
q3=df[col].quantile(0.75)
IQR=q3-q1
upper_whisker = q1-1.5*IQR
lower_whisker = q3+1.5*IQR
df[col]=np.clip(df[col],lower_whisker,upper_whisker)
return df

def all_outliers(df,col_list):
for i in col_list:
df=find_outliers_IQR(df,i)
return df

In [21]: df=all_outliers(df,df.iloc[:,0::])

localhost:8888/notebooks/Assignment 1_ML.ipynb 6/9


07/08/2024, 15:10 Assignment 1_ML - Jupyter Notebook

In [22]: df.plot(kind="box",subplots=True,layout=(7,2),figsize=(15,20))

Out[22]: fare_amount AxesSubplot(0.125,0.786098;0.352273x0.0939024)


pickup_longitude AxesSubplot(0.547727,0.786098;0.352273x0.0939024)
pickup_latitude AxesSubplot(0.125,0.673415;0.352273x0.0939024)
dropoff_longitude AxesSubplot(0.547727,0.673415;0.352273x0.0939024)
dropoff_latitude AxesSubplot(0.125,0.560732;0.352273x0.0939024)
passenger_count AxesSubplot(0.547727,0.560732;0.352273x0.0939024)
hour AxesSubplot(0.125,0.448049;0.352273x0.0939024)
day AxesSubplot(0.547727,0.448049;0.352273x0.0939024)
month AxesSubplot(0.125,0.335366;0.352273x0.0939024)
dtype: object

In [23]: df.corr()

Out[23]:
fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count

fare_amount 1.000000 0.154069 -0.110842 0.218704 -0.125898 0.015778

pickup_longitude 0.154069 1.000000 0.259497 0.425631 0.073290 -0.013213

pickup_latitude -0.110842 0.259497 1.000000 0.048898 0.515714 -0.012889

dropoff_longitude 0.218704 0.425631 0.048898 1.000000 0.245627 -0.009325

dropoff_latitude -0.125898 0.073290 0.515714 0.245627 1.000000 -0.006308

passenger_count 0.015778 -0.013213 -0.012889 -0.009325 -0.006308 1.000000

hour -0.023623 0.011579 0.029681 -0.046578 0.019783 0.020274

day 0.004534 -0.003204 -0.001553 -0.004027 -0.003479 0.002712

month 0.030817 0.001169 0.001562 0.002394 -0.001193 0.010351

localhost:8888/notebooks/Assignment 1_ML.ipynb 7/9


07/08/2024, 15:10 Assignment 1_ML - Jupyter Notebook

In [24]: import seaborn as sns


sns.heatmap(df.corr())

Out[24]: <AxesSubplot:>

In [25]: X = df[['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude',


y = df['fare_amount'] #Target

Out[25]: 0 7.50
1 7.70
2 12.90
3 5.30
4 16.00
...
199995 3.00
199996 7.50
199997 22.25
199998 14.50
199999 14.10
Name: fare_amount, Length: 200000, dtype: float64

In [27]: from sklearn.model_selection import train_test_split


from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error

In [28]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42

In [29]: lr_model = LinearRegression()


lr_model.fit(X_train, y_train)

Out[29]: LinearRegression()

In [31]: rf_model = RandomForestRegressor(n_estimators=100, random_state=42)


rf_model.fit(X_train, y_train)

Out[31]: RandomForestRegressor(random_state=42)

localhost:8888/notebooks/Assignment 1_ML.ipynb 8/9


07/08/2024, 15:10 Assignment 1_ML - Jupyter Notebook

In [32]: y_pred_lr = lr_model.predict(X_test)


y_pred_lr
print("Linear Model:",y_pred_lr)
y_pred_rf = rf_model.predict(X_test)
print("Random Forest Model:", y_pred_rf)

Linear Model: [ 9.8745977 17.13685119 10.30134461 ... 8.92996545 9.28083902


9.30188948]
Random Forest Model: [ 5.858 10.53971971 7.422 ... 5.3515 6.296
7.872 ]

In [33]: r2_lr = r2_score(y_test, y_pred_lr)


rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))

In [34]: print("Linear Regression - R2:", r2_lr)


print("Linear Regression - RMSE:", rmse_lr)

Linear Regression - R2: 0.09111542765407288


Linear Regression - RMSE: 5.200086615056714

In [35]: r2_rf = r2_score(y_test, y_pred_rf)


rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))

print("Random Forest Regression R2:", r2_rf)


print("Random Forest Regression RMSE:",rmse_rf)

Random Forest Regression R2: 0.7600801674798523


Random Forest Regression RMSE: 2.67170981840233

In [ ]:

localhost:8888/notebooks/Assignment 1_ML.ipynb 9/9

You might also like