Predict The Price of The Uber Ride From A Given Pickup Point To The Agreed Drop-Off Location
Predict The Price of The Uber Ride From A Given Pickup Point To The Agreed Drop-Off Location
Name : J a d h a v S h r u t i
Roll No : 2441027
Batch : C
Predict the price of the Uber ride from a given pickup point
to the agreed drop-off location.
Perform following tasks:
Pre-process the dataset. Identify outliers. Check the correlation. Implement linear regression and random forest
regression models. Evaluate the models and compare their respective scores like R2, RMSE, etc.
In [2]: df=pd.read_csv("Downloads/uber.csv")
df
Out[2]:
Unnamed:
key fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitu
0
2015-05-07 2015-05-07
0 24238194 7.5 -73.999817 40.738354 -73.9995
19:52:06.0000003 19:52:06 UTC
2009-07-17 2009-07-17
1 27835199 7.7 -73.994355 40.728225 -73.9947
20:04:56.0000002 20:04:56 UTC
2009-08-24 2009-08-24
2 44984355 12.9 -74.005043 40.740770 -73.9625
21:45:00.00000061 21:45:00 UTC
2009-06-26 2009-06-26
3 25894730 5.3 -73.976124 40.790844 -73.9653
08:22:21.0000001 08:22:21 UTC
2014-08-28 2014-08-28
4 17610152 16.0 -73.925023 40.744085 -73.9730
17:47:00.000000188 17:47:00 UTC
2012-10-28 2012-10-28
199995 42598914 3.0 -73.987042 40.739367 -73.9865
10:49:00.00000053 10:49:00 UTC
2014-03-14 2014-03-14
199996 16382965 7.5 -73.984722 40.736837 -74.0066
01:09:00.0000008 01:09:00 UTC
2009-06-29 2009-06-29
199997 27804658 30.9 -73.986017 40.756487 -73.8589
00:42:00.00000078 00:42:00 UTC
2015-05-20 2015-05-20
199998 20259894 14.5 -73.997124 40.725452 -73.9832
14:56:25.0000004 14:56:25 UTC
2010-05-15 2010-05-15
199999 11951496 14.1 -73.984395 40.720077 -73.9855
04:08:00.00000076 04:08:00 UTC
In [3]: df.shape
Out[3]: (200000, 9)
In [4]: df.dtypes
In [5]: df.head()
Out[5]:
Unnamed:
key fare_amou nt pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dr
0
2015-05-07 2015-05-07
0 24238194 7.5 -73.999817 40.738354 -73.999512
19:52:06.0000003 19:52:06 UTC
2009-07-17 2009-07-17
1 27835199 7.7 -73.994355 40.728225 -73.994710
20:04:56.0000002 20:04:56 UTC
2009-08-24 2009-08-24
2 44984355 12.9 -74.005043 40.740770 -73.962565
21:45:00.00000061 21:45:00 UTC
2009-06-26 2009-06-26
3 25894730 5.3 -73.976124 40.790844 -73.965316
08:22:21.0000001 08:22:21 UTC
2014-08-28 2014-08-28
4 17610152 16.0 -73.925023 40.744085 -73.973082
17:47:00.000000188 17:47:00 UTC
In [6]: df.tail()
Out[6]:
Unnamed:
key fare_amou nt pickup_datetime pickup_longitude pickup_latitude dropoff_longitud
0
2012-10-28 2012-10-28
199995 42598914 3.0 -73.987042 40.739367 -73.98652
10:49:00.00000053 10:49:00 UTC
2014-03-14 2014-03-14
199996 16382965 7.5 -73.984722 40.736837 -74.00667
01:09:00.0000008 01:09:00 UTC
2009-06-29 2009-06-29
199997 27804658 30 .9 00:42:00 UTC
-73.986017 40.756487 -73.85895
00:42:00.00000078
2015-05-20 2015-05-20
199998 20259894 14.5 14:56:25 UTC -73.997124 40.725452 -73.98321
14:56:25.0000004
2010-05-15 2010-05-15
199999 11951496 14 .1 04:08:00 UTC
-73.984395 40.720077 -73.98550
04:08:00.00000076
In [8]: df
Out[8]:
key fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_
2015-05-07 2015-05-07
0 7.5 -73.999817 40.738354 -73.999512 40
19:52:06.0000003 19:52:06 UTC
2009-07-17 2009-07-17
1 7.7 -73.994355 40.728225 -73.994710 40
20:04:56.0000002 20:04:56 UTC
2009-08-24 2009-08-24
2 12.9 -74.005043 40.740770 -73.962565 40
21:45:00.00000061 21:45:00 UTC
2009-06-26 2009-06-26
3 5.3 -73.976124 40.790844 -73.965316 40
08:22:21.0000001 08:22:21 UTC
2014-08-28 2014-08-28
4 16.0 -73.925023 40.744085 -73.973082 40
17:47:00.000000188 17:47:00 UTC
2012-10-28 2012-10-28
199995 3.0 -73.987042 40.739367 -73.986525 40
10:49:00.00000053 10:49:00 UTC
2014-03-14 2014-03-14
199996 7.5 -73.984722 40.736837 -74.006672 40
01:09:00.0000008 01:09:00 UTC
2009-06-29 2009-06-29
199997 30.9 -73.986017 40.756487 -73.858957 40
00:42:00.00000078 00:42:00 UTC
2015-05-20 2015-05-20
199998 14.5 -73.997124 40.725452 -73.983215 40
14:56:25.0000004 14:56:25 UTC
2010-05-15 2010-05-15
199999 14.1 -73.984395 40.720077 -73.985508 40
04:08:00.00000076 04:08:00 UTC
In [9]: df=df.drop("key",axis=1)
df
Out[9]:
fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_
2015 -05-07
0 7.5 -73.999817 40.738354 -73.999512 40.723217
19:52:06 UTC
2009 -07-17
1 7.7 -73.994355 40.728225 -73.994710 40.750325
20:04:56 UTC
2009 -08-24
2 12.9 -74.005043 40.740770 -73.962565 40.772647
21:45:00 UTC
2009 -06-26
3 5.3 -73.976124 40.790844 -73.965316 40.803349
08:22:21 UTC
2014 -08-28
4 16.0 -73.925023 40.744085 -73.973082 40.761247
17:47:0 0 UTC
2012 -10-28
199995 3.0 -73.987042 40.739367 -73.986525 40.740297
10:49:0 0 UTC
2014 -03-14
199996 7.5 -73.984722 40.736837 -74.006672 40.739620
01:09:00 UTC
2009 -06-29
199997 30.9 -73.986017 40.756487 -73.858957 40.692588
00:42:00 UTC
2015 -05-20
199998 14.5 -73.997124 40.725452 -73.983215 40.695415
14:56:25 UTC
2010 -05-15
199999 14.1 -73.984395 40.720077 -73.985508 40.768793
04:08:00 UTC
In [10]: df.dtypes
In [12]: df.isna().sum()
Out[12]: fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 1
dropoff_latitude 1
passenger_count 0
dtype: int64
In [13]: df.fillna(0,inplace=True)
In [14]: df.isnull().sum()
Out[14]: fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 0
dropoff_latitude 0
passenger_count 0
dtype: int64
In [15]: df=df.assign(hour=df.pickup_datetime.dt.hour,day=df.pickup_datetime.dt.day,month=df.pick
In [16]: df
Out[16]:
fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_
2015-05-07
0 7.5 -73.999817 40.738354 -73.999512 40.723217
19:52:06+00:00
2009-07-17
1 7.7 -73.994355 40.728225 -73.994710 40.750325
20:04:56+00:00
2009-08-24
2 12.9 -74.005043 40.740770 -73.962565 40.772647
21:45:00+00:00
2009-06-26
3 5.3 -73.976124 40.790844 -73.965316 40.803349
08:22:21+00:00
2014-08-28
4 16.0 -73.925023 40.744085 -73.973082 40.761247
17:47:00+00:00
2012-10-28
199995 3.0 -73.987042 40.739367 -73.986525 40.740297
10:49:00+00:00
2014-03-14
199996 7.5 -73.984722 40.736837 -74.006672 40.739620
01:09:00+00:00
2009-06-29
199997 30.9 -73.986017 40.756487 -73.858957 40.692588
00:42:00+00:00
2015-05-20
199998 14.5 -73.997124 40.725452 -73.983215 40.695415
14:56:25+00:00
2010-05-15
199999 14.1 -73.984395 40.720077 -73.985508 40.768793
04:08:00+00:00
In [17]: df=df.drop("pickup_datetime",axis=1)
In [18]: df
Out[18]:
fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count hour day
In [19]: df.plot(kind="box",subplots=True,layout=(7,2),figsize=(15,20))
def all_outliers(df,col_list):
for i in col_list:
df=find_outliers_IQR(df,i)
return df
In [21]: df=all_outliers(df,df.iloc[:,0::])
In [22]: df.plot(kind="box",subplots=True,layout=(7,2),figsize=(15,20))
In [23]: df.corr()
Out[23]:
fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
Out[24]: <AxesSubplot:>
Out[25]: 0 7.50
1 7.70
2 12.90
3 5.30
4 16.00
...
199995 3.00
199996 7.50
199997 22.25
199998 14.50
199999 14.10
Name: fare_amount, Length: 200000, dtype: float64
Out[29]: LinearRegression()
Out[31]: RandomForestRegressor(random_state=42)
In [ ]: