ML Practical 1 Code
ML Practical 1 Code
Perform following
tasks:
1. Pre-process the dataset.
2. Identify outliers.
3. Check the correlation.
4. Implement linear regression and random forest regression models.
5. Evaluate the models and compare their respective scores like R2, RMSE, etc. Dataset link: https://www.kaggle.com/datasets/yasserh/uber-fares-dataset
import pandas as pd
import numpy as np
df = pd.read_csv("uber.csv")
Out[3]: Unnamed: 0 key fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
0 24238194 2015-05-07 19:52:06.0000003 7.5 2015-05-07 19:52:06 UTC -73.999817 40.738354 -73.999512 40.723217 1
1 27835199 2009-07-17 20:04:56.0000002 7.7 2009-07-17 20:04:56 UTC -73.994355 40.728225 -73.994710 40.750325 1
2 44984355 2009-08-24 21:45:00.00000061 12.9 2009-08-24 21:45:00 UTC -74.005043 40.740770 -73.962565 40.772647 1
3 25894730 2009-06-26 08:22:21.0000001 5.3 2009-06-26 08:22:21 UTC -73.976124 40.790844 -73.965316 40.803349 3
4 17610152 2014-08-28 17:47:00.000000188 16.0 2014-08-28 17:47:00 UTC -73.925023 40.744085 -73.973082 40.761247 5
<class 'pandas.core.frame.DataFrame'>
Out[5]:
'pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
'dropoff_latitude', 'passenger_count'],
dtype='object')
In [6]: df = df.drop(['Unnamed: 0', 'key'], axis= 1) #To drop unnamed column as it isn't required
In [7]: df.head()
(200000, 7)
Out[8]:
fare_amount float64
Out[9]:
pickup_datetime object
pickup_longitude float64
pickup_latitude float64
dropoff_longitude float64
dropoff_latitude float64
passenger_count int64
dtype: object
In [11]: df.dtypes
fare_amount float64
Out[11]:
pickup_datetime datetime64[ns, UTC]
pickup_longitude float64
pickup_latitude float64
dropoff_longitude float64
dropoff_latitude float64
passenger_count int64
dtype: object
fare_amount 0
Out[12]:
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 1
dropoff_latitude 1
passenger_count 0
dtype: int64
df['dropoff_longitude'].fillna(value=df['dropoff_longitude'].median(),inplace = True)
In [14]: df.isnull().sum()
fare_amount 0
Out[14]:
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 0
dropoff_latitude 0
passenger_count 0
dtype: int64
day= df.pickup_datetime.dt.day,
month = df.pickup_datetime.dt.month,
year = df.pickup_datetime.dt.year,
dayofweek = df.pickup_datetime.dt.dayofweek)
In [16]: df.head()
Out[16]: fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count hour day month year dayofweek
Here we are going to use Heversine formula to calculate the distance between two points and journey, using the
longitude and latitude values.
Heversine formula
hav(θ) = sin**2(θ/2).
# function to calculate the travel distance from the longitudes and latitudes
travel_dist = []
long1,lati1,long2,lati2 = map(radians,[longitude1[pos],latitude1[pos],longitude2[pos],latitude2[pos]])
c = 2 * asin(sqrt(a))*6371
travel_dist.append(c)
return travel_dist
df['dropoff_longitude'].to_numpy(),
df['dropoff_latitude'].to_numpy()
In [21]: df.head()
Out[21]: fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count hour day month year dayofweek dist_travel_km
df = df.drop('pickup_datetime',axis=1)
In [23]: df.head()
Out[23]: fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count hour day month year dayofweek dist_travel_km
fare_amount AxesSubplot(0.125,0.787927;0.352273x0.0920732)
Out[24]:
pickup_longitude AxesSubplot(0.547727,0.787927;0.352273x0.0920732)
pickup_latitude AxesSubplot(0.125,0.677439;0.352273x0.0920732)
dropoff_longitude AxesSubplot(0.547727,0.677439;0.352273x0.0920732)
dropoff_latitude AxesSubplot(0.125,0.566951;0.352273x0.0920732)
passenger_count AxesSubplot(0.547727,0.566951;0.352273x0.0920732)
hour AxesSubplot(0.125,0.456463;0.352273x0.0920732)
day AxesSubplot(0.547727,0.456463;0.352273x0.0920732)
month AxesSubplot(0.125,0.345976;0.352273x0.0920732)
year AxesSubplot(0.547727,0.345976;0.352273x0.0920732)
dayofweek AxesSubplot(0.125,0.235488;0.352273x0.0920732)
dist_travel_km AxesSubplot(0.547727,0.235488;0.352273x0.0920732)
dtype: object
Q1 = df1[col].quantile(0.25)
Q3 = df1[col].quantile(0.75)
IQR = Q3 - Q1
lower_whisker = Q1-1.5*IQR
upper_whisker = Q3+1.5*IQR
return df1
for c in col_list:
df1 = remove_outlier(df , c)
return df1
In [27]: df.plot(kind = "box",subplots = True,layout = (7,2),figsize=(15,20)) #Boxplot shows that dataset is free from outliers
fare_amount AxesSubplot(0.125,0.787927;0.352273x0.0920732)
Out[27]:
pickup_longitude AxesSubplot(0.547727,0.787927;0.352273x0.0920732)
pickup_latitude AxesSubplot(0.125,0.677439;0.352273x0.0920732)
dropoff_longitude AxesSubplot(0.547727,0.677439;0.352273x0.0920732)
dropoff_latitude AxesSubplot(0.125,0.566951;0.352273x0.0920732)
passenger_count AxesSubplot(0.547727,0.566951;0.352273x0.0920732)
hour AxesSubplot(0.125,0.456463;0.352273x0.0920732)
day AxesSubplot(0.547727,0.456463;0.352273x0.0920732)
month AxesSubplot(0.125,0.345976;0.352273x0.0920732)
year AxesSubplot(0.547727,0.345976;0.352273x0.0920732)
dayofweek AxesSubplot(0.125,0.235488;0.352273x0.0920732)
dist_travel_km AxesSubplot(0.547727,0.235488;0.352273x0.0920732)
dtype: object
In [28]: #Uber doesn't travel over 130 kms so minimize the distance
In [29]: #Finding inccorect latitude (Less than or greater than 90) and longitude (greater than or less than 180)
In [31]: df.head()
Out[31]: fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count hour day month year dayofweek dist_travel_km
In [32]: df.isnull().sum()
fare_amount 0
Out[32]:
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 0
dropoff_latitude 0
passenger_count 0
hour 0
day 0
month 0
year 0
dayofweek 0
dist_travel_km 0
dtype: int64
<matplotlib.axes._subplots.AxesSubplot at 0x8d8af2a080>
Out[33]:
In [35]: corr
Out[35]: fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count hour day month year dayofweek dist_travel_km
fare_amount 1.000000 0.154069 -0.110842 0.218675 -0.125898 0.015778 -0.023623 0.004534 0.030817 0.141277 0.013652 0.844374
pickup_longitude 0.154069 1.000000 0.259497 0.425619 0.073290 -0.013213 0.011579 -0.003204 0.001169 0.010198 -0.024652 0.098094
pickup_latitude -0.110842 0.259497 1.000000 0.048889 0.515714 -0.012889 0.029681 -0.001553 0.001562 -0.014243 -0.042310 -0.046812
dropoff_longitude 0.218675 0.425619 0.048889 1.000000 0.245667 -0.009303 -0.046558 -0.004007 0.002391 0.011346 -0.003336 0.186531
dropoff_latitude -0.125898 0.073290 0.515714 0.245667 1.000000 -0.006308 0.019783 -0.003479 -0.001193 -0.009603 -0.031919 -0.038900
passenger_count 0.015778 -0.013213 -0.012889 -0.009303 -0.006308 1.000000 0.020274 0.002712 0.010351 -0.009749 0.048550 0.009709
hour -0.023623 0.011579 0.029681 -0.046558 0.019783 0.020274 1.000000 0.004677 -0.003926 0.002156 -0.086947 -0.038366
day 0.004534 -0.003204 -0.001553 -0.004007 -0.003479 0.002712 0.004677 1.000000 -0.017360 -0.012170 0.005617 0.003062
month 0.030817 0.001169 0.001562 0.002391 -0.001193 0.010351 -0.003926 -0.017360 1.000000 -0.115859 -0.008786 0.011628
year 0.141277 0.010198 -0.014243 0.011346 -0.009603 -0.009749 0.002156 -0.012170 -0.115859 1.000000 0.006113 0.024278
dayofweek 0.013652 -0.024652 -0.042310 -0.003336 -0.031919 0.048550 -0.086947 0.005617 -0.008786 0.006113 1.000000 0.027053
dist_travel_km 0.844374 0.098094 -0.046812 0.186531 -0.038900 0.009709 -0.038366 0.003062 0.011628 0.024278 0.027053 1.000000
<matplotlib.axes._subplots.AxesSubplot at 0x8d8affc588>
Out[36]:
In [38]: y = df['fare_amount']
Linear Regression
In [40]: from sklearn.linear_model import LinearRegression
regression = LinearRegression()
In [41]: regression.fit(X_train,y_train)
2809.192377415925
Out[42]:
Out[43]:
5.40456388e-02, 9.46950748e-03, 1.66720620e-03, 5.40917698e-02,
In [45]: print(prediction)
5.38487972]
In [46]: y_test
16850 8.50
Out[46]:
181076 4.10
70798 9.30
87421 12.90
169443 22.25
18976 11.00
50921 13.70
199564 14.50
125215 5.30
67510 8.50
85217 22.25
156903 21.50
116795 4.10
112179 16.90
124459 3.70
173299 22.25
51448 19.70
99502 22.25
174467 10.90
78880 20.50
26798 22.25
38501 4.50
63091 12.90
171207 22.25
142238 8.50
101106 7.30
120177 4.50
154585 14.50
75840 5.50
85918 14.00
...
104227 10.10
14172 19.70
49985 3.70
183045 6.50
11927 12.90
93684 4.50
101795 13.70
21444 6.10
85147 8.50
81311 8.00
157686 11.70
194074 6.50
132558 10.50
132616 11.70
188536 5.70
179629 8.90
11277 3.70
147880 7.30
116553 5.70
157394 6.50
103519 13.30
41348 12.90
12608 4.50
6820 5.50
84612 5.00
168836 3.70
39719 21.00
124536 4.90
90432 22.10
12543 4.90
Metrics Evaluation using R2, Mean Squared Error, Root Mean Sqared Error
In [47]: from sklearn.metrics import r2_score
In [48]: r2_score(y_test,prediction)
0.7471032194200018
Out[48]:
In [51]: MSE
7.464818887848474
Out[51]:
In [53]: RMSE
2.7321820744321696
Out[53]:
In [55]: rf = RandomForestRegressor(n_estimators=100) #Here n_estimators means number of trees you want to build before making the prediction
In [56]: rf.fit(X_train,y_train)
Out[56]:
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
verbose=0, warm_start=False)
In [58]: y_pred
In [60]: R2_Random
0.8024361566950065
Out[60]:
MSE_Random
5.831542440662031
Out[64]:
RMSE_Random
2.4148586792319815
Out[65]: