0% found this document useful (0 votes)
20 views8 pages

SPPUML1

Machine learning lab assignment no1

Uploaded by

kanaseaditya800
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views8 pages

SPPUML1

Machine learning lab assignment no1

Uploaded by

kanaseaditya800
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Predict the price of the Uber ride from a given

pickup point to the agreed drop-off


location.
Perform following tasks:

1. Pre-process the dataset.


2. Identify outliers.
3. Check the correlation.
4. Implement linear regression and random forest regression models.
5. Evaluate the models and compare their respective scores like R2, RMSE, etc. Dataset
link: https://www.kaggle.com/datasets/yasserh/uber-fares-dataset
(https://www.kaggle.com/datasets/yasserh/uber-fares-dataset)

In [ ]: 1 Name:-Kanase Aditya Madhukar


2 Roll no:-2441059
3 Batch:-D
4 Assignment no:-1

In [45]: 1 import pandas as pd


2 import numpy as np
3 import matplotlib.pyplot as plt
4 import seaborn as sns
5 from sklearn.model_selection import train_test_split
6 from sklearn.linear_model import LinearRegression
7 from sklearn.ensemble import RandomForestRegressor
8 ​

In [46]: 1 df=pd.read_csv('uber.csv')
In [47]: 1 df

Out[47]: Unnamed:
key fare_amount pickup_datetime pickup_longitude pick
0

2015-05-07 2015-05-07
0 24238194 7.5 -73.999817
19:52:06.0000003 19:52:06 UTC

2009-07-17 2009-07-17
1 27835199 7.7 -73.994355
20:04:56.0000002 20:04:56 UTC

2009-08-24 2009-08-24
2 44984355 12.9 -74.005043
21:45:00.00000061 21:45:00 UTC

2009-06-26 2009-06-26
3 25894730 5.3 -73.976124
08:22:21.0000001 08:22:21 UTC

2014-08-28 2014-08-28
4 17610152 16.0 -73.925023
17:47:00.000000188 17:47:00 UTC

... ... ... ... ... ...

2012-10-28 2012-10-28
199995 42598914 3.0 -73.987042
10:49:00.00000053 10:49:00 UTC

2014-03-14 2014-03-14
199996 16382965 7.5 -73.984722
01:09:00.0000008 01:09:00 UTC

2009-06-29 2009-06-29
199997 27804658 30.9 -73.986017
00:42:00.00000078 00:42:00 UTC

2015-05-20 2015-05-20
199998 20259894 14.5 -73.997124
14:56:25.0000004 14:56:25 UTC

2010-05-15 2010-05-15
199999 11951496 14.1 -73.984395
04:08:00.00000076 04:08:00 UTC

200000 rows × 9 columns

In [48]: 1 df.head()

Out[48]: Unnamed:
key fare_amount pickup_datetime pickup_longitude pickup_lat
0

2015-05-07 2015-05-07
0 24238194 7.5 -73.999817 40.73
19:52:06.0000003 19:52:06 UTC

2009-07-17 2009-07-17
1 27835199 7.7 -73.994355 40.72
20:04:56.0000002 20:04:56 UTC

2009-08-24 2009-08-24
2 44984355 12.9 -74.005043 40.74
21:45:00.00000061 21:45:00 UTC

2009-06-26 2009-06-26
3 25894730 5.3 -73.976124 40.79
08:22:21.0000001 08:22:21 UTC

2014-08-28 2014-08-28
4 17610152 16.0 -73.925023 40.74
17:47:00.000000188 17:47:00 UTC

In [49]: 1 df.shape

Out[49]: (200000, 9)
In [50]: 1 df.tail()

Out[50]: Unnamed:
key fare_amount pickup_datetime pickup_longitude picku
0

2012-10-28 2012-10-28
199995 42598914 3.0 -73.987042
10:49:00.00000053 10:49:00 UTC

2014-03-14 2014-03-14
199996 16382965 7.5 -73.984722
01:09:00.0000008 01:09:00 UTC

2009-06-29 2009-06-29
199997 27804658 30.9 -73.986017
00:42:00.00000078 00:42:00 UTC

2015-05-20 2015-05-20
199998 20259894 14.5 -73.997124
14:56:25.0000004 14:56:25 UTC

2010-05-15 2010-05-15
199999 11951496 14.1 -73.984395
04:08:00.00000076 04:08:00 UTC

In [51]: 1 df.describe()

Out[51]: Unnamed: 0 fare_amount pickup_longitude pickup_latitude dropoff_longitude dro

count 2.000000e+05 200000.000000 200000.000000 200000.000000 199999.000000 19

mean 2.771250e+07 11.359955 -72.527638 39.935885 -72.525292

std 1.601382e+07 9.901776 11.437787 7.720539 13.117408

min 1.000000e+00 -52.000000 -1340.648410 -74.015515 -3356.666300

25% 1.382535e+07 6.000000 -73.992065 40.734796 -73.991407

50% 2.774550e+07 8.500000 -73.981823 40.752592 -73.980093

75% 4.155530e+07 12.500000 -73.967154 40.767158 -73.963658

max 5.542357e+07 499.000000 57.418457 1644.421482 1153.572603

In [52]: 1 df.isna().sum()

Out[52]: Unnamed: 0 0
key 0
fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 1
dropoff_latitude 1
passenger_count 0
dtype: int64

In [53]: 1 df.fillna(0,inplace=True)
In [54]: 1 df["pickup_datetime"] = pd.to_datetime(df["pickup_datetime"])
2 ​
3 missing_values = df.isnull().sum()
4 print("Missing values in the dataset:")
5 print(missing_values)
6 df.dropna(inplace=True)
7 missing_values = df.isnull().sum()
8 print("Missing values after handling:")
9 print(missing_values)
10 sns.boxplot(x=df["fare_amount"])
11 plt.show()

Missing values in the dataset:


Unnamed: 0 0
key 0
fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 0
dropoff_latitude 0
passenger_count 0
dtype: int64
Missing values after handling:
Unnamed: 0 0
key 0
fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 0
dropoff_latitude 0
passenger_count 0
dtype: int64
In [55]: 1 Q1 = df["fare_amount"].quantile(0.25)
2 Q3 = df["fare_amount"].quantile(0.75)
3 IQR = Q3 - Q1
4 threshold = 1.5
5 lower_bound = Q1 - threshold * IQR
6 upper_bound = Q3 + threshold * IQR
7 data_no_outliers = df[(df["fare_amount"] >= lower_bound) & (df["fare_am
8 sns.boxplot(x=data_no_outliers["fare_amount"])
9 plt.show()
In [56]: 1 df.plot(kind="box",subplots=True, layout=(7, 2), figsize=(15, 20))

Out[56]: Unnamed: 0 AxesSubplot(0.125,0.787927;0.352273x0.0920732)


fare_amount AxesSubplot(0.547727,0.787927;0.352273x0.0920732)
pickup_longitude AxesSubplot(0.125,0.677439;0.352273x0.0920732)
pickup_latitude AxesSubplot(0.547727,0.677439;0.352273x0.0920732)
dropoff_longitude AxesSubplot(0.125,0.566951;0.352273x0.0920732)
dropoff_latitude AxesSubplot(0.547727,0.566951;0.352273x0.0920732)
passenger_count AxesSubplot(0.125,0.456463;0.352273x0.0920732)
dtype: object
In [57]: 1 correlation_matrix = df.corr()
2 sns.heatmap(correlation_matrix, annot=True,fmt='.1f')
3 plt.figure(figsize=(8,8))
4 plt.show()

<Figure size 576x576 with 0 Axes>

In [58]: 1 X = df[['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dr


2 y = df['fare_amount']
3 ​
4 ​
5 y

Out[58]: 0 7.5
1 7.7
2 12.9
3 5.3
4 16.0
...
199995 3.0
199996 7.5
199997 30.9
199998 14.5
199999 14.1
Name: fare_amount, Length: 200000, dtype: float64

In [59]: 1 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2

In [60]: 1 lr_model = LinearRegression()


2 lr_model.fit(X_train, y_train)

Out[60]: LinearRegression()
In [61]: 1 rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
2 rf_model.fit(X_train, y_train)

Out[61]: RandomForestRegressor(random_state=42)

In [62]: 1 y_pred_lr = lr_model.predict(X_test)


2 y_pred_lr
3 print("Linear Model:",y_pred_lr)
4 y_pred_rf = rf_model.predict(X_test)
5 print("Random Forest Model:", y_pred_rf)

Linear Model: [11.29485003 12.16430284 11.55413146 ... 11.35862539 11.2940


5912
11.29346213]
Random Forest Model: [ 6.939 11.33572265 7.485 ... 4.773
6.357
7.639 ]

In [ ]: 1 ​

In [ ]: 1 ​

You might also like