0% found this document useful (0 votes)

22 views47 pages

Merged

Uploaded by

liqvio.wankar151203

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views47 pages

Merged

Uploaded by

liqvio.wankar151203

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

predict-the-price-of-the-uber-ride

October 19, 2024

1 ASSIGNMENT 01
Predict the price of the Uber ride from a given pickup point to the agreed drop-off location.
Perform following tasks: 1. Pre-process the dataset. 2. Identify outliers. 3. Check the correlation.
4. Implement linear regression and random forest regression models. 5. Evaluate the models and
compare their respective scores like R2, RMSE, etc.
[ ]: #importing necessary libraries

import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
import seaborn as sns

[ ]: #loading the dataset

df = pd.read_csv("C://Users//91772//Desktop//ML assigns//uber.csv")

[ ]: df.head()

[ ]: Unnamed: 0 key fare_amount \

0 24238194 2015-05-07 19:52:06.0000003 7.5
1 27835199 2009-07-17 20:04:56.0000002 7.7
2 44984355 2009-08-24 21:45:00.00000061 12.9
3 25894730 2009-06-26 08:22:21.0000001 5.3
4 17610152 2014-08-28 17:47:00.000000188 16.0

pickup_datetime pickup_longitude pickup_latitude \

0 2015-05-07 19:52:06 UTC -73.999817 40.738354
1 2009-07-17 20:04:56 UTC -73.994355 40.728225
2 2009-08-24 21:45:00 UTC -74.005043 40.740770
3 2009-06-26 08:22:21 UTC -73.976124 40.790844
4 2014-08-28 17:47:00 UTC -73.925023 40.744085

dropoff_longitude dropoff_latitude passenger_count

0 -73.999512 40.723217 1
1 -73.994710 40.750325 1
2 -73.962565 40.772647 1

1
3 -73.965316 40.803349 3
4 -73.973082 40.761247 5

[ ]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 200000 non-null int64
1 key 200000 non-null object
2 fare_amount 200000 non-null float64
3 pickup_datetime 200000 non-null object
4 pickup_longitude 200000 non-null float64
5 pickup_latitude 200000 non-null float64
6 dropoff_longitude 199999 non-null float64
7 dropoff_latitude 199999 non-null float64
8 passenger_count 200000 non-null int64
dtypes: float64(5), int64(2), object(2)
memory usage: 13.7+ MB

[ ]: df.shape

[ ]: (200000, 9)

2 1.Pre-process the dataset.

[ ]: df.isnull().sum()

[ ]: Unnamed: 0 0
key 0
fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 1
dropoff_latitude 1
passenger_count 0
dtype: int64

[ ]: #dropping rows with missing values

df.dropna(inplace = True)

[ ]: df.isnull().sum()

2
[ ]: Unnamed: 0 0
key 0
fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 0
dropoff_latitude 0
passenger_count 0
dtype: int64

[ ]: #dropping unwanted columns

df.drop(labels='Unnamed: 0',axis=1,inplace=True)
df.drop(labels='key',axis=1,inplace=True)

[ ]: df.head()

[ ]: fare_amount pickup_datetime pickup_longitude pickup_latitude \

0 7.5 2015-05-07 19:52:06 UTC -73.999817 40.738354
1 7.7 2009-07-17 20:04:56 UTC -73.994355 40.728225
2 12.9 2009-08-24 21:45:00 UTC -74.005043 40.740770
3 5.3 2009-06-26 08:22:21 UTC -73.976124 40.790844
4 16.0 2014-08-28 17:47:00 UTC -73.925023 40.744085

dropoff_longitude dropoff_latitude passenger_count

0 -73.999512 40.723217 1
1 -73.994710 40.750325 1
2 -73.962565 40.772647 1
3 -73.965316 40.803349 3
4 -73.973082 40.761247 5

[ ]: df.dtypes

[ ]: fare_amount float64
pickup_datetime object
pickup_longitude float64
pickup_latitude float64
dropoff_longitude float64
dropoff_latitude float64
passenger_count int64
dtype: object

[ ]: df.describe()

[ ]: fare_amount pickup_longitude pickup_latitude dropoff_longitude \

count 199999.000000 199999.000000 199999.000000 199999.000000

3
mean 11.359892 -72.527631 39.935881 -72.525292
std 9.901760 11.437815 7.720558 13.117408
min -52.000000 -1340.648410 -74.015515 -3356.666300
25% 6.000000 -73.992065 40.734796 -73.991407
50% 8.500000 -73.981823 40.752592 -73.980093
75% 12.500000 -73.967154 40.767158 -73.963658
max 499.000000 57.418457 1644.421482 1153.572603

dropoff_latitude passenger_count
count 199999.000000 199999.000000
mean 39.923890 1.684543
std 6.794829 1.385995
min -881.985513 0.000000
25% 40.733823 1.000000
50% 40.753042 1.000000
75% 40.768001 2.000000
max 872.697628 208.000000

3 2. Identify outliers.
OUTLIER: An object that deviates significantly from the rest of the objects.
[ ]: # data visualization
# plotting distribution plot

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
sns.distplot(df['fare_amount'])

[ ]: <AxesSubplot:xlabel='fare_amount', ylabel='Density'>

4
[ ]: sns.distplot(df['pickup_latitude'])

[ ]: <AxesSubplot:xlabel='pickup_latitude', ylabel='Density'>

5
[ ]: sns.distplot(df['pickup_longitude'])

[ ]: <AxesSubplot:xlabel='pickup_longitude', ylabel='Density'>

[ ]: sns.distplot(df['dropoff_longitude'])

[ ]: <AxesSubplot:xlabel='dropoff_longitude', ylabel='Density'>

6
[ ]: sns.distplot(df['dropoff_latitude'])

[ ]: <AxesSubplot:xlabel='dropoff_latitude', ylabel='Density'>

7
[ ]: #creating a function to identify outliers

def find_outliers_IQR(df):
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
IQR = q3-q1
outliers = df[((df<(q1-1.5*IQR)) | (df>(q3+1.5*IQR)))]
return outliers

[ ]: #getting outlier details for column "fair_amount" using the above function

outliers = find_outliers_IQR(df["fare_amount"])
print("number of outliers: "+ str(len(outliers)))
print("max outlier value: "+ str(outliers.max()))
print("min outlier value: "+ str(outliers.min()))
outliers

number of outliers: 17166

max outlier value: 499.0
min outlier value: -52.0

[ ]: 6 24.50
30 25.70
34 39.50
39 29.00
48 56.80
…
199976 49.70
199977 43.50
199982 57.33
199985 24.00
199997 30.90
Name: fare_amount, Length: 17166, dtype: float64

[ ]: #you can also pass two columns as argument to the function (here␣
↪"passenger_count" and "fair_amount")

outliers = find_outliers_IQR(df[["passenger_count","fare_amount"]])
outliers

[ ]: passenger_count fare_amount
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 5.0 NaN

8
… … …
199995 NaN NaN
199996 NaN NaN
199997 NaN 30.9
199998 NaN NaN
199999 NaN NaN

[199999 rows x 2 columns]

[ ]: #upper and lower limit which can be used for capping of outliers

upper_limit = df['fare_amount'].mean() + 3*df['fare_amount'].std()

print(upper_limit)
lower_limit = df['fare_amount'].mean() - 3*df['fare_amount'].std()
print(lower_limit)

41.06517154774204
-18.3453884488253

4 3. Check the correlation.

[ ]: #creating a correlation matrix

corrMatrix = df.corr()
sns.heatmap(corrMatrix, annot=True)
plt.show()

9
[ ]: #splitting column "pickup_datetime" into 5 columns: "day", "hour", "month",␣
↪"year", "weekday"

#for a simplified view

import calendar
df['day']=df['pickup_datetime'].apply(lambda x:x.day)
df['hour']=df['pickup_datetime'].apply(lambda x:x.hour)
df['month']=df['pickup_datetime'].apply(lambda x:x.month)
df['year']=df['pickup_datetime'].apply(lambda x:x.year)
df['weekday']=df['pickup_datetime'].apply(lambda x: calendar.day_name[x.
↪weekday()])

df.drop(['pickup_datetime'],axis =1 , inplace = True)

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_20560/3136844708.py in <module>
3
4 import calendar
----> 5 df['day']=df['pickup_datetime'].apply(lambda x:x.day)
6 df['hour']=df['pickup_datetime'].apply(lambda x:x.hour)

10
7 df['month']=df['pickup_datetime'].apply(lambda x:x.month)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self,␣
↪func, convert_dtype, args, **kwargs)

4355 dtype: float64

4356 """
-> 4357 return SeriesApply(self, func, convert_dtype, args, kwargs).
↪apply()

4358
4359 def _reduce(

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\apply.py in apply(self)
1041 return self.apply_str()
1042
-> 1043 return self.apply_standard()
1044
1045 def agg(self):

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\apply.py in␣
↪apply_standard(self)

1096 # List[Union[Callable[…, Any], str]]]]]"; expected

1097 # "Callable[[Any], Any]"
-> 1098 mapped = lib.map_infer(
1099 values,
1100 f, # type: ignore[arg-type]

C:\ProgramData\Anaconda3\lib\site-packages\pandas\_libs\lib.pyx in pandas._libs.
↪lib.map_infer()

~\AppData\Local\Temp/ipykernel_20560/3136844708.py in <lambda>(x)
3
4 import calendar
----> 5 df['day']=df['pickup_datetime'].apply(lambda x:x.day)
6 df['hour']=df['pickup_datetime'].apply(lambda x:x.hour)
7 df['month']=df['pickup_datetime'].apply(lambda x:x.month)

AttributeError: 'str' object has no attribute 'day'

[ ]: #label encoding (categorical to numerical)

df.weekday = df.weekday.map({'Sunday':0,'Monday':1,'Tuesday':2,'Wednesday':
↪3,'Thursday':4,'Friday':5,'Saturday':6})

[ ]: df.head()

[ ]: df.info()

11
[ ]: #splitting the data into train and test

from sklearn.model_selection import train_test_split

[ ]: #independent variables (x)

x=df.drop("fare_amount", axis=1)
x

[ ]: #dependent variable (y)

y=df["fare_amount"]

[ ]: x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.
↪2,random_state=101)

[ ]: x_train.head()

[ ]: x_test.head()

[ ]: y_train.head()

[ ]: y_test.head()

[ ]: print(x_train.shape)
print(x_test.shape)
print(y_test.shape)
print(y_train.shape)

5 4. Implement linear regression and random forest regression

models.

6 5. Evaluate the models and compare their respective scores like

R2, RMSE, etc.
[ ]: #Linear Regression

from sklearn.linear_model import LinearRegression

lrmodel=LinearRegression()
lrmodel.fit(x_train, y_train)

[ ]: predictedvalues = lrmodel.predict(x_test)

[ ]: #Calculating the value of RMSE for Linear Regression

12
from sklearn.metrics import mean_squared_error
lrmodelrmse = np.sqrt(mean_squared_error(predictedvalues, y_test))
print("RMSE value for Linear regression is", lrmodelrmse)

[ ]: #Random Forest Regression

from sklearn.ensemble import RandomForestRegressor

rfrmodel = RandomForestRegressor(n_estimators=100, random_state=101)

[ ]: rfrmodel.fit(x_train,y_train)
rfrmodel_pred= rfrmodel.predict(x_test)

[ ]: #Calculating the value of RMSE for Random Forest Regression

rfrmodel_rmse=np.sqrt(mean_squared_error(rfrmodel_pred, y_test))
print("RMSE value for Random forest regression is ",rfrmodel_rmse)

[ ]: rfrmodel_pred.shape

7 Predict the price of the Uber ride

[ ]: test = pd.read_csv("https://raw.githubusercontent.com/piyushpandey758/
↪Uber-Fare-Prediction/master/testt.csv")

[ ]: test.head()

[ ]: test.drop(test[['Unnamed: 0.1.1','Unnamed: 0','Unnamed: 0.

↪1','key']],axis=1,inplace=True)

[ ]: test.isnull().sum()

[ ]: #converting datatype of column "pickup_datetime" from object to DateTime

test["pickup_datetime"] = pd.to_datetime(test["pickup_datetime"])

[ ]: #splitting column "pickup_datetime" into 5 columns: "day", "hour", "month",␣

↪"year", "weekday"

#for a simplified view

#label encoding weekdays

test['day']=test['pickup_datetime'].apply(lambda x:x.day)
test['hour']=test['pickup_datetime'].apply(lambda x:x.hour)
test['month']=test['pickup_datetime'].apply(lambda x:x.month)
test['year']=test['pickup_datetime'].apply(lambda x:x.year)
test['weekday']=test['pickup_datetime'].apply(lambda x: calendar.day_name[x.
↪weekday()])

13
test.weekday = test.weekday.map({'Sunday':0,'Monday':1,'Tuesday':2,'Wednesday':
↪3,'Thursday':4,'Friday':5,'Saturday':6})

test.drop(['pickup_datetime'], axis = 1, inplace = True)

test.head(5)

[ ]: #Prediction!

rfrmodel_pred= rfrmodel.predict(test)

[ ]: df_pred = pd.DataFrame(rfrmodel_pred)
df_pred

[ ]: #to_csv() function exports the DataFrame to CSV format

df_pred.to_csv('pred.csv')

[ ]:

14
ps-05-gradient-descent

October 19, 2024

1 ASSIGNMENT NO 4 Implement Gradient Descent Algorithm

to find the local minima of a function.
For example, find the local minima of the function y=(x+3)² starting from the point x=2.
It’s an optimisation algorithm. Gradient Descent : Minimization optimization that follows the
negative of the gradient to the minimum of the target function.
Input of Gradient Descent Algorithm: 1. Target function or Objective function. 2. Alpha Or Step
size Or learning parameter 3. Starting point 4. Iteration Cap
[1]: import matplotlib as plot
import numpy as np
import sympy as sym #Lib for Symbolic Math
from matplotlib import pyplot

[2]: def objective(x):

return (x+3)**2

[3]: def derivative(x):

return 2*(x + 3)

[4]: def gradient_descent(alpha, start, max_iter):

x_list = list()
x= start;
x_list.append(x)
for i in range(max_iter):
gradient = derivative(x);
x = x - (alpha*gradient);
x_list.append(x);
return x_list

[5]: x = sym.symbols('x')
expr = (x+3)**2.0;
grad = sym.Derivative(expr,x)
print("{}".format(grad.doit()) )
grad.doit().subs(x,2)

2.0*(x + 3)**1.0

1
[5]:
10.0
[6]: def gradient_descent1(expr,alpha, start, max_iter):
x_list = list()
x = sym.symbols('x')
grad = sym.Derivative(expr,x).doit()
x_val= start;
x_list.append(x_val)
for i in range(max_iter):
gradient = grad.subs(x,x_val);
x_val = x_val - (alpha*gradient);
x_list.append(x_val);
return x_list

[7]: alpha = 0.1 #Step_size

start = 2 #Starting point
max_iter = 30 #Limit on iterations
x = sym.symbols('x')
expr = (x+3)**2; #target function

[8]: x_cordinate = np.linspace(-15,15,100)

pyplot.plot(x_cordinate,objective(x_cordinate))
pyplot.plot(2,objective(2),'ro')

[8]: [<matplotlib.lines.Line2D at 0x215d1886280>]

2
[9]: X = gradient_descent(alpha,start,max_iter)

x_cordinate = np.linspace(-5,5,100)
pyplot.plot(x_cordinate,objective(x_cordinate))

X_arr = np.array(X)
pyplot.plot(X_arr, objective(X_arr), '.-', color='red')
pyplot.show()

[10]: X= gradient_descent1(expr,alpha,start,max_iter)
X_arr = np.array(X)

x_cordinate = np.linspace(-5,5,100)
pyplot.plot(x_cordinate,objective(x_cordinate))

X_arr = np.array(X)
pyplot.plot(X_arr, objective(X_arr), '.-', color='red')
pyplot.show()

3
4
eighbors-algorithm-on-diabetes-csv

October 19, 2024

1 ASSIGNMENT NO 5
Problem Statement:-Implement K-Nearest Neighbors algorithm on diabetes.csv dataset. Compute
confusion matrix, accuracy, error rate, precision and recall on the given dataset.
[1]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, f1_score, recall_score,␣
↪precision_score,accuracy_score

[2]: df=pd.read_csv("C://Users//91772//Desktop//ML assigns//diabetes.csv")

[3]: df.head()

[3]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \

0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1

Pedigree Age Outcome

0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1

[4]: df.shape

[4]: (768, 9)

[5]: df.describe()

1
[5]: Pregnancies Glucose BloodPressure SkinThickness Insulin \
count 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479
std 3.369578 31.972618 19.355807 15.952218 115.244002
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000
75% 6.000000 140.250000 80.000000 32.000000 127.250000
max 17.000000 199.000000 122.000000 99.000000 846.000000

BMI Pedigree Age Outcome

count 768.000000 768.000000 768.000000 768.000000
mean 31.992578 0.471876 33.240885 0.348958
std 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.078000 21.000000 0.000000
25% 27.300000 0.243750 24.000000 0.000000
50% 32.000000 0.372500 29.000000 0.000000
75% 36.600000 0.626250 41.000000 1.000000
max 67.100000 2.420000 81.000000 1.000000

[6]: #replace zeros

zero_not_accepted=["Glucose","BloodPressure","SkinThickness","BMI","Insulin"]
for column in zero_not_accepted:
df[column]=df[column].replace(0,np.NaN)
mean=int(df[column].mean(skipna=True))
df[column]=df[column].replace(np.NaN,mean)

[7]: df["Glucose"]

[7]: 0 148.0
1 85.0
2 183.0
3 89.0
4 137.0
…
763 101.0
764 122.0
765 121.0
766 126.0
767 93.0
Name: Glucose, Length: 768, dtype: float64

[8]: #split dataset

X=df.iloc[:,0:8]
y=df.iloc[:,8]
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0,test_size=0.2)

2
[9]: #feature Scaling
sc_X=StandardScaler()
X_train=sc_X.fit_transform(X_train)

X_test=sc_X.transform(X_test)

[10]: knn=KNeighborsClassifier(n_neighbors=11)

[11]: knn.fit(X_train,y_train)

[11]: KNeighborsClassifier(n_neighbors=11)

[12]: y_pred=knn.predict(X_test)

[13]: #Evaluate The Model

cf_matrix=confusion_matrix(y_test,y_pred)

[14]: ax = sns.heatmap(cf_matrix, annot=True, cmap='Blues')

ax.set_title('Seaborn Confusion Matrix with labels\n\n');

ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');

## Display the visualization of the Confusion Matrix.

plt.show()

3
[15]: tn, fp, fn, tp = confusion_matrix(y_test, y_pred ).ravel()

[16]: tn, fp, fn, tp

[16]: (94, 13, 15, 32)

[17]: #The accuracy rate is equal to (tn+tp)/(tn+tp+fn+fp)

accuracy_score(y_test,y_pred)

[17]: 0.8181818181818182

[18]: #The precision is the ratio of tp/(tp + fp)

precision_score(y_test,y_pred)

[18]: 0.7111111111111111

[19]: ##The recall is the ratio of tp/(tp + fn)

recall_score(y_test,y_pred)

[19]: 0.6808510638297872

4
[20]: #error rate=1-accuracy which is lies bertween 0 and 1
error_rate=1-accuracy_score(y_test,y_pred)

[21]: error_rate

[21]: 0.18181818181818177

5
03-email-classification-using-knn

October 19, 2024

1 ASSIGNMENT NO 3
Classify the email using the binary classification method. Email Spam detection has two states:
a) Normal State – Not Spam, b) Abnormal State – Spam. Use K-Nearest Neighbors and Support
Vector Machine for classification. Analyze their performance. Dataset link: The emails.csv dataset
on the Kaggle https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv

[2]: import pandas as pd

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

[3]: df=pd.read_csv('emails.csv')

[4]: df.head()

[4]: Email No. the to ect and for of a you hou … connevey jay \
0 Email 1 0 0 1 0 0 0 2 0 0 … 0 0
1 Email 2 8 13 24 6 6 2 102 1 27 … 0 0
2 Email 3 0 0 1 0 0 0 8 0 0 … 0 0
3 Email 4 0 5 22 0 5 1 51 2 10 … 0 0
4 Email 5 7 6 17 1 5 2 57 0 9 … 0 0

valued lay infrastructure military allowing ff dry Prediction

0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 1 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 1 0 0

[5 rows x 3002 columns]

[5]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5172 entries, 0 to 5171
Columns: 3002 entries, Email No. to Prediction

1
dtypes: int64(3001), object(1)
memory usage: 118.5+ MB

[6]: df.isnull().sum()

[6]: Email No. 0

the 0
to 0
ect 0
and 0
..
military 0
allowing 0
ff 0
dry 0
Prediction 0
Length: 3002, dtype: int64

[7]: X = df.iloc[:, 1:-1].values

y = df.iloc[:, -1].values

[8]: from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,␣
↪random_state=101)

[9]: from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

[10]: from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)

[10]: KNeighborsClassifier()

[11]: y_pred = classifier.predict(X_test)

[12]: from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_pred)

[13]: cm

[13]: array([[866, 248],

[ 16, 422]], dtype=int64)

2
[14]: from sklearn.metrics import classification_report
cl_report=classification_report(y_test,y_pred)
print(cl_report)

precision recall f1-score support

0 0.98 0.78 0.87 1114

1 0.63 0.96 0.76 438

accuracy 0.83 1552

macro avg 0.81 0.87 0.81 1552
weighted avg 0.88 0.83 0.84 1552

[15]: print("Accuracy Score for KNN : ", accuracy_score(y_pred,y_test))

Accuracy Score for KNN : 0.8298969072164949

3
clustering-on-sales-datasample-csv

October 19, 2024

0.1 Machine Learning - Assignment 6

[1]: # Implement K-Means clustering/ hierarchical clustering on sales_data_sample.
↪csv dataset.

# Determine the number of clusters using the elbow method.

[2]: import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

[3]: data = pd.read_csv("C://Users//91772//Desktop//ML assigns//sales_data_sample.

↪csv", encoding='Latin-1')

data.head()

# While utf-8 supports all languages according to pandas' documentation, utf-8␣

↪has a byte structure that must be respected at all times. Some of the values␣

↪not included in utf-8 are latin small letters i with diaeresis,␣

↪right-pointing double angle quotation mark, inverted question mark. This are␣

↪mapped as 0xef, 0xbb and 0xbf bytes respectively. Hence your error.

[3]: ORDERNUMBER QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES \

0 10107 30 95.70 2 2871.00
1 10121 34 81.35 5 2765.90
2 10134 41 94.74 2 3884.34
3 10145 45 83.26 6 3746.70
4 10159 49 100.00 14 5205.27

ORDERDATE STATUS QTR_ID MONTH_ID YEAR_ID … \

0 2/24/2003 0:00 Shipped 1 2 2003 …
1 5/7/2003 0:00 Shipped 2 5 2003 …
2 7/1/2003 0:00 Shipped 3 7 2003 …
3 8/25/2003 0:00 Shipped 3 8 2003 …
4 10/10/2003 0:00 Shipped 4 10 2003 …

ADDRESSLINE1 ADDRESSLINE2 CITY STATE \

0 897 Long Airport Avenue NaN NYC NY
1 59 rue de l'Abbaye NaN Reims NaN

1
2 27 rue du Colonel Pierre Avia NaN Paris NaN
3 78934 Hillside Dr. NaN Pasadena CA
4 7734 Strong St. NaN San Francisco CA

POSTALCODE COUNTRY TERRITORY CONTACTLASTNAME CONTACTFIRSTNAME DEALSIZE

0 10022 USA NaN Yu Kwai Small
1 51100 France EMEA Henriot Paul Small
2 75508 France EMEA Da Cunha Daniel Medium
3 90003 USA NaN Young Julie Medium
4 NaN USA NaN Brown Julie Medium

[5 rows x 25 columns]

[4]: data.shape

[4]: (2823, 25)

[5]: # Number of NAN values per column in the dataset

data.isnull().sum()

[5]: ORDERNUMBER 0
QUANTITYORDERED 0
PRICEEACH 0
ORDERLINENUMBER 0
SALES 0
ORDERDATE 0
STATUS 0
QTR_ID 0
MONTH_ID 0
YEAR_ID 0
PRODUCTLINE 0
MSRP 0
PRODUCTCODE 0
CUSTOMERNAME 0
PHONE 0
ADDRESSLINE1 0
ADDRESSLINE2 2521
CITY 0
STATE 1486
POSTALCODE 76
COUNTRY 0
TERRITORY 1074
CONTACTLASTNAME 0
CONTACTFIRSTNAME 0
DEALSIZE 0
dtype: int64

2
[6]: data.drop(["ORDERNUMBER", "PRICEEACH", "ORDERDATE", "PHONE", "ADDRESSLINE1",␣
↪"ADDRESSLINE2", "CITY", "STATE", "TERRITORY", "POSTALCODE",␣

↪"CONTACTLASTNAME", "CONTACTFIRSTNAME"], axis = 1, inplace=True)

[7]: data.head()

[7]: QUANTITYORDERED ORDERLINENUMBER SALES STATUS QTR_ID MONTH_ID \

0 30 2 2871.00 Shipped 1 2
1 34 5 2765.90 Shipped 2 5
2 41 2 3884.34 Shipped 3 7
3 45 6 3746.70 Shipped 3 8
4 49 14 5205.27 Shipped 4 10

YEAR_ID PRODUCTLINE MSRP PRODUCTCODE CUSTOMERNAME COUNTRY \

0 2003 Motorcycles 95 S10_1678 Land of Toys Inc. USA
1 2003 Motorcycles 95 S10_1678 Reims Collectables France
2 2003 Motorcycles 95 S10_1678 Lyon Souveniers France
3 2003 Motorcycles 95 S10_1678 Toys4GrownUps.com USA
4 2003 Motorcycles 95 S10_1678 Corporate Gift Ideas Co. USA

DEALSIZE
0 Small
1 Small
2 Medium
3 Medium
4 Medium

[8]: data.isnull().sum()

[8]: QUANTITYORDERED 0
ORDERLINENUMBER 0
SALES 0
STATUS 0
QTR_ID 0
MONTH_ID 0
YEAR_ID 0
PRODUCTLINE 0
MSRP 0
PRODUCTCODE 0
CUSTOMERNAME 0
COUNTRY 0
DEALSIZE 0
dtype: int64

3
1 Exploratary Data Analysis
[9]: data.describe()

[9]: QUANTITYORDERED ORDERLINENUMBER SALES QTR_ID \

count 2823.000000 2823.000000 2823.000000 2823.000000
mean 35.092809 6.466171 3553.889072 2.717676
std 9.741443 4.225841 1841.865106 1.203878
min 6.000000 1.000000 482.130000 1.000000
25% 27.000000 3.000000 2203.430000 2.000000
50% 35.000000 6.000000 3184.800000 3.000000
75% 43.000000 9.000000 4508.000000 4.000000
max 97.000000 18.000000 14082.800000 4.000000

MONTH_ID YEAR_ID MSRP

count 2823.000000 2823.00000 2823.000000
mean 7.092455 2003.81509 100.715551
std 3.656633 0.69967 40.187912
min 1.000000 2003.00000 33.000000
25% 4.000000 2003.00000 68.000000
50% 8.000000 2004.00000 99.000000
75% 11.000000 2004.00000 124.000000
max 12.000000 2005.00000 214.000000

[10]: sns.countplot(data = data , x = 'STATUS')

[10]: <AxesSubplot:xlabel='STATUS', ylabel='count'>

4
[11]: import seaborn as sns

[12]: sns.histplot(x = 'SALES' , hue = 'PRODUCTLINE', data = data,

element="poly")

[12]: <AxesSubplot:xlabel='SALES', ylabel='Count'>

Here we can see all the catagory lies in the range of price and hence in this we be creating a cluster
on targeting the same
[13]: data['PRODUCTLINE'].unique()

[13]: array(['Motorcycles', 'Classic Cars', 'Trucks and Buses', 'Vintage Cars',

'Planes', 'Ships', 'Trains'], dtype=object)

[14]: #checking the duplicated values

data.drop_duplicates(inplace=True)

[15]: data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2823 entries, 0 to 2822
Data columns (total 13 columns):

5
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 QUANTITYORDERED 2823 non-null int64
1 ORDERLINENUMBER 2823 non-null int64
2 SALES 2823 non-null float64
3 STATUS 2823 non-null object
4 QTR_ID 2823 non-null int64
5 MONTH_ID 2823 non-null int64
6 YEAR_ID 2823 non-null int64
7 PRODUCTLINE 2823 non-null object
8 MSRP 2823 non-null int64
9 PRODUCTCODE 2823 non-null object
10 CUSTOMERNAME 2823 non-null object
11 COUNTRY 2823 non-null object
12 DEALSIZE 2823 non-null object
dtypes: float64(1), int64(6), object(6)
memory usage: 308.8+ KB

[16]: list_cat = data.select_dtypes(include=['object']).columns.tolist()

[17]: list_cat

[17]: ['STATUS', 'PRODUCTLINE', 'PRODUCTCODE', 'CUSTOMERNAME', 'COUNTRY', 'DEALSIZE']

[18]: for i in list_cat:

sns.countplot(data = data ,x = i)
plt.xticks(rotation = 90)
plt.show()

6
7
8
9
10
11
[19]: #dealing with the catagorical features
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

# Encode labels in column 'species'.

for i in list_cat:
data[i]= le.fit_transform(data[i])

[20]: data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2823 entries, 0 to 2822
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 QUANTITYORDERED 2823 non-null int64
1 ORDERLINENUMBER 2823 non-null int64
2 SALES 2823 non-null float64
3 STATUS 2823 non-null int32
4 QTR_ID 2823 non-null int64
5 MONTH_ID 2823 non-null int64
6 YEAR_ID 2823 non-null int64
7 PRODUCTLINE 2823 non-null int32
8 MSRP 2823 non-null int64
9 PRODUCTCODE 2823 non-null int32
10 CUSTOMERNAME 2823 non-null int32
11 COUNTRY 2823 non-null int32
12 DEALSIZE 2823 non-null int32
dtypes: float64(1), int32(6), int64(6)
memory usage: 307.1 KB

[21]: data['SALES'] = data['SALES'].astype(int)

[22]: data.info()

12
5 MONTH_ID 2823 non-null int64
6 YEAR_ID 2823 non-null int64
7 PRODUCTLINE 2823 non-null int32
8 MSRP 2823 non-null int64
9 PRODUCTCODE 2823 non-null int32
10 CUSTOMERNAME 2823 non-null int32
11 COUNTRY 2823 non-null int32
12 DEALSIZE 2823 non-null int32
dtypes: int32(7), int64(6)
memory usage: 296.1 KB

[23]: data.describe()

[23]: QUANTITYORDERED ORDERLINENUMBER SALES STATUS \

count 2823.000000 2823.000000 2823.000000 2823.000000
mean 35.092809 6.466171 3553.421537 4.782501
std 9.741443 4.225841 1841.865754 0.879416
min 6.000000 1.000000 482.000000 0.000000
25% 27.000000 3.000000 2203.000000 5.000000
50% 35.000000 6.000000 3184.000000 5.000000
75% 43.000000 9.000000 4508.000000 5.000000
max 97.000000 18.000000 14082.000000 5.000000

QTR_ID MONTH_ID YEAR_ID PRODUCTLINE MSRP \

count 2823.000000 2823.000000 2823.00000 2823.000000 2823.000000
mean 2.717676 7.092455 2003.81509 2.515055 100.715551
std 1.203878 3.656633 0.69967 2.411665 40.187912
min 1.000000 1.000000 2003.00000 0.000000 33.000000
25% 2.000000 4.000000 2003.00000 0.000000 68.000000
50% 3.000000 8.000000 2004.00000 2.000000 99.000000
75% 4.000000 11.000000 2004.00000 5.000000 124.000000
max 4.000000 12.000000 2005.00000 6.000000 214.000000

PRODUCTCODE CUSTOMERNAME COUNTRY DEALSIZE

count 2823.000000 2823.000000 2823.000000 2823.000000
mean 53.773291 46.212186 12.029401 1.398512
std 31.585298 24.936147 6.169774 0.592498
min 0.000000 0.000000 0.000000 0.000000
25% 27.000000 29.000000 6.000000 1.000000
50% 53.000000 45.000000 14.000000 1.000000
75% 81.000000 67.000000 18.000000 2.000000
max 108.000000 91.000000 18.000000 2.000000

[24]: ## taget feature are Sales and productline

X = data[['SALES','PRODUCTCODE']]

[25]: data.columns

13
[25]: Index(['QUANTITYORDERED', 'ORDERLINENUMBER', 'SALES', 'STATUS', 'QTR_ID',
'MONTH_ID', 'YEAR_ID', 'PRODUCTLINE', 'MSRP', 'PRODUCTCODE',
'CUSTOMERNAME', 'COUNTRY', 'DEALSIZE'],
dtype='object')

1.1 K Means implementation

[26]: from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4, init='k-means++', random_state=0).fit(X)

[27]: kmeans.labels_

[27]: array([0, 0, 0, …, 3, 2, 0])

[28]: kmeans.inertia_

[28]: 1042223216.6249822

[29]: kmeans.n_iter_

[29]: 24

[30]: kmeans.cluster_centers_

[30]: array([[3416.59686888, 56.3072407 ],

[7983.1758794 , 28.05025126],
[1879.28363988, 63.25072604],
[5289.27065026, 41.01230228]])

[31]: #getting the size of the clusters

from collections import Counter
Counter(kmeans.labels_)

[31]: Counter({0: 1024, 3: 565, 2: 1035, 1: 199})

Hence the NUmber of Clusters to be choosen Will be 4 according to the elbow method
[32]: sns.scatterplot(data=X, x="SALES", y="PRODUCTCODE", hue=kmeans.labels_)
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1],
marker="X", c="r", s=80, label="centroids")
plt.legend()
plt.show()

14
[ ]:

Taxi Trips Analysis Project 1682332303
100% (2)
Taxi Trips Analysis Project 1682332303
28 pages
Loading The Dataset: First We Load The Dataset and Find Out The Number of Columns, Rows, NULL Values, Etc
100% (1)
Loading The Dataset: First We Load The Dataset and Find Out The Number of Columns, Rows, NULL Values, Etc
8 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Final
No ratings yet
Final
15 pages
Heath Anthology 5 Ins Guide
100% (2)
Heath Anthology 5 Ins Guide
1,175 pages
Practical 1
No ratings yet
Practical 1
6 pages
Hamid Uddin Farahi Awr Jamhur Kay Usul I Tafsir
100% (3)
Hamid Uddin Farahi Awr Jamhur Kay Usul I Tafsir
625 pages
UBER Data Wrangling
No ratings yet
UBER Data Wrangling
45 pages
Flight Fare Prediction Using ML Algorithms
No ratings yet
Flight Fare Prediction Using ML Algorithms
40 pages
ML A 6 Project
No ratings yet
ML A 6 Project
18 pages
Canadian Slang
100% (2)
Canadian Slang
39 pages
Analyzing Taxi Trends
No ratings yet
Analyzing Taxi Trends
43 pages
Flight-Price-Prediction - Flight - Price - Ipynb at Master Mandal-21 - Flight-Price-Prediction
No ratings yet
Flight-Price-Prediction - Flight - Price - Ipynb at Master Mandal-21 - Flight-Price-Prediction
28 pages
Featureselection
No ratings yet
Featureselection
11 pages
ML Code Output
No ratings yet
ML Code Output
38 pages
Rainfall Prediction Using Machine Learning
No ratings yet
Rainfall Prediction Using Machine Learning
9 pages
Cia Code
No ratings yet
Cia Code
38 pages
Assignment No 1 Output
No ratings yet
Assignment No 1 Output
42 pages
ML All Prints
No ratings yet
ML All Prints
25 pages
Flight Price Prediction
No ratings yet
Flight Price Prediction
34 pages
Absenteeism Module
No ratings yet
Absenteeism Module
2 pages
Ass 1 ML
No ratings yet
Ass 1 ML
21 pages
Airfare ML - Predicting Flight Fares
No ratings yet
Airfare ML - Predicting Flight Fares
21 pages
Uber - Rides - Analysis - Jupyter Notebook
No ratings yet
Uber - Rides - Analysis - Jupyter Notebook
12 pages
Supervised Regression
No ratings yet
Supervised Regression
24 pages
Uber ml1 - Jupyter Notebook
No ratings yet
Uber ml1 - Jupyter Notebook
10 pages
ML - Practical - 1 - Jupyter Notebook
No ratings yet
ML - Practical - 1 - Jupyter Notebook
15 pages
ML 1 16
No ratings yet
ML 1 16
13 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
8 pages
ML Practical 1
No ratings yet
ML Practical 1
15 pages
ML 1 Um
No ratings yet
ML 1 Um
5 pages
Lab1.ipynb - Colaboratory
No ratings yet
Lab1.ipynb - Colaboratory
9 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Exercise - 6: DS203-2024-S1 Problem1:: Statistics
No ratings yet
Exercise - 6: DS203-2024-S1 Problem1:: Statistics
10 pages
SPPUML1
No ratings yet
SPPUML1
8 pages
Name: Siddhesh Asati: #Group: B (ML) #Assignment: 6
No ratings yet
Name: Siddhesh Asati: #Group: B (ML) #Assignment: 6
9 pages
cdp201 10 11 2023
No ratings yet
cdp201 10 11 2023
17 pages
SourceCode Assignment1
No ratings yet
SourceCode Assignment1
9 pages
Predict The Price of The Uber Ride From A Given Pickup Point To The Agreed Drop-Off Location
No ratings yet
Predict The Price of The Uber Ride From A Given Pickup Point To The Agreed Drop-Off Location
9 pages
Taxi Fare Team 09
No ratings yet
Taxi Fare Team 09
25 pages
Uber
No ratings yet
Uber
7 pages
ML - 2 - Jupyter Notebook
No ratings yet
ML - 2 - Jupyter Notebook
6 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Project Intern - Jupyter Notebook
No ratings yet
Project Intern - Jupyter Notebook
16 pages
MIlosz, Czeslaw - To Begin Where I Am (2001, FSG) PDF
100% (10)
MIlosz, Czeslaw - To Begin Where I Am (2001, FSG) PDF
481 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
Yash Week 3 Uber Case Study
No ratings yet
Yash Week 3 Uber Case Study
38 pages
Dejene Chala Stat606 Screening Quiz Programming Part
No ratings yet
Dejene Chala Stat606 Screening Quiz Programming Part
12 pages
Flight Price Prediction Report
No ratings yet
Flight Price Prediction Report
18 pages
P1) Code Uber
No ratings yet
P1) Code Uber
6 pages
Tutorial 4
No ratings yet
Tutorial 4
8 pages
Cab Fare Prediction Report by Abhinav Jha
No ratings yet
Cab Fare Prediction Report by Abhinav Jha
41 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Dev Record Aids
No ratings yet
Dev Record Aids
24 pages
Kambi Kathakal Ammayude Maanthrikam PDF
No ratings yet
Kambi Kathakal Ammayude Maanthrikam PDF
16 pages
DataPreparation - Outlier - Treatment ASSIGEMENT ANSWER
No ratings yet
DataPreparation - Outlier - Treatment ASSIGEMENT ANSWER
4 pages
Uber Drive Practice DP PDF
No ratings yet
Uber Drive Practice DP PDF
10 pages
Region2 Cagayan Valley 13.Pptx New
No ratings yet
Region2 Cagayan Valley 13.Pptx New
32 pages
Expt 2
No ratings yet
Expt 2
3 pages
Practical (Data Science)
No ratings yet
Practical (Data Science)
13 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Forms of Reading
50% (2)
Forms of Reading
24 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
The Deal PDF
No ratings yet
The Deal PDF
26 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
Important Group Discussion Topics Guidelines Feb 2023
No ratings yet
Important Group Discussion Topics Guidelines Feb 2023
4 pages
Đề Kiểm Tra Học Kì 2 Tiếng Anh 8 Global Success Cô Mai Hương Đề 02-1717379050
No ratings yet
Đề Kiểm Tra Học Kì 2 Tiếng Anh 8 Global Success Cô Mai Hương Đề 02-1717379050
18 pages
AIML
No ratings yet
AIML
13 pages
TOEFL Reading Tips-1
No ratings yet
TOEFL Reading Tips-1
13 pages
Simultaneous Interpreting-Completeversion
No ratings yet
Simultaneous Interpreting-Completeversion
15 pages
cdp1802 Cosmac
No ratings yet
cdp1802 Cosmac
27 pages
Inverse Function
No ratings yet
Inverse Function
11 pages
C1 Input Manual 2007
No ratings yet
C1 Input Manual 2007
342 pages
Semantics Group 3
100% (1)
Semantics Group 3
10 pages
Grade 10 English
No ratings yet
Grade 10 English
17 pages
Deutsch-Jozsa Algorithm PDF
No ratings yet
Deutsch-Jozsa Algorithm PDF
7 pages
TLE-TE 9 - Q1 - W4 - Mod4 - ICT CSS
No ratings yet
TLE-TE 9 - Q1 - W4 - Mod4 - ICT CSS
18 pages
Basic Grammar a-WPS Office
No ratings yet
Basic Grammar a-WPS Office
3 pages
Year Abroad FAQ
No ratings yet
Year Abroad FAQ
23 pages
Universidad Autónoma de Entre Ríos: Facultad de Humanidades, Artes y Ciencias Sociales
No ratings yet
Universidad Autónoma de Entre Ríos: Facultad de Humanidades, Artes y Ciencias Sociales
12 pages
Gladiator Word Search
No ratings yet
Gladiator Word Search
3 pages
Computer Skills PDF
No ratings yet
Computer Skills PDF
9 pages
Past Year (2016 - 2018) MCQs Structure of Atom Chemistry NEET Practice Questions, MCQS, Past Year Questions (PYQs), NCERT Questi
No ratings yet
Past Year (2016 - 2018) MCQs Structure of Atom Chemistry NEET Practice Questions, MCQS, Past Year Questions (PYQs), NCERT Questi
1 page
Stat q4
No ratings yet
Stat q4
5 pages
Work Immersion Form 4 (L-Wif4) Accomplishment Report
No ratings yet
Work Immersion Form 4 (L-Wif4) Accomplishment Report
2 pages
SECTION 1-SPEAKING - Personal Information
No ratings yet
SECTION 1-SPEAKING - Personal Information
3 pages
CH 12 Word List
No ratings yet
CH 12 Word List
3 pages
Unit 14 Exercise 2
No ratings yet
Unit 14 Exercise 2
2 pages
Profound Python Libraries
From Everand
Profound Python Libraries
Onder Teker
No ratings yet