0% found this document useful (0 votes)
10 views

Merged

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Merged

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

predict-the-price-of-the-uber-ride

October 19, 2024

1 ASSIGNMENT 01
Predict the price of the Uber ride from a given pickup point to the agreed drop-off location.
Perform following tasks: 1. Pre-process the dataset. 2. Identify outliers. 3. Check the correlation.
4. Implement linear regression and random forest regression models. 5. Evaluate the models and
compare their respective scores like R2, RMSE, etc.
[ ]: #importing necessary libraries

import matplotlib.pyplot as plt


import numpy as np
import pandas as pd
import seaborn as sns

[ ]: #loading the dataset

df = pd.read_csv("C://Users//91772//Desktop//ML assigns//uber.csv")

[ ]: df.head()

[ ]: Unnamed: 0 key fare_amount \


0 24238194 2015-05-07 19:52:06.0000003 7.5
1 27835199 2009-07-17 20:04:56.0000002 7.7
2 44984355 2009-08-24 21:45:00.00000061 12.9
3 25894730 2009-06-26 08:22:21.0000001 5.3
4 17610152 2014-08-28 17:47:00.000000188 16.0

pickup_datetime pickup_longitude pickup_latitude \


0 2015-05-07 19:52:06 UTC -73.999817 40.738354
1 2009-07-17 20:04:56 UTC -73.994355 40.728225
2 2009-08-24 21:45:00 UTC -74.005043 40.740770
3 2009-06-26 08:22:21 UTC -73.976124 40.790844
4 2014-08-28 17:47:00 UTC -73.925023 40.744085

dropoff_longitude dropoff_latitude passenger_count


0 -73.999512 40.723217 1
1 -73.994710 40.750325 1
2 -73.962565 40.772647 1

1
3 -73.965316 40.803349 3
4 -73.973082 40.761247 5

[ ]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 200000 non-null int64
1 key 200000 non-null object
2 fare_amount 200000 non-null float64
3 pickup_datetime 200000 non-null object
4 pickup_longitude 200000 non-null float64
5 pickup_latitude 200000 non-null float64
6 dropoff_longitude 199999 non-null float64
7 dropoff_latitude 199999 non-null float64
8 passenger_count 200000 non-null int64
dtypes: float64(5), int64(2), object(2)
memory usage: 13.7+ MB

[ ]: df.shape

[ ]: (200000, 9)

2 1.Pre-process the dataset.


[ ]: df.isnull().sum()

[ ]: Unnamed: 0 0
key 0
fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 1
dropoff_latitude 1
passenger_count 0
dtype: int64

[ ]: #dropping rows with missing values

df.dropna(inplace = True)

[ ]: df.isnull().sum()

2
[ ]: Unnamed: 0 0
key 0
fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 0
dropoff_latitude 0
passenger_count 0
dtype: int64

[ ]: #dropping unwanted columns

df.drop(labels='Unnamed: 0',axis=1,inplace=True)
df.drop(labels='key',axis=1,inplace=True)

[ ]: df.head()

[ ]: fare_amount pickup_datetime pickup_longitude pickup_latitude \


0 7.5 2015-05-07 19:52:06 UTC -73.999817 40.738354
1 7.7 2009-07-17 20:04:56 UTC -73.994355 40.728225
2 12.9 2009-08-24 21:45:00 UTC -74.005043 40.740770
3 5.3 2009-06-26 08:22:21 UTC -73.976124 40.790844
4 16.0 2014-08-28 17:47:00 UTC -73.925023 40.744085

dropoff_longitude dropoff_latitude passenger_count


0 -73.999512 40.723217 1
1 -73.994710 40.750325 1
2 -73.962565 40.772647 1
3 -73.965316 40.803349 3
4 -73.973082 40.761247 5

[ ]: df.dtypes

[ ]: fare_amount float64
pickup_datetime object
pickup_longitude float64
pickup_latitude float64
dropoff_longitude float64
dropoff_latitude float64
passenger_count int64
dtype: object

[ ]: df.describe()

[ ]: fare_amount pickup_longitude pickup_latitude dropoff_longitude \


count 199999.000000 199999.000000 199999.000000 199999.000000

3
mean 11.359892 -72.527631 39.935881 -72.525292
std 9.901760 11.437815 7.720558 13.117408
min -52.000000 -1340.648410 -74.015515 -3356.666300
25% 6.000000 -73.992065 40.734796 -73.991407
50% 8.500000 -73.981823 40.752592 -73.980093
75% 12.500000 -73.967154 40.767158 -73.963658
max 499.000000 57.418457 1644.421482 1153.572603

dropoff_latitude passenger_count
count 199999.000000 199999.000000
mean 39.923890 1.684543
std 6.794829 1.385995
min -881.985513 0.000000
25% 40.733823 1.000000
50% 40.753042 1.000000
75% 40.768001 2.000000
max 872.697628 208.000000

3 2. Identify outliers.
OUTLIER: An object that deviates significantly from the rest of the objects.
[ ]: # data visualization
# plotting distribution plot

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
sns.distplot(df['fare_amount'])

[ ]: <AxesSubplot:xlabel='fare_amount', ylabel='Density'>

4
[ ]: sns.distplot(df['pickup_latitude'])

[ ]: <AxesSubplot:xlabel='pickup_latitude', ylabel='Density'>

5
[ ]: sns.distplot(df['pickup_longitude'])

[ ]: <AxesSubplot:xlabel='pickup_longitude', ylabel='Density'>

[ ]: sns.distplot(df['dropoff_longitude'])

[ ]: <AxesSubplot:xlabel='dropoff_longitude', ylabel='Density'>

6
[ ]: sns.distplot(df['dropoff_latitude'])

[ ]: <AxesSubplot:xlabel='dropoff_latitude', ylabel='Density'>

7
[ ]: #creating a function to identify outliers

def find_outliers_IQR(df):
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
IQR = q3-q1
outliers = df[((df<(q1-1.5*IQR)) | (df>(q3+1.5*IQR)))]
return outliers

[ ]: #getting outlier details for column "fair_amount" using the above function

outliers = find_outliers_IQR(df["fare_amount"])
print("number of outliers: "+ str(len(outliers)))
print("max outlier value: "+ str(outliers.max()))
print("min outlier value: "+ str(outliers.min()))
outliers

number of outliers: 17166


max outlier value: 499.0
min outlier value: -52.0

[ ]: 6 24.50
30 25.70
34 39.50
39 29.00
48 56.80

199976 49.70
199977 43.50
199982 57.33
199985 24.00
199997 30.90
Name: fare_amount, Length: 17166, dtype: float64

[ ]: #you can also pass two columns as argument to the function (here␣
↪"passenger_count" and "fair_amount")

outliers = find_outliers_IQR(df[["passenger_count","fare_amount"]])
outliers

[ ]: passenger_count fare_amount
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 5.0 NaN

8
… … …
199995 NaN NaN
199996 NaN NaN
199997 NaN 30.9
199998 NaN NaN
199999 NaN NaN

[199999 rows x 2 columns]

[ ]: #upper and lower limit which can be used for capping of outliers

upper_limit = df['fare_amount'].mean() + 3*df['fare_amount'].std()


print(upper_limit)
lower_limit = df['fare_amount'].mean() - 3*df['fare_amount'].std()
print(lower_limit)

41.06517154774204
-18.3453884488253

4 3. Check the correlation.


[ ]: #creating a correlation matrix

corrMatrix = df.corr()
sns.heatmap(corrMatrix, annot=True)
plt.show()

9
[ ]: #splitting column "pickup_datetime" into 5 columns: "day", "hour", "month",␣
↪"year", "weekday"

#for a simplified view

import calendar
df['day']=df['pickup_datetime'].apply(lambda x:x.day)
df['hour']=df['pickup_datetime'].apply(lambda x:x.hour)
df['month']=df['pickup_datetime'].apply(lambda x:x.month)
df['year']=df['pickup_datetime'].apply(lambda x:x.year)
df['weekday']=df['pickup_datetime'].apply(lambda x: calendar.day_name[x.
↪weekday()])

df.drop(['pickup_datetime'],axis =1 , inplace = True)

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_20560/3136844708.py in <module>
3
4 import calendar
----> 5 df['day']=df['pickup_datetime'].apply(lambda x:x.day)
6 df['hour']=df['pickup_datetime'].apply(lambda x:x.hour)

10
7 df['month']=df['pickup_datetime'].apply(lambda x:x.month)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self,␣
↪func, convert_dtype, args, **kwargs)

4355 dtype: float64


4356 """
-> 4357 return SeriesApply(self, func, convert_dtype, args, kwargs).
↪apply()

4358
4359 def _reduce(

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\apply.py in apply(self)
1041 return self.apply_str()
1042
-> 1043 return self.apply_standard()
1044
1045 def agg(self):

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\apply.py in␣
↪apply_standard(self)

1096 # List[Union[Callable[…, Any], str]]]]]"; expected


1097 # "Callable[[Any], Any]"
-> 1098 mapped = lib.map_infer(
1099 values,
1100 f, # type: ignore[arg-type]

C:\ProgramData\Anaconda3\lib\site-packages\pandas\_libs\lib.pyx in pandas._libs.
↪lib.map_infer()

~\AppData\Local\Temp/ipykernel_20560/3136844708.py in <lambda>(x)
3
4 import calendar
----> 5 df['day']=df['pickup_datetime'].apply(lambda x:x.day)
6 df['hour']=df['pickup_datetime'].apply(lambda x:x.hour)
7 df['month']=df['pickup_datetime'].apply(lambda x:x.month)

AttributeError: 'str' object has no attribute 'day'

[ ]: #label encoding (categorical to numerical)

df.weekday = df.weekday.map({'Sunday':0,'Monday':1,'Tuesday':2,'Wednesday':
↪3,'Thursday':4,'Friday':5,'Saturday':6})

[ ]: df.head()

[ ]: df.info()

11
[ ]: #splitting the data into train and test

from sklearn.model_selection import train_test_split

[ ]: #independent variables (x)

x=df.drop("fare_amount", axis=1)
x

[ ]: #dependent variable (y)

y=df["fare_amount"]

[ ]: x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.
↪2,random_state=101)

[ ]: x_train.head()

[ ]: x_test.head()

[ ]: y_train.head()

[ ]: y_test.head()

[ ]: print(x_train.shape)
print(x_test.shape)
print(y_test.shape)
print(y_train.shape)

5 4. Implement linear regression and random forest regression


models.

6 5. Evaluate the models and compare their respective scores like


R2, RMSE, etc.
[ ]: #Linear Regression

from sklearn.linear_model import LinearRegression


lrmodel=LinearRegression()
lrmodel.fit(x_train, y_train)

[ ]: predictedvalues = lrmodel.predict(x_test)

[ ]: #Calculating the value of RMSE for Linear Regression

12
from sklearn.metrics import mean_squared_error
lrmodelrmse = np.sqrt(mean_squared_error(predictedvalues, y_test))
print("RMSE value for Linear regression is", lrmodelrmse)

[ ]: #Random Forest Regression

from sklearn.ensemble import RandomForestRegressor


rfrmodel = RandomForestRegressor(n_estimators=100, random_state=101)

[ ]: rfrmodel.fit(x_train,y_train)
rfrmodel_pred= rfrmodel.predict(x_test)

[ ]: #Calculating the value of RMSE for Random Forest Regression

rfrmodel_rmse=np.sqrt(mean_squared_error(rfrmodel_pred, y_test))
print("RMSE value for Random forest regression is ",rfrmodel_rmse)

[ ]: rfrmodel_pred.shape

7 Predict the price of the Uber ride


[ ]: test = pd.read_csv("https://raw.githubusercontent.com/piyushpandey758/
↪Uber-Fare-Prediction/master/testt.csv")

[ ]: test.head()

[ ]: test.drop(test[['Unnamed: 0.1.1','Unnamed: 0','Unnamed: 0.


↪1','key']],axis=1,inplace=True)

[ ]: test.isnull().sum()

[ ]: #converting datatype of column "pickup_datetime" from object to DateTime

test["pickup_datetime"] = pd.to_datetime(test["pickup_datetime"])

[ ]: #splitting column "pickup_datetime" into 5 columns: "day", "hour", "month",␣


↪"year", "weekday"

#for a simplified view


#label encoding weekdays

test['day']=test['pickup_datetime'].apply(lambda x:x.day)
test['hour']=test['pickup_datetime'].apply(lambda x:x.hour)
test['month']=test['pickup_datetime'].apply(lambda x:x.month)
test['year']=test['pickup_datetime'].apply(lambda x:x.year)
test['weekday']=test['pickup_datetime'].apply(lambda x: calendar.day_name[x.
↪weekday()])

13
test.weekday = test.weekday.map({'Sunday':0,'Monday':1,'Tuesday':2,'Wednesday':
↪3,'Thursday':4,'Friday':5,'Saturday':6})

test.drop(['pickup_datetime'], axis = 1, inplace = True)

test.head(5)

[ ]: #Prediction!

rfrmodel_pred= rfrmodel.predict(test)

[ ]: df_pred = pd.DataFrame(rfrmodel_pred)
df_pred

[ ]: #to_csv() function exports the DataFrame to CSV format

df_pred.to_csv('pred.csv')

[ ]:

14
ps-05-gradient-descent

October 19, 2024

1 ASSIGNMENT NO 4 Implement Gradient Descent Algorithm


to find the local minima of a function.
For example, find the local minima of the function y=(x+3)² starting from the point x=2.
It’s an optimisation algorithm. Gradient Descent : Minimization optimization that follows the
negative of the gradient to the minimum of the target function.
Input of Gradient Descent Algorithm: 1. Target function or Objective function. 2. Alpha Or Step
size Or learning parameter 3. Starting point 4. Iteration Cap
[1]: import matplotlib as plot
import numpy as np
import sympy as sym #Lib for Symbolic Math
from matplotlib import pyplot

[2]: def objective(x):


return (x+3)**2

[3]: def derivative(x):


return 2*(x + 3)

[4]: def gradient_descent(alpha, start, max_iter):


x_list = list()
x= start;
x_list.append(x)
for i in range(max_iter):
gradient = derivative(x);
x = x - (alpha*gradient);
x_list.append(x);
return x_list

[5]: x = sym.symbols('x')
expr = (x+3)**2.0;
grad = sym.Derivative(expr,x)
print("{}".format(grad.doit()) )
grad.doit().subs(x,2)

2.0*(x + 3)**1.0

1
[5]:
10.0
[6]: def gradient_descent1(expr,alpha, start, max_iter):
x_list = list()
x = sym.symbols('x')
grad = sym.Derivative(expr,x).doit()
x_val= start;
x_list.append(x_val)
for i in range(max_iter):
gradient = grad.subs(x,x_val);
x_val = x_val - (alpha*gradient);
x_list.append(x_val);
return x_list

[7]: alpha = 0.1 #Step_size


start = 2 #Starting point
max_iter = 30 #Limit on iterations
x = sym.symbols('x')
expr = (x+3)**2; #target function

[8]: x_cordinate = np.linspace(-15,15,100)


pyplot.plot(x_cordinate,objective(x_cordinate))
pyplot.plot(2,objective(2),'ro')

[8]: [<matplotlib.lines.Line2D at 0x215d1886280>]

2
[9]: X = gradient_descent(alpha,start,max_iter)

x_cordinate = np.linspace(-5,5,100)
pyplot.plot(x_cordinate,objective(x_cordinate))

X_arr = np.array(X)
pyplot.plot(X_arr, objective(X_arr), '.-', color='red')
pyplot.show()

[10]: X= gradient_descent1(expr,alpha,start,max_iter)
X_arr = np.array(X)

x_cordinate = np.linspace(-5,5,100)
pyplot.plot(x_cordinate,objective(x_cordinate))

X_arr = np.array(X)
pyplot.plot(X_arr, objective(X_arr), '.-', color='red')
pyplot.show()

3
4
eighbors-algorithm-on-diabetes-csv

October 19, 2024

1 ASSIGNMENT NO 5
Problem Statement:-Implement K-Nearest Neighbors algorithm on diabetes.csv dataset. Compute
confusion matrix, accuracy, error rate, precision and recall on the given dataset.
[1]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, f1_score, recall_score,␣
↪precision_score,accuracy_score

[2]: df=pd.read_csv("C://Users//91772//Desktop//ML assigns//diabetes.csv")

[3]: df.head()

[3]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \


0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1

Pedigree Age Outcome


0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1

[4]: df.shape

[4]: (768, 9)

[5]: df.describe()

1
[5]: Pregnancies Glucose BloodPressure SkinThickness Insulin \
count 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479
std 3.369578 31.972618 19.355807 15.952218 115.244002
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000
75% 6.000000 140.250000 80.000000 32.000000 127.250000
max 17.000000 199.000000 122.000000 99.000000 846.000000

BMI Pedigree Age Outcome


count 768.000000 768.000000 768.000000 768.000000
mean 31.992578 0.471876 33.240885 0.348958
std 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.078000 21.000000 0.000000
25% 27.300000 0.243750 24.000000 0.000000
50% 32.000000 0.372500 29.000000 0.000000
75% 36.600000 0.626250 41.000000 1.000000
max 67.100000 2.420000 81.000000 1.000000

[6]: #replace zeros


zero_not_accepted=["Glucose","BloodPressure","SkinThickness","BMI","Insulin"]
for column in zero_not_accepted:
df[column]=df[column].replace(0,np.NaN)
mean=int(df[column].mean(skipna=True))
df[column]=df[column].replace(np.NaN,mean)

[7]: df["Glucose"]

[7]: 0 148.0
1 85.0
2 183.0
3 89.0
4 137.0

763 101.0
764 122.0
765 121.0
766 126.0
767 93.0
Name: Glucose, Length: 768, dtype: float64

[8]: #split dataset


X=df.iloc[:,0:8]
y=df.iloc[:,8]
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0,test_size=0.2)

2
[9]: #feature Scaling
sc_X=StandardScaler()
X_train=sc_X.fit_transform(X_train)

X_test=sc_X.transform(X_test)

[10]: knn=KNeighborsClassifier(n_neighbors=11)

[11]: knn.fit(X_train,y_train)

[11]: KNeighborsClassifier(n_neighbors=11)

[12]: y_pred=knn.predict(X_test)

[13]: #Evaluate The Model


cf_matrix=confusion_matrix(y_test,y_pred)

[14]: ax = sns.heatmap(cf_matrix, annot=True, cmap='Blues')

ax.set_title('Seaborn Confusion Matrix with labels\n\n');


ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');

## Display the visualization of the Confusion Matrix.


plt.show()

3
[15]: tn, fp, fn, tp = confusion_matrix(y_test, y_pred ).ravel()

[16]: tn, fp, fn, tp

[16]: (94, 13, 15, 32)

[17]: #The accuracy rate is equal to (tn+tp)/(tn+tp+fn+fp)


accuracy_score(y_test,y_pred)

[17]: 0.8181818181818182

[18]: #The precision is the ratio of tp/(tp + fp)


precision_score(y_test,y_pred)

[18]: 0.7111111111111111

[19]: ##The recall is the ratio of tp/(tp + fn)


recall_score(y_test,y_pred)

[19]: 0.6808510638297872

4
[20]: #error rate=1-accuracy which is lies bertween 0 and 1
error_rate=1-accuracy_score(y_test,y_pred)

[21]: error_rate

[21]: 0.18181818181818177

5
03-email-classification-using-knn

October 19, 2024

1 ASSIGNMENT NO 3
Classify the email using the binary classification method. Email Spam detection has two states:
a) Normal State – Not Spam, b) Abnormal State – Spam. Use K-Nearest Neighbors and Support
Vector Machine for classification. Analyze their performance. Dataset link: The emails.csv dataset
on the Kaggle https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv

[2]: import pandas as pd


import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

[3]: df=pd.read_csv('emails.csv')

[4]: df.head()

[4]: Email No. the to ect and for of a you hou … connevey jay \
0 Email 1 0 0 1 0 0 0 2 0 0 … 0 0
1 Email 2 8 13 24 6 6 2 102 1 27 … 0 0
2 Email 3 0 0 1 0 0 0 8 0 0 … 0 0
3 Email 4 0 5 22 0 5 1 51 2 10 … 0 0
4 Email 5 7 6 17 1 5 2 57 0 9 … 0 0

valued lay infrastructure military allowing ff dry Prediction


0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 1 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 1 0 0

[5 rows x 3002 columns]

[5]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5172 entries, 0 to 5171
Columns: 3002 entries, Email No. to Prediction

1
dtypes: int64(3001), object(1)
memory usage: 118.5+ MB

[6]: df.isnull().sum()

[6]: Email No. 0


the 0
to 0
ect 0
and 0
..
military 0
allowing 0
ff 0
dry 0
Prediction 0
Length: 3002, dtype: int64

[7]: X = df.iloc[:, 1:-1].values


y = df.iloc[:, -1].values

[8]: from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,␣
↪random_state=101)

[9]: from sklearn.preprocessing import StandardScaler


sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

[10]: from sklearn.neighbors import KNeighborsClassifier


classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)

[10]: KNeighborsClassifier()

[11]: y_pred = classifier.predict(X_test)

[12]: from sklearn.metrics import confusion_matrix, accuracy_score


cm = confusion_matrix(y_test, y_pred)

[13]: cm

[13]: array([[866, 248],


[ 16, 422]], dtype=int64)

2
[14]: from sklearn.metrics import classification_report
cl_report=classification_report(y_test,y_pred)
print(cl_report)

precision recall f1-score support

0 0.98 0.78 0.87 1114


1 0.63 0.96 0.76 438

accuracy 0.83 1552


macro avg 0.81 0.87 0.81 1552
weighted avg 0.88 0.83 0.84 1552

[15]: print("Accuracy Score for KNN : ", accuracy_score(y_pred,y_test))

Accuracy Score for KNN : 0.8298969072164949

3
clustering-on-sales-datasample-csv

October 19, 2024

0.1 Machine Learning - Assignment 6


[1]: # Implement K-Means clustering/ hierarchical clustering on sales_data_sample.
↪csv dataset.

# Determine the number of clusters using the elbow method.

[2]: import pandas as pd


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

[3]: data = pd.read_csv("C://Users//91772//Desktop//ML assigns//sales_data_sample.


↪csv", encoding='Latin-1')

data.head()

# While utf-8 supports all languages according to pandas' documentation, utf-8␣


↪has a byte structure that must be respected at all times. Some of the values␣

↪not included in utf-8 are latin small letters i with diaeresis,␣

↪right-pointing double angle quotation mark, inverted question mark. This are␣

↪mapped as 0xef, 0xbb and 0xbf bytes respectively. Hence your error.

[3]: ORDERNUMBER QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES \


0 10107 30 95.70 2 2871.00
1 10121 34 81.35 5 2765.90
2 10134 41 94.74 2 3884.34
3 10145 45 83.26 6 3746.70
4 10159 49 100.00 14 5205.27

ORDERDATE STATUS QTR_ID MONTH_ID YEAR_ID … \


0 2/24/2003 0:00 Shipped 1 2 2003 …
1 5/7/2003 0:00 Shipped 2 5 2003 …
2 7/1/2003 0:00 Shipped 3 7 2003 …
3 8/25/2003 0:00 Shipped 3 8 2003 …
4 10/10/2003 0:00 Shipped 4 10 2003 …

ADDRESSLINE1 ADDRESSLINE2 CITY STATE \


0 897 Long Airport Avenue NaN NYC NY
1 59 rue de l'Abbaye NaN Reims NaN

1
2 27 rue du Colonel Pierre Avia NaN Paris NaN
3 78934 Hillside Dr. NaN Pasadena CA
4 7734 Strong St. NaN San Francisco CA

POSTALCODE COUNTRY TERRITORY CONTACTLASTNAME CONTACTFIRSTNAME DEALSIZE


0 10022 USA NaN Yu Kwai Small
1 51100 France EMEA Henriot Paul Small
2 75508 France EMEA Da Cunha Daniel Medium
3 90003 USA NaN Young Julie Medium
4 NaN USA NaN Brown Julie Medium

[5 rows x 25 columns]

[4]: data.shape

[4]: (2823, 25)

[5]: # Number of NAN values per column in the dataset


data.isnull().sum()

[5]: ORDERNUMBER 0
QUANTITYORDERED 0
PRICEEACH 0
ORDERLINENUMBER 0
SALES 0
ORDERDATE 0
STATUS 0
QTR_ID 0
MONTH_ID 0
YEAR_ID 0
PRODUCTLINE 0
MSRP 0
PRODUCTCODE 0
CUSTOMERNAME 0
PHONE 0
ADDRESSLINE1 0
ADDRESSLINE2 2521
CITY 0
STATE 1486
POSTALCODE 76
COUNTRY 0
TERRITORY 1074
CONTACTLASTNAME 0
CONTACTFIRSTNAME 0
DEALSIZE 0
dtype: int64

2
[6]: data.drop(["ORDERNUMBER", "PRICEEACH", "ORDERDATE", "PHONE", "ADDRESSLINE1",␣
↪"ADDRESSLINE2", "CITY", "STATE", "TERRITORY", "POSTALCODE",␣

↪"CONTACTLASTNAME", "CONTACTFIRSTNAME"], axis = 1, inplace=True)

[7]: data.head()

[7]: QUANTITYORDERED ORDERLINENUMBER SALES STATUS QTR_ID MONTH_ID \


0 30 2 2871.00 Shipped 1 2
1 34 5 2765.90 Shipped 2 5
2 41 2 3884.34 Shipped 3 7
3 45 6 3746.70 Shipped 3 8
4 49 14 5205.27 Shipped 4 10

YEAR_ID PRODUCTLINE MSRP PRODUCTCODE CUSTOMERNAME COUNTRY \


0 2003 Motorcycles 95 S10_1678 Land of Toys Inc. USA
1 2003 Motorcycles 95 S10_1678 Reims Collectables France
2 2003 Motorcycles 95 S10_1678 Lyon Souveniers France
3 2003 Motorcycles 95 S10_1678 Toys4GrownUps.com USA
4 2003 Motorcycles 95 S10_1678 Corporate Gift Ideas Co. USA

DEALSIZE
0 Small
1 Small
2 Medium
3 Medium
4 Medium

[8]: data.isnull().sum()

[8]: QUANTITYORDERED 0
ORDERLINENUMBER 0
SALES 0
STATUS 0
QTR_ID 0
MONTH_ID 0
YEAR_ID 0
PRODUCTLINE 0
MSRP 0
PRODUCTCODE 0
CUSTOMERNAME 0
COUNTRY 0
DEALSIZE 0
dtype: int64

3
1 Exploratary Data Analysis
[9]: data.describe()

[9]: QUANTITYORDERED ORDERLINENUMBER SALES QTR_ID \


count 2823.000000 2823.000000 2823.000000 2823.000000
mean 35.092809 6.466171 3553.889072 2.717676
std 9.741443 4.225841 1841.865106 1.203878
min 6.000000 1.000000 482.130000 1.000000
25% 27.000000 3.000000 2203.430000 2.000000
50% 35.000000 6.000000 3184.800000 3.000000
75% 43.000000 9.000000 4508.000000 4.000000
max 97.000000 18.000000 14082.800000 4.000000

MONTH_ID YEAR_ID MSRP


count 2823.000000 2823.00000 2823.000000
mean 7.092455 2003.81509 100.715551
std 3.656633 0.69967 40.187912
min 1.000000 2003.00000 33.000000
25% 4.000000 2003.00000 68.000000
50% 8.000000 2004.00000 99.000000
75% 11.000000 2004.00000 124.000000
max 12.000000 2005.00000 214.000000

[10]: sns.countplot(data = data , x = 'STATUS')

[10]: <AxesSubplot:xlabel='STATUS', ylabel='count'>

4
[11]: import seaborn as sns

[12]: sns.histplot(x = 'SALES' , hue = 'PRODUCTLINE', data = data,


element="poly")

[12]: <AxesSubplot:xlabel='SALES', ylabel='Count'>

Here we can see all the catagory lies in the range of price and hence in this we be creating a cluster
on targeting the same
[13]: data['PRODUCTLINE'].unique()

[13]: array(['Motorcycles', 'Classic Cars', 'Trucks and Buses', 'Vintage Cars',


'Planes', 'Ships', 'Trains'], dtype=object)

[14]: #checking the duplicated values


data.drop_duplicates(inplace=True)

[15]: data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2823 entries, 0 to 2822
Data columns (total 13 columns):

5
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 QUANTITYORDERED 2823 non-null int64
1 ORDERLINENUMBER 2823 non-null int64
2 SALES 2823 non-null float64
3 STATUS 2823 non-null object
4 QTR_ID 2823 non-null int64
5 MONTH_ID 2823 non-null int64
6 YEAR_ID 2823 non-null int64
7 PRODUCTLINE 2823 non-null object
8 MSRP 2823 non-null int64
9 PRODUCTCODE 2823 non-null object
10 CUSTOMERNAME 2823 non-null object
11 COUNTRY 2823 non-null object
12 DEALSIZE 2823 non-null object
dtypes: float64(1), int64(6), object(6)
memory usage: 308.8+ KB

[16]: list_cat = data.select_dtypes(include=['object']).columns.tolist()

[17]: list_cat

[17]: ['STATUS', 'PRODUCTLINE', 'PRODUCTCODE', 'CUSTOMERNAME', 'COUNTRY', 'DEALSIZE']

[18]: for i in list_cat:


sns.countplot(data = data ,x = i)
plt.xticks(rotation = 90)
plt.show()

6
7
8
9
10
11
[19]: #dealing with the catagorical features
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

# Encode labels in column 'species'.


for i in list_cat:
data[i]= le.fit_transform(data[i])

[20]: data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2823 entries, 0 to 2822
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 QUANTITYORDERED 2823 non-null int64
1 ORDERLINENUMBER 2823 non-null int64
2 SALES 2823 non-null float64
3 STATUS 2823 non-null int32
4 QTR_ID 2823 non-null int64
5 MONTH_ID 2823 non-null int64
6 YEAR_ID 2823 non-null int64
7 PRODUCTLINE 2823 non-null int32
8 MSRP 2823 non-null int64
9 PRODUCTCODE 2823 non-null int32
10 CUSTOMERNAME 2823 non-null int32
11 COUNTRY 2823 non-null int32
12 DEALSIZE 2823 non-null int32
dtypes: float64(1), int32(6), int64(6)
memory usage: 307.1 KB

[21]: data['SALES'] = data['SALES'].astype(int)

[22]: data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2823 entries, 0 to 2822
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 QUANTITYORDERED 2823 non-null int64
1 ORDERLINENUMBER 2823 non-null int64
2 SALES 2823 non-null int32
3 STATUS 2823 non-null int32
4 QTR_ID 2823 non-null int64

12
5 MONTH_ID 2823 non-null int64
6 YEAR_ID 2823 non-null int64
7 PRODUCTLINE 2823 non-null int32
8 MSRP 2823 non-null int64
9 PRODUCTCODE 2823 non-null int32
10 CUSTOMERNAME 2823 non-null int32
11 COUNTRY 2823 non-null int32
12 DEALSIZE 2823 non-null int32
dtypes: int32(7), int64(6)
memory usage: 296.1 KB

[23]: data.describe()

[23]: QUANTITYORDERED ORDERLINENUMBER SALES STATUS \


count 2823.000000 2823.000000 2823.000000 2823.000000
mean 35.092809 6.466171 3553.421537 4.782501
std 9.741443 4.225841 1841.865754 0.879416
min 6.000000 1.000000 482.000000 0.000000
25% 27.000000 3.000000 2203.000000 5.000000
50% 35.000000 6.000000 3184.000000 5.000000
75% 43.000000 9.000000 4508.000000 5.000000
max 97.000000 18.000000 14082.000000 5.000000

QTR_ID MONTH_ID YEAR_ID PRODUCTLINE MSRP \


count 2823.000000 2823.000000 2823.00000 2823.000000 2823.000000
mean 2.717676 7.092455 2003.81509 2.515055 100.715551
std 1.203878 3.656633 0.69967 2.411665 40.187912
min 1.000000 1.000000 2003.00000 0.000000 33.000000
25% 2.000000 4.000000 2003.00000 0.000000 68.000000
50% 3.000000 8.000000 2004.00000 2.000000 99.000000
75% 4.000000 11.000000 2004.00000 5.000000 124.000000
max 4.000000 12.000000 2005.00000 6.000000 214.000000

PRODUCTCODE CUSTOMERNAME COUNTRY DEALSIZE


count 2823.000000 2823.000000 2823.000000 2823.000000
mean 53.773291 46.212186 12.029401 1.398512
std 31.585298 24.936147 6.169774 0.592498
min 0.000000 0.000000 0.000000 0.000000
25% 27.000000 29.000000 6.000000 1.000000
50% 53.000000 45.000000 14.000000 1.000000
75% 81.000000 67.000000 18.000000 2.000000
max 108.000000 91.000000 18.000000 2.000000

[24]: ## taget feature are Sales and productline


X = data[['SALES','PRODUCTCODE']]

[25]: data.columns

13
[25]: Index(['QUANTITYORDERED', 'ORDERLINENUMBER', 'SALES', 'STATUS', 'QTR_ID',
'MONTH_ID', 'YEAR_ID', 'PRODUCTLINE', 'MSRP', 'PRODUCTCODE',
'CUSTOMERNAME', 'COUNTRY', 'DEALSIZE'],
dtype='object')

1.1 K Means implementation


[26]: from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4, init='k-means++', random_state=0).fit(X)

[27]: kmeans.labels_

[27]: array([0, 0, 0, …, 3, 2, 0])

[28]: kmeans.inertia_

[28]: 1042223216.6249822

[29]: kmeans.n_iter_

[29]: 24

[30]: kmeans.cluster_centers_

[30]: array([[3416.59686888, 56.3072407 ],


[7983.1758794 , 28.05025126],
[1879.28363988, 63.25072604],
[5289.27065026, 41.01230228]])

[31]: #getting the size of the clusters


from collections import Counter
Counter(kmeans.labels_)

[31]: Counter({0: 1024, 3: 565, 2: 1035, 1: 199})

Hence the NUmber of Clusters to be choosen Will be 4 according to the elbow method
[32]: sns.scatterplot(data=X, x="SALES", y="PRODUCTCODE", hue=kmeans.labels_)
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1],
marker="X", c="r", s=80, label="centroids")
plt.legend()
plt.show()

14
[ ]:

15

You might also like