21bce5695-knn-lab7
March 13, 2024
21BCE5695 M. Ashwin
1 K Nearest Neighbours
1.1 Importing required libraries
[ ]: from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier,KNeighborsRegressor
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.metrics import classification_report, mean_squared_error
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
1.2 Importing Dataset
[ ]: df = pd.read_csv('apple_quality.csv')
[ ]: print(df.head(2))
A_id Size Weight Sweetness Crunchiness Juiciness Ripeness \
0 0 -3.970049 -2.512336 5.346330 -1.012009 1.844900 0.32984
1 1 -1.195217 -2.839257 3.664059 1.588232 0.853286 0.86753
Acidity Quality
0 -0.491590 good
1 -0.722809 good
Dropping the ID column since it is not relevant to the machine learning model
[ ]: df.drop(['A_id'], axis=1, inplace=True)
Splitting into input and output data
1
[ ]: x = df.drop(['Quality'], axis=1)
y = df['Quality']
1.3 Data Analysis
[ ]: plt.figure(figsize=(25,10))
for (i,v) in enumerate(x.columns):
plt.subplot(3,df.shape[1],i+1);
plt.hist(df.iloc[:,i],bins="sqrt")
plt.title(df.columns[i],fontsize=9);
Encoding the categorical output values into binary values
[ ]: label = []
for i in tqdm(df['Quality']):
if i=='bad':
label.append(0)
else:
label.append(1)
df['Quality'] = label
100%|����������| 4000/4000 [00:00<00:00, 945994.70it/s]
[ ]: dfinfo = pd.DataFrame(df.dtypes,columns=["dtypes"])
for (m,n) in zip([df.count(),df.isna().sum()],["count","isna"]):
dfinfo = dfinfo.merge(pd.
↪DataFrame(m,columns=[n]),right_index=True,left_index=True,how="inner");
dfinfo.T.append(df.describe())
<ipython-input-65-4673ff7821a0>:4: FutureWarning: The frame.append method is
deprecated and will be removed from pandas in a future version. Use
pandas.concat instead.
dfinfo.T.append(df.describe())
[ ]: Size Weight Sweetness Crunchiness Juiciness Ripeness \
dtypes float64 float64 float64 float64 float64 float64
count 4000 4000 4000 4000 4000 4000
isna 0 0 0 0 0 0
2
count 4000.0 4000.0 4000.0 4000.0 4000.0 4000.0
mean -0.503015 -0.989547 -0.470479 0.985478 0.512118 0.498277
std 1.928059 1.602507 1.943441 1.402757 1.930286 1.874427
min -7.151703 -7.149848 -6.894485 -6.055058 -5.961897 -5.864599
25% -1.816765 -2.01177 -1.738425 0.062764 -0.801286 -0.771677
50% -0.513703 -0.984736 -0.504758 0.998249 0.534219 0.503445
75% 0.805526 0.030976 0.801922 1.894234 1.835976 1.766212
max 6.406367 5.790714 6.374916 7.619852 7.364403 7.237837
Acidity Quality
dtypes float64 int64
count 4000 4000
isna 0 0
count 4000.0 4000.0
mean 0.076877 0.501
std 2.11027 0.500062
min -7.010538 0.0
25% -1.377424 0.0
50% 0.022609 1.0
75% 1.510493 1.0
max 7.404736 1.0
Correlation matrix
[ ]: df.corr().round(2).style.background_gradient(cmap="viridis")
[ ]: <pandas.io.formats.style.Styler at 0x78992d29c3d0>
[ ]: print(df.head(3))
Size Weight Sweetness Crunchiness Juiciness Ripeness Acidity \
0 -3.970049 -2.512336 5.346330 -1.012009 1.844900 0.329840 -0.491590
1 -1.195217 -2.839257 3.664059 1.588232 0.853286 0.867530 -0.722809
2 -0.292024 -1.351282 -1.738429 -0.342616 2.838636 -0.038033 2.621636
Quality
0 1
1 1
2 0
1.4 Model building and testing
Splitting data into training and testing
[ ]: x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.
↪3,stratify=y,random_state=30);
3
[ ]: model = KNeighborsClassifier(algorithm="auto");
parameters = {"n_neighbors":[1,3,5],
"weights":["uniform","distance"]}
model_optim = GridSearchCV(model, parameters, cv=5,scoring="accuracy");
Training the model
[ ]: model_optim.fit(x_train,y_train)
[ ]: GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
param_grid={'n_neighbors': [1, 3, 5],
'weights': ['uniform', 'distance']},
scoring='accuracy')
[ ]: model_optim.best_estimator_
[ ]: KNeighborsClassifier(weights='distance')
Model metrics
[ ]: for (i,x,y) in zip(["Train","Test"],[x_train,x_test],[y_train,y_test]):
print("Classification kNN",i," report:
↪\n",classification_report(y,model_optim.predict(x)))
Classification kNN Train report:
precision recall f1-score support
bad 1.00 1.00 1.00 1397
good 1.00 1.00 1.00 1403
accuracy 1.00 2800
macro avg 1.00 1.00 1.00 2800
weighted avg 1.00 1.00 1.00 2800
Classification kNN Test report:
precision recall f1-score support
bad 0.91 0.90 0.91 599
good 0.90 0.91 0.91 601
accuracy 0.91 1200
macro avg 0.91 0.91 0.91 1200
weighted avg 0.91 0.91 0.91 1200