0% found this document useful (0 votes)
12 views4 pages

K Nearest Neighbours

This document discusses using K-Nearest Neighbors (KNN) for classification of apple quality. It includes importing datasets, data analysis, splitting data, and building and evaluating a KNN model. Key steps are preprocessing the 'apple_quality' dataset, analyzing correlations, fitting a KNN classifier with hyperparameter tuning, and reporting train and test classification metrics.

Uploaded by

bunsglazing135
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views4 pages

K Nearest Neighbours

This document discusses using K-Nearest Neighbors (KNN) for classification of apple quality. It includes importing datasets, data analysis, splitting data, and building and evaluating a KNN model. Key steps are preprocessing the 'apple_quality' dataset, analyzing correlations, fitting a KNN classifier with hyperparameter tuning, and reporting train and test classification metrics.

Uploaded by

bunsglazing135
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

21bce5695-knn-lab7

March 13, 2024

21BCE5695 M. Ashwin

1 K Nearest Neighbours
1.1 Importing required libraries
[ ]: from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier,KNeighborsRegressor
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.metrics import classification_report, mean_squared_error
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm

1.2 Importing Dataset


[ ]: df = pd.read_csv('apple_quality.csv')

[ ]: print(df.head(2))

A_id Size Weight Sweetness Crunchiness Juiciness Ripeness \


0 0 -3.970049 -2.512336 5.346330 -1.012009 1.844900 0.32984
1 1 -1.195217 -2.839257 3.664059 1.588232 0.853286 0.86753

Acidity Quality
0 -0.491590 good
1 -0.722809 good
Dropping the ID column since it is not relevant to the machine learning model
[ ]: df.drop(['A_id'], axis=1, inplace=True)

Splitting into input and output data

1
[ ]: x = df.drop(['Quality'], axis=1)
y = df['Quality']

1.3 Data Analysis


[ ]: plt.figure(figsize=(25,10))
for (i,v) in enumerate(x.columns):
plt.subplot(3,df.shape[1],i+1);
plt.hist(df.iloc[:,i],bins="sqrt")
plt.title(df.columns[i],fontsize=9);

Encoding the categorical output values into binary values


[ ]: label = []
for i in tqdm(df['Quality']):
if i=='bad':
label.append(0)
else:
label.append(1)

df['Quality'] = label

100%|����������| 4000/4000 [00:00<00:00, 945994.70it/s]

[ ]: dfinfo = pd.DataFrame(df.dtypes,columns=["dtypes"])
for (m,n) in zip([df.count(),df.isna().sum()],["count","isna"]):
dfinfo = dfinfo.merge(pd.
↪DataFrame(m,columns=[n]),right_index=True,left_index=True,how="inner");

dfinfo.T.append(df.describe())

<ipython-input-65-4673ff7821a0>:4: FutureWarning: The frame.append method is


deprecated and will be removed from pandas in a future version. Use
pandas.concat instead.
dfinfo.T.append(df.describe())

[ ]: Size Weight Sweetness Crunchiness Juiciness Ripeness \


dtypes float64 float64 float64 float64 float64 float64
count 4000 4000 4000 4000 4000 4000
isna 0 0 0 0 0 0

2
count 4000.0 4000.0 4000.0 4000.0 4000.0 4000.0
mean -0.503015 -0.989547 -0.470479 0.985478 0.512118 0.498277
std 1.928059 1.602507 1.943441 1.402757 1.930286 1.874427
min -7.151703 -7.149848 -6.894485 -6.055058 -5.961897 -5.864599
25% -1.816765 -2.01177 -1.738425 0.062764 -0.801286 -0.771677
50% -0.513703 -0.984736 -0.504758 0.998249 0.534219 0.503445
75% 0.805526 0.030976 0.801922 1.894234 1.835976 1.766212
max 6.406367 5.790714 6.374916 7.619852 7.364403 7.237837

Acidity Quality
dtypes float64 int64
count 4000 4000
isna 0 0
count 4000.0 4000.0
mean 0.076877 0.501
std 2.11027 0.500062
min -7.010538 0.0
25% -1.377424 0.0
50% 0.022609 1.0
75% 1.510493 1.0
max 7.404736 1.0

Correlation matrix
[ ]: df.corr().round(2).style.background_gradient(cmap="viridis")

[ ]: <pandas.io.formats.style.Styler at 0x78992d29c3d0>

[ ]: print(df.head(3))

Size Weight Sweetness Crunchiness Juiciness Ripeness Acidity \


0 -3.970049 -2.512336 5.346330 -1.012009 1.844900 0.329840 -0.491590
1 -1.195217 -2.839257 3.664059 1.588232 0.853286 0.867530 -0.722809
2 -0.292024 -1.351282 -1.738429 -0.342616 2.838636 -0.038033 2.621636

Quality
0 1
1 1
2 0

1.4 Model building and testing


Splitting data into training and testing
[ ]: x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.
↪3,stratify=y,random_state=30);

3
[ ]: model = KNeighborsClassifier(algorithm="auto");
parameters = {"n_neighbors":[1,3,5],
"weights":["uniform","distance"]}
model_optim = GridSearchCV(model, parameters, cv=5,scoring="accuracy");

Training the model


[ ]: model_optim.fit(x_train,y_train)

[ ]: GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
param_grid={'n_neighbors': [1, 3, 5],
'weights': ['uniform', 'distance']},
scoring='accuracy')

[ ]: model_optim.best_estimator_

[ ]: KNeighborsClassifier(weights='distance')

Model metrics
[ ]: for (i,x,y) in zip(["Train","Test"],[x_train,x_test],[y_train,y_test]):
print("Classification kNN",i," report:
↪\n",classification_report(y,model_optim.predict(x)))

Classification kNN Train report:


precision recall f1-score support

bad 1.00 1.00 1.00 1397


good 1.00 1.00 1.00 1403

accuracy 1.00 2800


macro avg 1.00 1.00 1.00 2800
weighted avg 1.00 1.00 1.00 2800

Classification kNN Test report:


precision recall f1-score support

bad 0.91 0.90 0.91 599


good 0.90 0.91 0.91 601

accuracy 0.91 1200


macro avg 0.91 0.91 0.91 1200
weighted avg 0.91 0.91 0.91 1200

You might also like