0% found this document useful (0 votes)
36 views16 pages

Experiment No-4 Code

This document describes using k-nearest neighbors (KNN) classification on a diabetes dataset. It performs exploratory data analysis, splits the data into training and test sets, and preprocesses features, but does not describe building or evaluating the KNN model.

Uploaded by

Apurva Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views16 pages

Experiment No-4 Code

This document describes using k-nearest neighbors (KNN) classification on a diabetes dataset. It performs exploratory data analysis, splits the data into training and test sets, and preprocesses features, but does not describe building or evaluating the KNN model.

Uploaded by

Apurva Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

2/20/24, 4:52 PM k Nearest Neighbours with Python and Scikit-Learn - Jupyter Notebook

Experiment No-4

k Nearest Neighbours

Dataset description
The attribute information of this dataset is as follows:-

Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome

1.Import libraries
In [40]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [41]: import warnings



warnings.filterwarnings('ignore')

2. Import dataset

localhost:8888/notebooks/Batch S1/k Nearest Neighbours with Python and Scikit-Learn.ipynb 1/16


2/20/24, 4:52 PM k Nearest Neighbours with Python and Scikit-Learn - Jupyter Notebook

In [42]: data = 'diabetes.csv'



df = pd.read_csv(data)

3. Exploratory data analysis


Now, I will explore the data to gain insights about the data.

In [44]: # view dimensions of dataset



df.shape

Out[44]: (768, 9)

View top 5 rows of dataset

In [45]: # preview the dataset



df.head()

Out[45]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

0 6 148 72 35 0 33.6 0.627 50 1

1 1 85 66 29 0 26.6 0.351 31 0

2 8 183 64 0 0 23.3 0.672 32 1

3 1 89 66 23 94 28.1 0.167 21 0

4 0 137 40 35 168 43.1 2.288 33 1

localhost:8888/notebooks/Batch S1/k Nearest Neighbours with Python and Scikit-Learn.ipynb 2/16


2/20/24, 4:52 PM k Nearest Neighbours with Python and Scikit-Learn - Jupyter Notebook

View summary of dataset

In [46]: # view summary of dataset



df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

localhost:8888/notebooks/Batch S1/k Nearest Neighbours with Python and Scikit-Learn.ipynb 3/16


2/20/24, 4:52 PM k Nearest Neighbours with Python and Scikit-Learn - Jupyter Notebook

Frequency distribution of values in variables

In [47]: for var in df.columns:

print(df[var].value_counts())

1 135
0 111
2 103
3 75
4 68
5 57
6 50
7 45
8 38
9 28
10 24
11 11
13 10
12 9
14 2
15 1
17 1
Name: Pregnancies, dtype: int64
99 17
100 17

localhost:8888/notebooks/Batch S1/k Nearest Neighbours with Python and Scikit-Learn.ipynb 4/16


2/20/24, 4:52 PM k Nearest Neighbours with Python and Scikit-Learn - Jupyter Notebook

Check data types of columns of dataframe

In [48]: df.dtypes

Out[48]: Pregnancies int64


Glucose int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DiabetesPedigreeFunction float64
Age int64
Outcome int64
dtype: object

Missing values in variables

In [49]: # check missing values in variables



df.isnull().sum()

Out[49]: Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64

localhost:8888/notebooks/Batch S1/k Nearest Neighbours with Python and Scikit-Learn.ipynb 5/16


2/20/24, 4:52 PM k Nearest Neighbours with Python and Scikit-Learn - Jupyter Notebook

4. Declare feature vector and target variable


In [50]: X = df.drop(['Outcome'], axis=1)

y = df['Outcome']

5.Split data into separate training and test set


In [51]: # split X and y into training and testing sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [52]: # check the shape of X_train and X_test



X_train.shape, X_test.shape

Out[52]: ((614, 8), (154, 8))

6.Feature Engineering
Feature Engineering is the process of transforming raw data into useful features that help us to understand our model better and increase its
predictive power. I will carry out feature engineering on different types of variables.

localhost:8888/notebooks/Batch S1/k Nearest Neighbours with Python and Scikit-Learn.ipynb 6/16


2/20/24, 4:52 PM k Nearest Neighbours with Python and Scikit-Learn - Jupyter Notebook

In [53]: # check data types in X_train



X_train.dtypes

Out[53]: Pregnancies int64


Glucose int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DiabetesPedigreeFunction float64
Age int64
dtype: object

We can see that there are no missing values in X_train and X_test.

In [54]: X_train.head()

Out[54]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age

603 7 150 78 29 126 35.2 0.692 54

118 4 97 60 23 0 28.2 0.443 22

247 0 165 90 33 680 52.3 0.427 23

157 1 109 56 21 135 25.2 0.833 23

468 8 120 0 0 0 30.0 0.183 38

localhost:8888/notebooks/Batch S1/k Nearest Neighbours with Python and Scikit-Learn.ipynb 7/16


2/20/24, 4:52 PM k Nearest Neighbours with Python and Scikit-Learn - Jupyter Notebook

In [55]: X_test.head()

Out[55]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age

661 1 199 76 43 0 42.9 1.394 22

122 2 107 74 30 100 33.6 0.404 23

113 4 76 62 0 0 34.0 0.391 25

14 5 166 72 19 175 25.8 0.587 51

529 0 111 65 0 0 24.6 0.660 31

7. Fit K Neighbours Classifier to the Training Set


In [56]: # import KNeighbors ClaSSifier from sklearn

from sklearn.neighbors import KNeighborsClassifier


# instantiate the model
knn = KNeighborsClassifier(n_neighbors=3)


# fit the model to the training set
knn.fit(X_train, y_train)

Out[56]: ▾ KNeighborsClassifier
KNeighborsClassifier(n_neighbors=3)

localhost:8888/notebooks/Batch S1/k Nearest Neighbours with Python and Scikit-Learn.ipynb 8/16


2/20/24, 4:52 PM k Nearest Neighbours with Python and Scikit-Learn - Jupyter Notebook

8. Predict the test-set results


In [57]: y_pred = knn.predict(X_test)

y_pred

Out[57]: array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1,
1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1,
1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
dtype=int64)

In [58]: from sklearn.metrics import accuracy_score



print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

Model accuracy score: 0.7208

9. Check for overfitting and underfitting

In [59]: # print the scores on training and test set



print('Training set score: {:.4f}'.format(knn.score(X_train, y_train)))

print('Test set score: {:.4f}'.format(knn.score(X_test, y_test)))

Training set score: 0.8518


Test set score: 0.7208

## 10. Rebuild kNN Classification model using different values of k



localhost:8888/notebooks/Batch S1/k Nearest Neighbours with Python and Scikit-Learn.ipynb 9/16


2/20/24, 4:52 PM k Nearest Neighbours with Python and Scikit-Learn - Jupyter Notebook
I have build the kNN classification model using k=3. Now, I will increase the value of k and see its effect on
accuracy.

### Rebuild kNN Classification model using k=5

In [60]: # instantiate the model with k=5


knn_5 = KNeighborsClassifier(n_neighbors=5)


# fit the model to the training set
knn_5.fit(X_train, y_train)


# predict on the test-set
y_pred_5 = knn_5.predict(X_test)


print('Model accuracy score with k=5 : {0:0.4f}'. format(accuracy_score(y_test, y_pred_5)))

Model accuracy score with k=5 : 0.7532

### Rebuild kNN Classification model using k=6

localhost:8888/notebooks/Batch S1/k Nearest Neighbours with Python and Scikit-Learn.ipynb 10/16


2/20/24, 4:52 PM k Nearest Neighbours with Python and Scikit-Learn - Jupyter Notebook

In [61]: # instantiate the model with k=6


knn_6 = KNeighborsClassifier(n_neighbors=6)


# fit the model to the training set
knn_6.fit(X_train, y_train)


# predict on the test-set
y_pred_6 = knn_6.predict(X_test)


print('Model accuracy score with k=6 : {0:0.4f}'. format(accuracy_score(y_test, y_pred_6)))

Model accuracy score with k=6 : 0.7792

### Rebuild kNN Classification model using k=7

In [27]: # instantiate the model with k=7


knn_7 = KNeighborsClassifier(n_neighbors=7)


# fit the model to the training set
knn_7.fit(X_train, y_train)


# predict on the test-set
y_pred_7 = knn_7.predict(X_test)


print('Model accuracy score with k=7 : {0:0.4f}'. format(accuracy_score(y_test, y_pred_7)))

Model accuracy score with k=7 : 0.7597


### Rebuild kNN Classification model using k=8

localhost:8888/notebooks/Batch S1/k Nearest Neighbours with Python and Scikit-Learn.ipynb 11/16


2/20/24, 4:52 PM k Nearest Neighbours with Python and Scikit-Learn - Jupyter Notebook

In [62]: # instantiate the model with k=8


knn_8 = KNeighborsClassifier(n_neighbors=8)


# fit the model to the training set
knn_8.fit(X_train, y_train)


# predict on the test-set
y_pred_8 = knn_8.predict(X_test)


print('Model accuracy score with k=8 : {0:0.4f}'. format(accuracy_score(y_test, y_pred_8)))

Model accuracy score with k=8 : 0.7792

### Rebuild kNN Classification model using k=9

In [63]: # instantiate the model with k=9


knn_9 = KNeighborsClassifier(n_neighbors=9)


# fit the model to the training set
knn_9.fit(X_train, y_train)


# predict on the test-set
y_pred_9 = knn_9.predict(X_test)


print('Model accuracy score with k=9 : {0:0.4f}'. format(accuracy_score(y_test, y_pred_9)))

Model accuracy score with k=9 : 0.7727

localhost:8888/notebooks/Batch S1/k Nearest Neighbours with Python and Scikit-Learn.ipynb 12/16


2/20/24, 4:52 PM k Nearest Neighbours with Python and Scikit-Learn - Jupyter Notebook

11. Confusion matrix


A confusion matrix is a tool for summarizing the performance of a classification algorithm. A confusion matrix will give us a clear picture of
classification model performance and the types of errors produced by the model. It gives us a summary of correct and incorrect predictions
broken down by each category. The summary is represented in a tabular form.

Four types of outcomes are possible while evaluating a classification model performance. These four outcomes are described below:-

True Positives (TP) – True Positives occur when we predict an observation belongs to a certain class and the observation actually belongs to
that class.

True Negatives (TN) – True Negatives occur when we predict an observation does not belong to a certain class and the observation actually
does not belong to that class.

False Positives (FP) – False Positives occur when we predict an observation belongs to a certain class but the observation actually does not
belong to that class. This type of error is called Type I error.

False Negatives (FN) – False Negatives occur when we predict an observation does not belong to a certain class but the observation actually
belongs to that class. This is a very serious error and it is called Type II error.

These four outcomes are summarized in a confusion matrix given below.

In [69]: # Print the Confusion Matrix with k =3 and slice it into four pieces

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

print('Confusion matrix\n\n', cm)

Confusion matrix

[[83 24]
[19 28]]

localhost:8888/notebooks/Batch S1/k Nearest Neighbours with Python and Scikit-Learn.ipynb 13/16


2/20/24, 4:52 PM k Nearest Neighbours with Python and Scikit-Learn - Jupyter Notebook

In [70]: # Print the Confusion Matrix with k =7 and slice it into four pieces

cm_7 = confusion_matrix(y_test, y_pred_7)

print('Confusion matrix\n\n', cm_7)

Confusion matrix

[[90 17]
[20 27]]

In [71]: y_test.value_counts()

Out[71]: 0 107
1 47
Name: Outcome, dtype: int64

localhost:8888/notebooks/Batch S1/k Nearest Neighbours with Python and Scikit-Learn.ipynb 14/16


2/20/24, 4:52 PM k Nearest Neighbours with Python and Scikit-Learn - Jupyter Notebook

In [72]: class_names = ['0', '1']


plt.figure(figsize=(3, 3))
sns.heatmap(cm_7, annot=True, fmt='d', cmap='Blues', cbar=False,xticklabels=class_names, yticklabels=class_names)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

12. Classification Report


Classification report is another way to evaluate the classification model performance. It displays the precision, recall, f1 and support scores
for the model.

We can print a classification report as follows:-

localhost:8888/notebooks/Batch S1/k Nearest Neighbours with Python and Scikit-Learn.ipynb 15/16


2/20/24, 4:52 PM k Nearest Neighbours with Python and Scikit-Learn - Jupyter Notebook

In [73]: from sklearn.metrics import classification_report



print(classification_report(y_test, y_pred_7))

precision recall f1-score support

0 0.82 0.84 0.83 107


1 0.61 0.57 0.59 47

accuracy 0.76 154


macro avg 0.72 0.71 0.71 154
weighted avg 0.76 0.76 0.76 154

localhost:8888/notebooks/Batch S1/k Nearest Neighbours with Python and Scikit-Learn.ipynb 16/16

You might also like