0% found this document useful (0 votes)

9 views7 pages

KNN - Jupyter Notebook

The document outlines a data analysis process using Python libraries such as Pandas, NumPy, and Scikit-learn to handle a diabetes dataset. It includes steps for data cleaning, such as replacing zeros with column means, splitting the dataset into training and testing sets, and standardizing the data. Finally, it describes building a KNN classifier model for prediction on the test set.

Uploaded by

Arundhathi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views7 pages

KNN - Jupyter Notebook

Uploaded by

Arundhathi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

In [1]: 1 # import the necessary libraries

2 import pandas as pd
3 import numpy as np
4 import matplotlib.pyplot as plt

In [2]: 1 data=pd.read_csv \
2 ('C:/Users/kriti/OneDrive/Desktop/machine Learning/experiments\
3 /diabetes.csv')
4 data

Out[2]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

0 6 148 72 35 0 33.6 0.627 50 1

1 1 85 66 29 0 26.6 0.351 31 0

2 8 183 64 0 0 23.3 0.672 32 1

3 1 89 66 23 94 28.1 0.167 21 0

4 0 137 40 35 168 43.1 2.288 33 1

... ... ... ... ... ... ... ... ... ...

763 10 101 76 48 180 32.9 0.171 63 0

764 2 122 70 27 0 36.8 0.340 27 0

765 5 121 72 23 112 26.2 0.245 30 0

766 1 126 60 0 0 30.1 0.349 47 1

767 1 93 70 31 0 30.4 0.315 23 0

768 rows × 9 columns

 

In [3]: 1 print(len(data))

768

In [4]: 1 data.columns

Out[4]: Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',

'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')

There are certain columns where 0 values are there like

in SkinThickness,Insulin,glucose

Replace zeros with mean of those column

In [5]: 1 (data == 0).sum()

Out[5]: Pregnancies 111

Glucose 5
BloodPressure 35
SkinThickness 227
Insulin 374
BMI 11
DiabetesPedigreeFunction 0
Age 0
Outcome 500
dtype: int64
In [6]: 1 non_zero =['Glucose', 'BloodPressure','SkinThickness','BMI', 'Insulin']
2 #Iterating all columns wherever 0 is there & substituting with NaN
3 #NaN defined in NUMPY LIBRARY
4 # Then we are replacing NaN with mean of the column
5 for column in non_zero:
6 data[column] = data[column].replace(0, np.NaN)
7 mean = int(data[column].mean(skipna = True))
8 data[column] = data[column].replace(np.NaN, mean)

In [7]: 1 #NaN is short for Not a number.

2 #It is used to represent entries that are undefined or
3 #missing values
4 #example
5 v= np.array([1, np.NaN, 3, 4])
6 np.sum(v)

Out[7]: nan

In [8]: 1 np.nansum(v)

Out[8]: 8.0

In [9]: 1 print(data['SkinThickness'])

0 35.0
1 29.0
2 29.0
3 23.0
4 35.0
...
763 48.0
764 27.0
765 23.0
766 29.0
767 31.0
Name: SkinThickness, Length: 768, dtype: float64

In [10]: 1 x = data.iloc[:, 0:8]

2 x

Out[10]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age

0 6 148.0 72.0 35.0 155.0 33.6 0.627 50

1 1 85.0 66.0 29.0 155.0 26.6 0.351 31

2 8 183.0 64.0 29.0 155.0 23.3 0.672 32

3 1 89.0 66.0 23.0 94.0 28.1 0.167 21

4 0 137.0 40.0 35.0 168.0 43.1 2.288 33

... ... ... ... ... ... ... ... ...

763 10 101.0 76.0 48.0 180.0 32.9 0.171 63

764 2 122.0 70.0 27.0 155.0 36.8 0.340 27

765 5 121.0 72.0 23.0 112.0 26.2 0.245 30

766 1 126.0 60.0 29.0 155.0 30.1 0.349 47

767 1 93.0 70.0 31.0 155.0 30.4 0.315 23

768 rows × 8 columns

-sklearn Library: Scikit-learn is probably the most useful library for machine learning in Python. The sklearn library
contains a lot of efficient tools for machine learning and statistical modeling including classification, regression,
clustering and dimensionality reduction.
•Supervised learning algorithms •Cross-validation •Unsupervised learning algorithms •Various toy datasets: (e.g.
IRIS dataset, Boston House prices dataset). •Feature extraction: Scikit-learn for extracting features from images

In [11]: 1 y = data.iloc[:, 8]
2 y

Out[11]: 0 1
1 0
2 1
3 0
4 1
..
763 0
764 0
765 0
766 1
767 0
Name: Outcome, Length: 768, dtype: int64

-model_selection: It is a method for setting a blueprint to analyze data and then using it to measure new data.
Selecting a proper model allows you to generate accurate results when making a prediction.

-train_test_split is a function in Sklearn model selection for splitting data arrays into two subsets: for training data
and for testing data. With this function, you don't need to divide the dataset manually.

In [27]: 1 #Split the data into train and test

2 from sklearn.model_selection import train_test_split
3 x = data.iloc[:, 0:8]
4 y = data.iloc[:, 8]
5 x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,\
6 random_state=89)

In [28]: 1 print(x)

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \

0 6 148.0 72.0 35.0 155.0 33.6
1 1 85.0 66.0 29.0 155.0 26.6
2 8 183.0 64.0 29.0 155.0 23.3
3 1 89.0 66.0 23.0 94.0 28.1
4 0 137.0 40.0 35.0 168.0 43.1
.. ... ... ... ... ... ...
763 10 101.0 76.0 48.0 180.0 32.9
764 2 122.0 70.0 27.0 155.0 36.8
765 5 121.0 72.0 23.0 112.0 26.2
766 1 126.0 60.0 29.0 155.0 30.1
767 1 93.0 70.0 31.0 155.0 30.4

DiabetesPedigreeFunction Age
0 0.627 50
1 0.351 31
2 0.672 32
3 0.167 21
4 2.288 33
.. ... ...
763 0.171 63
764 0.340 27
765 0.245 30
766 0.349 47
767 0.315 23

[768 rows x 8 columns]

In [14]: 1 print(y)

0 1
1 0
2 1
3 0
4 1
..
763 0
764 0
765 0
766 1
767 0
Name: Outcome, Length: 768, dtype: int64

1 Data standardization is the process of rescaling the attributes so that they have mean
as 0 and variance as 1.
2 The ultimate goal to perform standardization is to bring down all the features to a
common scale without distorting the differences in the range of the values.
3 In sklearn.preprocessing.StandardScaler(), centering and scaling happens independently
on each feature.
4 The idea behind StandardScaler is that it will transform your data such that its
distribution will have a mean value 0 and standard deviation of 1.
5 python sklearn library offers us with StandardScaler() function to standardize the data
values into a standard format with mean 0 and standard deviation 1

1 feature scaling is done by making mean 0 and standard deviation 1 of every input column
2 fit_tranform function is used on x_train to learn paramters(mean and standard
deviation)
3 so that standardization can happpen use the tranform function on x_text to use the
paramters(mean and standard deviation)
4 of train set on set test

In [30]: 1
2 from sklearn.preprocessing import StandardScaler
3
4 sc = StandardScaler()
5
6 x_train = sc.fit_transform(x_train)
7
8 x_test = sc.transform(x_test)

1 object = StandardScaler()
2 object.fit_transform(data)
3 According to the above syntax, we initially create an object of the StandardScaler()
function. Further, we use fit_transform()
4 fit_transform() is used on the training data so that we can scale the training data and
also learn the scaling parameters of that data (mean and standard deviation) . Here,
the model built by us will learn the mean and standard deviation of the features of the
training set. These learned parameters are then used to scale our test data also.
5 Using the transform method we can use the same mean and standard deviation as it is
calculated from our training data to transform our test data. Thus, the parameters
learned by our model using the training data will help us to transform our test data
also.
In [31]: 1 x_train

Out[31]: array([[-0.83444704, 2.4607925 , 0.2729567 , ..., 1.49054148,

2.83104312, -0.95428911],
[ 0.6549439 , 0.46408671, -0.98652789, ..., -1.1976073 ,
-0.97626322, 1.81271231],
[-0.83444704, 0.23865218, -0.23083713, ..., -0.95323013,
0.00695909, -0.95428911],
...,
[-0.83444704, -1.46820922, -1.40635608, ..., 0.11052928,
2.39303443, -0.78659205],
[-0.23869066, 1.52684947, -0.73463097, ..., 0.2830308 ,
-0.34834866, -0.28350089],
[ 0.35706571, 1.04377549, 0.94468182, ..., 0.88678614,
0.45721978, 0.05189323]])

Generating Model Let's build KNN classifier model.

First, import the KNeighborsClassifier module and create KNN classifier object by passing argument number of
neighbors in KNeighborsClassifier() function.

In [32]: 1 # define the model

2 from sklearn.neighbors import KNeighborsClassifier
3 #classifier = KNeighborsClassifier(n_neighbors = 11, p =2, metric = 'euclidean')
4 classifier = KNeighborsClassifier(n_neighbors = 11, \
5 metric = 'euclidean')
6 # p= power paramter 1 for manhatten and 2 for euclidian

Why k is odd: Let's think for a while: The k, in the KNN algorithm, represent the number of closest neighbors that
you are comparing, right? So, no matter if you have 2 or n classes, if you choose an even k, there is a risk of a tie
in the decision of which class you should set a new instance. This is why the k is usually odd - no ties.

Fit the model on the train set using fit() and perform prediction on the test set using predict().

In [18]: 1 classifier.fit(x_train, y_train)

Out[18]: KNeighborsClassifier(metric='euclidean', n_neighbors=11)

In [19]: 1 # prediction on test set

2 y_pred = classifier.predict(x_test)
3 y_pred

Out[19]: array([0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0,
1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0,
1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,
0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0],
dtype=int64)

In [20]: 1 # Generate the confusion matrix

2 from sklearn.metrics import confusion_matrix
3 cm = confusion_matrix(y_test, y_pred)
4 print(cm);
5

[[82 21]
[23 28]]
In [21]: 1 # Find accuracy
2 from sklearn.metrics import accuracy_score
3 print(accuracy_score(y_test, y_pred))

0.7142857142857143

In [22]: 1 # Find F1 score

2 from sklearn.metrics import f1_score
3 print(f1_score(y_test, y_pred))

0.5599999999999999

Elbow method: helps to find optimal value of k Elbow method helps data scientists to select the optimal number of
neighbors for KNN. As K increases, the error usually goes down, then stabilizes, and then raises again. Pick the
optimum K at the beginning of the stable zone. This is also called Elbow Method.

In [23]: 1 from sklearn.neighbors import KNeighborsClassifier

2 error_rate = []
3 for i in range(1,40):
4
5 knn = KNeighborsClassifier(n_neighbors=i)
6 knn.fit(x_train,y_train)
7 pred_i = knn.predict(x_test)
8 error_rate.append(np.mean(pred_i != y_test))

In [24]: 1 error_rate

Out[24]: [0.3246753246753247,
0.2922077922077922,
0.2792207792207792,
0.2792207792207792,
0.2922077922077922,
0.2727272727272727,
0.2727272727272727,
0.2662337662337662,
0.2792207792207792,
0.2662337662337662,
0.2857142857142857,
0.2597402597402597,
0.2792207792207792,
0.2792207792207792,
0.2597402597402597,
0.2727272727272727,
0.2532467532467532,
0.2597402597402597,
0.24675324675324675,
0.24025974025974026,
0.2532467532467532,
0.2532467532467532,
0.2532467532467532,
0.2532467532467532,
0.2532467532467532,
0.24675324675324675,
0.24675324675324675,
0.2532467532467532,
0.2597402597402597,
0.2727272727272727,
0.2792207792207792,
0.2727272727272727,
0.2662337662337662,
0.2597402597402597,
0.2597402597402597,
0.2532467532467532,
0.2792207792207792,
0.2597402597402597,
0.2662337662337662]
In [25]: 1 plt.figure(figsize=(10,6))
2 plt.plot(range(1,40),error_rate,color="blue", linestyle="dashed", marker="o",
3 markerfacecolor="red", markersize=10)
4 plt.title("Error Rate vs. K Value")
5 plt.xlabel("K")
6 plt.ylabel("Error Rate")

Out[25]: Text(0, 0.5, 'Error Rate')

1 Use the confusion_matrix method from sklearn.metrics to compute the confusion matrix.
2 classification_report: Gives a text report showing the main classification metrics.

In [26]: 1 from sklearn.metrics import confusion_matrix

2 from sklearn.metrics import classification_report
3 # NOW WITH K=20
4 knn = KNeighborsClassifier(n_neighbors=20)
5 knn.fit(x_train,y_train)
6 pred = knn.predict(x_test)
7
8 print(confusion_matrix(y_test,pred))
9
10 print(classification_report(y_test,pred))

[[90 13]
[24 27]]
precision recall f1-score support

0 0.79 0.87 0.83 103

1 0.68 0.53 0.59 51

accuracy 0.76 154

macro avg 0.73 0.70 0.71 154
weighted avg 0.75 0.76 0.75 154

In [ ]: 1

Manual B8000 With Fan PDF
No ratings yet
Manual B8000 With Fan PDF
15 pages
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
No ratings yet
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
12 pages
Pima Indian Diabetes Questions
No ratings yet
Pima Indian Diabetes Questions
6 pages
Diabetes - Prediction - Project - Ipynb - Colab
No ratings yet
Diabetes - Prediction - Project - Ipynb - Colab
11 pages
Diabetis Project
No ratings yet
Diabetis Project
7 pages
Project 3 - Diabetes Prediction - Ipynb - Colab
No ratings yet
Project 3 - Diabetes Prediction - Ipynb - Colab
4 pages
DAL Experiment Outputs 6to10
No ratings yet
DAL Experiment Outputs 6to10
16 pages
Python 2025
No ratings yet
Python 2025
25 pages
Project 16 Calories Burnt Prediction
No ratings yet
Project 16 Calories Burnt Prediction
10 pages
Pima Indian Diabetes Prediction
No ratings yet
Pima Indian Diabetes Prediction
22 pages
Exp 5
No ratings yet
Exp 5
7 pages
Data Pre-Processing
No ratings yet
Data Pre-Processing
22 pages
Data Pre Processing 1
No ratings yet
Data Pre Processing 1
35 pages
ML Proj Diabetes
No ratings yet
ML Proj Diabetes
51 pages
Labpg3.ipynb - Colab
No ratings yet
Labpg3.ipynb - Colab
2 pages
Supervised Learning With Scikit-Learn: Preprocessing Data
No ratings yet
Supervised Learning With Scikit-Learn: Preprocessing Data
32 pages
Stroke Prediction
No ratings yet
Stroke Prediction
10 pages
Practical 4
No ratings yet
Practical 4
2 pages
Assignment 03
No ratings yet
Assignment 03
6 pages
Student - Linear Regression Example - Colaboratory
No ratings yet
Student - Linear Regression Example - Colaboratory
6 pages
KNN For Classification
No ratings yet
KNN For Classification
4 pages
ML Data Preprocessing in Python
No ratings yet
ML Data Preprocessing in Python
9 pages
ADS Exp-1
No ratings yet
ADS Exp-1
3 pages
Model2.ipynb - Colab
No ratings yet
Model2.ipynb - Colab
11 pages
Fds Mannual
No ratings yet
Fds Mannual
39 pages
Data Science Practical 9
No ratings yet
Data Science Practical 9
6 pages
Experiment 4
No ratings yet
Experiment 4
5 pages
Preprocessing1.ipynb - Colab
No ratings yet
Preprocessing1.ipynb - Colab
13 pages
Openlab 1
No ratings yet
Openlab 1
17 pages
Linear and Multilinear Regression
No ratings yet
Linear and Multilinear Regression
5 pages
Heart Disease Prediction! ?
No ratings yet
Heart Disease Prediction! ?
52 pages
Batch-2 Ieee DMT
No ratings yet
Batch-2 Ieee DMT
4 pages
Chapter4 PDF
No ratings yet
Chapter4 PDF
34 pages
Import As From Import From Import From Import From Import From Import From Import From Import From Import From Import From Import Import As
No ratings yet
Import As From Import From Import From Import From Import From Import From Import From Import From Import From Import From Import Import As
8 pages
lab - 8 - - (6) عفان عبدالله احمد - التكليف -
No ratings yet
lab - 8 - - (6) عفان عبدالله احمد - التكليف -
18 pages
Heart Failure Prediction With Detailed Headings
No ratings yet
Heart Failure Prediction With Detailed Headings
12 pages
KNN - Jupyter Notebook
No ratings yet
KNN - Jupyter Notebook
5 pages
Baseline - Ipynb - Colab
No ratings yet
Baseline - Ipynb - Colab
5 pages
Pandas
No ratings yet
Pandas
4 pages
Fds 1
No ratings yet
Fds 1
44 pages
Bio-Signal Analysis For Smoking
No ratings yet
Bio-Signal Analysis For Smoking
1 page
DSBDA2
No ratings yet
DSBDA2
6 pages
Major Project - Colab
No ratings yet
Major Project - Colab
15 pages
Python Solution
No ratings yet
Python Solution
30 pages
EXP - 7 - Prasham Doshi - 22bec097
No ratings yet
EXP - 7 - Prasham Doshi - 22bec097
7 pages
Department of Statistics: COURSE STATS 330/762
No ratings yet
Department of Statistics: COURSE STATS 330/762
8 pages
Data Science Manual
No ratings yet
Data Science Manual
16 pages
Ams 427 Statistical Model Building
No ratings yet
Ams 427 Statistical Model Building
5 pages
Week 4 Naive Bayes Classifier
No ratings yet
Week 4 Naive Bayes Classifier
2 pages
FYMCA IDSLab A6 Submission
No ratings yet
FYMCA IDSLab A6 Submission
9 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
7 pages
Stroke Prediction Dataset
No ratings yet
Stroke Prediction Dataset
48 pages
ML LAB 12 - Jupyter Notebook
No ratings yet
ML LAB 12 - Jupyter Notebook
11 pages
utf-8''C2M1 Assignment
No ratings yet
utf-8''C2M1 Assignment
24 pages
Open Lab 2
No ratings yet
Open Lab 2
15 pages
Artificial Neural Network (Ann)
No ratings yet
Artificial Neural Network (Ann)
1 page
Dsa 1
No ratings yet
Dsa 1
8 pages
A List of Factorial Math Constants
From Everand
A List of Factorial Math Constants
StreetLib
No ratings yet
The Smart Math Tricks Secrets to Solving Math Fast and Easy
From Everand
The Smart Math Tricks Secrets to Solving Math Fast and Easy
Leonardo Cruz
No ratings yet
Cooking at Home
From Everand
Cooking at Home
Martha Frommert Kausch
No ratings yet
Shake Them Haters off Volume 12: Mastering Your Mathematics Skills – the Study Guide
From Everand
Shake Them Haters off Volume 12: Mastering Your Mathematics Skills – the Study Guide
Russell Bailey
No ratings yet
Proposal For Venture Capitalist
No ratings yet
Proposal For Venture Capitalist
12 pages
K Means Clustering
No ratings yet
K Means Clustering
6 pages
Indexing
No ratings yet
Indexing
6 pages
ch15 - Concurrency Control
No ratings yet
ch15 - Concurrency Control
92 pages
Pandas - Workshop - Jupyter Notebook
No ratings yet
Pandas - Workshop - Jupyter Notebook
5 pages
Untitled Collection 36eqvaek Rui Li 2023 Effects of Mobile Assisted Language Learning On 4pnxfz83qc
No ratings yet
Untitled Collection 36eqvaek Rui Li 2023 Effects of Mobile Assisted Language Learning On 4pnxfz83qc
15 pages
Material Lab Report 1 Son Hali
No ratings yet
Material Lab Report 1 Son Hali
7 pages
Ma Notes
No ratings yet
Ma Notes
23 pages
Chapter 6 Measures of Dispersion
No ratings yet
Chapter 6 Measures of Dispersion
27 pages
CAF Mock Test-10N
No ratings yet
CAF Mock Test-10N
7 pages
Errors PPT 1
No ratings yet
Errors PPT 1
29 pages
Data Science Question Bank Updated
No ratings yet
Data Science Question Bank Updated
15 pages
Math 8 1ST Quarter
No ratings yet
Math 8 1ST Quarter
56 pages
Assignment Stat-500
No ratings yet
Assignment Stat-500
2 pages
T56 Allison
No ratings yet
T56 Allison
43 pages
University of Cambridge International Examinations General Certificate of Education Advanced Level
No ratings yet
University of Cambridge International Examinations General Certificate of Education Advanced Level
8 pages
C473-15 Standard Test Methods For Physical Testing of Gypsum Panel Products
No ratings yet
C473-15 Standard Test Methods For Physical Testing of Gypsum Panel Products
17 pages
A Study On University Students Time Management An
No ratings yet
A Study On University Students Time Management An
6 pages
83283bos67342 cp6
No ratings yet
83283bos67342 cp6
130 pages
Gmath Finals Module 2 Chapter 4
100% (3)
Gmath Finals Module 2 Chapter 4
39 pages
PA Last
No ratings yet
PA Last
23 pages
Economics NEP Syllabus UG
No ratings yet
Economics NEP Syllabus UG
71 pages
Quantitative Methods For Plant Breeding 1696860498. - Print
No ratings yet
Quantitative Methods For Plant Breeding 1696860498. - Print
536 pages
ART20204071
No ratings yet
ART20204071
6 pages
What Are Repeatability and Reproducibility, Part 2 The E11 Viewpoint, ASTM Data Points, May-June 2009
No ratings yet
What Are Repeatability and Reproducibility, Part 2 The E11 Viewpoint, ASTM Data Points, May-June 2009
4 pages
LGD Compact A
No ratings yet
LGD Compact A
19 pages
Mid Term Exam
No ratings yet
Mid Term Exam
3 pages
Measures of Variability
No ratings yet
Measures of Variability
5 pages
Lab 5-2
No ratings yet
Lab 5-2
4 pages
F1514-03 (2013) Astm
No ratings yet
F1514-03 (2013) Astm
3 pages
Math 4TH Quarter Reviewer
No ratings yet
Math 4TH Quarter Reviewer
12 pages
J Clinic Periodontology - 2020 - Stavropoulos - Medium and Long Term Clinical Benefits of Periodontal Regenerative
No ratings yet
J Clinic Periodontology - 2020 - Stavropoulos - Medium and Long Term Clinical Benefits of Periodontal Regenerative
21 pages
GE 4 MODULE 4 LESSON 3 4 and 5 1
No ratings yet
GE 4 MODULE 4 LESSON 3 4 and 5 1
15 pages
CSIC 6132 排版870 878
No ratings yet
CSIC 6132 排版870 878
9 pages