0% found this document useful (0 votes)

308 views4 pages

Loading The Dataset: 'Diabetes - CSV'

The document loads and cleans a diabetes dataset, splits it into training and test sets, tunes a KNN model using grid search, and evaluates the best model on the test set. It finds that a KNN model with 27 neighbors and p=2 achieves the best accuracy of 0.78 on the held-out data, with a confusion matrix and classification report provided.

Uploaded by

Divyani Chavan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

308 views4 pages

Loading The Dataset: 'Diabetes - CSV'

Uploaded by

Divyani Chavan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

In [27]: import pandas as pd

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn import preprocessing

Loading the Dataset

First we load the dataset and find out the number of columns, rows, NULL values, etc.

In [2]: df = pd.read_csv('diabetes.csv')

In [3]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 Pedigree 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

In [4]: df.head()

Out[4]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI Pedigree Age Outcome

0 6 148 72 35 0 33.6 0.627 50 1

1 1 85 66 29 0 26.6 0.351 31 0

2 8 183 64 0 0 23.3 0.672 32 1

3 1 89 66 23 94 28.1 0.167 21 0

4 0 137 40 35 168 43.1 2.288 33 1

Cleaning
In [11]: df.corr().style.background_gradient(cmap='BuGn')

Out[11]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI Pedigree

Pregnancies 1.000000 0.129459 0.141282 -0.081672 -0.073535 0.017683 -0.033523

Glucose 0.129459 1.000000 0.152590 0.057328 0.331357 0.221071 0.137337

BloodPressure 0.141282 0.152590 1.000000 0.207371 0.088933 0.281805 0.041265

SkinThickness -0.081672 0.057328 0.207371 1.000000 0.436783 0.392573 0.183928

Insulin -0.073535 0.331357 0.088933 0.436783 1.000000 0.197859 0.185071

BMI 0.017683 0.221071 0.281805 0.392573 0.197859 1.000000 0.140647

Pedigree -0.033523 0.137337 0.041265 0.183928 0.185071 0.140647 1.000000

Age 0.544341 0.263514 0.239528 -0.113970 -0.042163 0.036242 0.033561

Outcome 0.221898 0.466581 0.065068 0.074752 0.130548 0.292695 0.173844

In [13]: df.drop(['BloodPressure', 'SkinThickness'], axis=1, inplace=True)

In [14]: df.isna().sum()

Out[14]: Pregnancies 0
Glucose 0
Insulin 0
BMI 0
Pedigree 0
Age 0
Outcome 0
dtype: int64

In [15]: df.describe()

Out[15]: Pregnancies Glucose Insulin BMI Pedigree Age Outcome

count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000

mean 3.845052 120.894531 79.799479 31.992578 0.471876 33.240885 0.348958

std 3.369578 31.972618 115.244002 7.884160 0.331329 11.760232 0.476951

min 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000

25% 1.000000 99.000000 0.000000 27.300000 0.243750 24.000000 0.000000

50% 3.000000 117.000000 30.500000 32.000000 0.372500 29.000000 0.000000

75% 6.000000 140.250000 127.250000 36.600000 0.626250 41.000000 1.000000

max 17.000000 199.000000 846.000000 67.100000 2.420000 81.000000 1.000000

Visualization
In [16]: hist = df.hist(figsize=(20,16))

Separating the features and the labels

In [17]: X=df.iloc[:, :df.shape[1]-1] #Independent Variables
y=df.iloc[:, -1] #Dependent Variable
X.shape, y.shape

Out[17]: ((768, 6), (768,))

Splitting the Dataset

Training and Test Set

In [21]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=8)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Machine Learning model
In [30]: def knn(X_train, X_test, y_train, y_test, neighbors, power):
model = KNeighborsClassifier(n_neighbors=neighbors, p=power)
# Fit and predict on model
# Model is trained using the train set and predictions are made based on the test se
y_pred=model.fit(X_train, y_train).predict(X_test)
print(f"Accuracy for K-Nearest Neighbors model \t: {accuracy_score(y_test, y_pred)}

cm = confusion_matrix(y_test, y_pred)
print(f'''Confusion matrix :\n
| Positive Prediction\t| Negative Prediction
---------------+------------------------+----------------------
Positive Class | True Positive (TP) {cm[0, 0]}\t| False Negative (FN) {cm[0, 1]}
---------------+------------------------+----------------------
Negative Class | False Positive (FP) {cm[1, 0]}\t| True Negative (TN) {cm[1, 1]}\n'
cr = classification_report(y_test, y_pred)
print('Classification report : \n', cr)

Hyperparameter tuning
In [28]: param_grid = {
'n_neighbors': range(1, 51),
'p': range(1, 4)
}
grid = GridSearchCV(estimator=KNeighborsClassifier(), param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)
grid.best_estimator_, grid.best_params_, grid.best_score_

Out[28]: (KNeighborsClassifier(n_neighbors=27),
{'n_neighbors': 27, 'p': 2},
0.7719845395175262)

In [31]: knn(X_train, X_test, y_train, y_test, grid.best_params_['n_neighbors'], grid.best_params

Accuracy for K-Nearest Neighbors model : 0.7987012987012987

Confusion matrix :

| Positive Prediction | Negative Prediction

---------------+------------------------+----------------------
Positive Class | True Positive (TP) 91 | False Negative (FN) 11
---------------+------------------------+----------------------
Negative Class | False Positive (FP) 20 | True Negative (TN) 32

Classification report :
precision recall f1-score support

0 0.82 0.89 0.85 102

1 0.74 0.62 0.67 52

accuracy 0.80 154

macro avg 0.78 0.75 0.76 154
weighted avg 0.79 0.80 0.79 154

JANTSCH - The Self-Organizing Universe
100% (6)
JANTSCH - The Self-Organizing Universe
354 pages
BCA 2nd Project Presentation Notice
No ratings yet
BCA 2nd Project Presentation Notice
4 pages
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
No ratings yet
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
12 pages
Generalized Linear Model With Excel Tutorial
100% (2)
Generalized Linear Model With Excel Tutorial
6 pages
Two-Way (Between-Groups) ANOVA: Statstutor Community Project
No ratings yet
Two-Way (Between-Groups) ANOVA: Statstutor Community Project
4 pages
ch04 PDF
No ratings yet
ch04 PDF
43 pages
Ordinary Least Squares: Linear Model
No ratings yet
Ordinary Least Squares: Linear Model
13 pages
Group 4 - Summary PSM
No ratings yet
Group 4 - Summary PSM
33 pages
Lazy Learning (Or Learning From Your Neighbors)
No ratings yet
Lazy Learning (Or Learning From Your Neighbors)
3 pages
Apriori Algorithm
No ratings yet
Apriori Algorithm
13 pages
Tabel MetStat (Disusun Oleh Matthew Alvian)
100% (1)
Tabel MetStat (Disusun Oleh Matthew Alvian)
48 pages
Probability and Statistics Sheet: A A S, T & M T
No ratings yet
Probability and Statistics Sheet: A A S, T & M T
57 pages
Lecture 10 Randomized Complete Block Design Last Lecture
100% (1)
Lecture 10 Randomized Complete Block Design Last Lecture
4 pages
Clustering Large Data Sets With Mixed Numeric and Categorical Values
No ratings yet
Clustering Large Data Sets With Mixed Numeric and Categorical Values
14 pages
12-Multiple Comparison Procedure
No ratings yet
12-Multiple Comparison Procedure
12 pages
Data Mining Lab 02 - Truong Quang Tuong - ITITIU20130
No ratings yet
Data Mining Lab 02 - Truong Quang Tuong - ITITIU20130
8 pages
Goal Programming Goal Programming: DR Muhammad Al Salamah, Industrial Engineering, KFUPM
No ratings yet
Goal Programming Goal Programming: DR Muhammad Al Salamah, Industrial Engineering, KFUPM
33 pages
6.1 There Is An Urn Containing 9 Balls, Which Can Be Either Green or Red. The Number of Red Balls in The
No ratings yet
6.1 There Is An Urn Containing 9 Balls, Which Can Be Either Green or Red. The Number of Red Balls in The
6 pages
NguyenCongSang ITITIU20292 Lab2
No ratings yet
NguyenCongSang ITITIU20292 Lab2
13 pages
RCBD Anova Notes (III)
No ratings yet
RCBD Anova Notes (III)
13 pages
Python Implementation of Random Forest Algorithm
No ratings yet
Python Implementation of Random Forest Algorithm
10 pages
CHAPTER 3.7 Likelihood Ratio Test
No ratings yet
CHAPTER 3.7 Likelihood Ratio Test
5 pages
Stat 138 Course Syllabus
No ratings yet
Stat 138 Course Syllabus
4 pages
03 - K Means Clustering On Iris Datasets
No ratings yet
03 - K Means Clustering On Iris Datasets
4 pages
Junior Docs
100% (1)
Junior Docs
3 pages
Hypothesis Test
100% (1)
Hypothesis Test
1 page
Correlation & Regression Analysis
100% (1)
Correlation & Regression Analysis
39 pages
Factorial Design PDF
100% (1)
Factorial Design PDF
2 pages
Random Effects Models
No ratings yet
Random Effects Models
37 pages
Chapter 1 (STD)
No ratings yet
Chapter 1 (STD)
5 pages
Prob&StatsBook PDF
No ratings yet
Prob&StatsBook PDF
202 pages
Item Response Theory in R Using Package LTM
No ratings yet
Item Response Theory in R Using Package LTM
27 pages
The Power of Deep Learning Techniques For Predicting Student Performance in Virtual Learning Environments A Systematic Literature Review
No ratings yet
The Power of Deep Learning Techniques For Predicting Student Performance in Virtual Learning Environments A Systematic Literature Review
29 pages
Kisi Kisi Uts Statistik
No ratings yet
Kisi Kisi Uts Statistik
18 pages
Rohatgi Expl
No ratings yet
Rohatgi Expl
192 pages
MANOVA
No ratings yet
MANOVA
70 pages
Soal Statistika
No ratings yet
Soal Statistika
8 pages
Introduction To Probability and Mathematical Statistics
100% (1)
Introduction To Probability and Mathematical Statistics
167 pages
Latin Square Design
No ratings yet
Latin Square Design
10 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Syntax SAS Metode ARIMA Dengan Deteksi Outlier
No ratings yet
Syntax SAS Metode ARIMA Dengan Deteksi Outlier
3 pages
Binomial Distribution
No ratings yet
Binomial Distribution
26 pages
7.1 Two Phase Sampling
No ratings yet
7.1 Two Phase Sampling
5 pages
Univariate and Bivariate Data Analysis + Probability
100% (1)
Univariate and Bivariate Data Analysis + Probability
5 pages
Ajit Tiwari Laptop
No ratings yet
Ajit Tiwari Laptop
69 pages
One-Stage Cluster Sampling and Systematic Sampling
0% (1)
One-Stage Cluster Sampling and Systematic Sampling
25 pages
Logit Model For Binary Data
No ratings yet
Logit Model For Binary Data
50 pages
Multicollinearity Exercise
100% (1)
Multicollinearity Exercise
6 pages
7871 - 管理运筹建模与求解 - 基于Excel VBA与Matlab
No ratings yet
7871 - 管理运筹建模与求解 - 基于Excel VBA与Matlab
333 pages
Analisis Varian Dua Faktor Dalam Rancangan Pengamatan Berulang
No ratings yet
Analisis Varian Dua Faktor Dalam Rancangan Pengamatan Berulang
10 pages
LASSO Book Tibshirani PDF
No ratings yet
LASSO Book Tibshirani PDF
362 pages
Count Data Models in SAS
No ratings yet
Count Data Models in SAS
12 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
Penerapan Tabel Mortalita Di Excel
No ratings yet
Penerapan Tabel Mortalita Di Excel
3 pages
Syntax SAS Untuk Metode Fungsi Transfer Multi Input Dengan Deteksi Outlier
No ratings yet
Syntax SAS Untuk Metode Fungsi Transfer Multi Input Dengan Deteksi Outlier
7 pages
SEM:Confirmatory Factor Analysis (CFA)
No ratings yet
SEM:Confirmatory Factor Analysis (CFA)
28 pages
KNN - Jupyter Notebook
No ratings yet
KNN - Jupyter Notebook
5 pages
ML Practical 3D
No ratings yet
ML Practical 3D
4 pages
Experiment 4
No ratings yet
Experiment 4
5 pages
Exp 5
No ratings yet
Exp 5
7 pages
AIML Report (1) 11
No ratings yet
AIML Report (1) 11
13 pages
AIML Report.
No ratings yet
AIML Report.
12 pages
IT Audit Exercise
No ratings yet
IT Audit Exercise
1 page
SIA Unit1 Questions
No ratings yet
SIA Unit1 Questions
2 pages
CHAPTER 4 Slides
No ratings yet
CHAPTER 4 Slides
144 pages
NR-410209-Principles of Software Engineering
No ratings yet
NR-410209-Principles of Software Engineering
5 pages
The Lean Handbook Journey-Asq Lss Conference
No ratings yet
The Lean Handbook Journey-Asq Lss Conference
26 pages
Unit-4-Development Life Cycle
No ratings yet
Unit-4-Development Life Cycle
30 pages
Software Requirement Specifications
No ratings yet
Software Requirement Specifications
6 pages
Deep Learning Final Sheet
No ratings yet
Deep Learning Final Sheet
915 pages
CHAPTER 12B Simulation Monte Carlo Method
No ratings yet
CHAPTER 12B Simulation Monte Carlo Method
24 pages
Management Theory
No ratings yet
Management Theory
65 pages
PML Lec1 Slides PDF
No ratings yet
PML Lec1 Slides PDF
111 pages
Explaining HMI - SCADA - PLC
No ratings yet
Explaining HMI - SCADA - PLC
7 pages
Combining Inductive and Analytical Learning: CS 5751 Machine Lear Ning Chapter 12 Comb. Inductive/Analyti Cal 1
No ratings yet
Combining Inductive and Analytical Learning: CS 5751 Machine Lear Ning Chapter 12 Comb. Inductive/Analyti Cal 1
19 pages
Iot Arch Overview
No ratings yet
Iot Arch Overview
17 pages
Work Breakdown Structures (WBS) For Software Development Projects
No ratings yet
Work Breakdown Structures (WBS) For Software Development Projects
13 pages
Chat Server Report
100% (1)
Chat Server Report
41 pages
Perancangan Pabrik Fenol Dari Kumen UMS
No ratings yet
Perancangan Pabrik Fenol Dari Kumen UMS
10 pages
CISA Certified Information Systems Auditor
No ratings yet
CISA Certified Information Systems Auditor
13 pages
Framework For P&G
No ratings yet
Framework For P&G
2 pages
Lazy vs. Eager Learning
No ratings yet
Lazy vs. Eager Learning
6 pages
BPM Practice
No ratings yet
BPM Practice
15 pages
Control System Engineering - Lecture Notes, Study Material and Important Questions, Answers
No ratings yet
Control System Engineering - Lecture Notes, Study Material and Important Questions, Answers
5 pages
Block Diagrams & Signal Flow Graphs Lectures 5 & 6: M.R. Azimi, Professor
No ratings yet
Block Diagrams & Signal Flow Graphs Lectures 5 & 6: M.R. Azimi, Professor
21 pages
5 QSP Recritment
No ratings yet
5 QSP Recritment
5 pages
QA Lead JD
No ratings yet
QA Lead JD
2 pages
1.5.multi-Fidelity Surrogates by Qi Zhou, Min Zhao, Jiexiang Hu,... - Z-Library
No ratings yet
1.5.multi-Fidelity Surrogates by Qi Zhou, Min Zhao, Jiexiang Hu,... - Z-Library
4 pages
Software Requiement
No ratings yet
Software Requiement
15 pages
The University of Lahore: Testing Throughout The Software Life Cycle
No ratings yet
The University of Lahore: Testing Throughout The Software Life Cycle
27 pages