0% found this document useful (0 votes)
20 views

Diabetes Prediction System With KNN Algorithm

The document describes building a diabetes prediction system using a KNN classification algorithm. It details exploring a diabetes dataset, cleaning and preprocessing the data, selecting features, splitting the data into training and testing sets, implementing a KNN function to make predictions for different K values, and plotting the results to analyze model performance.

Uploaded by

Mahtab Ansari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Diabetes Prediction System With KNN Algorithm

The document describes building a diabetes prediction system using a KNN classification algorithm. It details exploring a diabetes dataset, cleaning and preprocessing the data, selecting features, splitting the data into training and testing sets, implementing a KNN function to make predictions for different K values, and plotting the results to analyze model performance.

Uploaded by

Mahtab Ansari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Diabetes prediction

system with KNN


algorithm
Submitted By Submitted To:
Mahtab Alam(31) And
Mohd Arish(13)
Pallvi Mam
Project Overview:

• A classification project using the K-Nearest Neighbors


(KNN) predictive model to determine if a person has
diabetes. We'll work with a dataset from the National
Institute of Diabetes and Digestive and Kidney Diseases,
available on Kaggle. This dataset contains valuable
healthcare information. We'll start by exploring the dataset
and then build our predictive model using KNN with cross-
validation.
Description of dataset :

• The dataset I've obtained from Kaggle originates from the National Institute of
Diabetes and Digestive and Kidney Diseases and consists of predictive
variables and an outcome indicating whether a person is diabetic or not. It
contains data from 768 patients and serves as the basis for our classification
task.
• Now, before diving into the K-Nearest Neighbors (KNN) algorithm, let's briefly
discuss what it entails. KNN is a type of supervised machine learning
algorithm used for classification and regression. In classification, like our
case, it predicts the class of a given data point by finding the most common
class among the k closest data points in the feature space. The choice of k,
the number of neighbors, is a crucial hyperparameter that can significantly
impact the model's performance.
KNN algorithm:
• K-Nearest Neighbors (KNN) is a supervised
machine learning algorithm that focuses on
similarity. It classifies a target variable by
predicting its class based on a specified number
of nearest neighbors. To make a prediction, KNN
calculates the distance from the instance being
classified to every instance in the training
dataset. It then assigns a class to the instance
based on the majority class of its k nearest
neighbors.
Distance between data points in KNN
algorithm:
Reading and exploring the
dataset:
• We begin by loading the dataset using pandas' `read_csv()`
function, which reads the dataset and converts it into a
structured tabular format that we can easily analyze.
Input Code:
Output:
Manipulating and Cleaning our dataset

• In this phase of data preprocessing, we focus on cleaning the


dataset by addressing zero and missing values. These values
can significantly impact the accuracy of our predictive model.
We concentrate on specific columns, including 'Glucose',
'Blood Pressure', 'Skin Thickness', 'Insulin', 'BMI', and
'Pedigree', as they are key indicators for determining diabetes.
By replacing these problematic values with the mean of their
respective columns, we ensure that our dataset is more
reliable and suitable for training our KNN model.
Input Code:
Output:
Let's take a quick statistical view of
the data provided

Input Code:
Output:
Plotting the dataset

• The updated diabetes dataset is now prepared for


basic visualization. Plotting the data will provide a
visual understanding of its distribution and help in
selecting the most suitable columns for the K-Nearest
Neighbors (KNN) experiment. To visualize the data, I
utilized the `pairplot()` function from the Seaborn
library, which generates a series of plots showing
relationships between different variables in the
dataset.
Input Code:
Output:
Reducing The Attributes

• Given the complexity of our multidimensional dataset


with numerous data points across various variables,
we aim to simplify our analysis by selecting only a
subset of variables to test our model. This approach
will streamline our experimentation process and make
it more manageable. for the purpose of simplicity and
analyzing the most relevant data , we will select three
features of the dataset
Input Code:
Output:
Splitting the dataset into training and
testing dataset
• An essential step in machine learning modeling is dividing our
dataset into training and testing sets. This division allows us to
evaluate the performance of our model accurately.
• During testing, the machine learning algorithm predicts outcomes
for the testing set based on what it learned from the training set.
This process helps us assess how well the model generalizes to
new, unseen data. By comparing the predicted outcomes to the
actual outcomes in the testing set, we can compute the accuracy
rate of the machine learning algorithm. A higher accuracy rate
indicates that the model is better at predicting outcomes for the
presented sample data, which is a desirable outcome.
Input Code:
We need to run a quick syntax to see
if these data are split correctly
Output:
KNN function
• The K-Nearest Neighbors (KNN) algorithm assesses similarity
between sample test data and training data. This similarity is gauged
using a set of K values, which represent the nearest data points to the
sample data. In this case, we employ two distance measurements to
determine the closest distances between our test data and the training
dataset. The selected distance measurement method for this exercise
is the Euclidean distance.
Continued…

• To execute these operations efficiently, I utilized the


scikit-learn library, which provides built-in functions
for running the KNN algorithm and calculating
distances.
• I have created a function that implements the K-
Nearest Neighbors (KNN) algorithm and populates
the results in the form of a line plot. This function
runs the KNN algorithm multiple times (K times) and
displays the results.
Input Code:
For this exercise , I will test and plot the model
with K values from 1 up to 500 and see where
are we with the best overall k values

Input Code:
Output:
Thank You

You might also like