0% found this document useful (0 votes)

19 views15 pages

K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)

Uploaded by

Malik Yousaf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views15 pages

K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)

Uploaded by

Malik Yousaf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 15

K-Nearest Neighbors for Diabetes Prediction

Malik Yousaf (F2020019038)

Ahsan Rauf(F2020019057)
Introduction:
Diabetes is a prevalent chronic disease that affects millions of individuals worldwide, posing
significant challenges to public health systems and individual well-being. Early diagnosis and
effective management are crucial for mitigating its complications and improving patient
outcomes. In this project, we delve into the realm of predictive analytics using machine
learning to explore and understand how we can leverage data to predict the onset of diabetes.
By employing a dataset containing various health metrics and outcomes related to diabetes,
our objective is to develop a predictive model capable of identifying individuals at risk of
developing diabetes based on their physiological attributes.
The dataset used in this project encompasses diverse features such as glucose levels, blood
pressure, BMI, insulin levels, age, and pregnancy history among others. These features serve as
critical indicators in understanding the risk factors associated with diabetes. Through
exploratory data analysis (EDA) techniques, we aim to uncover patterns, correlations, and
anomalies within the data that may influence diabetes prediction. Visualizations such as
histograms, box plots, and correlation matrices will aid in gaining insights into the distribution
and relationships between these variables, thereby guiding feature selection and
preprocessing steps.
Machine learning algorithms, particularly classification techniques, will be employed to build
predictive models. Algorithms like K-Nearest Neighbors (KNN), Decision Trees, and Logistic
Regression will be evaluated for their efficacy in predicting diabetes based on the dataset's
attributes. Model evaluation metrics such as accuracy, precision, recall, and F1-score will be
used to assess and compare the performance of these models. Ultimately, this project aims to
contribute to the development of robust predictive tools that can assist healthcare
professionals in early diabetes detection and personalized patient care strategies.

Procedure:

Importing of libraries:
This Python code sets up an environment for data analysis by importing several key libraries:
numpy for numerical operations, pandas for data manipulation and analysis, matplotlib.pyplot
and seaborn for data visualization, and warnings to suppress unnecessary warning messages
for cleaner output. It includes a script to traverse and list all files in the `/kaggle/input`
directory using the os module, ensuring that all necessary data files are identified and
accessible. This setup prepares the groundwork for efficiently loading, processing, and
visualizing the dataset, crucial for any data-driven analysis or machine learning project.
Loading Dataset:
The first line of code, `df = pd.read_csv("/content/diabetes.csv")`, uses the pandas library to
read a CSV (Comma Separated Values) file named "diabetes.csv" located at the specified path
"/content/". The function `pd.read_csv` loads the data from the CSV file into a pandas
DataFrame, which is a two-dimensional labeled data structure with columns of potentially
different types. This DataFrame, stored in the variable `df`, allows for efficient data
manipulation and analysis, providing a structure similar to a table in a database or an Excel
spreadsheet.

The second line, `df.head()`, calls the `head()` method on the DataFrame `df` to display the first
five rows of the dataset. This method is useful for quickly inspecting the dataset to understand
its structure and contents. It shows the top records, which include the column names and the
initial entries, providing an overview of the data and helping to verify that it has been loaded
correctly. This initial inspection is crucial for getting familiar with the dataset, identifying any
obvious issues, and planning further data processing or analysis steps.

Refining of Dataset:
Insertion of Head:
The first line of code, `df.shape`, is used to determine the dimensions of the DataFrame `df`.
The `shape` attribute returns a tuple representing the number of rows and columns in the
DataFrame. In this case, it would return `(768, 9)`, indicating that the dataset consists of 768
rows and 9 columns. This information is essential for understanding the size of the dataset,
which can influence various aspects of data processing and analysis, such as memory
requirements and computational complexity.

Defining the shape

The second line, `df.columns`, is used to retrieve the column names of the DataFrame. This
attribute returns an Index object containing the names of all columns in the DataFrame.
Knowing the column names is crucial for understanding what data is available and how it is
organized. It also helps in referencing specific columns for operations like data selection,
filtering, and analysis. For instance, you might want to know the names to perform operations
like `df['Glucose']` to access the glucose column specifically.

Segregation in terms of Column

The third line, `df.dtypes`, provides information about the data types of each column in the
DataFrame. This attribute returns a Series with the data type of each column. Understanding
the data types is important because it affects how the data can be manipulated and analyzed.
For example, numerical operations can only be performed on numeric data types, and certain
methods or functions require data to be of a specific type. Knowing the data types helps in
planning data preprocessing steps, such as converting data types or handling missing values
appropriately.
Defining DataTypes:
The first line of code, `df.dtypes`, returns the data types of each column in the DataFrame `df`.
This attribute produces a pandas Series with the data type for each column, such as integers,
floats, or objects (usually strings). Understanding the data types is crucial for various
computations and operations, as it informs you which mathematical operations can be
performed, which statistical methods are applicable, and how to handle data during
preprocessing. For instance, numeric columns can be used for calculations and aggregations,
while object types might need conversion or different handling methods.

Data Information Collection:

The second line, `df.info()`, provides a concise summary of the DataFrame. This method prints
detailed information about the DataFrame, including the index data type, column data types,
non-null values count for each column, and the memory usage. The output helps in
understanding the structure and completeness of the dataset, identifying any missing values,
and assessing the overall size of the data in memory. This summary is particularly useful for
getting a quick overview of the dataset's health and ensuring that it is in a suitable state for
further analysis or machine learning tasks.
Data Discription:
The third line, `df.describe()`, generates descriptive statistics that summarize the central
tendency, dispersion, and shape of the dataset’s distribution, excluding NaN values. This
method returns a DataFrame with statistical measures such as count, mean, standard
deviation, minimum, quartiles, and maximum for each numeric column. It helps in
understanding how the data is distributed, identifying any outliers or anomalies, and gaining
insights into the data's overall characteristics. These summary statistics are valuable for initial
data exploration and can guide subsequent data cleaning and analysis steps.
Data Cleaning:
Removing Duplicates:
The first line of code, `df = df.drop_duplicates()`, is used to remove any duplicate rows from
the DataFrame `df`. The `drop_duplicates` method checks for duplicate rows based on all
columns by default and removes any such duplicates, keeping only the first occurrence.
Assigning the result back to `df` ensures that the DataFrame is updated with these changes.
This step is crucial for data integrity, as duplicates can skew analysis results, lead to incorrect
insights, and affect the performance of machine learning models.

Finding Missing Values:

The second block of code, `df.isnull().sum()`, checks for missing values in the DataFrame. The
`isnull` method returns a DataFrame of the same shape as `df`, where each element is a
boolean indicating whether it is a null value. The `sum` method then aggregates these boolean
values column-wise, providing a count of missing values for each column. This helps in
identifying columns with missing data, which is essential for data cleaning and preprocessing.
The comment indicates that after executing this code, it was concluded that there are no null
values in the dataset, ensuring data completeness for subsequent analysis.

Checking for Zero Values:

The third block of code checks for zero values in specific columns where zero is not a valid or
possible value. The `print` statements are used to display the count of rows where
`BloodPressure`, `Glucose`, `SkinThickness`, `Insulin`, and `BMI` have zero values. For each
column, `df[df['ColumnName'] == 0].shape[0]` filters the DataFrame to include only rows
where the specified column has a value of zero and then retrieves the count of such rows. This
step is important for identifying and handling potential data entry errors or missing data
represented as zeros. The comment clarifies that columns like Age, DiabetesPedigreeFunction,
and the number of pregnancies do not need this check, as zeros in these columns are either
not present or are valid values.

Data Visualization:
Plotting data in form of plots:
The line of code `sns.countplot(x='Outcome', data=df)` uses the seaborn library to create a
count plot that visualizes the frequency distribution of the 'Outcome' column in the DataFrame
`df`. In this context, `sns` is an alias for the seaborn library, which is widely used for creating
attractive and informative statistical graphics. The `countplot` function specifically generates a
bar plot that shows the count of occurrences for each unique value in the specified categorical
variable, 'Outcome' in this case.sns.countplotm is a seaborn function that creates a bar plot
showing the count of observations in each categorical bin using bars. It is particularly useful for
visualizing the distribution of categorical data.
Forming the Histogram:
The lines of code `df.hist(bins=10, figsize=(10, 10))` followed by `plt.show()` generate
histograms for each feature (column) in the DataFrame `df` and then display them. Here is a
detailed explanation of what each part of the code does:

df.hist: This method is a quick way to generate histograms for each numeric column in the
DataFrame `df`. A histogram is a type of bar plot that represents the distribution of a dataset
by showing the frequency of data points that fall within certain ranges, or bins. Each numeric
feature in the DataFrame will have its histogram plotted, allowing for a visual inspection of the
distribution of values within that feature.

bins=10: This parameter specifies the number of bins (intervals) to use for each histogram.
In this case, the range of values for each feature is divided into 10 equal-width bins. The choice
of the number of bins can affect the granularity and readability of the histogram; fewer bins
might oversimplify the data distribution, while too many bins might overcomplicate it.

figsize=(10, 10) : This parameter sets the size of the entire figure (the collection of
histograms) to 10 inches by 10 inches. Adjusting the figure size ensures that the histograms are
readable and well-spaced, which is especially important when dealing with multiple subplots.

plt.show() : This function from the matplotlib library displays the plotted figure. Without
this line, the histograms might not be rendered in some environments, especially in scripts or
Jupyter notebooks.
When these lines are executed, a grid of histograms is created, with each histogram
corresponding to one numeric feature in the DataFrame. This visualization allows for an
immediate assessment of the data distribution for each feature, highlighting aspects such as
skewness, spread, and the presence of outliers. Understanding the distribution of each feature
is crucial for various stages of data analysis and preprocessing, such as normalization, handling
outliers, and selecting appropriate machine learning algorithms.
The provided lines of code create a figure with multiple subplots, each displaying a box plot for
a specific feature (column) from the DataFrame `df`. Here's a detailed explanation of what each
part of the code accomplishes:

Data Plotting
plt.figure(figsize=(16,12)): This line initializes a new figure with a specified size of 16
inches in width and 12 inches in height. This size ensures that the overall figure is large enough
to accommodate multiple subplots without them being cramped.

sns.set_style(style='whitegrid'): This seaborn function sets the aesthetic style of the

plots. In this case, `whitegrid` style is chosen, which adds grid lines to the background of the
plots for better readability without overshadowing the plotted data.
plt.subplot(3,3,1)to plt.subplot(3,3,8): These lines define the layout of the subplots
within the figure. The `plt.subplot` function divides the figure into a 3x3 grid (3 rows and 3
columns). The third argument in each subplot call specifies the position of the subplot in the
grid. For example, `(3,3,1)` corresponds to the first subplot in the first row, `(3,3,2)` to the
second subplot in the first row, and so on up to `(3,3,8)` which is the eighth subplot in the grid.

sns.boxplot(x='FeatureName', data=df): These lines use seaborn's `boxplot`

function to create a box plot for each specified feature (`Glucose`, `BloodPressure`, `Insulin`,
`BMI`, `Age`, `SkinThickness`, `Pregnancies`, `DiabetesPedigreeFunction`). A box plot is a
graphical summary of the distribution of numerical data through quartiles. The box represents
the interquartile range (IQR) between the first (Q1) and third (Q3) quartiles, with a line inside
marking the median (Q2). Whiskers extend from the box to the minimum and maximum values
within 1.5 times the IQR from the quartiles, while outliers beyond this range are plotted
individually.
Each subplot displays a box plot for a specific feature, providing insights into its distribution,
central tendency, and presence of outliers. These visualizations are particularly useful for
identifying potential data anomalies, understanding the spread of values, and making informed
decisions about data preprocessing steps such as outlier removal or feature scaling before
applying machine learning algorithms. The structured layout of subplots helps in comparing
different features side by side, facilitating a comprehensive analysis of the dataset's numeric
attributes.
K- Nearest Neighbor:
These lines of code demonstrate the process of tuning hyperparameters for a K-Nearest
Neighbors (KNN) classifier using GridSearchCV from scikit-learn. Initially, a KNeighborsClassifier
is instantiated without specifying any hyperparameters. A range of hyperparameters to tune is
defined, including `n_neighbors` (number of neighbors), `p` (power parameter for Minkowski
distance), `weights` (weight function used in prediction), and `metric` (distance metric). These
parameters are organized into a dictionary named `hyperparameters`.
Next, RepeatedStratifiedKFold cross-validation is configured (`cv`) to ensure robust evaluation
of the model's performance across different splits of the dataset. GridSearchCV is then
employed to exhaustively search through the specified hyperparameter space
(`param_grid=hyperparameters`), using F1 score (`scoring='f1'`) as the evaluation metric. The
best performing model is selected based on the highest F1 score across all cross-validation
folds and repetitions (`best_model = grid_search.fit(X_train, y_train)`). Finally, the best
hyperparameters are printed out, providing insights into the optimal configuration for the KNN
classifier on the given dataset (`best_model.best_estimator_.get_params()` retrieves these
values). This methodical approach helps in fine-tuning the model to achieve better predictive
performance and generalizability.
Conclusion:
In conclusion, this project has demonstrated the application of machine learning techniques in
predicting diabetes based on a comprehensive dataset of health metrics. Through exploratory
data analysis, we identified significant insights into the distribution and relationships among
various physiological attributes and diabetes outcomes. Visualizations such as histograms and
box plots helped to visualize the spread of data and detect potential outliers or anomalies,
informing our preprocessing strategies.
We implemented and evaluated multiple machine learning models including K-Nearest
Neighbors (KNN), Decision Trees, and Logistic Regression. These models were assessed using
performance metrics like accuracy, precision, recall, and F1-score, with KNN showing promising
results in predicting diabetes onset. The findings highlight the potential of machine learning in
healthcare for early disease detection and management, emphasizing the importance of data-
driven approaches in improving diagnostic accuracy and patient outcomes. Future work could
explore more advanced algorithms, incorporate additional data sources, and focus on
integrating predictive models into clinical practice to enhance diabetes prevention and
treatment strategies effectively.

Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
Karamysheva I. D. Contrastive Grammar of English and Ukrainian Languages
89% (38)
Karamysheva I. D. Contrastive Grammar of English and Ukrainian Languages
322 pages
Inventory Management and Its Effects On Customer Satisfaction
No ratings yet
Inventory Management and Its Effects On Customer Satisfaction
12 pages
ML Proj Diabetes
No ratings yet
ML Proj Diabetes
51 pages
Diabetes Prediction Using Machine Learning
No ratings yet
Diabetes Prediction Using Machine Learning
16 pages
ML Data Preprocessing in Python
No ratings yet
ML Data Preprocessing in Python
9 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
42 pages
Data Pre-Processing
No ratings yet
Data Pre-Processing
22 pages
DA Lab
No ratings yet
DA Lab
27 pages
Py 10
No ratings yet
Py 10
5 pages
20BCE7620 AP2021228000397 Experiment-6 Removed
No ratings yet
20BCE7620 AP2021228000397 Experiment-6 Removed
19 pages
Exercise 3
No ratings yet
Exercise 3
12 pages
Natural Language Understanding
No ratings yet
Natural Language Understanding
14 pages
Diabetes - Test Report
No ratings yet
Diabetes - Test Report
62 pages
Pandas
No ratings yet
Pandas
21 pages
Pyhton 2
No ratings yet
Pyhton 2
8 pages
CS3362 Data Science Laboratory Manual 2022-23
No ratings yet
CS3362 Data Science Laboratory Manual 2022-23
54 pages
Diabetes EDA and Kears Modeling
No ratings yet
Diabetes EDA and Kears Modeling
26 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
2 Pandas
No ratings yet
2 Pandas
22 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
19 pages
Data Frame
No ratings yet
Data Frame
95 pages
Data Science Practicals - Ipynb
No ratings yet
Data Science Practicals - Ipynb
54 pages
ML Expt 1 Description
No ratings yet
ML Expt 1 Description
15 pages
Decision Support
No ratings yet
Decision Support
21 pages
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
No ratings yet
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
7 pages
Pandas in Python
No ratings yet
Pandas in Python
59 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Experiment 2
No ratings yet
Experiment 2
17 pages
Asfasdas
No ratings yet
Asfasdas
36 pages
Exp3 Python
No ratings yet
Exp3 Python
15 pages
ModuleAr Merged
No ratings yet
ModuleAr Merged
42 pages
Eda Code Snippets
No ratings yet
Eda Code Snippets
17 pages
Data Science
No ratings yet
Data Science
8 pages
Pima Indian Diabetes Data Analysis in Python - Canopus Business Management Group
No ratings yet
Pima Indian Diabetes Data Analysis in Python - Canopus Business Management Group
21 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
Phython Example
No ratings yet
Phython Example
12 pages
DAP Writeups - Merged
No ratings yet
DAP Writeups - Merged
33 pages
Content Pandas Cheat Sheet
No ratings yet
Content Pandas Cheat Sheet
9 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
10 pages
Fds SLOT 2
No ratings yet
Fds SLOT 2
12 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
ML - Preprocessing - Introduction
No ratings yet
ML - Preprocessing - Introduction
14 pages
FDS Aim Algorithm
No ratings yet
FDS Aim Algorithm
18 pages
Diabetes Prediction 1704256341
No ratings yet
Diabetes Prediction 1704256341
17 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
Pandas Notes
No ratings yet
Pandas Notes
5 pages
Fds Mannual
No ratings yet
Fds Mannual
39 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
Pandas
No ratings yet
Pandas
29 pages
Assignment 1 - LP1
No ratings yet
Assignment 1 - LP1
14 pages
Batch1 Ds
No ratings yet
Batch1 Ds
15 pages
3
No ratings yet
3
4 pages
Pandas DataFrameObject
No ratings yet
Pandas DataFrameObject
4 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Condition Monitoring of A Surface Mounted Permanen
No ratings yet
Condition Monitoring of A Surface Mounted Permanen
18 pages
Lecture 9
No ratings yet
Lecture 9
36 pages
Lecture 5
No ratings yet
Lecture 5
31 pages
Lecture 8
No ratings yet
Lecture 8
29 pages
Cattle Monitoring System Using Smart Sensors For Health Monitoring and Health Managment
0% (1)
Cattle Monitoring System Using Smart Sensors For Health Monitoring and Health Managment
19 pages
Lecture 2
No ratings yet
Lecture 2
30 pages
Refusal Letter For Deployment
100% (1)
Refusal Letter For Deployment
2 pages
This Release Contains:: How To Upgrade From Previous Versions
No ratings yet
This Release Contains:: How To Upgrade From Previous Versions
8 pages
Operation Manual WBL-100/101/200: 3, Hagavish St. Israel 58817 Tel: 972 3 5595252, Fax: 972 3 5594529
No ratings yet
Operation Manual WBL-100/101/200: 3, Hagavish St. Israel 58817 Tel: 972 3 5595252, Fax: 972 3 5594529
5 pages
Structural Functionalism
No ratings yet
Structural Functionalism
16 pages
Enclaving Atas Tanah Hak Guna Usaha Sebagai Sumber
No ratings yet
Enclaving Atas Tanah Hak Guna Usaha Sebagai Sumber
17 pages
Personality Theory and Research 13th Edition by Daniel Cervone Daniel Cervone PDF Download
100% (1)
Personality Theory and Research 13th Edition by Daniel Cervone Daniel Cervone PDF Download
17 pages
2018-11-06 Los Angeles Times
No ratings yet
2018-11-06 Los Angeles Times
44 pages
DLL Imaging and Online Platform
No ratings yet
DLL Imaging and Online Platform
3 pages
Summer Vacation BST Holiday Homework
No ratings yet
Summer Vacation BST Holiday Homework
22 pages
Unix and Shell Programming
No ratings yet
Unix and Shell Programming
19 pages
Board of Intermediate and Secondary Education, DG - Khan Application
No ratings yet
Board of Intermediate and Secondary Education, DG - Khan Application
1 page
Dye or Die PDF
No ratings yet
Dye or Die PDF
5 pages
Introduction of Session Chairs
No ratings yet
Introduction of Session Chairs
8 pages
Section 5 Electrical Wiring Diagram
No ratings yet
Section 5 Electrical Wiring Diagram
159 pages
test 1 ЗНО 2021 без слушанья
No ratings yet
test 1 ЗНО 2021 без слушанья
7 pages
Grade 10 Data Handling QP 2024
No ratings yet
Grade 10 Data Handling QP 2024
6 pages
Analysis and Design of Water Tank Using Staad Pro V8i Submitted by
No ratings yet
Analysis and Design of Water Tank Using Staad Pro V8i Submitted by
59 pages
S T o P S: Present Simple
No ratings yet
S T o P S: Present Simple
3 pages
Sentosa Case - SIS Experience
100% (2)
Sentosa Case - SIS Experience
14 pages
RR 446 Cal
No ratings yet
RR 446 Cal
3 pages
Dependent and Independent Clauses
No ratings yet
Dependent and Independent Clauses
1 page
En Recipe Book Eclairs
100% (3)
En Recipe Book Eclairs
24 pages
Trek Marlin 29er Owners Manual
No ratings yet
Trek Marlin 29er Owners Manual
3 pages
The Empire Writes Back
No ratings yet
The Empire Writes Back
6 pages
Arab Cultural Clothing - Google Search 2
No ratings yet
Arab Cultural Clothing - Google Search 2
1 page
Programming Imp Questions
No ratings yet
Programming Imp Questions
32 pages
2
No ratings yet
2
7 pages
Colored White - Transcending The Racial Pas - David R. Roediger
100% (1)
Colored White - Transcending The Racial Pas - David R. Roediger
337 pages

K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)

Uploaded by

K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)

Uploaded by

K-Nearest Neighbors for Diabetes Prediction

Malik Yousaf (F2020019038)

Defining the shape

Segregation in terms of Column

Data Information Collection:

Finding Missing Values:

Checking for Zero Values:

sns.set_style(style='whitegrid'): This seaborn function sets the aesthetic style of the

sns.boxplot(x='FeatureName', data=df): These lines use seaborn's `boxplot`

You might also like