0% found this document useful (0 votes)
19 views15 pages

K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)

Uploaded by

Malik Yousaf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views15 pages

K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)

Uploaded by

Malik Yousaf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

K-Nearest Neighbors for Diabetes Prediction

Malik Yousaf (F2020019038)


Ahsan Rauf(F2020019057)
Introduction:
Diabetes is a prevalent chronic disease that affects millions of individuals worldwide, posing
significant challenges to public health systems and individual well-being. Early diagnosis and
effective management are crucial for mitigating its complications and improving patient
outcomes. In this project, we delve into the realm of predictive analytics using machine
learning to explore and understand how we can leverage data to predict the onset of diabetes.
By employing a dataset containing various health metrics and outcomes related to diabetes,
our objective is to develop a predictive model capable of identifying individuals at risk of
developing diabetes based on their physiological attributes.
The dataset used in this project encompasses diverse features such as glucose levels, blood
pressure, BMI, insulin levels, age, and pregnancy history among others. These features serve as
critical indicators in understanding the risk factors associated with diabetes. Through
exploratory data analysis (EDA) techniques, we aim to uncover patterns, correlations, and
anomalies within the data that may influence diabetes prediction. Visualizations such as
histograms, box plots, and correlation matrices will aid in gaining insights into the distribution
and relationships between these variables, thereby guiding feature selection and
preprocessing steps.
Machine learning algorithms, particularly classification techniques, will be employed to build
predictive models. Algorithms like K-Nearest Neighbors (KNN), Decision Trees, and Logistic
Regression will be evaluated for their efficacy in predicting diabetes based on the dataset's
attributes. Model evaluation metrics such as accuracy, precision, recall, and F1-score will be
used to assess and compare the performance of these models. Ultimately, this project aims to
contribute to the development of robust predictive tools that can assist healthcare
professionals in early diabetes detection and personalized patient care strategies.

Procedure:

Importing of libraries:
This Python code sets up an environment for data analysis by importing several key libraries:
numpy for numerical operations, pandas for data manipulation and analysis, matplotlib.pyplot
and seaborn for data visualization, and warnings to suppress unnecessary warning messages
for cleaner output. It includes a script to traverse and list all files in the `/kaggle/input`
directory using the os module, ensuring that all necessary data files are identified and
accessible. This setup prepares the groundwork for efficiently loading, processing, and
visualizing the dataset, crucial for any data-driven analysis or machine learning project.
Loading Dataset:
The first line of code, `df = pd.read_csv("/content/diabetes.csv")`, uses the pandas library to
read a CSV (Comma Separated Values) file named "diabetes.csv" located at the specified path
"/content/". The function `pd.read_csv` loads the data from the CSV file into a pandas
DataFrame, which is a two-dimensional labeled data structure with columns of potentially
different types. This DataFrame, stored in the variable `df`, allows for efficient data
manipulation and analysis, providing a structure similar to a table in a database or an Excel
spreadsheet.

The second line, `df.head()`, calls the `head()` method on the DataFrame `df` to display the first
five rows of the dataset. This method is useful for quickly inspecting the dataset to understand
its structure and contents. It shows the top records, which include the column names and the
initial entries, providing an overview of the data and helping to verify that it has been loaded
correctly. This initial inspection is crucial for getting familiar with the dataset, identifying any
obvious issues, and planning further data processing or analysis steps.

Refining of Dataset:
Insertion of Head:
The first line of code, `df.shape`, is used to determine the dimensions of the DataFrame `df`.
The `shape` attribute returns a tuple representing the number of rows and columns in the
DataFrame. In this case, it would return `(768, 9)`, indicating that the dataset consists of 768
rows and 9 columns. This information is essential for understanding the size of the dataset,
which can influence various aspects of data processing and analysis, such as memory
requirements and computational complexity.

Defining the shape


The second line, `df.columns`, is used to retrieve the column names of the DataFrame. This
attribute returns an Index object containing the names of all columns in the DataFrame.
Knowing the column names is crucial for understanding what data is available and how it is
organized. It also helps in referencing specific columns for operations like data selection,
filtering, and analysis. For instance, you might want to know the names to perform operations
like `df['Glucose']` to access the glucose column specifically.

Segregation in terms of Column


The third line, `df.dtypes`, provides information about the data types of each column in the
DataFrame. This attribute returns a Series with the data type of each column. Understanding
the data types is important because it affects how the data can be manipulated and analyzed.
For example, numerical operations can only be performed on numeric data types, and certain
methods or functions require data to be of a specific type. Knowing the data types helps in
planning data preprocessing steps, such as converting data types or handling missing values
appropriately.
Defining DataTypes:
The first line of code, `df.dtypes`, returns the data types of each column in the DataFrame `df`.
This attribute produces a pandas Series with the data type for each column, such as integers,
floats, or objects (usually strings). Understanding the data types is crucial for various
computations and operations, as it informs you which mathematical operations can be
performed, which statistical methods are applicable, and how to handle data during
preprocessing. For instance, numeric columns can be used for calculations and aggregations,
while object types might need conversion or different handling methods.

Data Information Collection:


The second line, `df.info()`, provides a concise summary of the DataFrame. This method prints
detailed information about the DataFrame, including the index data type, column data types,
non-null values count for each column, and the memory usage. The output helps in
understanding the structure and completeness of the dataset, identifying any missing values,
and assessing the overall size of the data in memory. This summary is particularly useful for
getting a quick overview of the dataset's health and ensuring that it is in a suitable state for
further analysis or machine learning tasks.
Data Discription:
The third line, `df.describe()`, generates descriptive statistics that summarize the central
tendency, dispersion, and shape of the dataset’s distribution, excluding NaN values. This
method returns a DataFrame with statistical measures such as count, mean, standard
deviation, minimum, quartiles, and maximum for each numeric column. It helps in
understanding how the data is distributed, identifying any outliers or anomalies, and gaining
insights into the data's overall characteristics. These summary statistics are valuable for initial
data exploration and can guide subsequent data cleaning and analysis steps.
Data Cleaning:
Removing Duplicates:
The first line of code, `df = df.drop_duplicates()`, is used to remove any duplicate rows from
the DataFrame `df`. The `drop_duplicates` method checks for duplicate rows based on all
columns by default and removes any such duplicates, keeping only the first occurrence.
Assigning the result back to `df` ensures that the DataFrame is updated with these changes.
This step is crucial for data integrity, as duplicates can skew analysis results, lead to incorrect
insights, and affect the performance of machine learning models.

Finding Missing Values:


The second block of code, `df.isnull().sum()`, checks for missing values in the DataFrame. The
`isnull` method returns a DataFrame of the same shape as `df`, where each element is a
boolean indicating whether it is a null value. The `sum` method then aggregates these boolean
values column-wise, providing a count of missing values for each column. This helps in
identifying columns with missing data, which is essential for data cleaning and preprocessing.
The comment indicates that after executing this code, it was concluded that there are no null
values in the dataset, ensuring data completeness for subsequent analysis.

Checking for Zero Values:


The third block of code checks for zero values in specific columns where zero is not a valid or
possible value. The `print` statements are used to display the count of rows where
`BloodPressure`, `Glucose`, `SkinThickness`, `Insulin`, and `BMI` have zero values. For each
column, `df[df['ColumnName'] == 0].shape[0]` filters the DataFrame to include only rows
where the specified column has a value of zero and then retrieves the count of such rows. This
step is important for identifying and handling potential data entry errors or missing data
represented as zeros. The comment clarifies that columns like Age, DiabetesPedigreeFunction,
and the number of pregnancies do not need this check, as zeros in these columns are either
not present or are valid values.

Data Visualization:
Plotting data in form of plots:
The line of code `sns.countplot(x='Outcome', data=df)` uses the seaborn library to create a
count plot that visualizes the frequency distribution of the 'Outcome' column in the DataFrame
`df`. In this context, `sns` is an alias for the seaborn library, which is widely used for creating
attractive and informative statistical graphics. The `countplot` function specifically generates a
bar plot that shows the count of occurrences for each unique value in the specified categorical
variable, 'Outcome' in this case.sns.countplotm is a seaborn function that creates a bar plot
showing the count of observations in each categorical bin using bars. It is particularly useful for
visualizing the distribution of categorical data.
Forming the Histogram:
The lines of code `df.hist(bins=10, figsize=(10, 10))` followed by `plt.show()` generate
histograms for each feature (column) in the DataFrame `df` and then display them. Here is a
detailed explanation of what each part of the code does:

df.hist: This method is a quick way to generate histograms for each numeric column in the
DataFrame `df`. A histogram is a type of bar plot that represents the distribution of a dataset
by showing the frequency of data points that fall within certain ranges, or bins. Each numeric
feature in the DataFrame will have its histogram plotted, allowing for a visual inspection of the
distribution of values within that feature.

bins=10: This parameter specifies the number of bins (intervals) to use for each histogram.
In this case, the range of values for each feature is divided into 10 equal-width bins. The choice
of the number of bins can affect the granularity and readability of the histogram; fewer bins
might oversimplify the data distribution, while too many bins might overcomplicate it.

figsize=(10, 10) : This parameter sets the size of the entire figure (the collection of
histograms) to 10 inches by 10 inches. Adjusting the figure size ensures that the histograms are
readable and well-spaced, which is especially important when dealing with multiple subplots.

plt.show() : This function from the matplotlib library displays the plotted figure. Without
this line, the histograms might not be rendered in some environments, especially in scripts or
Jupyter notebooks.
When these lines are executed, a grid of histograms is created, with each histogram
corresponding to one numeric feature in the DataFrame. This visualization allows for an
immediate assessment of the data distribution for each feature, highlighting aspects such as
skewness, spread, and the presence of outliers. Understanding the distribution of each feature
is crucial for various stages of data analysis and preprocessing, such as normalization, handling
outliers, and selecting appropriate machine learning algorithms.
The provided lines of code create a figure with multiple subplots, each displaying a box plot for
a specific feature (column) from the DataFrame `df`. Here's a detailed explanation of what each
part of the code accomplishes:

Data Plotting
plt.figure(figsize=(16,12)): This line initializes a new figure with a specified size of 16
inches in width and 12 inches in height. This size ensures that the overall figure is large enough
to accommodate multiple subplots without them being cramped.

sns.set_style(style='whitegrid'): This seaborn function sets the aesthetic style of the


plots. In this case, `whitegrid` style is chosen, which adds grid lines to the background of the
plots for better readability without overshadowing the plotted data.
plt.subplot(3,3,1)to plt.subplot(3,3,8): These lines define the layout of the subplots
within the figure. The `plt.subplot` function divides the figure into a 3x3 grid (3 rows and 3
columns). The third argument in each subplot call specifies the position of the subplot in the
grid. For example, `(3,3,1)` corresponds to the first subplot in the first row, `(3,3,2)` to the
second subplot in the first row, and so on up to `(3,3,8)` which is the eighth subplot in the grid.

sns.boxplot(x='FeatureName', data=df): These lines use seaborn's `boxplot`


function to create a box plot for each specified feature (`Glucose`, `BloodPressure`, `Insulin`,
`BMI`, `Age`, `SkinThickness`, `Pregnancies`, `DiabetesPedigreeFunction`). A box plot is a
graphical summary of the distribution of numerical data through quartiles. The box represents
the interquartile range (IQR) between the first (Q1) and third (Q3) quartiles, with a line inside
marking the median (Q2). Whiskers extend from the box to the minimum and maximum values
within 1.5 times the IQR from the quartiles, while outliers beyond this range are plotted
individually.
Each subplot displays a box plot for a specific feature, providing insights into its distribution,
central tendency, and presence of outliers. These visualizations are particularly useful for
identifying potential data anomalies, understanding the spread of values, and making informed
decisions about data preprocessing steps such as outlier removal or feature scaling before
applying machine learning algorithms. The structured layout of subplots helps in comparing
different features side by side, facilitating a comprehensive analysis of the dataset's numeric
attributes.
K- Nearest Neighbor:
These lines of code demonstrate the process of tuning hyperparameters for a K-Nearest
Neighbors (KNN) classifier using GridSearchCV from scikit-learn. Initially, a KNeighborsClassifier
is instantiated without specifying any hyperparameters. A range of hyperparameters to tune is
defined, including `n_neighbors` (number of neighbors), `p` (power parameter for Minkowski
distance), `weights` (weight function used in prediction), and `metric` (distance metric). These
parameters are organized into a dictionary named `hyperparameters`.
Next, RepeatedStratifiedKFold cross-validation is configured (`cv`) to ensure robust evaluation
of the model's performance across different splits of the dataset. GridSearchCV is then
employed to exhaustively search through the specified hyperparameter space
(`param_grid=hyperparameters`), using F1 score (`scoring='f1'`) as the evaluation metric. The
best performing model is selected based on the highest F1 score across all cross-validation
folds and repetitions (`best_model = grid_search.fit(X_train, y_train)`). Finally, the best
hyperparameters are printed out, providing insights into the optimal configuration for the KNN
classifier on the given dataset (`best_model.best_estimator_.get_params()` retrieves these
values). This methodical approach helps in fine-tuning the model to achieve better predictive
performance and generalizability.
Conclusion:
In conclusion, this project has demonstrated the application of machine learning techniques in
predicting diabetes based on a comprehensive dataset of health metrics. Through exploratory
data analysis, we identified significant insights into the distribution and relationships among
various physiological attributes and diabetes outcomes. Visualizations such as histograms and
box plots helped to visualize the spread of data and detect potential outliers or anomalies,
informing our preprocessing strategies.
We implemented and evaluated multiple machine learning models including K-Nearest
Neighbors (KNN), Decision Trees, and Logistic Regression. These models were assessed using
performance metrics like accuracy, precision, recall, and F1-score, with KNN showing promising
results in predicting diabetes onset. The findings highlight the potential of machine learning in
healthcare for early disease detection and management, emphasizing the importance of data-
driven approaches in improving diagnostic accuracy and patient outcomes. Future work could
explore more advanced algorithms, incorporate additional data sources, and focus on
integrating predictive models into clinical practice to enhance diabetes prevention and
treatment strategies effectively.

You might also like