Classification
Classification
Submitted By :- Submitted To :-
Anubhav Rai Dr. Gopa Bhaumik
1. Introduction
Diabetes is a chronic and life-threatening metabolic disorder that affects millions of
individuals worldwide, with increasing prevalence in both developed and developing
nations. Characterized by elevated blood glucose levels, diabetes can lead to serious
health complications such as heart disease, kidney failure, vision loss, and nerve
damage if not diagnosed and managed in a timely manner. Early prediction and
effective classification of individuals at risk of developing diabetes are crucial steps in
promoting public health and reducing the long-term burden on healthcare systems.
The dataset utilized for this study is sourced from the open-source YBI Foundation
repository, containing anonymized patient health records specifically designed for
machine learning applications. The dataset can be accessed at the following link:
Through this project, we aim not only to develop a reliable classification model but
also to gain insights into the relationships between various health attributes and their
influence on diabetes prediction. The model's performance is assessed using
standard classification metrics, ensuring a balanced evaluation of its accuracy,
precision, recall, and overall effectiveness in identifying diabetic cases.
2. Model Selection and Training
In the field of supervised machine learning, selecting an appropriate algorithm is a
critical step that directly influences the performance and interpretability of the
predictive model. For this project, Logistic Regression was chosen as the primary
classification algorithm due to its simplicity, efficiency, and effectiveness in handling
binary classification problems, where the target variable has only two possible
outcomes — in this case, diabetic or non-diabetic.
Once the model was selected, the following steps were undertaken during the
training process:
Data Preprocessing:
The raw dataset was cleaned, encoded, and scaled to ensure uniformity and
suitability for model training.
Train-Test Split:
The dataset was divided into two subsets — 80% for training the model and
20% for testing its predictive capability on unseen data.
Feature Scaling:
Standardization techniques such as StandardScaler were applied to normalize
feature values, enhancing the model’s performance and convergence speed.
Model Training:
The Logistic Regression model was trained on the prepared training dataset.
During training, the algorithm iteratively optimized the model parameters
(coefficients) to minimize the classification error and improve predictive
accuracy.
The trained model was then evaluated on the test dataset using established
performance metrics to assess its accuracy, precision, recall, and F1-score. This
systematic approach ensures that the model is both reliable and generalizable for
practical use in diabetes risk prediction.
3.Data Overview
A crucial step before training any machine learning model is to understand the
structure, distribution, and relationships within the dataset. In this project,
exploratory data analysis (EDA) was performed to examine the variables, their values,
and how they interact with each other. Below is an overview of the dataset through
visual and tabular representations:
The dataset consists of the following medical attributes for each patient:
This table offers a preview of the numerical values within the dataset, highlighting
the variation in patient health metrics.
b. Distribution of Diabetes Cases (Pie Chart)
The pie chart illustrates the distribution of diabetic versus non-diabetic cases in the
dataset:
Glucose (0.47) and BMI (0.29) show a relatively strong positive correlation
with the diabetes outcome.
Age (0.24) and Pregnancies (0.22) also have noticeable correlations with
diabetes.
This visualization helps identify which features might be most influential in predicting
the presence of diabetes. In this case, glucose and BMI appear to be the most
impactful predictors.
4. Train-Test Split
In machine learning workflows, it is essential to evaluate the model’s performance on
unseen data to assess its generalizability and predictive accuracy. To achieve this, the
original dataset is divided into two distinct subsets:
Training Set: Used to train the machine learning model by allowing it to learn
patterns and relationships from the input data.
Test Set: Used to evaluate the model's performance on new, unseen data to
check how well it generalizes.
For this project, the dataset was split using an 8:2 ratio, where:
80% of the data (614 records) was allocated for training the Logistic Regression
model.
20% of the data (154 records) was reserved for testing the trained model’s
performance.
The split was performed using the train_test_split() function from the Scikit-learn
library, ensuring randomness in the selection process while maintaining
reproducibility by specifying a random state value of 30.
This step is crucial in preventing overfitting, where a model performs well on the
training data but poorly on new, unseen data. By evaluating on the test set, it is
possible to gauge the model's real-world effectiveness and reliability before
deploying it for practical use.
Code Reference:
This ensures that the model has a balanced opportunity to learn from a significant
portion of the data while being fairly evaluated on a separate set it has never
encountered during training.
5. Results and Comparison
After successfully training the Logistic Regression model on the prepared dataset, the
model was evaluated on the test set using standard performance metrics widely
employed in binary classification problems. These metrics provide a comprehensive
view of the model’s predictive capability, particularly in the context of healthcare-
related classification where both false positives and false negatives can have
significant implications.
Accuracy 0.79
Recall 0.74
F1-Score 0.75
Interpretation:
The model achieved an accuracy of 79%, meaning it correctly classified
approximately four out of five test instances.
A recall of 74% indicates that the model successfully identified 74% of all
actual diabetic cases within the test set.