0% found this document useful (0 votes)
13 views9 pages

Classification

Uploaded by

raisahab2199
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views9 pages

Classification

Uploaded by

raisahab2199
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

AI-Project

Submitted By :- Submitted To :-
Anubhav Rai Dr. Gopa Bhaumik

2023PGCSCA004 Department of CSE

MCA Sem IV NIT Jamshedpur

TOPIC: DIABETES PREDICATION AND CLASSIFICATION BASED ON LOGISTIC


REGRESSION
Diabetes Detection

1. Introduction
Diabetes is a chronic and life-threatening metabolic disorder that affects millions of
individuals worldwide, with increasing prevalence in both developed and developing
nations. Characterized by elevated blood glucose levels, diabetes can lead to serious
health complications such as heart disease, kidney failure, vision loss, and nerve
damage if not diagnosed and managed in a timely manner. Early prediction and
effective classification of individuals at risk of developing diabetes are crucial steps in
promoting public health and reducing the long-term burden on healthcare systems.

This project aims to develop a machine learning-based predictive model that


leverages Logistic Regression — a widely used statistical method for binary
classification problems — to determine the likelihood of an individual being
diabetic based on a set of medical attributes. By analysing patient health indicators
such as glucose levels, blood pressure, insulin levels, BMI, and other key factors, this
project seeks to accurately classify individuals as either diabetic or non-diabetic.

The dataset utilized for this study is sourced from the open-source YBI Foundation
repository, containing anonymized patient health records specifically designed for
machine learning applications. The dataset can be accessed at the following link:

Data Link: https://github.com/YBIFoundation/Dataset/raw/main/Diabetes.csv

Through this project, we aim not only to develop a reliable classification model but
also to gain insights into the relationships between various health attributes and their
influence on diabetes prediction. The model's performance is assessed using
standard classification metrics, ensuring a balanced evaluation of its accuracy,
precision, recall, and overall effectiveness in identifying diabetic cases.
2. Model Selection and Training
In the field of supervised machine learning, selecting an appropriate algorithm is a
critical step that directly influences the performance and interpretability of the
predictive model. For this project, Logistic Regression was chosen as the primary
classification algorithm due to its simplicity, efficiency, and effectiveness in handling
binary classification problems, where the target variable has only two possible
outcomes — in this case, diabetic or non-diabetic.

Logistic Regression is a statistical technique that models the probability of a


categorical dependent variable based on one or more independent variables. It
operates by applying a logistic (sigmoid) function to predict the probability of an
outcome, ensuring that the output values remain between 0 and 1. This makes it
particularly suitable for medical diagnosis scenarios where outcomes are often
dichotomous.

Once the model was selected, the following steps were undertaken during the
training process:

 Data Preprocessing:
The raw dataset was cleaned, encoded, and scaled to ensure uniformity and
suitability for model training.

 Train-Test Split:
The dataset was divided into two subsets — 80% for training the model and
20% for testing its predictive capability on unseen data.

 Feature Scaling:
Standardization techniques such as StandardScaler were applied to normalize
feature values, enhancing the model’s performance and convergence speed.

 Model Training:
The Logistic Regression model was trained on the prepared training dataset.
During training, the algorithm iteratively optimized the model parameters
(coefficients) to minimize the classification error and improve predictive
accuracy.

The trained model was then evaluated on the test dataset using established
performance metrics to assess its accuracy, precision, recall, and F1-score. This
systematic approach ensures that the model is both reliable and generalizable for
practical use in diabetes risk prediction.
3.Data Overview
A crucial step before training any machine learning model is to understand the
structure, distribution, and relationships within the dataset. In this project,
exploratory data analysis (EDA) was performed to examine the variables, their values,
and how they interact with each other. Below is an overview of the dataset through
visual and tabular representations:

a. Dataset Sample (First Five Records) :

The dataset consists of the following medical attributes for each patient:

 pregnancies: Number of times the patient has been pregnant

 glucose: Plasma glucose concentration

 diastolic: Diastolic blood pressure (mm Hg)

 triceps: Skinfold thickness (mm)

 insulin: 2-hour serum insulin (mu U/ml)

 bmi: Body Mass Index (weight in kg/(height in m)^2)

 dpf: Diabetes Pedigree Function (a function which scores likelihood of


diabetes based on family history)

 age: Age of the patient in years

This table offers a preview of the numerical values within the dataset, highlighting
the variation in patient health metrics.
b. Distribution of Diabetes Cases (Pie Chart)

The pie chart illustrates the distribution of diabetic versus non-diabetic cases in the
dataset:

 65.10% (Blue section) of patients do not have diabetes (label 0)

 34.90% (Green section) of patients are diabetic (label 1)

This distribution indicates a moderately imbalanced dataset, where non-diabetic


cases are more prevalent than diabetic cases. It’s important for classification models
to handle such imbalances to avoid biased predictions.
c. Correlation Heatmap
The heatmap displays the Pearson correlation coefficients between all pairs of
features and the target variable:

 Glucose (0.47) and BMI (0.29) show a relatively strong positive correlation
with the diabetes outcome.

 Age (0.24) and Pregnancies (0.22) also have noticeable correlations with
diabetes.

 The diagonal values represent perfect correlation (value of 1.0) of a feature


with itself.

This visualization helps identify which features might be most influential in predicting
the presence of diabetes. In this case, glucose and BMI appear to be the most
impactful predictors.
4. Train-Test Split
In machine learning workflows, it is essential to evaluate the model’s performance on
unseen data to assess its generalizability and predictive accuracy. To achieve this, the
original dataset is divided into two distinct subsets:

 Training Set: Used to train the machine learning model by allowing it to learn
patterns and relationships from the input data.

 Test Set: Used to evaluate the model's performance on new, unseen data to
check how well it generalizes.

For this project, the dataset was split using an 8:2 ratio, where:

 80% of the data (614 records) was allocated for training the Logistic Regression
model.

 20% of the data (154 records) was reserved for testing the trained model’s
performance.

The split was performed using the train_test_split() function from the Scikit-learn
library, ensuring randomness in the selection process while maintaining
reproducibility by specifying a random state value of 30.

This step is crucial in preventing overfitting, where a model performs well on the
training data but poorly on new, unseen data. By evaluating on the test set, it is
possible to gauge the model's real-world effectiveness and reliability before
deploying it for practical use.

Code Reference:

This ensures that the model has a balanced opportunity to learn from a significant
portion of the data while being fairly evaluated on a separate set it has never
encountered during training.
5. Results and Comparison
After successfully training the Logistic Regression model on the prepared dataset, the
model was evaluated on the test set using standard performance metrics widely
employed in binary classification problems. These metrics provide a comprehensive
view of the model’s predictive capability, particularly in the context of healthcare-
related classification where both false positives and false negatives can have
significant implications.

The following evaluation metrics were computed:

 Accuracy: The ratio of correctly predicted observations to the total


observations.

 Precision: The ratio of correctly predicted positive observations to the total


predicted positive observations — indicating how precise the model is when it
predicts a patient has diabetes.

 Recall (Sensitivity): The ratio of correctly predicted positive observations to all


actual positive cases — reflecting the model’s ability to detect diabetic
patients.

 F1-Score: The harmonic mean of precision and recall, providing a balanced


measure between the two.

 Model Performance Results:


Metric Value

Accuracy 0.79

Precision Score 0.76

Recall 0.74

F1-Score 0.75

 Interpretation:
 The model achieved an accuracy of 79%, meaning it correctly classified
approximately four out of five test instances.

 A precision of 76% suggests that when the model predicts a patient as


diabetic, it is correct 76% of the time.

 A recall of 74% indicates that the model successfully identified 74% of all
actual diabetic cases within the test set.

 The F1-score of 0.75 confirms a balanced performance between precision and


recall, which is particularly important in medical predictions where both false
positives and false negatives carry consequences.

You might also like