Diabetes Prediction Report
Diabetes Prediction Report
#### Objective
The primary goal of this study is to develop and evaluate machine
learning models capable of predicting diabetes in individuals based
on diagnostic features. The dataset used for this purpose originates
from the National Institute of Diabetes and Digestive and Kidney
Diseases and focuses on Pima Indian women aged 21 and older.
Diabetes, being a major metabolic disorder, demands early detection
and management to mitigate severe complications, making predictive
models crucial in healthcare.
| Feature | Description
|
|----------------------------|----------------------------------------------------------------------
-------|
| Pregnancies | Number of times the patient has been
pregnant |
| Glucose | Plasma glucose concentration (mg/dL) during a
2-hour oral glucose tolerance test |
| BloodPressure | Diastolic blood pressure (mm Hg)
|
| SkinThickness | Triceps skinfold thickness (mm)
|
| Insulin | 2-hour serum insulin (mu U/ml)
|
| BMI | Body mass index (weight in kg/(height in m)^2)
|
| DiabetesPedigreeFunction | A function representing diabetes
history in the family |
| Age | Patient's age in years
|
| Outcome | Class variable (0 = No diabetes, 1 = Diabetes)
|
**Key Observations:**
- The dataset contains missing or zero values in critical features such
as `Glucose`, `BloodPressure`, `SkinThickness`, `BMI`, and `Insulin`.
These values were treated as missing data.
- The distribution of the target variable (`Outcome`) revealed an
imbalance with 65% non-diabetic cases (Outcome = 0) and 35%
diabetic cases (Outcome = 1).
- Correlation analysis identified strong positive relationships between
`Glucose`, `BMI`, and `Outcome`, highlighting their predictive
importance.
**Models Evaluated:**
- Logistic Regression
- K-Nearest Neighbors (KNN)
- Support Vector Machine (SVM)
- Decision Tree Classifier (CART)
- Random Forest Classifier
- XGBoost
- LightGBM
**Results of Baseline Models:**
| Model | Accuracy | Precision | Recall | F1 Score | AUC-ROC |
|------------------|----------|-----------|--------|----------|---------|
| Logistic Regression | 0.7674 | 0.74 | 0.68 | 0.71 | 0.84 |
| Random Forest | 0.8472 | 0.83 | 0.78 | 0.80 | 0.90 |
| XGBoost | 0.8703 | 0.85 | 0.80 | 0.82 | 0.92 |
#### Conclusions