0% found this document useful (0 votes)
33 views8 pages

MLDA1

The document presents a comparative analysis of machine learning models for diabetes prediction using the Pima Indians Diabetes Dataset, highlighting its medical significance and structured features. Five models were evaluated: Logistic Regression, k-Nearest Neighbors, Decision Trees, Random Forests, and Support Vector Machines, with Random Forest achieving the highest accuracy of 74.68%. The analysis emphasizes the importance of ensemble methods in effectively handling complex datasets for medical predictions.

Uploaded by

edukku22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views8 pages

MLDA1

The document presents a comparative analysis of machine learning models for diabetes prediction using the Pima Indians Diabetes Dataset, highlighting its medical significance and structured features. Five models were evaluated: Logistic Regression, k-Nearest Neighbors, Decision Trees, Random Forests, and Support Vector Machines, with Random Forest achieving the highest accuracy of 74.68%. The analysis emphasizes the importance of ensemble methods in effectively handling complex datasets for medical predictions.

Uploaded by

edukku22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Name: Syed Naweed S N

Reg.no: 22BIT0061

Digital assignment - 1

Comparative Analysis of Machine Learning


Models for Diabetes Prediction Using the
Pima Indians Diabetes Dataset
Introduction: (Why I Chose the Diabetes Dataset)

I have chosen the Pima Indians Diabetes Dataset because it is one of the most
widely used datasets for predicting the presence of diabetes. This dataset, originally
from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK),
consists of 768 samples with 8 numerical features that provide essential medical
information related to diabetes risk factors.

The features include:

 Pregnancies (number of times pregnant)


 Glucose (plasma glucose concentration)
 Blood Pressure (diastolic blood pressure)
 Skin Thickness (triceps skin fold thickness)
 Insulin (2-hour serum insulin)
 BMI (body mass index)
 Diabetes Pedigree Function (family history of diabetes)
 Age

The target variable is binary, where 1 represents the presence of diabetes, and 0
represents its absence.

Why This Dataset?

1. Medical Significance: Diabetes is a major global health concern, and machine


learning models trained on this dataset can help in early diagnosis and risk
assessment.
2. Structured & Standardized Features: The dataset is well-structured and
contains measurable medical attributes, making it ideal for classification
algorithms such as Logistic Regression, k-Nearest Neighbors (k-NN),
Decision Trees, Random Forests, and Support Vector Machines (SVMs).
3. Widely Used in Research: This dataset has been extensively used in machine
learning and healthcare studies, making it a benchmark for evaluating
classification models.
4. Binary Classification Problem: The dataset allows us to compare different
classification models in terms of accuracy, precision, recall, and F1-score,
helping determine the most effective algorithm for diabetes prediction.

By applying machine learning models to this dataset, we aim to analyze their


performance and identify the most reliable model for predicting diabetes,
contributing to better healthcare decision-making.

Code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,
classification_report

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-
diabetes.data.csv"
columns = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin",
"BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
df = pd.read_csv(url, names=columns)

X = df.drop(columns=['Outcome'])
y = df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42, stratify=y)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

models = {
"Logistic Regression": LogisticRegression(),
"k-Nearest Neighbors": KNeighborsClassifier(n_neighbors=5),
"Decision Tree": DecisionTreeClassifier(),
"Random Forest": RandomForestClassifier(n_estimators=100),
"Support Vector Machine": SVC(kernel='linear')
}
results = {}

for name, model in models.items():


model.fit(X_train, y_train)
y_pred = model.predict(X_test)
results[name] = {
"Accuracy": accuracy_score(y_test, y_pred),
"Precision": precision_score(y_test, y_pred),
"Recall": recall_score(y_test, y_pred),
"F1-Score": f1_score(y_test, y_pred)
}
print(f"\n🔹 {name} 🔹")
print(classification_report(y_test, y_pred))

results_df = pd.DataFrame(results).T
print("\nPerformance Comparison:\n", results_df)

best_model = results_df['Accuracy'].idxmax()
best_accuracy = results_df['Accuracy'].max()
print(f"\n🏆 The best performing model is **{best_model}** with an accuracy of
**{best_accuracy:.4f}**.")

plt.figure(figsize=(12, 7))
colors = ["#3498DB", "#2ECC71", "#F1C40F", "#E74C3C", "#9B59B6"]
ax = results_df.plot(kind='bar', figsize=(12, 7), width=0.8, alpha=0.85,
edgecolor='black', color=colors)
plt.title("Comparison of Classification Models on Diabetes Dataset", fontsize=14,
fontweight='bold')
plt.ylabel("Score", fontsize=12)
plt.xticks(rotation=45, fontsize=11)
plt.yticks(fontsize=11)
plt.legend(loc='lower right', fontsize=11)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
Output:
my F1-score is low likely due to class imbalance, causing the model to favor the
majority class, reducing precision or recall for the minority class. Improper feature
scaling or an unoptimized classification threshold could also contribute. To improve
it, I can balance the dataset using SMOTE, fine-tune the threshold, and ensure
proper feature preprocessing.

I attempted to use SMOTE to balance the dataset, but I encountered multiple issues
during implementation. The errors were mainly due to preprocessing conflicts and
compatibility with certain models after oversampling. I need to further debug these
issues to ensure that the synthetic samples do not introduce biases or affect the
model's performance incorrectly. For now, I proceeded with the original dataset to
analyze the model’s behavior on real-world imbalanced data.
Classification Algorithms and Pseudocode:

1. Logistic Regression
Explanation:
Logistic Regression is a statistical model used for binary classification. It predicts the
probability that a given input belongs to one of the two classes using a sigmoid
function. The model finds the best-fitting line that separates the classes by
minimizing the error using optimization techniques like Gradient Descent.

Pseudocode:

1. Initialize weights (w) and bias (b) to small random values.


2. Repeat until convergence:
a. Compute the weighted sum: Z = wX + b
b. Apply the sigmoid function: Y_pred = 1 / (1 + e^(-Z))
c. Compute the loss using Binary Cross-Entropy:
Loss = - [Y * log(Y_pred) + (1 - Y) * log(1 - Y_pred)]
d. Compute the gradients:
dw = (1/m) * Σ [(Y_pred - Y) * X]
db = (1/m) * Σ (Y_pred - Y)
e. Update the weights and bias using Gradient Descent:
w = w - learning_rate * dw
b = b - learning_rate * db
3. Return the optimized weights and bias.
4. Use the model to make predictions:
If Y_pred ≥ 0.5, classify as 1; else classify as 0.

2. k-Nearest Neighbors (k-NN)


Explanation:
k-NN is a non-parametric, instance-based learning algorithm used for classification. It
classifies a new data point based on the majority class of its nearest neighbors in the
feature space. The distance is typically measured using Euclidean distance.

Pseudocode:

1. Choose the number of neighbors (k).


2. For each test data point:
a. Compute the distance between the test point and all training points.
b. Sort the distances in ascending order.
c. Select the k nearest neighbors.
d. Count the occurrences of each class in the k neighbors.
e. Assign the test point to the class with the highest count.
3. Return the predicted class labels.
3. Decision Tree Classifier
Explanation:
A Decision Tree is a hierarchical model that splits the dataset based on the most
significant feature at each node. The splits are made to maximize information gain or
minimize Gini impurity. The tree continues growing until a stopping criterion is met.

Pseudocode:

1. Select the best feature to split the dataset using Information Gain or Gini Impurity.
2. Split the dataset into subsets based on the selected feature.
3. Repeat recursively for each subset:
a. If all data points in a subset belong to the same class, create a leaf node.
b. Otherwise, continue splitting until the stopping condition is met.
4. Assign class labels to leaf nodes.
5. For a new test instance:
a. Traverse the tree from the root node.
b. Follow the decision rules at each node until a leaf node is reached.
c. Assign the label of the leaf node as the prediction.

4. Random Forest Classifier


Explanation:
Random Forest is an ensemble learning method that builds multiple Decision Trees
and combines their outputs to improve accuracy and reduce overfitting. Each tree is
trained on a random subset of the data, and the final prediction is determined by
majority voting.

Pseudocode:

1. Select the number of trees (N).


2. For each tree:
a. Randomly sample data points (with replacement) from the training set.
b. Randomly select a subset of features for each split.
c. Grow a Decision Tree until the stopping condition is met.
3. For a new test instance:
a. Pass the instance through each Decision Tree to get predictions.
b. Use majority voting to determine the final class.
4. Return the predicted class label.
5. Support Vector Machine (SVM)
Explanation:
SVM is a supervised learning model that finds the optimal hyperplane that separates
data points of different classes. It maximizes the margin between the nearest data
points (support vectors) and the decision boundary.

Pseudocode:

1. Initialize the dataset with labeled instances.


2. Transform the dataset using kernel functions if necessary.
3. Compute the optimal hyperplane by solving the optimization problem:
a. Minimize ||w||² subject to:
yi (w * xi + b) ≥ 1 for all training samples.
4. Compute the support vectors that define the hyperplane.
5. For a new test instance:
a. Compute the decision function: f(x) = w * x + b
b. Assign class label based on the sign of f(x):
If f(x) ≥ 0, classify as 1; else classify as 0.

Conclusion
In this study, I have compared five classification algorithms—Logistic Regression, k-
Nearest Neighbors (k-NN), Decision Trees, Random Forests, and Support Vector
Machines (SVMs)—using the Diabetes Dataset.

Among all models, Random Forest achieved the highest accuracy (74.68%) and the
best F1-score (61.39%), making it the most effective classifier in our analysis. It
outperformed other models by balancing precision and recall, which is crucial for
medical predictions.

The Decision Tree classifier followed closely with 74.02% accuracy, but its F1-score
(57.44%) was slightly lower, indicating a higher variance in predictions.

Support Vector Machine (SVM) provided 72.08% accuracy with an F1-score of


56.57%, performing better than Logistic Regression (71.43% accuracy, 56.00% F1-
score) and k-NN (70.13% accuracy, 54.90% F1-score).

The lower F1-scores of Logistic Regression and k-NN suggest that these models
struggled with class imbalance, leading to misclassification of the minority class.

This analysis shows that ensemble methods like Random Forest are more effective
in handling complex datasets, making them a strong choice for medical predictions.

You might also like