0% found this document useful (0 votes)

33 views8 pages

MLDA1

The document presents a comparative analysis of machine learning models for diabetes prediction using the Pima Indians Diabetes Dataset, highlighting its medical significance and structured features. Five models were evaluated: Logistic Regression, k-Nearest Neighbors, Decision Trees, Random Forests, and Support Vector Machines, with Random Forest achieving the highest accuracy of 74.68%. The analysis emphasizes the importance of ensemble methods in effectively handling complex datasets for medical predictions.

Uploaded by

edukku22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views8 pages

MLDA1

Uploaded by

edukku22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Name: Syed Naweed S N

Reg.no: 22BIT0061

Digital assignment - 1

Comparative Analysis of Machine Learning

Models for Diabetes Prediction Using the
Pima Indians Diabetes Dataset
Introduction: (Why I Chose the Diabetes Dataset)

I have chosen the Pima Indians Diabetes Dataset because it is one of the most
widely used datasets for predicting the presence of diabetes. This dataset, originally
from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK),
consists of 768 samples with 8 numerical features that provide essential medical
information related to diabetes risk factors.

The features include:

 Pregnancies (number of times pregnant)

 Glucose (plasma glucose concentration)
 Blood Pressure (diastolic blood pressure)
 Skin Thickness (triceps skin fold thickness)
 Insulin (2-hour serum insulin)
 BMI (body mass index)
 Diabetes Pedigree Function (family history of diabetes)
 Age

The target variable is binary, where 1 represents the presence of diabetes, and 0
represents its absence.

Why This Dataset?

1. Medical Significance: Diabetes is a major global health concern, and machine

learning models trained on this dataset can help in early diagnosis and risk
assessment.
2. Structured & Standardized Features: The dataset is well-structured and
contains measurable medical attributes, making it ideal for classification
algorithms such as Logistic Regression, k-Nearest Neighbors (k-NN),
Decision Trees, Random Forests, and Support Vector Machines (SVMs).
3. Widely Used in Research: This dataset has been extensively used in machine
learning and healthcare studies, making it a benchmark for evaluating
classification models.
4. Binary Classification Problem: The dataset allows us to compare different
classification models in terms of accuracy, precision, recall, and F1-score,
helping determine the most effective algorithm for diabetes prediction.

By applying machine learning models to this dataset, we aim to analyze their

performance and identify the most reliable model for predicting diabetes,
contributing to better healthcare decision-making.

Code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,
classification_report

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-
diabetes.data.csv"
columns = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin",
"BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
df = pd.read_csv(url, names=columns)

X = df.drop(columns=['Outcome'])
y = df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,

random_state=42, stratify=y)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

models = {
"Logistic Regression": LogisticRegression(),
"k-Nearest Neighbors": KNeighborsClassifier(n_neighbors=5),
"Decision Tree": DecisionTreeClassifier(),
"Random Forest": RandomForestClassifier(n_estimators=100),
"Support Vector Machine": SVC(kernel='linear')
}
results = {}

for name, model in models.items():

model.fit(X_train, y_train)
y_pred = model.predict(X_test)
results[name] = {
"Accuracy": accuracy_score(y_test, y_pred),
"Precision": precision_score(y_test, y_pred),
"Recall": recall_score(y_test, y_pred),
"F1-Score": f1_score(y_test, y_pred)
}
print(f"\n🔹 {name} 🔹")
print(classification_report(y_test, y_pred))

results_df = pd.DataFrame(results).T
print("\nPerformance Comparison:\n", results_df)

best_model = results_df['Accuracy'].idxmax()
best_accuracy = results_df['Accuracy'].max()
print(f"\n🏆 The best performing model is **{best_model}** with an accuracy of
**{best_accuracy:.4f}**.")

plt.figure(figsize=(12, 7))
colors = ["#3498DB", "#2ECC71", "#F1C40F", "#E74C3C", "#9B59B6"]
ax = results_df.plot(kind='bar', figsize=(12, 7), width=0.8, alpha=0.85,
edgecolor='black', color=colors)
plt.title("Comparison of Classification Models on Diabetes Dataset", fontsize=14,
fontweight='bold')
plt.ylabel("Score", fontsize=12)
plt.xticks(rotation=45, fontsize=11)
plt.yticks(fontsize=11)
plt.legend(loc='lower right', fontsize=11)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
Output:
my F1-score is low likely due to class imbalance, causing the model to favor the
majority class, reducing precision or recall for the minority class. Improper feature
scaling or an unoptimized classification threshold could also contribute. To improve
it, I can balance the dataset using SMOTE, fine-tune the threshold, and ensure
proper feature preprocessing.

I attempted to use SMOTE to balance the dataset, but I encountered multiple issues
during implementation. The errors were mainly due to preprocessing conflicts and
compatibility with certain models after oversampling. I need to further debug these
issues to ensure that the synthetic samples do not introduce biases or affect the
model's performance incorrectly. For now, I proceeded with the original dataset to
analyze the model’s behavior on real-world imbalanced data.
Classification Algorithms and Pseudocode:

1. Logistic Regression
Explanation:
Logistic Regression is a statistical model used for binary classification. It predicts the
probability that a given input belongs to one of the two classes using a sigmoid
function. The model finds the best-fitting line that separates the classes by
minimizing the error using optimization techniques like Gradient Descent.

Pseudocode:

1. Initialize weights (w) and bias (b) to small random values.

2. Repeat until convergence:
a. Compute the weighted sum: Z = wX + b
b. Apply the sigmoid function: Y_pred = 1 / (1 + e^(-Z))
c. Compute the loss using Binary Cross-Entropy:
Loss = - [Y * log(Y_pred) + (1 - Y) * log(1 - Y_pred)]
d. Compute the gradients:
dw = (1/m) * Σ [(Y_pred - Y) * X]
db = (1/m) * Σ (Y_pred - Y)
e. Update the weights and bias using Gradient Descent:
w = w - learning_rate * dw
b = b - learning_rate * db
3. Return the optimized weights and bias.
4. Use the model to make predictions:
If Y_pred ≥ 0.5, classify as 1; else classify as 0.

2. k-Nearest Neighbors (k-NN)

Explanation:
k-NN is a non-parametric, instance-based learning algorithm used for classification. It
classifies a new data point based on the majority class of its nearest neighbors in the
feature space. The distance is typically measured using Euclidean distance.

Pseudocode:

1. Choose the number of neighbors (k).

2. For each test data point:
a. Compute the distance between the test point and all training points.
b. Sort the distances in ascending order.
c. Select the k nearest neighbors.
d. Count the occurrences of each class in the k neighbors.
e. Assign the test point to the class with the highest count.
3. Return the predicted class labels.
3. Decision Tree Classifier
Explanation:
A Decision Tree is a hierarchical model that splits the dataset based on the most
significant feature at each node. The splits are made to maximize information gain or
minimize Gini impurity. The tree continues growing until a stopping criterion is met.

Pseudocode:

1. Select the best feature to split the dataset using Information Gain or Gini Impurity.
2. Split the dataset into subsets based on the selected feature.
3. Repeat recursively for each subset:
a. If all data points in a subset belong to the same class, create a leaf node.
b. Otherwise, continue splitting until the stopping condition is met.
4. Assign class labels to leaf nodes.
5. For a new test instance:
a. Traverse the tree from the root node.
b. Follow the decision rules at each node until a leaf node is reached.
c. Assign the label of the leaf node as the prediction.

4. Random Forest Classifier

Explanation:
Random Forest is an ensemble learning method that builds multiple Decision Trees
and combines their outputs to improve accuracy and reduce overfitting. Each tree is
trained on a random subset of the data, and the final prediction is determined by
majority voting.

Pseudocode:

1. Select the number of trees (N).

2. For each tree:
a. Randomly sample data points (with replacement) from the training set.
b. Randomly select a subset of features for each split.
c. Grow a Decision Tree until the stopping condition is met.
3. For a new test instance:
a. Pass the instance through each Decision Tree to get predictions.
b. Use majority voting to determine the final class.
4. Return the predicted class label.
5. Support Vector Machine (SVM)
Explanation:
SVM is a supervised learning model that finds the optimal hyperplane that separates
data points of different classes. It maximizes the margin between the nearest data
points (support vectors) and the decision boundary.

Pseudocode:

1. Initialize the dataset with labeled instances.

2. Transform the dataset using kernel functions if necessary.
3. Compute the optimal hyperplane by solving the optimization problem:
a. Minimize ||w||² subject to:
yi (w * xi + b) ≥ 1 for all training samples.
4. Compute the support vectors that define the hyperplane.
5. For a new test instance:
a. Compute the decision function: f(x) = w * x + b
b. Assign class label based on the sign of f(x):
If f(x) ≥ 0, classify as 1; else classify as 0.

Conclusion
In this study, I have compared five classification algorithms—Logistic Regression, k-
Nearest Neighbors (k-NN), Decision Trees, Random Forests, and Support Vector
Machines (SVMs)—using the Diabetes Dataset.

Among all models, Random Forest achieved the highest accuracy (74.68%) and the
best F1-score (61.39%), making it the most effective classifier in our analysis. It
outperformed other models by balancing precision and recall, which is crucial for
medical predictions.

The Decision Tree classifier followed closely with 74.02% accuracy, but its F1-score
(57.44%) was slightly lower, indicating a higher variance in predictions.

Support Vector Machine (SVM) provided 72.08% accuracy with an F1-score of

56.57%, performing better than Logistic Regression (71.43% accuracy, 56.00% F1-
score) and k-NN (70.13% accuracy, 54.90% F1-score).

The lower F1-scores of Logistic Regression and k-NN suggest that these models
struggled with class imbalance, leading to misclassification of the minority class.

This analysis shows that ensemble methods like Random Forest are more effective
in handling complex datasets, making them a strong choice for medical predictions.

Solution Manual For Reliability Engineering and Risk 3 A V.F. Modarres, S. Kaminskiy, Krivtsov
100% (2)
Solution Manual For Reliability Engineering and Risk 3 A V.F. Modarres, S. Kaminskiy, Krivtsov
229 pages
Diabetes Pridiction Using Machine Learning
No ratings yet
Diabetes Pridiction Using Machine Learning
31 pages
Diabetes
No ratings yet
Diabetes
41 pages
Diabetes PPT
100% (1)
Diabetes PPT
9 pages
Diabetes Case Study - Jupyter Notebook
100% (1)
Diabetes Case Study - Jupyter Notebook
10 pages
Minor Project
No ratings yet
Minor Project
21 pages
Machine Learning
100% (1)
Machine Learning
21 pages
Diabetes Prediction
No ratings yet
Diabetes Prediction
15 pages
Diabetes Prediction
No ratings yet
Diabetes Prediction
28 pages
Data Mining Assignment No. 1
No ratings yet
Data Mining Assignment No. 1
7 pages
Binod ML Project-052
No ratings yet
Binod ML Project-052
14 pages
Dhanush - Diabetes Report
No ratings yet
Dhanush - Diabetes Report
4 pages
Classification of Diabetes Mellitus Using Machine Learning Techniques
No ratings yet
Classification of Diabetes Mellitus Using Machine Learning Techniques
4 pages
AIML Report.
No ratings yet
AIML Report.
12 pages
Predictive Modelingand Analyticsfor Diabetesusing
No ratings yet
Predictive Modelingand Analyticsfor Diabetesusing
13 pages
CHAPTER 4 Diabetes
No ratings yet
CHAPTER 4 Diabetes
6 pages
End To End Project Multiple Disease Detection Using ML - Nomidl
No ratings yet
End To End Project Multiple Disease Detection Using ML - Nomidl
24 pages
CIEA Term Project
No ratings yet
CIEA Term Project
19 pages
Weka Project1 Sajeena
No ratings yet
Weka Project1 Sajeena
14 pages
Project Report
No ratings yet
Project Report
10 pages
Mini Project Report
No ratings yet
Mini Project Report
34 pages
Deep Learning Approach For Diabetes Prediction Using PIMA Indian Dataset
No ratings yet
Deep Learning Approach For Diabetes Prediction Using PIMA Indian Dataset
3 pages
20MIS7043 (LAB 7) .Ipynb Colaboratory
No ratings yet
20MIS7043 (LAB 7) .Ipynb Colaboratory
4 pages
20MIS7095 (LAB 7) .Ipynb Colaboratory
No ratings yet
20MIS7095 (LAB 7) .Ipynb Colaboratory
4 pages
A Survey On Medical Diagnosis of Diabetes Using Machine Learning Techniques
No ratings yet
A Survey On Medical Diagnosis of Diabetes Using Machine Learning Techniques
12 pages
Cs Batchno19
No ratings yet
Cs Batchno19
53 pages
Predicting Diabetes Onset Using Machine Learning
No ratings yet
Predicting Diabetes Onset Using Machine Learning
4 pages
Data Mining Journal 4 Kashan
No ratings yet
Data Mining Journal 4 Kashan
8 pages
DIAPRO - Diabetes Prediction Application
No ratings yet
DIAPRO - Diabetes Prediction Application
18 pages
IEEE Conference Team ATOM
No ratings yet
IEEE Conference Team ATOM
5 pages
Mini Project
No ratings yet
Mini Project
15 pages
Ijs DR 2205103
No ratings yet
Ijs DR 2205103
4 pages
Diabetes Prediction - ML
No ratings yet
Diabetes Prediction - ML
29 pages
Download
No ratings yet
Download
6 pages
Diabetes Classification Report
No ratings yet
Diabetes Classification Report
17 pages
MLPPT 11 45
No ratings yet
MLPPT 11 45
31 pages
Ai Datascience Project Grade 10
No ratings yet
Ai Datascience Project Grade 10
14 pages
Estimating Diabetic Risk Accurately
No ratings yet
Estimating Diabetic Risk Accurately
26 pages
Biostatistics: DR Shakil, MD Resident Neurology Bsmmu Fcps Part 2 Internal Medicine
No ratings yet
Biostatistics: DR Shakil, MD Resident Neurology Bsmmu Fcps Part 2 Internal Medicine
71 pages
Efficient Diagnosis of Diabetes Mellitus
No ratings yet
Efficient Diagnosis of Diabetes Mellitus
23 pages
Diabetes - Test Report
No ratings yet
Diabetes - Test Report
62 pages
Diabetes Prediction Presentation
No ratings yet
Diabetes Prediction Presentation
12 pages
Exposys Data Labs Diabetes Disease Prediction: Shilpa J Shetty Nishma Nayana
No ratings yet
Exposys Data Labs Diabetes Disease Prediction: Shilpa J Shetty Nishma Nayana
13 pages
Deep Learning
No ratings yet
Deep Learning
41 pages
3 Dispersion Skewness Kurtosis PDF
No ratings yet
3 Dispersion Skewness Kurtosis PDF
42 pages
Literature Survey Paper On Comparative Analysis of Diabetics Prediction Systems Using Machine Learning Algorithms
No ratings yet
Literature Survey Paper On Comparative Analysis of Diabetics Prediction Systems Using Machine Learning Algorithms
4 pages
20BCE7620 AP2021228000397 Experiment-6 Removed
No ratings yet
20BCE7620 AP2021228000397 Experiment-6 Removed
19 pages
G26 Report
No ratings yet
G26 Report
4 pages
Sse 25 21 114-1
No ratings yet
Sse 25 21 114-1
14 pages
Final
No ratings yet
Final
44 pages
Practical File - Aiml
No ratings yet
Practical File - Aiml
8 pages
Final Seminar Report Soumya
No ratings yet
Final Seminar Report Soumya
20 pages
Classification
No ratings yet
Classification
9 pages
Automated Payroll Management System
No ratings yet
Automated Payroll Management System
4 pages
Ads Exp 10
No ratings yet
Ads Exp 10
10 pages
Independent Project
No ratings yet
Independent Project
10 pages
Final Research Paper
No ratings yet
Final Research Paper
5 pages
The Normal Distribution
100% (1)
The Normal Distribution
2 pages
Chat-AI ML Project Proposal
No ratings yet
Chat-AI ML Project Proposal
4 pages
Seetu Papers 1
No ratings yet
Seetu Papers 1
6 pages
Prediction of Diabetes Using R
No ratings yet
Prediction of Diabetes Using R
6 pages
Essentials of Modern Business Statistics With Microsoft Excel 7th Edition David Anderson - Ebook PDF PDF Download
100% (2)
Essentials of Modern Business Statistics With Microsoft Excel 7th Edition David Anderson - Ebook PDF PDF Download
75 pages
طرق متقدمة في الإحصاء الحيوي بواسطة SPSS
No ratings yet
طرق متقدمة في الإحصاء الحيوي بواسطة SPSS
194 pages
Rms PDF
No ratings yet
Rms PDF
506 pages
Slide Presetatio
No ratings yet
Slide Presetatio
30 pages
Course Outline MUF0142
No ratings yet
Course Outline MUF0142
10 pages
Big Data Computing - Assignment 7
No ratings yet
Big Data Computing - Assignment 7
3 pages
Biostatistics 1
No ratings yet
Biostatistics 1
18 pages
Correlation Regression
No ratings yet
Correlation Regression
42 pages
Facto MIneR-PCA Datos Decathlon
No ratings yet
Facto MIneR-PCA Datos Decathlon
19 pages
Unbiased Estimation of A Sparse Vector in White Gaussian Noise
No ratings yet
Unbiased Estimation of A Sparse Vector in White Gaussian Noise
46 pages
IPPTCh 007
No ratings yet
IPPTCh 007
41 pages
Stat-324
No ratings yet
Stat-324
5 pages
Cheat Sheet - Test 3
No ratings yet
Cheat Sheet - Test 3
2 pages
2024 BA Pre-Read
No ratings yet
2024 BA Pre-Read
33 pages
A Metaanalysis of Corporate Governance and Firm Performance2020Journal of Governance and RegulationOpen Access
No ratings yet
A Metaanalysis of Corporate Governance and Firm Performance2020Journal of Governance and RegulationOpen Access
17 pages
Module No. 9 Title: Testing The Difference Between The Population Means (Large Independent Samples) and (Small Independent Samples)
No ratings yet
Module No. 9 Title: Testing The Difference Between The Population Means (Large Independent Samples) and (Small Independent Samples)
13 pages
Quiz Ans Key
No ratings yet
Quiz Ans Key
12 pages
Jiajing Zhneg Resume
No ratings yet
Jiajing Zhneg Resume
1 page
Parametric and Non-Parametric Hypothesis Tests (Slides)
No ratings yet
Parametric and Non-Parametric Hypothesis Tests (Slides)
19 pages
Fiverr Gig Research
No ratings yet
Fiverr Gig Research
7 pages
Diagnosis Worksheet: Page 1 of 2 Citation
No ratings yet
Diagnosis Worksheet: Page 1 of 2 Citation
2 pages
Example Correlation Analysis
No ratings yet
Example Correlation Analysis
2 pages
Sultan Exp4
No ratings yet
Sultan Exp4
5 pages
Cumulative Standart Normal Probabilities Table (0-100)
No ratings yet
Cumulative Standart Normal Probabilities Table (0-100)
1 page
Student's T Distribution
No ratings yet
Student's T Distribution
6 pages
Artea: Name - Anamika Gupta PGP ID - PGP37399
No ratings yet
Artea: Name - Anamika Gupta PGP ID - PGP37399
1 page
Chapter 7 Hypothesis Testing HL
No ratings yet
Chapter 7 Hypothesis Testing HL
2 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet