MLDA1
MLDA1
Reg.no: 22BIT0061
Digital assignment - 1
I have chosen the Pima Indians Diabetes Dataset because it is one of the most
widely used datasets for predicting the presence of diabetes. This dataset, originally
from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK),
consists of 768 samples with 8 numerical features that provide essential medical
information related to diabetes risk factors.
The target variable is binary, where 1 represents the presence of diabetes, and 0
represents its absence.
Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,
classification_report
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-
diabetes.data.csv"
columns = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin",
"BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
df = pd.read_csv(url, names=columns)
X = df.drop(columns=['Outcome'])
y = df['Outcome']
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
models = {
"Logistic Regression": LogisticRegression(),
"k-Nearest Neighbors": KNeighborsClassifier(n_neighbors=5),
"Decision Tree": DecisionTreeClassifier(),
"Random Forest": RandomForestClassifier(n_estimators=100),
"Support Vector Machine": SVC(kernel='linear')
}
results = {}
results_df = pd.DataFrame(results).T
print("\nPerformance Comparison:\n", results_df)
best_model = results_df['Accuracy'].idxmax()
best_accuracy = results_df['Accuracy'].max()
print(f"\n🏆 The best performing model is **{best_model}** with an accuracy of
**{best_accuracy:.4f}**.")
plt.figure(figsize=(12, 7))
colors = ["#3498DB", "#2ECC71", "#F1C40F", "#E74C3C", "#9B59B6"]
ax = results_df.plot(kind='bar', figsize=(12, 7), width=0.8, alpha=0.85,
edgecolor='black', color=colors)
plt.title("Comparison of Classification Models on Diabetes Dataset", fontsize=14,
fontweight='bold')
plt.ylabel("Score", fontsize=12)
plt.xticks(rotation=45, fontsize=11)
plt.yticks(fontsize=11)
plt.legend(loc='lower right', fontsize=11)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
Output:
my F1-score is low likely due to class imbalance, causing the model to favor the
majority class, reducing precision or recall for the minority class. Improper feature
scaling or an unoptimized classification threshold could also contribute. To improve
it, I can balance the dataset using SMOTE, fine-tune the threshold, and ensure
proper feature preprocessing.
I attempted to use SMOTE to balance the dataset, but I encountered multiple issues
during implementation. The errors were mainly due to preprocessing conflicts and
compatibility with certain models after oversampling. I need to further debug these
issues to ensure that the synthetic samples do not introduce biases or affect the
model's performance incorrectly. For now, I proceeded with the original dataset to
analyze the model’s behavior on real-world imbalanced data.
Classification Algorithms and Pseudocode:
1. Logistic Regression
Explanation:
Logistic Regression is a statistical model used for binary classification. It predicts the
probability that a given input belongs to one of the two classes using a sigmoid
function. The model finds the best-fitting line that separates the classes by
minimizing the error using optimization techniques like Gradient Descent.
Pseudocode:
Pseudocode:
Pseudocode:
1. Select the best feature to split the dataset using Information Gain or Gini Impurity.
2. Split the dataset into subsets based on the selected feature.
3. Repeat recursively for each subset:
a. If all data points in a subset belong to the same class, create a leaf node.
b. Otherwise, continue splitting until the stopping condition is met.
4. Assign class labels to leaf nodes.
5. For a new test instance:
a. Traverse the tree from the root node.
b. Follow the decision rules at each node until a leaf node is reached.
c. Assign the label of the leaf node as the prediction.
Pseudocode:
Pseudocode:
Conclusion
In this study, I have compared five classification algorithms—Logistic Regression, k-
Nearest Neighbors (k-NN), Decision Trees, Random Forests, and Support Vector
Machines (SVMs)—using the Diabetes Dataset.
Among all models, Random Forest achieved the highest accuracy (74.68%) and the
best F1-score (61.39%), making it the most effective classifier in our analysis. It
outperformed other models by balancing precision and recall, which is crucial for
medical predictions.
The Decision Tree classifier followed closely with 74.02% accuracy, but its F1-score
(57.44%) was slightly lower, indicating a higher variance in predictions.
The lower F1-scores of Logistic Regression and k-NN suggest that these models
struggled with class imbalance, leading to misclassification of the minority class.
This analysis shows that ensemble methods like Random Forest are more effective
in handling complex datasets, making them a strong choice for medical predictions.