ML2
ML2
ROLL NO : 12212161
1) 1. Import Libraries(code)
import pandas as pd
df = pd.read_csv('your_dataset.csv')
print(df.head())
# Build the decision tree classifier with Gini index and max depth of 4
clf.fit(X_train, y_train)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
import pandas as pd
df = pd.read_csv('your_dataset.csv')
dt_clf.fit(X_train, y_train)
# Make predictions
dt_pred = dt_clf.predict(X_test)
# Initialize Logistic Regression
lr_clf = LogisticRegression(max_iter=1000)
lr_clf.fit(X_train, y_train)
# Make predictions
lr_pred = lr_clf.predict(X_test)
print(confusion_matrix(y_test, dt_pred))
print(confusion_matrix(y_test, lr_pred))
print(classification_report(y_test, dt_pred))
print(classification_report(y_test, lr_pred))
Summary: The model with the higher ROC-AUC score combined with good precision, recall, and F1-
score should be considered better for classifying CHD cases.
3) Step 1: Code to Plot the Decision Tree
import pandas as pd
plt.figure(figsize=(20,10))
plt.show()
Once you plot the tree, the top node is the root node of the tree. The feature shown at this node is
the most important feature, i.e., the one that best splits the data according to the Gini index (or
whichever criterion is used).
Look for the first feature in the plot, which will be the top node.
The threshold at this node is the value that splits the dataset into two branches
Step-by-Step Calculation:
o Count how many samples belong to class 0 (No CHD) and class 1 (CHD).
o A Gini index of 0 means the node is pure (all samples belong to one class).
Summary:
1. The most important splitting criterion at the top node is the feature that gives the best
separation between classes, based on the Gini index.
2. The Gini index is a measure of how impure a node is. The lower the Gini index, the purer the
node.
3. The Gini index is calculated using the class proportions at the node, and it tells you how well
the split divides the data into classes.
import pandas as pd
df = pd.read_csv('your_dataset.csv')
X = df.drop('target_column', axis=1) # Replace 'target_column' with the actual target column for CHD
best_max_depth = grid_search.best_params_['max_depth']
best_score = grid_search.best_score_
optimal_tree.fit(X_train, y_train)
y_pred_prob = optimal_tree.predict_proba(X_test)[:, 1]
plt.figure()
plt.title('ROC Curve')
plt.legend(loc='best')
plt.show()
The optimal max_depth is the one that gives the highest ROC-AUC score during cross-
validation.
After finding the optimal max_depth, you evaluate the model on the test set to ensure its
performance is consistent.