0% found this document useful (0 votes)
5 views7 pages

ML2

The document outlines a series of steps for building and evaluating a Decision Tree model and comparing it with a Logistic Regression model using a dataset. It includes code for data preparation, model training, accuracy assessment, confusion matrix generation, and ROC-AUC score calculation. Additionally, it discusses the Gini index as a measure of node impurity and provides a method for optimizing the model's max depth using GridSearchCV.

Uploaded by

hudsonnnnn16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views7 pages

ML2

The document outlines a series of steps for building and evaluating a Decision Tree model and comparing it with a Logistic Regression model using a dataset. It includes code for data preparation, model training, accuracy assessment, confusion matrix generation, and ROC-AUC score calculation. Additionally, it discusses the Gini index as a measure of node impurity and provides a method for optimizing the model's max depth using GridSearchCV.

Uploaded by

hudsonnnnn16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

NAME : P KOUSHIK REDDY

ROLL NO : 12212161

1) 1. Import Libraries(code)

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score, confusion_matrix

2) 2. Load the Dataset

# Load the dataset

df = pd.read_csv('your_dataset.csv')

# Check the first few rows of the dataset

print(df.head())

3) 3. Prepare Features and Target Variables

# Define features (X) and target (y)

X = df.drop('target_column', axis=1) # Replace 'target_column' with your actual target column

y = df['target_column'] # Replace 'target_column' with your actual target column

4) 4. Split the Data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

5) 5. Build the Decision Tree Model

# Build the decision tree classifier with Gini index and max depth of 4

clf = DecisionTreeClassifier(criterion='gini', max_depth=4, random_state=42)

# Train the model

clf.fit(X_train, y_train)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

# Confusion matrix

cm = confusion_matrix(y_test, y_pred)

print("Confusion Matrix:\n", cm)

2) CODE TO COMPARE BOTH THE MODELS:

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score

# Load the dataset

df = pd.read_csv('your_dataset.csv')

# Define features (X) and target (y)

X = df.drop('target_column', axis=1) # Drop the target column

y = df['target_column'] # Target is CHD (0 or 1)

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Decision Tree Classifier

dt_clf = DecisionTreeClassifier(criterion='gini', max_depth=4, random_state=42)

# Train the model

dt_clf.fit(X_train, y_train)

# Make predictions

dt_pred = dt_clf.predict(X_test)
# Initialize Logistic Regression

lr_clf = LogisticRegression(max_iter=1000)

# Train the model

lr_clf.fit(X_train, y_train)

# Make predictions

lr_pred = lr_clf.predict(X_test)

dt_accuracy = accuracy_score(y_test, dt_pred)

lr_accuracy = accuracy_score(y_test, lr_pred)

print("Decision Tree Accuracy:", dt_accuracy)

print("Logistic Regression Accuracy:", lr_accuracy)

print("Decision Tree Confusion Matrix:")

print(confusion_matrix(y_test, dt_pred))

print("Logistic Regression Confusion Matrix:")

print(confusion_matrix(y_test, lr_pred))

print("Decision Tree Classification Report:")

print(classification_report(y_test, dt_pred))

print("Logistic Regression Classification Report:")

print(classification_report(y_test, lr_pred))

dt_roc_auc = roc_auc_score(y_test, dt_pred)

lr_roc_auc = roc_auc_score(y_test, lr_pred)

print("Decision Tree ROC-AUC:", dt_roc_auc)

print("Logistic Regression ROC-AUC:", lr_roc_auc)

Summary: The model with the higher ROC-AUC score combined with good precision, recall, and F1-
score should be considered better for classifying CHD cases.
3) Step 1: Code to Plot the Decision Tree

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier, plot_tree

import matplotlib.pyplot as plt

# Assuming you have already trained the decision tree classifier

# dt_clf is your decision tree classifier object

# Plot the decision tree

plt.figure(figsize=(20,10))

plot_tree(dt_clf, feature_names=X.columns, class_names=['No CHD', 'CHD'], filled=True,


rounded=True)

plt.show()

Step 2: Identify the Top Node and Splitting Criteria

Once you plot the tree, the top node is the root node of the tree. The feature shown at this node is
the most important feature, i.e., the one that best splits the data according to the Gini index (or
whichever criterion is used).

 Look for the first feature in the plot, which will be the top node.

 The threshold at this node is the value that splits the dataset into two branches

Step 3: Deriving the Gini Index at the Top Node

Step-by-Step Calculation:

1. Calculate class proportions at the top node:

o Count how many samples belong to class 0 (No CHD) and class 1 (CHD).

o Calculate the proportion of each class: p0p_0p0 and p1p_1p1.

2. Plug the proportions into the Gini formula:

Gini=1−(p02+p12)\text{Gini} = 1 - (p_0^2 + p_1^2)Gini=1−(p02+p12)

3. Interpret the result:

o A Gini index of 0 means the node is pure (all samples belong to one class).

o A higher Gini index indicates a more mixed node.

Summary:
1. The most important splitting criterion at the top node is the feature that gives the best
separation between classes, based on the Gini index.

2. The Gini index is a measure of how impure a node is. The lower the Gini index, the purer the
node.

3. The Gini index is calculated using the class proportions at the node, and it tells you how well
the split divides the data into classes.

4) CODE FOR OPTIMAL MAX SEARCH:

import pandas as pd

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import GridSearchCV, train_test_split

from sklearn.metrics import roc_auc_score

# Load the dataset

df = pd.read_csv('your_dataset.csv')

# Define features (X) and target (y)

X = df.drop('target_column', axis=1) # Replace 'target_column' with the actual target column for CHD

y = df['target_column'] # Target is CHD (0 or 1)

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Decision Tree Classifier using Gini index

dt_clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# Define the parameter grid for max_depth

param_grid = {'max_depth': range(3, 11)}

# Set up the GridSearchCV with ROC-AUC as scoring

grid_search = GridSearchCV(estimator=dt_clf, param_grid=param_grid, scoring='roc_auc', cv=5)

# Train the model using grid search


grid_search.fit(X_train, y_train)

# Get the optimal max_depth

best_max_depth = grid_search.best_params_['max_depth']

best_score = grid_search.best_score_

print(f"Optimal max_depth: {best_max_depth}")

print(f"Best ROC-AUC score: {best_score}")

# Train the Decision Tree with the optimal max_depth

optimal_tree = DecisionTreeClassifier(criterion='gini', max_depth=best_max_depth,


random_state=42)

optimal_tree.fit(X_train, y_train)

# Predict probabilities for the test set

y_pred_prob = optimal_tree.predict_proba(X_test)[:, 1]

# Calculate ROC-AUC score on the test set

roc_auc = roc_auc_score(y_test, y_pred_prob)

print(f"ROC-AUC Score on Test Set: {roc_auc}")

import matplotlib.pyplot as plt

from sklearn.metrics import roc_curve

# Compute ROC curve

fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Plot the ROC curve

plt.figure()

plt.plot(fpr, tpr, label=f'Decision Tree (AUC = {roc_auc:.2f})')

plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('ROC Curve')
plt.legend(loc='best')

plt.show()

Interpret the Results

 The optimal max_depth is the one that gives the highest ROC-AUC score during cross-
validation.

 After finding the optimal max_depth, you evaluate the model on the test set to ensure its
performance is consistent.

 A higher ROC-AUC score indicates better model performance in distinguishing between


positive and negative classes (CHD and non-CHD cases).

You might also like