dwm_06
dwm_06
Theory:
The Decision Tree classifier uses a flowchart-like structure where internal nodes represent a "test" on an
attribute, branches represent the outcome of the test, and each leaf node represents a class label (decision
outcome).
ID3 algorithm:
Thе ID3 algorithm is a classic decision tree algorithm used for both classification and regression tasks. It
builds a decision tree by recursively partitioning the data set into smaller and smaller subsets until all data
points in each subset belong to the same class. It employs a top-down approach, recursively selecting
features to split the dataset based on information gain.
The algorithm works by selecting the attribute that best classifies the training data using a metric called
"Information Gain." The attribute with the highest Information Gain is chosen as the root node, and this
process is repeated recursively for each branch until the tree fully represents the data or satisfies a stopping
criterion.
● Information Gain: The reduction in entropy achieved by partitioning the data according to a
particular attribute. The attribute with the highest Information Gain is selected for a decision node.
(S is the original set of examples, A is the attribute being tested, Sv is the subset of S for which
attribute A has value v.)
Step 1: Data Preparation- The dataset is read using pandas and the target variable (DEP_DEL15) is
separated from the features. The categorical columns are defined for encoding.
Step 2: Encoding Categorical Features- The categorical features are converted to numerical format to
prepare the data for the Decision Tree algorithm using LabelEncoder from the sklearn.preprocessing
module.
Step 3: Splitting the Data- The dataset is split into training and test sets using train_test_split with 70%
of the data used for training and 30% for testing.
Step 4: Model Training- A DecisionTreeClassifier is created with criterion='entropy', which uses the
ID3 algorithm to build the tree based on entropy and information gain. The model is trained using the
training dataset.
Step 5: Model Prediction and Evaluation- The trained model predicts the target values for the test dataset.
The accuracy of the model is then evaluated using accuracy_score.
Code:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
df = pd.read_csv('final_customer.csv')
X = df.drop(['Spending_Score'], axis=1)
def convert_to_string(num):
return str(num)
print(X.info())
y = df['Spending_Score']
categorical_columns = X.columns
label_encoders = {}
for col in categorical_columns:
label_encoders[col] = LabelEncoder()
X[col] = label_encoders[col].fit_transform(X[col])
y_pred = clf.predict(X_test)
plt.figure(figsize=(20,10))
plot_tree(clf, max_depth=3, filled=True, feature_names=X_train.columns,
class_names=label_encoders['Spending_Score'].classes_, rounded=True)
Output:
Conclusion:
In this experiment, we implemented a Decision Tree using the ID3 algorithm to select the root node. The
ID3 method leverages entropy and information gain to determine the attribute that best splits the data at
each node. Using this approach, we developed a decision tree to predict customer spending scores based on
several factors, such as gender, age group, profession, income group, family size, and work experience.