ML 4
ML 4
Aim: to create a decision tree for the given dataset using the ID3 algorithm
Theory:
Decision Tree:
A decision tree is a supervised learning algorithm used for classification and regression tasks. It is structured like a
flowchart, consisting of a root node, internal decision nodes, branches, and leaf nodes.
• Root Node: The starting point of the tree, representing the entire dataset.
• Internal Nodes: Represent tests or decisions based on attributes.
• Branches: Indicate the outcomes of these tests.
• Leaf Nodes: Represent final decisions or predictions.
The tree works by recursively splitting the dataset into subsets based on attribute values, aiming to create homogenous
groups at the leaf nodes. Algorithms like ID3, C4.5, and CART use metrics such as information gain, gain ratio, or Gini
index to decide the best splits. Decision trees are popular for their interpretability and ability to handle both categorical
and continuous data.
Id3:
The ID3 algorithm is a decision tree-building method that uses a greedy, top-down approach to classify data. It selects
attributes based on information gain, a measure derived from entropy, to create splits that maximize classification
accuracy.
Steps of the ID3 Algorithm:
1. Start with the Root Node:
• Begin with the entire dataset as the root node.
2. Calculate Entropy and Information Gain:
• Compute the entropy of the dataset to measure impurity.
• For each attribute, calculate the information gain by splitting the dataset based on its values.
3. Select Best Attribute:
• Choose the attribute with the highest information gain as the node for splitting.
4. Partition Data:
• Split the dataset into subsets based on the selected attribute's values.
5. Create Child Nodes:
• For each subset, create a child node and repeat steps 2–4 recursively.
6. Stop Recursion:
• Stop when all instances in a subset belong to the same class, no attributes remain, or no examples are
left in a subset.
• Label leaf nodes with the majority class if stopping conditions are met before pure classification.
Code:
from sklearn.datasets import load_iris from
data = load_iris()
X = data.data y
= data.target
= DecisionTreeClassifier(criterion='entropy')
model.fit(X_train, y_train)
# Predict a sample
sample = [X_test[0]] predicted = model.predict(sample)
model.score(X_test, y_test)
Output:
Play Tennis:
Code:
from sklearn.tree import DecisionTreeClassifier
pandas as pd
={
'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 'Overcast', 'Sunny', 'Sunny', 'Rain', 'Sunny',
'Overcast', 'Overcast', 'Rain'],
'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild', 'Mild', 'Mild', 'Hot',
'Mild'],
'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'Normal', 'Normal',
'High', 'Normal', 'High'],
'Wind': ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong',
'Weak', 'Strong'],
'PlayTennis': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']
df = pd.DataFrame(data)
df.columns:
encoders[column].fit_transform(df[column])
# Features and target
X = df.drop('PlayTennis', axis=1) y
= df['PlayTennis']
# Split dataset
DecisionTreeClassifier(criterion='entropy')
clf.fit(X_train, y_train)
= clf.predict(X_test)
accuracy_score(y_test,
y_pred))
= pd.DataFrame([{
'Outlook': encoders['Outlook'].transform(['Sunny'])[0],
'Temperature': encoders['Temperature'].transform(['Cool'])[0],
'Humidity': encoders['Humidity'].transform(['High'])[0],
'Wind': encoders['Wind'].transform(['Strong'])[0]
}])
prediction = clf.predict(sample)[0] result =
encoders['PlayTennis'].inverse_transform([prediction])[0]