Lab 3
Lab 3
import math
df = pd.read_csv('1d3.csv')
t = df.keys()[-1]
attribute_names = list(df.keys())
attribute_names.remove(t)
def entropy(probs):
total_instances = len(ls)
return entropy(probs)
df_split = df.groupby(split_attribute)
glist = []
glist.append(gname)
df_agg1.columns = ['Entropy']
df_agg2.columns = ['Proportion']
if battr != 'S':
else:
if len(cnt) == 1:
return next(iter(cnt))
return default_class
else:
default_class = max(cnt.keys())
gainz = []
gainz.append(ig)
index_of_max = gainz.index(max(gainz))
best_attr = attribute_names[index_of_max]
tree[best_attr][attr_val] = subtree
return tree
pprint(tree)
The code you've shared implements a decision tree classifier using the ID3 (Iterative Dichotomiser
algorithm. Let's walk through the process and explain what the code is doing:
python
Copy code
df = pd.read_csv('1d3.csv')
This loads the data from the CSV file 1d3.csv into a DataFrame, and prints the dataset.
The target attribute (i.e., the attribute to be predicted) is the last column of the dataset:
python
Copy code
t = df.keys()[-1]
The remaining columns are the attributes that are used to predict the target:
python
Copy code
attribute_names = list(df.keys())
attribute_names.remove(t)
3. Entropy Calculation:
The function entropy(probs) calculates the entropy of a set of probabilities. The entropy is a
measure of the uncertainty or impurity of a set of classes. A lower entropy means the set is more
pure.
python
Copy code
def entropy(probs):
The function entropy_of_list(ls, value) calculates the entropy of a list of target values for a
particular attribute value.
python
Copy code
total_instances = len(ls)
return entropy(probs)
4. Information Gain:
The Information Gain (IG) is used to determine which attribute to split on at each step of the tree.
It measures the reduction in entropy when splitting the data based on a particular attribute. The
function information_gain() calculates the information gain of splitting the dataset on a specific
attribute:
python
Copy code
df_split = df.groupby(split_attribute)
glist = []
glist.append(gname)
df_agg1.columns = ['Entropy']
df_agg2.columns = ['Proportion']
if battr != 'S':
else:
5. ID3 Algorithm:
The main function, id3(), implements the ID3 algorithm for building a decision tree. This function
works recursively to split the data based on the attribute that maximizes the information gain.
python
Copy code
if len(cnt) == 1:
return next(iter(cnt))
return default_class
else:
default_class = max(cnt.keys())
gainz = []
for attr in attribute_names:
gainz.append(ig)
index_of_max = gainz.index(max(gainz))
best_attr = attribute_names[index_of_max]
tree[best_attr][attr_val] = subtree
return tree
1. Base Case: If all the instances in the dataset belong to the same class (e.g., all "YES" or all
"NO"), return that class.
2. Base Case: If the dataset is empty or there are no attributes left to split on, return the
default class.
3. Choose the Best Attribute: Calculate the information gain for each attribute, and choose
the attribute with the highest information gain to split on.
4. Recursive Case: Split the dataset based on the chosen attribute and recursively build
subtrees for each subset of data.
After running the id3() function, the decision tree is printed using the pprint() function for a better
visual representation:
python
Copy code
pprint(tree)
Conclusion:
The code will generate a decision tree based on the ID3 algorithm. The tree will be constructed
step-by-step by evaluating which attribute (from the available ones) provides the highest
information gain, and then recursively applying this process to each subset of the data.
The result is a hierarchical decision tree structure that can be used to make predictions based on
the values of the attributes.