C2 W4 Decision Tree With Markdown
C2 W4 Decision Tree With Markdown
December 3, 2024
2 Outline
• 1 - Packages
• 2 - Problem Statement
• 3 - Dataset
– 3.1 One hot encoded dataset
• 4 - Decision Tree Refresher
– 4.1 Calculate entropy
∗ Exercise 1
– 4.2 Split dataset
∗ Exercise 2
– 4.3 Calculate information gain
∗ Exercise 3
– 4.4 Get best split
∗ Exercise 4
• 5 - Building the tree
## 1 - Packages
First, let’s run the cell below to import all the packages that you will need during this assignment.
- numpy is the fundamental package for working with matrices in Python. - matplotlib is a famous
library to plot graphs in Python. - utils.py contains helper functions for this assignment. You
do not need to modify code in this file.
[1]: import numpy as np
import matplotlib.pyplot as plt
from public_tests import *
from utils import *
%matplotlib inline
## 2 - Problem Statement
Suppose you are starting a company that grows and sells wild mushrooms. - Since not all mushrooms
are edible, you’d like to be able to tell whether a given mushroom is edible or poisonous based on
1
it’s physical attributes - You have some existing data that you can use for this task.
Can you use the data to help you identify which mushrooms can be sold safely?
Note: The dataset used is for illustrative purposes only. It is not meant to be a guide on identifying
edible mushrooms.
## 3 - Dataset
You will start by loading the dataset for this task. The dataset you have collected is as follows:
Therefore, - X_train contains three features for each example - Brown Color (A value of 1 indicates
“Brown” cap color and 0 indicates “Red” cap color) - Tapering Shape (A value of 1 indicates
2
“Tapering Stalk Shape” and 0 indicates “Enlarging” stalk shape) - Solitary (A value of 1 indicates
“Yes” and 0 indicates “No”)
• y_train is whether the mushroom is edible
– y = 1 indicates edible
– y = 0 indicates poisonous
[2]: X_train = np.array(
[
[1, 1, 1],
[1, 0, 1],
[1, 0, 0],
[1, 0, 0],
[1, 1, 1],
[0, 1, 1],
[0, 0, 0],
[1, 0, 1],
[0, 1, 0],
[1, 0, 0],
]
)
View the variables Let’s get more familiar with your dataset.
- A good place to start is to just print out each variable and see what it contains.
The code below prints the first few elements of X_train and the type of the variable.
[3]: print("First few elements of X_train:\n", X_train[:5])
print("Type of X_train:", type(X_train))
3
Check the dimensions of your variables Another useful way to get familiar with your data
is to view its dimensions.
Please print the shape of X_train and y_train and see how many training examples you have in
your dataset.
[5]: print("The shape of X_train is:", X_train.shape)
print("The shape of y_train is: ", y_train.shape)
print("Number of training examples (m):", len(X_train))
4
Please complete the compute_entropy() function using the previous instructions.
If you get stuck, you can check out the hints presented after the cell below to help you with the
implementation.
[6]: # UNQ_C1
# GRADED FUNCTION: compute_entropy
def compute_entropy(y):
"""
Computes the entropy for
Args:
y (ndarray): Numpy array indicating whether each example at a node is
edible (`1`) or poisonous (`0`)
Returns:
entropy (float): Entropy at that node
"""
# You need to return the following variables correctly
entropy = 0.0
return entropy
5
entropy = 0.
return entropy
If you’re still stuck, you can check the hints presented below to figure out how to calculate
p1 and entropy.
Hint to calculate p1 You can compute p1 as p1 = len(y[y == 1]) / len(y)
Hint to calculate entropy You can compute entropy as entropy = -p1 * np.log2(p1) -
(1 - p1) * np.log2(1 - p1)
You can check if your implementation was correct by running the following test code:
[7]: # Compute entropy at the root node (i.e. with all examples)
# Since we have 5 edible and 5 non-edible mushrooms, the entropy should be 1"
# UNIT TESTS
compute_entropy_test(compute_entropy)
6
– The output of the function is then, left_indices = [0,1,2,3,4,7,9] and
right_indices = [5,6,8]
### Exercise 2
Please complete the split_dataset() function shown below
• For each index in node_indices
– If the value of X at that index for that feature is 1, add the index to left_indices
– If the value of X at that index for that feature is 0, add the index to right_indices
If you get stuck, you can check out the hints presented after the cell below to help you with the
implementation.
[8]: # UNQ_C2
# GRADED FUNCTION: split_dataset
Args:
X (ndarray): Data matrix of shape(n_samples, n_features)
node_indices (list): List containing the active indices. I.e, the␣
↪samples being considered at this step.
Returns:
left_indices (list): Indices with feature value == 1
right_indices (list): Indices with feature value == 0
"""
7
right_indices = []
feature = 0
8
print("Right indices: ", right_indices)
# UNIT TESTS
split_dataset_test(split_dataset)
where - 𝐻(𝑝1node ) is entropy at the node - 𝐻(𝑝1left ) and 𝐻(𝑝1right ) are the entropies at the left and
the right branches resulting from the split - 𝑤left and 𝑤right are the proportion of examples at the
left and right branch, respectively
Note: - You can use the compute_entropy() function that you implemented above to calculate
the entropy - We’ve provided some starter code that uses the split_dataset() function you
implemented above to split the dataset
If you get stuck, you can check out the hints presented after the cell below to help you with the
implementation.
[10]: # UNQ_C3
# GRADED FUNCTION: compute_information_gain
Args:
X (ndarray): Data matrix of shape(n_samples, n_features)
y (array like): list or ndarray with n_samples containing the␣
↪target variable
9
Returns:
cost (float): Cost computed
"""
# Split dataset
left_indices, right_indices = split_dataset(X, node_indices, feature)
# Weights
w_left = len(X_left) / len(X_node)
w_right = len(X_right) / len(X_node)
#Weighted entropy
weighted_entropy = w_left * left_entropy + w_right * right_entropy
#Information gain
information_gain = node_entropy - weighted_entropy
return information_gain
10
### BEGINNING SOLUTION
# Your code here to compute the entropy at the node using compute_entropy()
node_entropy =
# Your code here to compute the entropy at the left branch
left_entropy =
# Your code here to compute the entropy at the right branch
right_entropy =
# Your code here to compute the proportion of examples at the left branch
w_left =
# Your code here to compute the proportion of examples at the right branch
w_right =
# Your code here to compute weighted entropy from the split using
# w_left, w_right, left_entropy and right_entropy
weighted_entropy =
# Your code here to compute the information gain as the entropy at the node
# minus the weighted entropy
information_gain =
### ENDING SOLUTION
return information_gain
“‘ If you’re still stuck, check out the hints below.
Hint to calculate the entropies
node_entropy = compute_entropy(y_node) left_entropy = compute_entropy(y_left)
right_entropy = compute_entropy(y_right)
Hint to calculate w_left and w_right w_left = len(X_left) / len(X_node) w_right =
len(X_right) / len(X_node)
Hint to calculate weighted_entropy weighted_entropy = w_left * left_entropy + w_right *
right_entropy
Hint to calculate information_gain information_gain = node_entropy - weighted_entropy
You can now check your implementation using the cell below and calculate what the information
gain would be from splitting on each of the featues
[11]: info_gain0 = compute_information_gain(X_train, y_train, root_indices, feature=0)
print("Information Gain from splitting the root on brown cap: ", info_gain0)
11
info_gain2 = compute_information_gain(X_train, y_train, root_indices, feature=2)
print("Information Gain from splitting the root on solitary: ", info_gain2)
# UNIT TESTS
compute_information_gain_test(compute_information_gain)
Args:
X (ndarray): Data matrix of shape(n_samples, n_features)
y (array like): list or ndarray with n_samples containing the␣
↪target variable
Returns:
best_feature (int): The index of the best feature to split
12
"""
return best_feature
# Your code here to compute the information gain from splitting on this feature
info_gain =
13
return best_feature
If you’re still stuck, check out the hints below.
Hint to calculate info_gain
info_gain = compute_information_gain(X, y, node_indices, feature)
Hint to update the max_info_gain and best_feature max_info_gain = info_gain
best_feature = feature
Now, let’s check the implementation of your function using the cell below.
[13]: best_feature = get_best_split(X_train, y_train, root_indices)
print("Best feature to split on: %d" % best_feature)
# UNIT TESTS
get_best_split_test(get_best_split)
"""
Build a tree using the recursive algorithm that split the dataset into 2␣
↪subgroups at each node.
Args:
X (ndarray): Data matrix of shape(n_samples, n_features)
y (array like): list or ndarray with n_samples containing the␣
↪target variable
14
current_depth (int): Current depth. Parameter used during recursive␣
↪call.
"""
return
# continue splitting the left and the right child. Increment current depth
build_tree_recursive(X, y, left_indices, "Left", max_depth, current_depth +␣
↪1)
[15]: build_tree_recursive(
X_train, y_train, root_indices, "Root", max_depth=2, current_depth=0
)
15
[16]: import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree
16
[ ]:
17