trees_classification.ipynb - Colab
trees_classification.ipynb - Colab
ipynb - Colab
For the sake of simplicity, we focus the discussion on the hyperparamter max_depth , which controls the maximal depth of the decision tree.
import pandas as pd
from sklearn.model_selection import train_test_split
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)
penguin_df = pd.DataFrame(penguins)
penguin_df.head(5)
Adelie
Adult,
Penguin
0 PAL0708 1 Anvers Torgersen 1 Egg N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 MALE
(Pygoscelis
Stage
adeliae)
Adelie
Adult,
Penguin
1 PAL0708 2 Anvers Torgersen 1 Egg N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 FEMALE
(Pygoscelis
Stage
adeliae)
Adelie
Adult,
Penguin
2 PAL0708 3 Anvers Torgersen 1 Egg N2A1 Yes 11/16/07 40.3 18.0 195.0 3250.0 FEMALE
(Pygoscelis
Stage
adeliae)
Adelie
Adult,
Penguin
4 PAL0708 5 Anvers Torgersen 1 Egg N3A1 Yes 11/16/07 36.7 19.3 193.0 3450.0 FEMALE
(Pygoscelis
Stage
adeliae)
Adelie
Adult,
Penguin
5 PAL0708 6 Anvers Torgersen 1 Egg N3A2 Yes 11/16/07 39.3 20.6 190.0 3650.0 MALE
(Pygoscelis
Stage
adeliae)
Next steps: Generate code with penguin_df toggle_off View recommended plots New interactive sheet
row_count = len(penguin_df)
row_count
342
First, we split the data into two subsets to investigate how trees predict values based on unseen data.
In a previous notebook, we learnt that linear classifiers define a linear separation to split classes using a linear combination of the input
features. In our 2-dimensional feature space, it means that a linear classifier finds the oblique lines that best separate the classes. This is still
true for multiclass problems, except that more than one line is fitted. We can use DecisionBoundaryDisplay to plot the decision boundaries
learnt by the classifier.
linear_model = LogisticRegression()
linear_model.fit(data_train, target_train)
https://colab.research.google.com/drive/1G9Ld1Dc3NE2LLdhDIse-HrLnc5Qs4doK?usp=sharing#scrollTo=ef5ad512 1/6
2/5/25, 1:58 PM trees_classification.ipynb - Colab
▾ LogisticRegression i ?
LogisticRegression()
dbd = DecisionBoundaryDisplay.from_estimator(
linear_model,
data_train,
response_method="predict",
cmap="tab10",
norm=tab10_norm,
alpha=0.5,
)
sns.scatterplot(
data=penguins,
x=culmen_columns[0],
y=culmen_columns[1],
hue=target_column,
palette=palette,
)
# put the legend outside the plot
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
_ = plt.title("Decision boundary using a logistic regression")
We see that the lines are a combination of the input features since they are not perpendicular a specific axis. Indeed, this is due to the model
parametrization that we saw in some previous notebooks, i.e. controlled by the model's weights and intercept.
Besides, it seems that the linear model would be a good candidate for such problem as it gives good accuracy.
linear_model.fit(data_train, target_train)
test_score = linear_model.score(data_test, target_test)
print(f"Accuracy of the LogisticRegression: {test_score:.2f}")
Unlike linear models, the decision rule for the decision tree is not controlled by a simple linear combination of weights and feature values.
https://colab.research.google.com/drive/1G9Ld1Dc3NE2LLdhDIse-HrLnc5Qs4doK?usp=sharing#scrollTo=ef5ad512 2/6
2/5/25, 1:58 PM trees_classification.ipynb - Colab
Decision trees partition the feature space by considering a single feature at a time. The number of splits depends on both the
hyperparameters and the number of data points in the training set: the more flexible the hyperparameters and the larger the training set, the
more splits can be considered by the model.
As the number of adjustable components taking part in the decision rule changes with the training size, we say that decision trees are non-
parametric models.
Let's now visualize the shape of the decision boundary of a decision tree when we set the max_depth hyperparameter to only allow for a
single split to partition the feature space.
tree = DecisionTreeClassifier(max_depth=1)
tree.fit(data_train, target_train)
▾ DecisionTreeClassifier i ?
DecisionTreeClassifier(max_depth=1)
DecisionBoundaryDisplay.from_estimator(
tree,
data_train,
response_method="predict",
cmap="tab10",
norm=tab10_norm,
alpha=0.5,
)
sns.scatterplot(
data=penguins,
x=culmen_columns[0],
y=culmen_columns[1],
hue=target_column,
palette=palette,
)
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
_ = plt.title("Decision boundary using a decision tree")
The partitions found by the algorithm separates the data along the axis "Culmen Length", discarding the feature "Culmen Depth". Thus, it
highlights that a decision tree does not use a combination of features when making a single split. We can look more in depth at the tree
structure.
_, ax = plt.subplots(figsize=(8, 6))
_ = plot_tree(
tree,
feature_names=culmen_columns,
class_names=tree.classes_.tolist(),
impurity=False,
ax=ax,
)
https://colab.research.google.com/drive/1G9Ld1Dc3NE2LLdhDIse-HrLnc5Qs4doK?usp=sharing#scrollTo=ef5ad512 3/6
2/5/25, 1:58 PM trees_classification.ipynb - Colab
Tip
We are using the function fig, ax = plt.subplots(figsize=(8, 6)) to create a figure and an axis with a specific size. Then, we can
pass the axis to the sklearn.tree.plot_tree function such that the drawing happens in this axis.
We see that the split was done on the culmen length feature. The original dataset was subdivided into 2 sets based on the culmen length
(inferior or superior to 42.35 mm).
This partition of the dataset minimizes the class diversity in each sub-partitions. This measure is also known as a criterion, and is a settable
parameter.
If we look more closely at the partition, we see that the sample to the left of 42.35mm belongs mainly to the "Adelie" class. Looking at the
values, we indeed observe 107 "Adelie" individuals in this space. We also count 55 "Chinstrap" samples and 94 "Gentoo" samples. We can
make similar interpretation for the partition defined by a threshold to the right of 42.35mm. In this case, the most represented class is the
"Gentoo" species.
Let's see how our tree would work as a predictor. Let's start with a case where the culmen length is below the threshold.
test_penguin_1 = pd.DataFrame(
{"Culmen Length (mm)": [41], "Culmen Depth (mm)": [0]}
)
tree.predict(test_penguin_1)
The class predicted is the "Adelie". We can now check what happens if we pass a culmen depth superior to the threshold.
test_penguin_2 = pd.DataFrame(
{"Culmen Length (mm)": [43], "Culmen Depth (mm)": [0]}
)
tree.predict(test_penguin_2)
Thus, we can conclude that a decision tree classifier predicts the most represented class within a partition.
During the training, we have a count of samples in each partition, we can also compute the probability of belonging to a specific class within
this partition.
y_pred_proba = tree.predict_proba(test_penguin_2)
y_proba_class_0 = pd.Series(y_pred_proba[0], index=tree.classes_)
y_pred_proba
https://colab.research.google.com/drive/1G9Ld1Dc3NE2LLdhDIse-HrLnc5Qs4doK?usp=sharing#scrollTo=ef5ad512 4/6
2/5/25, 1:58 PM trees_classification.ipynb - Colab
y_proba_class_0
dtype: float64
y_proba_class_0.plot.bar()
plt.ylabel("Probability")
_ = plt.title("Probability to belong to a penguin class")
We can also compute the different probabilities manually directly from the tree structure.
adelie_proba = 11 / 159
chinstrap_proba = 55 / 159
gentoo_proba = 93 / 159
print(
"Probabilities for the different classes:\n"
f"Adelie: {adelie_proba:.3f}\n"
f"Chinstrap: {chinstrap_proba:.3f}\n"
f"Gentoo: {gentoo_proba:.3f}\n"
)
It is also important to note that the culmen depth has been disregarded for the moment. It means that regardless of its value, it is not used
during the prediction.
https://colab.research.google.com/drive/1G9Ld1Dc3NE2LLdhDIse-HrLnc5Qs4doK?usp=sharing#scrollTo=ef5ad512 5/6
2/5/25, 1:58 PM trees_classification.ipynb - Colab
test_penguin_3 = pd.DataFrame(
{"Culmen Length (mm)": [40], "Culmen Depth (mm)": [18]}
)
tree.predict_proba(test_penguin_3)
array([[0.98969072, 0. , 0.01030928]])
Going back to our classification problem, the split found with a maximum depth of 1 is not powerful enough to separate the three species and
the model accuracy is low when compared to the linear model.
tree.fit(data_train, target_train)
test_score = tree.score(data_test, target_test)
print(f"Accuracy of the DecisionTreeClassifier: {test_score:.2f}")
Indeed, it is not a surprise. We saw earlier that a single feature is not able to separate all three species: it underfits. However, from the
previous analysis we saw that by using both features we should be able to get fairly good results.
In the next exercise, you will increase the tree depth to get an intuition on how such a parameter affects the space partitioning.
https://colab.research.google.com/drive/1G9Ld1Dc3NE2LLdhDIse-HrLnc5Qs4doK?usp=sharing#scrollTo=ef5ad512 6/6