0% found this document useful (0 votes)

17 views6 pages

trees_classification.ipynb - Colab

This document presents a Jupyter notebook that demonstrates the use of decision trees for multiclass classification using the penguins dataset. It focuses on the hyperparameter max_depth and compares the performance of a decision tree classifier with a logistic regression model. The notebook illustrates how decision trees partition feature space and the implications of using different features for classification.

Uploaded by

whizbainz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views6 pages

trees_classification.ipynb - Colab

Uploaded by

whizbainz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

2/5/25, 1:58 PM trees_classification.

ipynb - Colab

keyboard_arrow_down Build a classification decision tree

In this notebook we illustrate decision trees in a multiclass classification problem by using the penguins dataset with 2 features and 3
classes.

For the sake of simplicity, we focus the discussion on the hyperparamter max_depth , which controls the maximal depth of the decision tree.

import pandas as pd
from sklearn.model_selection import train_test_split

url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)

culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]

target_column = "Species"

penguins = penguins[penguins['Culmen Length (mm)'].notna()]

data, target = penguins[culmen_columns], penguins[target_column]

data_train, data_test, target_train, target_test = train_test_split(data, target, random_state=0)

penguin_df = pd.DataFrame(penguins)
penguin_df.head(5)

Culmen Culmen Flipper Body

Sample Individual Clutch Date
studyName Species Region Island Stage Length Depth Length Mass Sex
Number ID Completion Egg
(mm) (mm) (mm) (g)

Adelie
Adult,
Penguin
0 PAL0708 1 Anvers Torgersen 1 Egg N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 MALE
(Pygoscelis
Stage
adeliae)

Adelie
Adult,
Penguin
1 PAL0708 2 Anvers Torgersen 1 Egg N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 FEMALE
(Pygoscelis
Stage
adeliae)

Adelie
Adult,
Penguin
2 PAL0708 3 Anvers Torgersen 1 Egg N2A1 Yes 11/16/07 40.3 18.0 195.0 3250.0 FEMALE
(Pygoscelis
Stage
adeliae)

Adelie
Adult,
Penguin
4 PAL0708 5 Anvers Torgersen 1 Egg N3A1 Yes 11/16/07 36.7 19.3 193.0 3450.0 FEMALE
(Pygoscelis
Stage
adeliae)

Adelie
Adult,
Penguin
5 PAL0708 6 Anvers Torgersen 1 Egg N3A2 Yes 11/16/07 39.3 20.6 190.0 3650.0 MALE
(Pygoscelis
Stage
adeliae)

Next steps: Generate code with penguin_df toggle_off View recommended plots New interactive sheet

row_count = len(penguin_df)
row_count

342

First, we split the data into two subsets to investigate how trees predict values based on unseen data.

In a previous notebook, we learnt that linear classifiers define a linear separation to split classes using a linear combination of the input
features. In our 2-dimensional feature space, it means that a linear classifier finds the oblique lines that best separate the classes. This is still
true for multiclass problems, except that more than one line is fitted. We can use DecisionBoundaryDisplay to plot the decision boundaries
learnt by the classifier.

from sklearn.linear_model import LogisticRegression

linear_model = LogisticRegression()
linear_model.fit(data_train, target_train)

https://colab.research.google.com/drive/1G9Ld1Dc3NE2LLdhDIse-HrLnc5Qs4doK?usp=sharing#scrollTo=ef5ad512 1/6
2/5/25, 1:58 PM trees_classification.ipynb - Colab

▾ LogisticRegression i ?

LogisticRegression()

import matplotlib.pyplot as plt

import matplotlib as mpl
import seaborn as sns

from sklearn.inspection import DecisionBoundaryDisplay

tab10_norm = mpl.colors.Normalize(vmin=-0.5, vmax=8.5)

# create a palette to be used in the scatterplot
palette = ["tab:blue", "tab:green", "tab:orange"]

dbd = DecisionBoundaryDisplay.from_estimator(
linear_model,
data_train,
response_method="predict",
cmap="tab10",
norm=tab10_norm,
alpha=0.5,
)
sns.scatterplot(
data=penguins,
x=culmen_columns[0],
y=culmen_columns[1],
hue=target_column,
palette=palette,
)
# put the legend outside the plot
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
_ = plt.title("Decision boundary using a logistic regression")

We see that the lines are a combination of the input features since they are not perpendicular a specific axis. Indeed, this is due to the model
parametrization that we saw in some previous notebooks, i.e. controlled by the model's weights and intercept.

Besides, it seems that the linear model would be a good candidate for such problem as it gives good accuracy.

linear_model.fit(data_train, target_train)
test_score = linear_model.score(data_test, target_test)
print(f"Accuracy of the LogisticRegression: {test_score:.2f}")

Accuracy of the LogisticRegression: 0.98

Unlike linear models, the decision rule for the decision tree is not controlled by a simple linear combination of weights and feature values.

Instead, the decision rules of trees can be defined in terms of

the feature index used at each split node of the tree,

the threshold value used at each split node,
the value to predict at each leaf node.

https://colab.research.google.com/drive/1G9Ld1Dc3NE2LLdhDIse-HrLnc5Qs4doK?usp=sharing#scrollTo=ef5ad512 2/6
2/5/25, 1:58 PM trees_classification.ipynb - Colab
Decision trees partition the feature space by considering a single feature at a time. The number of splits depends on both the
hyperparameters and the number of data points in the training set: the more flexible the hyperparameters and the larger the training set, the
more splits can be considered by the model.

As the number of adjustable components taking part in the decision rule changes with the training size, we say that decision trees are non-
parametric models.

Let's now visualize the shape of the decision boundary of a decision tree when we set the max_depth hyperparameter to only allow for a
single split to partition the feature space.

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(max_depth=1)
tree.fit(data_train, target_train)

▾ DecisionTreeClassifier i ?

DecisionTreeClassifier(max_depth=1)

DecisionBoundaryDisplay.from_estimator(
tree,
data_train,
response_method="predict",
cmap="tab10",
norm=tab10_norm,
alpha=0.5,
)
sns.scatterplot(
data=penguins,
x=culmen_columns[0],
y=culmen_columns[1],
hue=target_column,
palette=palette,
)
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
_ = plt.title("Decision boundary using a decision tree")

The partitions found by the algorithm separates the data along the axis "Culmen Length", discarding the feature "Culmen Depth". Thus, it
highlights that a decision tree does not use a combination of features when making a single split. We can look more in depth at the tree
structure.

from sklearn.tree import plot_tree

_, ax = plt.subplots(figsize=(8, 6))
_ = plot_tree(
tree,
feature_names=culmen_columns,
class_names=tree.classes_.tolist(),
impurity=False,
ax=ax,
)

https://colab.research.google.com/drive/1G9Ld1Dc3NE2LLdhDIse-HrLnc5Qs4doK?usp=sharing#scrollTo=ef5ad512 3/6
2/5/25, 1:58 PM trees_classification.ipynb - Colab

Tip

We are using the function fig, ax = plt.subplots(figsize=(8, 6)) to create a figure and an axis with a specific size. Then, we can
pass the axis to the sklearn.tree.plot_tree function such that the drawing happens in this axis.

We see that the split was done on the culmen length feature. The original dataset was subdivided into 2 sets based on the culmen length
(inferior or superior to 42.35 mm).

This partition of the dataset minimizes the class diversity in each sub-partitions. This measure is also known as a criterion, and is a settable
parameter.

If we look more closely at the partition, we see that the sample to the left of 42.35mm belongs mainly to the "Adelie" class. Looking at the
values, we indeed observe 107 "Adelie" individuals in this space. We also count 55 "Chinstrap" samples and 94 "Gentoo" samples. We can
make similar interpretation for the partition defined by a threshold to the right of 42.35mm. In this case, the most represented class is the
"Gentoo" species.

Let's see how our tree would work as a predictor. Let's start with a case where the culmen length is below the threshold.

test_penguin_1 = pd.DataFrame(
{"Culmen Length (mm)": [41], "Culmen Depth (mm)": [0]}
)
tree.predict(test_penguin_1)

array(['Adelie Penguin (Pygoscelis adeliae)'], dtype=object)

The class predicted is the "Adelie". We can now check what happens if we pass a culmen depth superior to the threshold.

test_penguin_2 = pd.DataFrame(
{"Culmen Length (mm)": [43], "Culmen Depth (mm)": [0]}
)
tree.predict(test_penguin_2)

array(['Gentoo penguin (Pygoscelis papua)'], dtype=object)

In this case, the tree predicts the "Gentoo" specie.

Thus, we can conclude that a decision tree classifier predicts the most represented class within a partition.

During the training, we have a count of samples in each partition, we can also compute the probability of belonging to a specific class within
this partition.

y_pred_proba = tree.predict_proba(test_penguin_2)
y_proba_class_0 = pd.Series(y_pred_proba[0], index=tree.classes_)

y_pred_proba

array([[0.06918239, 0.34591195, 0.58490566]])

https://colab.research.google.com/drive/1G9Ld1Dc3NE2LLdhDIse-HrLnc5Qs4doK?usp=sharing#scrollTo=ef5ad512 4/6
2/5/25, 1:58 PM trees_classification.ipynb - Colab
y_proba_class_0

Adelie Penguin (Pygoscelis adeliae) 0.069182

Chinstrap penguin (Pygoscelis antarctica) 0.345912

Gentoo penguin (Pygoscelis papua) 0.584906

dtype: float64

y_proba_class_0.plot.bar()
plt.ylabel("Probability")
_ = plt.title("Probability to belong to a penguin class")

We can also compute the different probabilities manually directly from the tree structure.

adelie_proba = 11 / 159
chinstrap_proba = 55 / 159
gentoo_proba = 93 / 159
print(
"Probabilities for the different classes:\n"
f"Adelie: {adelie_proba:.3f}\n"
f"Chinstrap: {chinstrap_proba:.3f}\n"
f"Gentoo: {gentoo_proba:.3f}\n"
)

Probabilities for the different classes:

Adelie: 0.069
Chinstrap: 0.346
Gentoo: 0.585

It is also important to note that the culmen depth has been disregarded for the moment. It means that regardless of its value, it is not used
during the prediction.

https://colab.research.google.com/drive/1G9Ld1Dc3NE2LLdhDIse-HrLnc5Qs4doK?usp=sharing#scrollTo=ef5ad512 5/6
2/5/25, 1:58 PM trees_classification.ipynb - Colab
test_penguin_3 = pd.DataFrame(
{"Culmen Length (mm)": [40], "Culmen Depth (mm)": [18]}
)
tree.predict_proba(test_penguin_3)

array([[0.98969072, 0. , 0.01030928]])

Going back to our classification problem, the split found with a maximum depth of 1 is not powerful enough to separate the three species and
the model accuracy is low when compared to the linear model.

tree.fit(data_train, target_train)
test_score = tree.score(data_test, target_test)
print(f"Accuracy of the DecisionTreeClassifier: {test_score:.2f}")

Accuracy of the DecisionTreeClassifier: 0.81

train_score = tree.score(data_train, target_train)

print(f"Accuracy of the DecisionTreeClassifier: {train_score:.2f}")

Accuracy of the DecisionTreeClassifier: 0.74

Indeed, it is not a surprise. We saw earlier that a single feature is not able to separate all three species: it underfits. However, from the
previous analysis we saw that by using both features we should be able to get fairly good results.

In the next exercise, you will increase the tree depth to get an intuition on how such a parameter affects the space partitioning.

https://colab.research.google.com/drive/1G9Ld1Dc3NE2LLdhDIse-HrLnc5Qs4doK?usp=sharing#scrollTo=ef5ad512 6/6

1st semester cse sample paper
No ratings yet
1st semester cse sample paper
4 pages
Experiment 3: Name: Reena Kale Te Comps Roll No:23
100% (1)
Experiment 3: Name: Reena Kale Te Comps Roll No:23
4 pages
SHUBH_GARG_JaneStreet
No ratings yet
SHUBH_GARG_JaneStreet
1 page
Icrtmse 2023 Book of Abstracts
No ratings yet
Icrtmse 2023 Book of Abstracts
154 pages
A Look at Contrast Stretching
0% (1)
A Look at Contrast Stretching
11 pages
Significant Figures
No ratings yet
Significant Figures
3 pages
Lab - Association Rule
No ratings yet
Lab - Association Rule
6 pages
LAB (1) Decision Tree: Islamic University of Gaza Computer Engineering Department Artificial Intelligence ECOM 5038
No ratings yet
LAB (1) Decision Tree: Islamic University of Gaza Computer Engineering Department Artificial Intelligence ECOM 5038
18 pages
SSRN-id2227333 - Give Me Some Credit
No ratings yet
SSRN-id2227333 - Give Me Some Credit
35 pages
1.10. Decision Trees — scikit-learn 0.24.1 documentation
No ratings yet
1.10. Decision Trees — scikit-learn 0.24.1 documentation
10 pages
module_4
No ratings yet
module_4
30 pages
Optical Character Recognition Using Convolutional Neural Network[1][1]
No ratings yet
Optical Character Recognition Using Convolutional Neural Network[1][1]
5 pages
Amazon Interview Paper
50% (2)
Amazon Interview Paper
2 pages
7 Decision Trees Annotated
No ratings yet
7 Decision Trees Annotated
61 pages
DSP Lab Experiments
No ratings yet
DSP Lab Experiments
13 pages
REVIEW2 STATISTICS-TEST1 Correction
No ratings yet
REVIEW2 STATISTICS-TEST1 Correction
3 pages
DM Lab Cycle 5
No ratings yet
DM Lab Cycle 5
3 pages
ai_lect_06
No ratings yet
ai_lect_06
54 pages
Onsager Reciprocal Relation
No ratings yet
Onsager Reciprocal Relation
15 pages
IJIVP Vol 12 Iss 2 Paper 8 2610 2614
No ratings yet
IJIVP Vol 12 Iss 2 Paper 8 2610 2614
5 pages
MA4270
No ratings yet
MA4270
1 page
Practical 5
No ratings yet
Practical 5
3 pages
3. Decision Tree Algorithm
No ratings yet
3. Decision Tree Algorithm
2 pages
Power of Knockoff: The Impact of Ranking Algorithm, Augmented Design, and Symmetric Statistic
No ratings yet
Power of Knockoff: The Impact of Ranking Algorithm, Augmented Design, and Symmetric Statistic
67 pages
Model Engineering
No ratings yet
Model Engineering
7 pages
Prac5 AAM
No ratings yet
Prac5 AAM
2 pages
DaoGiaKhanh Weather Forecasting Using MachineLearning
No ratings yet
DaoGiaKhanh Weather Forecasting Using MachineLearning
8 pages
Decision tree
No ratings yet
Decision tree
1 page
trees_regression.ipynb - Colab
No ratings yet
trees_regression.ipynb - Colab
4 pages
EST Cheatsheet
No ratings yet
EST Cheatsheet
5 pages
Experiment 8
No ratings yet
Experiment 8
14 pages
Title
No ratings yet
Title
10 pages
friedman LEarning Choices 1964 Bayesian probability
No ratings yet
friedman LEarning Choices 1964 Bayesian probability
11 pages
Chapter 2-1 PDF
100% (2)
Chapter 2-1 PDF
70 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
18 pages
09 Decision Trees Nearest Neighbor
No ratings yet
09 Decision Trees Nearest Neighbor
8 pages
06 Classification Decision Tree
No ratings yet
06 Classification Decision Tree
42 pages
Practical 1ritesh
No ratings yet
Practical 1ritesh
3 pages
Probability Distributions: X X X X X X
No ratings yet
Probability Distributions: X X X X X X
14 pages
ml4
No ratings yet
ml4
5 pages
Sequence Algorithms Screen
No ratings yet
Sequence Algorithms Screen
13 pages
FCGuide3
No ratings yet
FCGuide3
62 pages
W1 Trial Design Final Workbook
No ratings yet
W1 Trial Design Final Workbook
59 pages
Decision Trees For Classification - A Machine Learning Algorithm - Xoriant Blog
No ratings yet
Decision Trees For Classification - A Machine Learning Algorithm - Xoriant Blog
17 pages
Decision Tree Classification
No ratings yet
Decision Tree Classification
1 page
Schrodinger 1D Infinite Potential
No ratings yet
Schrodinger 1D Infinite Potential
3 pages
Lab 4_Logistic Regression_kNN_Notes
No ratings yet
Lab 4_Logistic Regression_kNN_Notes
6 pages
Experiment 8 ml vtu
No ratings yet
Experiment 8 ml vtu
4 pages
14MachineLearningDecisionTreeRandomForest - Ipynb - Colaboratory
No ratings yet
14MachineLearningDecisionTreeRandomForest - Ipynb - Colaboratory
29 pages
MIS410-Chapter6
No ratings yet
MIS410-Chapter6
47 pages
Experiment 3: Name: Reena Kale Te Comps Roll No:23
No ratings yet
Experiment 3: Name: Reena Kale Te Comps Roll No:23
4 pages
Experiment 8
No ratings yet
Experiment 8
4 pages
2. Random Forest Algorithm
No ratings yet
2. Random Forest Algorithm
2 pages
FDP Session 4 (Decision Tree)
No ratings yet
FDP Session 4 (Decision Tree)
1 page
ML_4,5 (1)
No ratings yet
ML_4,5 (1)
5 pages
AIH_Lab2
No ratings yet
AIH_Lab2
10 pages
Decision Tree and Related Techniques For Classification in Scalation
No ratings yet
Decision Tree and Related Techniques For Classification in Scalation
12 pages
Trees and Forests: Machine Learning With Python Cookbook
No ratings yet
Trees and Forests: Machine Learning With Python Cookbook
5 pages
Decision_Tree_Regression.ipynb - Colab
No ratings yet
Decision_Tree_Regression.ipynb - Colab
3 pages
Comprehensive Viva LCD Qstns
No ratings yet
Comprehensive Viva LCD Qstns
9 pages
DT Classifier
No ratings yet
DT Classifier
45 pages
Unit-5 Decision Trees & Ensembles Methods
No ratings yet
Unit-5 Decision Trees & Ensembles Methods
11 pages
practical 15 python
No ratings yet
practical 15 python
6 pages
2023AIB1008_Lab08
No ratings yet
2023AIB1008_Lab08
8 pages
S.No Date Name of The Experiment NO Marks Signature
No ratings yet
S.No Date Name of The Experiment NO Marks Signature
57 pages
Introduction To Decision Tree: Gini Index
No ratings yet
Introduction To Decision Tree: Gini Index
15 pages
Random Forest: The Algorithm in A Nutshell
No ratings yet
Random Forest: The Algorithm in A Nutshell
10 pages
Decision Tree Induction Algorithm
No ratings yet
Decision Tree Induction Algorithm
6 pages
AI Engineer Roadmap
No ratings yet
AI Engineer Roadmap
13 pages
Decision Trees
No ratings yet
Decision Trees
11 pages
Lecture 11 Slides - After
No ratings yet
Lecture 11 Slides - After
55 pages
The Adaline Learning Algorithm
No ratings yet
The Adaline Learning Algorithm
11 pages
Practical No4 - 5 ML
No ratings yet
Practical No4 - 5 ML
11 pages
RANDOM_FOREST__1737667979
No ratings yet
RANDOM_FOREST__1737667979
11 pages
L3_Classification_RandomForest - Jupyter Notebook
No ratings yet
L3_Classification_RandomForest - Jupyter Notebook
6 pages
Problem Solving & Algorithms Lesson
No ratings yet
Problem Solving & Algorithms Lesson
37 pages
HW 03 Solution
No ratings yet
HW 03 Solution
4 pages
Lab 02: Decision Tree With Scikit-Learn: About The Mushroom Data Set
No ratings yet
Lab 02: Decision Tree With Scikit-Learn: About The Mushroom Data Set
3 pages
CSET301 LabW8L2
No ratings yet
CSET301 LabW8L2
1 page
03_Random Forest
No ratings yet
03_Random Forest
24 pages
Supervised Learning
No ratings yet
Supervised Learning
71 pages
Tutorial 6
No ratings yet
Tutorial 6
8 pages
Decision Trees
No ratings yet
Decision Trees
15 pages
Decision Tree
No ratings yet
Decision Tree
18 pages
OR1 Review 2018
No ratings yet
OR1 Review 2018
39 pages
Decision Trees and Random Forests
No ratings yet
Decision Trees and Random Forests
25 pages
Wiener Filter
No ratings yet
Wiener Filter
6 pages

trees_classification.ipynb - Colab

Uploaded by

trees_classification.ipynb - Colab

Uploaded by

2/5/25, 1:58 PM trees_classification.

keyboard_arrow_down Build a classification decision tree

culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]

penguins = penguins[penguins['Culmen Length (mm)'].notna()]

data, target = penguins[culmen_columns], penguins[target_column]

data_train, data_test, target_train, target_test = train_test_split(data, target, random_state=0)

Culmen Culmen Flipper Body

from sklearn.linear_model import LogisticRegression

import matplotlib.pyplot as plt

from sklearn.inspection import DecisionBoundaryDisplay

tab10_norm = mpl.colors.Normalize(vmin=-0.5, vmax=8.5)

Accuracy of the LogisticRegression: 0.98

Instead, the decision rules of trees can be defined in terms of

the feature index used at each split node of the tree,

from sklearn.tree import DecisionTreeClassifier

from sklearn.tree import plot_tree

array(['Adelie Penguin (Pygoscelis adeliae)'], dtype=object)

array(['Gentoo penguin (Pygoscelis papua)'], dtype=object)

In this case, the tree predicts the "Gentoo" specie.

array([[0.06918239, 0.34591195, 0.58490566]])

Adelie Penguin (Pygoscelis adeliae) 0.069182

Chinstrap penguin (Pygoscelis antarctica) 0.345912

Gentoo penguin (Pygoscelis papua) 0.584906

Probabilities for the different classes:

Accuracy of the DecisionTreeClassifier: 0.81

train_score = tree.score(data_train, target_train)

Accuracy of the DecisionTreeClassifier: 0.74

You might also like