Lesson 5.0 Supervised Learning with Decision Trees (1)

This document provides an overview of decision trees in supervised learning, explaining their structure, how they make predictions, and the methods used to control their complexity to prevent overfitting. It discusses the use of Gini impurity for node splitting, the importance of feature selection, and introduces ensemble methods like random forests and gradient boosted trees to enhance model performance. Additionally, it highlights the advantages and limitations of decision trees and ensemble methods in machine learning applications.

Uploaded by

masy5677

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Lesson 5.0 Supervised Learning with Decision Trees (1)

Uploaded by

masy5677

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 16

1/28/2023

SUPERVISED
LEARNING
DECISION TREES
1/28/2023

INTRODUCTION

• Trees learn from a set of if/else questions with answers

and use the knowledge to make a decision.
• For example if you have four animals with Yes/No under features
such as feathers, can fly, has fins and the name of the animals
( bears, hawks, penguins, and dolphins), you can ask all the
questions in order to identify the animal.
• However if you examine the questions closely, you realize that
you only need to ask a few questions and you can tell the
animal.This is learning the pattern from the rules.
• The presence of feathers narrows down your possible animals
to just two hence you only need to ask one more question to
know the animal. In some cases you only need to ask one
question, e.g. does the animal have fins? These questions
enable us to find the answer faster.They become the rules.
• Notice the description of a lion will make the tree classify it as a
bear.This is not accurate but is the closest match based on the
data that we have.
• In machine learning we can provide an algorithm with data
which acts as examples (hence supervised learning) and then
it can learn the rules and represent them as a tree which is
the model that we use later for prediction.
DECISION TREES IN SCIKIT LEARN
• The data is usually represented as continuous features
such as in the 2D dataset.
• To build a tree, the algorithm searches over all possible
tests and finds the one that is most informative about the
target variable. For instance splitting the dataset vertically
at x[1]=0.0596 yields the most information.
• The split is done by testing whether x[1] <= 0.0596,
indicated by a black line. If the test is true, a point is
assigned to the left node, which contains 2 points
belonging to class 0 and 32 points belonging to class 1.
Otherwise the point is assigned to the right node
• These two nodes correspond to the top and bottom regions
shown in the diagram. Even though the first split did a
good job of separating the two classes, the bottom region
still contains points belonging to class 0, and the top
region still contains points belonging to class 1.
• The recursive partitioning of the data is repeated until
each region in the partition (each leaf in the decision tree)
only contains a single target value (a single class or a
single regression value). A leaf of the tree that contains
data points that all share the same target value is called
pure.
1/28/2023

IRIS FLOWER

• We use the iris flower dataset to train the model

• To visualize we use the export_graphviz function from the tree
module. It makes it easier to analyze trees by adding color to
the nodes to reflect the majority class in each node and
includes features names for the tree.
• It creates file in the .dot file format for storing the
image. To view the file we use the library graphviz. You
might need to install graphviz
• The sequence of if/else questions that gets us to the true
answer most quickly form the rules is identified (most
informative about the target variable). In this dataset if the
petal width is less than 0.8cm we are dealing with
setosa.There is no need to read other features to know a
setosa specie. Hence this is picked as the first rule.The other
classes need a few more rules.
• Notice that this data does not come in the form of binary
yes/no features as in the animal example, but instead is
continuous data, the rules are represented as => or <=
rather than =. The number of data points in each node and
their class is indicated.
• Where multiple classes exist recursive partitioning of the data
occurs until we find a node that contains a single class. It is
said to be pure and forms a rule. Our tree has 7 rules
• A prediction on a new data point is made by traversing the
tree from the root and checking which area the point lies in,
then predicting the target.
1/28/2023

GINI IMPURITY

• There are different methods used to

split the features.The Gini Impurity, is
the simplest hence common. Other
methods are Entropy / Information gain
albeit more complex
• Gini Impurity is a measurement used
to build Decision Trees to determine
how the features of a dataset should
split nodes to form the tree i.e. it
determines the likelihood that a data
point with the feature will be
misclassified if assigned to a class. Thus
measures the impurity level of a split.
• It yields a number between 0 and 0.5. A
small Gini value signifies less impurity
hence a suitable split. It is calculated as
shown where:
• D is the dataset
• k is the number of classes.
• P is the probability of samples
belonging to class i
1/28/2023

C A L C U L AT I N G GINI IMPURITY

• Assume that algorithm picks the feature sepal length for a node, it
creates a threshold (e.g. 5cm) and then splits the data point into
those whose attribute value is above the threshold and those
that are below or equal to it.
• If a pattern exists then one group will tend to have data points that
belong to one class and the other group will have a different
class.There will be some impurities e.g. some circle species where
triangle are the majority.We calculate this impurity for each group
as
• G ini impurity True (Sepal
Length>5)
• Probability of class circle = 4/6 = 0.67
• Probability of class triangle = 2/6 = 0.33
• Gini impurity for True = 1-
[(0.67*0.67)+(0.33*0.33)]=0.44
Similarly Gini impurity False =
0.38

• The weighted Gini for all groups in a

feature (attribute) =6/10*0.44 +
4/10*0.38=0.42
• The process is repeated for other features e.g. petal length > 5 and
the feature with the lowest Weighted Gini Impurity is selected for
further splitting. A node is pure when the Weighted Gini Impurity is
0.This is when splitting stops.
• An alternative approach for finding the best split uses the highest
information gain (entropy).
• It is also possible to use trees for regression tasks, by traversing the
tree and finding the leaf the new data point falls into.The output for
this data point is the mean target of the training points in this leaf.
1/28/2023

CONTROLLING
C O M P L E X I T Y OF
D E C I S I O N TREES
• Typically, building a tree as described earlier and
continuing until all leaves are pure can leads to
models that are very complex and thus highly
overfitting to the training data.
• The presence of pure leaves mean that a tree is
100% accurate on the training set; each data point
in the training set is in a leaf that has the correct
class. The overfitting can be seen on a tree where a
node belonging to a class with few data points is
sitting in the middle of another class. A sign that
the decision boundary is not very clear.
• There are two common strategies to prevent
overfitting:
• Preprunning - stopping the creation of the
tree early
• Postprunning - removing nodes that
contain little information
1/28/2023

PREPRUNING

• scikit-learn only implements pre-pruning, not post-

pruning.
• Lets use a bigger dataset (Breast Cancer dataset ) to
understand
the effect of pre-pruning in more detail
• When we do not pre-prune the tree, the accuracy on the
training set is 100% because the leaves are pure.The
tree has grown deep enough to get pure leafs hence
perfectly memorized all the labels on the training data.
• The test set accuracy is slightly worse than for the
linear models we looked at previously, which had
around 95% accuracy. This means that the model did
not generalize as well to new data, a sign of
overfitting.
• To apply pre-pruning to the tree, we can limit the depth.
For example if we say max_depth=4, only four
consecutive questions can be asked
• Limiting the depth of the tree leads to a lower accuracy
on the training set (leaves are not as pure) and an
improvement on the testing set hence generalizing
better (decreases overfitting).
1/28/2023

F E AT U R E I M P O R TA N C E

• However, even with a tree of depth four, the tree can

still be huge.The leaves with more samples are the
useful ones when making predictions.
• If we visualize with color, we notice the orange nodes
(malignant samples) are mostly on one leaf, with the
other leaves containing few samples.Thus instead of
looking at the whole tree, we can focus on the ones
with more samples.The number of samples is called
feature importance.
• It provides a number between 0 and 1 for each feature,
where 0 means “not used at all” and 1 means
“perfectly predicts the target.” The feature
importance always sum to 1:
• When we plot the data we see that the feature “worst
radius” is
the most important feature.
• If a feature has a low feature_importance, it doesn’t
mean that this feature is uninformative. In some cases it
encodes the same information hence was not picked
by the tree.
• However linear model coefficients are better because in
addition to telling us which features are important,
they also tell us which class the features are important
for. With feature importance we can tell “worst radius”
is important, but not whether a high worst radius is
indicative of a sample being benign or malignant.
1/28/2023

R
De gIrSeI O
EC sNs iTREES
o n t r FOR
ees
REGRESSION
• Decision trees for regression work the same way using the DecisionTreeRegressor
algorithm.
• NB: However the models are not able to extrapolate, or make predictions
outside of the range of the training data.
• Decision tree classifiers have two advantages over many of the
algorithms:
o The resulting model can easily be visualized and understood by
non-experts (especially for smaller trees)
o The algorithms are completely invariant to scaling of the data.
As each feature is processed separately, and the possible splits
of the data don’t depend on scaling, no preprocessing like
normalization or standardization of features is needed for
decision tree algorithms.
• The main downside of decision trees classifiers is that even with the use of
pre-pruning, they tend to over fit as you don’t always know how much
pruning to do to avoid over fitting.This can be addressed by combining
different types of trees.
1/28/2023

EEnNsSeEmMbBl e
LEs SMOeFt hDoEdCsI S I O N
TREES

• Ensembles are methods that combine multiple machine learning

models to create more powerful models.
• Here we look at ensemble trees. Two types that perform well:
1. Random forests
2. Gradient boosted regression trees
1/28/2023

R A N D O M FORESTS

• Random forests use a collection of slightly different trees. Basically, If you build many accurate trees,
which overfit in different ways, you can reduce the amount of overfitting by averaging their results.
• To implement this strategy, you need to build different trees. Random forests inject in their trees by
selecting the data points used to build a tree and by selecting the features used for splitting randomly.
• To build a random forest model, specify the number of trees to build using the n_estimators
parameter.The larger the number of trees the higher the accuracy and generalizability results.
• The algorithm works as follows:
o First, selects a sample of the data (called a bootstrap sample) of the data and then repeats this
sampling process it creates a dataset that is as big as the original dataset, (some data points from
the original dataset will be missing while some will be repeated).
o Next, a decision tree is built based on this newly created dataset. However, only a subset of the
features are used when deciding how to split a node.The number of features selected is controlled
by the max_features parameter. Each node uses a different subset of the features.
• If you set max_features to all n features, no randomness will be injected in the feature selection making all
trees similar (increases overfitting), and if we set max_features to 1, the splits have no choice on which
feature thus can only search over different thresholds for the feature that was selected randomly
(although we reduce overfitting).Thus this value should be controlled. (A good rule of thumb select
max_features=sqrt(n_features) for classification and max_fea tures=log2(n_features) for regression.)
• To make a prediction the algorithm first makes a prediction for every tree in the forest. For classification, a
“soft voting” strategy is used. Where the class with the highest probability is predicted while for
regression, the results are averaged to get the final prediction.
1/28/2023

PARAMETER SELECTION

• The random forest gives us an accuracy of 96%, better than the linear models or a single
decision tree, even without tuning any parameters.We could adjust the max_features setting, to
reduce overfitting, however, often the default parameters of the random forest already work
quite well.
1/28/2023

FEATURE IMPORTANCE

• Similarly to the decision tree, the random forest provides

feature importances, which are computed by aggregating
the importances of all trees and these typically are more
reliable than the ones provided by a single tree.
• Note that random forest considers more features as
important. The levels of importance for different features
might also differ from those of a single tree, thus it
captures a much broader picture of the data than a single
tree hence improved accuracy levels.
• Random forests don’t tend to perform well on high
dimensional, sparse data, such as text data.Thus it is
recommended to use linear models for such data.
• Random forests are not easy to explain to non experts, and
their random nature makes it difficult to have reproducible
results. In such cases you can opt for decision trees
• Random forests on large datasets might time consuming,
requiring you to use multiple CPU cores. Alternative
ensemble trees such as gradient boosted trees can
address this
• Where random forests seem to overfit, using
alternative trees such as gradient boosted trees would
help
1/28/2023

GRADIENT BOOSTED
R E G R E SS I O N
TREES

• Despite the name, these models can be used for regression and classification.
• There is no randomization but instead, strong pre-pruning is used.Thus
the trees are shallow (depth one to five), hence using less memory
and time.
• Unlike random forests, they build trees in a sequential manner with each
tree trying to improve on the previous ones. In this model a simple
models (known as weak learners) is created from a shallow trees and
the good predictions from the tree used to create other trees.
• They perform very well hence are very popular, however they need more
parameter tuning than random forests to perform that well.
• Parameter are for pre-pruning (max_depth) and the number of trees
(n_estimators) just like random forests., but they also have another
important parameter called learning_rate. This is the ability of a tree
to correct the mistakes of the previous trees. A higher learning rate
means each tree can make stronger corrections, allowing for more
complex models. Adding more trees to the ensemble, also increases the
models chances to correct mistakes on the training set.
Example
• Let us apply GradientBoostingClassifier on the Breast Cancer dataset
with 100 trees of maximum depth 1 and a learning rate of 0.1.
• Without adjusting any parameters the training set accuracy is 100%, a
possible sign of overfitting. Limiting the maximum depth of the tree
reduces overfitting, while lowering the learning rate increased the
generalization performance slightly.We are able to get similar accuracy
levels to random forests
1/28/2023

SELECTING PARAMETERS

• When you view the feature importances of the gradient

boosted trees you will notice that they are somewhat
similar to those of random forests, though the gradient
boosting completely ignored some of the features.
• Because both gradient boosting and random forests
perform well on similar kinds of data, a common
approach is to first try random forests, which work quite
robustly with less tuning.
• Improved algorithms exist in other packages e.g. the the
xgboost package would be better for large scale
problems because its faster than the scikit-learn
implementation of gradient boosted tree.
• Gradient boosted decision trees also do not work well
on high- dimensional sparse data.
• Increasing the number of trees in gradient boost trees
may lead
to overfitting hence adjusting the learning_rates is
better.
• As you adjust the max_depth in gradient boosted
trees, keep it low ( less than 5)

The C# Player's Guide - 5th Edition - 5.0.0
83% (18)
The C# Player's Guide - 5th Edition - 5.0.0
497 pages
ATPG Observation
100% (2)
ATPG Observation
13 pages
ESGB_2025_classification and regression tress [Enregistré automatiquement]
No ratings yet
ESGB_2025_classification and regression tress [Enregistré automatiquement]
43 pages
LAB (1) Decision Tree: Islamic University of Gaza Computer Engineering Department Artificial Intelligence ECOM 5038
No ratings yet
LAB (1) Decision Tree: Islamic University of Gaza Computer Engineering Department Artificial Intelligence ECOM 5038
18 pages
Trees and Forests: Machine Learning With Python Cookbook
No ratings yet
Trees and Forests: Machine Learning With Python Cookbook
5 pages
FDP Session 4 (Decision Tree)
No ratings yet
FDP Session 4 (Decision Tree)
1 page
Decision Trees
No ratings yet
Decision Trees
8 pages
Decision Trees
No ratings yet
Decision Trees
37 pages
Classification Using Decision Trees
No ratings yet
Classification Using Decision Trees
43 pages
Lecture 17 18
No ratings yet
Lecture 17 18
52 pages
ML Unit 3
No ratings yet
ML Unit 3
83 pages
Chapter 3
No ratings yet
Chapter 3
88 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
ML CLASS 6 Decision Tree Algorithm
No ratings yet
ML CLASS 6 Decision Tree Algorithm
21 pages
L04 Decision Trees
No ratings yet
L04 Decision Trees
34 pages
Random Forest
No ratings yet
Random Forest
5 pages
Decision Trees-Lecture 9&10
No ratings yet
Decision Trees-Lecture 9&10
60 pages
Week 7 - Tree-Based Model
100% (1)
Week 7 - Tree-Based Model
8 pages
Classification and Regression Trees
No ratings yet
Classification and Regression Trees
48 pages
Decision Tree & Random Forest
No ratings yet
Decision Tree & Random Forest
16 pages
Data Mining
No ratings yet
Data Mining
15 pages
Decision Tree Algorithm
No ratings yet
Decision Tree Algorithm
18 pages
What's Next?: Tree Models Decision Trees Ranking and Probability Estimation Trees
No ratings yet
What's Next?: Tree Models Decision Trees Ranking and Probability Estimation Trees
49 pages
Unit-3 Alt
No ratings yet
Unit-3 Alt
24 pages
Decision Trees: Decision Tree Is One of The Most Widely Used and
No ratings yet
Decision Trees: Decision Tree Is One of The Most Widely Used and
53 pages
Scikit-Learn Decision Trees Explained - by Frank Ceballos - Towards Data Science
No ratings yet
Scikit-Learn Decision Trees Explained - by Frank Ceballos - Towards Data Science
10 pages
Machine Learning in Ecology
No ratings yet
Machine Learning in Ecology
15 pages
DS4 - CLS-Decision Tree
No ratings yet
DS4 - CLS-Decision Tree
32 pages
Apznzayn4iudcvxyoppqs61j04 7hfvwveb4orry3irmq7ekrlv08lh81olz64cb1ycwzmxuattzrg0ox0g-e Tcprei1i3bwhbnbqofqhvtixwokm0ftaoxwee3znpcytoh6jgknlof6 Rukjysosqdyan8wfbovpzrikmrpeywyu07ft Vvpsanuerxuhcghc7g6sd4pcyi9z-Wao8bn
No ratings yet
Apznzayn4iudcvxyoppqs61j04 7hfvwveb4orry3irmq7ekrlv08lh81olz64cb1ycwzmxuattzrg0ox0g-e Tcprei1i3bwhbnbqofqhvtixwokm0ftaoxwee3znpcytoh6jgknlof6 Rukjysosqdyan8wfbovpzrikmrpeywyu07ft Vvpsanuerxuhcghc7g6sd4pcyi9z-Wao8bn
20 pages
19 -- Decision Tree -- ID3
No ratings yet
19 -- Decision Tree -- ID3
87 pages
decision_trees_implementation (1)
No ratings yet
decision_trees_implementation (1)
13 pages
Classification
No ratings yet
Classification
8 pages
ML Mod-4
No ratings yet
ML Mod-4
30 pages
Decision Tree
No ratings yet
Decision Tree
28 pages
08 Decision - Tree
No ratings yet
08 Decision - Tree
9 pages
STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
No ratings yet
STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
18 pages
Decision Tree Theory
No ratings yet
Decision Tree Theory
22 pages
An Introduction TO Decision Trees
No ratings yet
An Introduction TO Decision Trees
30 pages
Suitability of Various Intelligent Tree Based Classifiers For Diagnosing Noisy Medical Data
No ratings yet
Suitability of Various Intelligent Tree Based Classifiers For Diagnosing Noisy Medical Data
12 pages
ML Unit 3 New
No ratings yet
ML Unit 3 New
24 pages
1.decision Trees Concepts
No ratings yet
1.decision Trees Concepts
70 pages
Decision Tree and Random Forest
No ratings yet
Decision Tree and Random Forest
41 pages
DMDW-CO3-SESSION-14
No ratings yet
DMDW-CO3-SESSION-14
55 pages
Wk. 5.2. Decision Trees (27.10.2020)
No ratings yet
Wk. 5.2. Decision Trees (27.10.2020)
57 pages
BSC ML Ch3.pptx
No ratings yet
BSC ML Ch3.pptx
106 pages
Decision Tree
No ratings yet
Decision Tree
12 pages
Lab 4_Logistic Regression_kNN_Notes
No ratings yet
Lab 4_Logistic Regression_kNN_Notes
6 pages
Bhabesh - Chapter 3 Complete Editing Including Summary
No ratings yet
Bhabesh - Chapter 3 Complete Editing Including Summary
18 pages
Experiment No-2
No ratings yet
Experiment No-2
4 pages
Decision Trees - 2022
No ratings yet
Decision Trees - 2022
49 pages
Ni Hms 92537
No ratings yet
Ni Hms 92537
5 pages
Decision Tree Basics
No ratings yet
Decision Tree Basics
30 pages
Introduction To Decision Tree: Gini Index
No ratings yet
Introduction To Decision Tree: Gini Index
15 pages
Decision Trees: A Recent Overview: S. B. Kotsiantis
No ratings yet
Decision Trees: A Recent Overview: S. B. Kotsiantis
23 pages
Ds 6
No ratings yet
Ds 6
24 pages
DM Lab 04
No ratings yet
DM Lab 04
6 pages
Lecture+Notes+-+Random Forests
No ratings yet
Lecture+Notes+-+Random Forests
10 pages
Geometric Intuition of Decision Tree: Axis Parallel Hyperplanes
No ratings yet
Geometric Intuition of Decision Tree: Axis Parallel Hyperplanes
7 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
18 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Session 14 Finite State Machine - FSM
No ratings yet
Session 14 Finite State Machine - FSM
13 pages
Worksheet 2 Freelancing l3 Yr 1 23-24
No ratings yet
Worksheet 2 Freelancing l3 Yr 1 23-24
3 pages
P7 - PHP Cookies Session
No ratings yet
P7 - PHP Cookies Session
4 pages
Chapter 3 and 8. Tenth Lecture
No ratings yet
Chapter 3 and 8. Tenth Lecture
33 pages
An In-Depth Analysis of Iot Security Requirements, Challenges, and Their Countermeasures Via Software-Defined Security
No ratings yet
An In-Depth Analysis of Iot Security Requirements, Challenges, and Their Countermeasures Via Software-Defined Security
27 pages
Part - 3 ASP - Net Core Main Method
No ratings yet
Part - 3 ASP - Net Core Main Method
6 pages
Software Development Process
No ratings yet
Software Development Process
16 pages
NMEA Data
No ratings yet
NMEA Data
26 pages
Age Calculation Shortcut Tricks
No ratings yet
Age Calculation Shortcut Tricks
2 pages
Tornado Software - System Manual
No ratings yet
Tornado Software - System Manual
9 pages
Hyper Geometric Distribution: Examples and Formula
No ratings yet
Hyper Geometric Distribution: Examples and Formula
9 pages
Sanam Math
No ratings yet
Sanam Math
13 pages
Computerized Accconting With Tally
No ratings yet
Computerized Accconting With Tally
5 pages
Database Administration and Security Revised Notes Ver 3.0
No ratings yet
Database Administration and Security Revised Notes Ver 3.0
60 pages
Module 04 CSIRT
No ratings yet
Module 04 CSIRT
41 pages
Q4 ICT CSS 10 Week7
No ratings yet
Q4 ICT CSS 10 Week7
3 pages
Mabel Kidisil CV
No ratings yet
Mabel Kidisil CV
3 pages
Statement 194801000020579
No ratings yet
Statement 194801000020579
11 pages
BASIC JAVA PROGRAMMING
No ratings yet
BASIC JAVA PROGRAMMING
22 pages
CRYPTOGRAPHY LAB 1 PDF
No ratings yet
CRYPTOGRAPHY LAB 1 PDF
10 pages
BRAHMASTR1
No ratings yet
BRAHMASTR1
2 pages
Improving Opportunities in Healthcare Supply Chain Processes Via The Internet of Things & Blockchain Technology
No ratings yet
Improving Opportunities in Healthcare Supply Chain Processes Via The Internet of Things & Blockchain Technology
21 pages
The Wonderful and Terrifying Implications of Computers That Can Learn
No ratings yet
The Wonderful and Terrifying Implications of Computers That Can Learn
2 pages
2021-fintech-eng
No ratings yet
2021-fintech-eng
3 pages
NCERT Class 11 Computer Science Data Representation in Computers
No ratings yet
NCERT Class 11 Computer Science Data Representation in Computers
17 pages
(Ebook) Control of Large Wind Energy Systems: Theory and Methods for the User by Adrian Gambier ISBN 9783030848941, 3030848949 - Get the ebook in PDF format for a complete experience
100% (3)
(Ebook) Control of Large Wind Energy Systems: Theory and Methods for the User by Adrian Gambier ISBN 9783030848941, 3030848949 - Get the ebook in PDF format for a complete experience
72 pages
Numerical Methods For Engineers
80% (10)
Numerical Methods For Engineers
179 pages
Basics of Database Normaliztion
No ratings yet
Basics of Database Normaliztion
8 pages