Lesson 5.0 Supervised Learning with Decision Trees (1)
Lesson 5.0 Supervised Learning with Decision Trees (1)
SUPERVISED
LEARNING
DECISION TREES
1/28/2023
INTRODUCTION
IRIS FLOWER
GINI IMPURITY
C A L C U L AT I N G GINI IMPURITY
• Assume that algorithm picks the feature sepal length for a node, it
creates a threshold (e.g. 5cm) and then splits the data point into
those whose attribute value is above the threshold and those
that are below or equal to it.
• If a pattern exists then one group will tend to have data points that
belong to one class and the other group will have a different
class.There will be some impurities e.g. some circle species where
triangle are the majority.We calculate this impurity for each group
as
• G ini impurity True (Sepal
Length>5)
• Probability of class circle = 4/6 = 0.67
• Probability of class triangle = 2/6 = 0.33
• Gini impurity for True = 1-
[(0.67*0.67)+(0.33*0.33)]=0.44
Similarly Gini impurity False =
0.38
CONTROLLING
C O M P L E X I T Y OF
D E C I S I O N TREES
• Typically, building a tree as described earlier and
continuing until all leaves are pure can leads to
models that are very complex and thus highly
overfitting to the training data.
• The presence of pure leaves mean that a tree is
100% accurate on the training set; each data point
in the training set is in a leaf that has the correct
class. The overfitting can be seen on a tree where a
node belonging to a class with few data points is
sitting in the middle of another class. A sign that
the decision boundary is not very clear.
• There are two common strategies to prevent
overfitting:
• Preprunning - stopping the creation of the
tree early
• Postprunning - removing nodes that
contain little information
1/28/2023
PREPRUNING
F E AT U R E I M P O R TA N C E
R
De gIrSeI O
EC sNs iTREES
o n t r FOR
ees
REGRESSION
• Decision trees for regression work the same way using the DecisionTreeRegressor
algorithm.
• NB: However the models are not able to extrapolate, or make predictions
outside of the range of the training data.
• Decision tree classifiers have two advantages over many of the
algorithms:
o The resulting model can easily be visualized and understood by
non-experts (especially for smaller trees)
o The algorithms are completely invariant to scaling of the data.
As each feature is processed separately, and the possible splits
of the data don’t depend on scaling, no preprocessing like
normalization or standardization of features is needed for
decision tree algorithms.
• The main downside of decision trees classifiers is that even with the use of
pre-pruning, they tend to over fit as you don’t always know how much
pruning to do to avoid over fitting.This can be addressed by combining
different types of trees.
1/28/2023
EEnNsSeEmMbBl e
LEs SMOeFt hDoEdCsI S I O N
TREES
R A N D O M FORESTS
• Random forests use a collection of slightly different trees. Basically, If you build many accurate trees,
which overfit in different ways, you can reduce the amount of overfitting by averaging their results.
• To implement this strategy, you need to build different trees. Random forests inject in their trees by
selecting the data points used to build a tree and by selecting the features used for splitting randomly.
• To build a random forest model, specify the number of trees to build using the n_estimators
parameter.The larger the number of trees the higher the accuracy and generalizability results.
• The algorithm works as follows:
o First, selects a sample of the data (called a bootstrap sample) of the data and then repeats this
sampling process it creates a dataset that is as big as the original dataset, (some data points from
the original dataset will be missing while some will be repeated).
o Next, a decision tree is built based on this newly created dataset. However, only a subset of the
features are used when deciding how to split a node.The number of features selected is controlled
by the max_features parameter. Each node uses a different subset of the features.
• If you set max_features to all n features, no randomness will be injected in the feature selection making all
trees similar (increases overfitting), and if we set max_features to 1, the splits have no choice on which
feature thus can only search over different thresholds for the feature that was selected randomly
(although we reduce overfitting).Thus this value should be controlled. (A good rule of thumb select
max_features=sqrt(n_features) for classification and max_fea tures=log2(n_features) for regression.)
• To make a prediction the algorithm first makes a prediction for every tree in the forest. For classification, a
“soft voting” strategy is used. Where the class with the highest probability is predicted while for
regression, the results are averaged to get the final prediction.
1/28/2023
PARAMETER SELECTION
• The random forest gives us an accuracy of 96%, better than the linear models or a single
decision tree, even without tuning any parameters.We could adjust the max_features setting, to
reduce overfitting, however, often the default parameters of the random forest already work
quite well.
1/28/2023
FEATURE IMPORTANCE
GRADIENT BOOSTED
R E G R E SS I O N
TREES
• Despite the name, these models can be used for regression and classification.
• There is no randomization but instead, strong pre-pruning is used.Thus
the trees are shallow (depth one to five), hence using less memory
and time.
• Unlike random forests, they build trees in a sequential manner with each
tree trying to improve on the previous ones. In this model a simple
models (known as weak learners) is created from a shallow trees and
the good predictions from the tree used to create other trees.
• They perform very well hence are very popular, however they need more
parameter tuning than random forests to perform that well.
• Parameter are for pre-pruning (max_depth) and the number of trees
(n_estimators) just like random forests., but they also have another
important parameter called learning_rate. This is the ability of a tree
to correct the mistakes of the previous trees. A higher learning rate
means each tree can make stronger corrections, allowing for more
complex models. Adding more trees to the ensemble, also increases the
models chances to correct mistakes on the training set.
Example
• Let us apply GradientBoostingClassifier on the Breast Cancer dataset
with 100 trees of maximum depth 1 and a learning rate of 0.1.
• Without adjusting any parameters the training set accuracy is 100%, a
possible sign of overfitting. Limiting the maximum depth of the tree
reduces overfitting, while lowering the learning rate increased the
generalization performance slightly.We are able to get similar accuracy
levels to random forests
1/28/2023
SELECTING PARAMETERS