0% found this document useful (0 votes)
6 views

Lesson 5.0 Supervised Learning with Decision Trees (1)

This document provides an overview of decision trees in supervised learning, explaining their structure, how they make predictions, and the methods used to control their complexity to prevent overfitting. It discusses the use of Gini impurity for node splitting, the importance of feature selection, and introduces ensemble methods like random forests and gradient boosted trees to enhance model performance. Additionally, it highlights the advantages and limitations of decision trees and ensemble methods in machine learning applications.

Uploaded by

masy5677
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lesson 5.0 Supervised Learning with Decision Trees (1)

This document provides an overview of decision trees in supervised learning, explaining their structure, how they make predictions, and the methods used to control their complexity to prevent overfitting. It discusses the use of Gini impurity for node splitting, the importance of feature selection, and introduces ensemble methods like random forests and gradient boosted trees to enhance model performance. Additionally, it highlights the advantages and limitations of decision trees and ensemble methods in machine learning applications.

Uploaded by

masy5677
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

1/28/2023

SUPERVISED
LEARNING
DECISION TREES
1/28/2023

INTRODUCTION

• Trees learn from a set of if/else questions with answers


and use the knowledge to make a decision.
• For example if you have four animals with Yes/No under features
such as feathers, can fly, has fins and the name of the animals
( bears, hawks, penguins, and dolphins), you can ask all the
questions in order to identify the animal.
• However if you examine the questions closely, you realize that
you only need to ask a few questions and you can tell the
animal.This is learning the pattern from the rules.
• The presence of feathers narrows down your possible animals
to just two hence you only need to ask one more question to
know the animal. In some cases you only need to ask one
question, e.g. does the animal have fins? These questions
enable us to find the answer faster.They become the rules.
• Notice the description of a lion will make the tree classify it as a
bear.This is not accurate but is the closest match based on the
data that we have.
• In machine learning we can provide an algorithm with data
which acts as examples (hence supervised learning) and then
it can learn the rules and represent them as a tree which is
the model that we use later for prediction.
DECISION TREES IN SCIKIT LEARN
• The data is usually represented as continuous features
such as in the 2D dataset.
• To build a tree, the algorithm searches over all possible
tests and finds the one that is most informative about the
target variable. For instance splitting the dataset vertically
at x[1]=0.0596 yields the most information.
• The split is done by testing whether x[1] <= 0.0596,
indicated by a black line. If the test is true, a point is
assigned to the left node, which contains 2 points
belonging to class 0 and 32 points belonging to class 1.
Otherwise the point is assigned to the right node
• These two nodes correspond to the top and bottom regions
shown in the diagram. Even though the first split did a
good job of separating the two classes, the bottom region
still contains points belonging to class 0, and the top
region still contains points belonging to class 1.
• The recursive partitioning of the data is repeated until
each region in the partition (each leaf in the decision tree)
only contains a single target value (a single class or a
single regression value). A leaf of the tree that contains
data points that all share the same target value is called
pure.
1/28/2023

IRIS FLOWER

• We use the iris flower dataset to train the model


• To visualize we use the export_graphviz function from the tree
module. It makes it easier to analyze trees by adding color to
the nodes to reflect the majority class in each node and
includes features names for the tree.
• It creates file in the .dot file format for storing the
image. To view the file we use the library graphviz. You
might need to install graphviz
• The sequence of if/else questions that gets us to the true
answer most quickly form the rules is identified (most
informative about the target variable). In this dataset if the
petal width is less than 0.8cm we are dealing with
setosa.There is no need to read other features to know a
setosa specie. Hence this is picked as the first rule.The other
classes need a few more rules.
• Notice that this data does not come in the form of binary
yes/no features as in the animal example, but instead is
continuous data, the rules are represented as => or <=
rather than =. The number of data points in each node and
their class is indicated.
• Where multiple classes exist recursive partitioning of the data
occurs until we find a node that contains a single class. It is
said to be pure and forms a rule. Our tree has 7 rules
• A prediction on a new data point is made by traversing the
tree from the root and checking which area the point lies in,
then predicting the target.
1/28/2023

GINI IMPURITY

• There are different methods used to


split the features.The Gini Impurity, is
the simplest hence common. Other
methods are Entropy / Information gain
albeit more complex
• Gini Impurity is a measurement used
to build Decision Trees to determine
how the features of a dataset should
split nodes to form the tree i.e. it
determines the likelihood that a data
point with the feature will be
misclassified if assigned to a class. Thus
measures the impurity level of a split.
• It yields a number between 0 and 0.5. A
small Gini value signifies less impurity
hence a suitable split. It is calculated as
shown where:
• D is the dataset
• k is the number of classes.
• P is the probability of samples
belonging to class i
1/28/2023

C A L C U L AT I N G GINI IMPURITY

• Assume that algorithm picks the feature sepal length for a node, it
creates a threshold (e.g. 5cm) and then splits the data point into
those whose attribute value is above the threshold and those
that are below or equal to it.
• If a pattern exists then one group will tend to have data points that
belong to one class and the other group will have a different
class.There will be some impurities e.g. some circle species where
triangle are the majority.We calculate this impurity for each group
as
• G ini impurity True (Sepal
Length>5)
• Probability of class circle = 4/6 = 0.67
• Probability of class triangle = 2/6 = 0.33
• Gini impurity for True = 1-
[(0.67*0.67)+(0.33*0.33)]=0.44
Similarly Gini impurity False =
0.38

• The weighted Gini for all groups in a


feature (attribute) =6/10*0.44 +
4/10*0.38=0.42
• The process is repeated for other features e.g. petal length > 5 and
the feature with the lowest Weighted Gini Impurity is selected for
further splitting. A node is pure when the Weighted Gini Impurity is
0.This is when splitting stops.
• An alternative approach for finding the best split uses the highest
information gain (entropy).
• It is also possible to use trees for regression tasks, by traversing the
tree and finding the leaf the new data point falls into.The output for
this data point is the mean target of the training points in this leaf.
1/28/2023

CONTROLLING
C O M P L E X I T Y OF
D E C I S I O N TREES
• Typically, building a tree as described earlier and
continuing until all leaves are pure can leads to
models that are very complex and thus highly
overfitting to the training data.
• The presence of pure leaves mean that a tree is
100% accurate on the training set; each data point
in the training set is in a leaf that has the correct
class. The overfitting can be seen on a tree where a
node belonging to a class with few data points is
sitting in the middle of another class. A sign that
the decision boundary is not very clear.
• There are two common strategies to prevent
overfitting:
• Preprunning - stopping the creation of the
tree early
• Postprunning - removing nodes that
contain little information
1/28/2023

PREPRUNING

• scikit-learn only implements pre-pruning, not post-


pruning.
• Lets use a bigger dataset (Breast Cancer dataset ) to
understand
the effect of pre-pruning in more detail
• When we do not pre-prune the tree, the accuracy on the
training set is 100% because the leaves are pure.The
tree has grown deep enough to get pure leafs hence
perfectly memorized all the labels on the training data.
• The test set accuracy is slightly worse than for the
linear models we looked at previously, which had
around 95% accuracy. This means that the model did
not generalize as well to new data, a sign of
overfitting.
• To apply pre-pruning to the tree, we can limit the depth.
For example if we say max_depth=4, only four
consecutive questions can be asked
• Limiting the depth of the tree leads to a lower accuracy
on the training set (leaves are not as pure) and an
improvement on the testing set hence generalizing
better (decreases overfitting).
1/28/2023

F E AT U R E I M P O R TA N C E

• However, even with a tree of depth four, the tree can


still be huge.The leaves with more samples are the
useful ones when making predictions.
• If we visualize with color, we notice the orange nodes
(malignant samples) are mostly on one leaf, with the
other leaves containing few samples.Thus instead of
looking at the whole tree, we can focus on the ones
with more samples.The number of samples is called
feature importance.
• It provides a number between 0 and 1 for each feature,
where 0 means “not used at all” and 1 means
“perfectly predicts the target.” The feature
importance always sum to 1:
• When we plot the data we see that the feature “worst
radius” is
the most important feature.
• If a feature has a low feature_importance, it doesn’t
mean that this feature is uninformative. In some cases it
encodes the same information hence was not picked
by the tree.
• However linear model coefficients are better because in
addition to telling us which features are important,
they also tell us which class the features are important
for. With feature importance we can tell “worst radius”
is important, but not whether a high worst radius is
indicative of a sample being benign or malignant.
1/28/2023

R
De gIrSeI O
EC sNs iTREES
o n t r FOR
ees
REGRESSION
• Decision trees for regression work the same way using the DecisionTreeRegressor
algorithm.
• NB: However the models are not able to extrapolate, or make predictions
outside of the range of the training data.
• Decision tree classifiers have two advantages over many of the
algorithms:
o The resulting model can easily be visualized and understood by
non-experts (especially for smaller trees)
o The algorithms are completely invariant to scaling of the data.
As each feature is processed separately, and the possible splits
of the data don’t depend on scaling, no preprocessing like
normalization or standardization of features is needed for
decision tree algorithms.
• The main downside of decision trees classifiers is that even with the use of
pre-pruning, they tend to over fit as you don’t always know how much
pruning to do to avoid over fitting.This can be addressed by combining
different types of trees.
1/28/2023

EEnNsSeEmMbBl e
LEs SMOeFt hDoEdCsI S I O N
TREES

• Ensembles are methods that combine multiple machine learning


models to create more powerful models.
• Here we look at ensemble trees. Two types that perform well:
1. Random forests
2. Gradient boosted regression trees
1/28/2023

R A N D O M FORESTS

• Random forests use a collection of slightly different trees. Basically, If you build many accurate trees,
which overfit in different ways, you can reduce the amount of overfitting by averaging their results.
• To implement this strategy, you need to build different trees. Random forests inject in their trees by
selecting the data points used to build a tree and by selecting the features used for splitting randomly.
• To build a random forest model, specify the number of trees to build using the n_estimators
parameter.The larger the number of trees the higher the accuracy and generalizability results.
• The algorithm works as follows:
o First, selects a sample of the data (called a bootstrap sample) of the data and then repeats this
sampling process it creates a dataset that is as big as the original dataset, (some data points from
the original dataset will be missing while some will be repeated).
o Next, a decision tree is built based on this newly created dataset. However, only a subset of the
features are used when deciding how to split a node.The number of features selected is controlled
by the max_features parameter. Each node uses a different subset of the features.
• If you set max_features to all n features, no randomness will be injected in the feature selection making all
trees similar (increases overfitting), and if we set max_features to 1, the splits have no choice on which
feature thus can only search over different thresholds for the feature that was selected randomly
(although we reduce overfitting).Thus this value should be controlled. (A good rule of thumb select
max_features=sqrt(n_features) for classification and max_fea tures=log2(n_features) for regression.)
• To make a prediction the algorithm first makes a prediction for every tree in the forest. For classification, a
“soft voting” strategy is used. Where the class with the highest probability is predicted while for
regression, the results are averaged to get the final prediction.
1/28/2023

PARAMETER SELECTION

• The random forest gives us an accuracy of 96%, better than the linear models or a single
decision tree, even without tuning any parameters.We could adjust the max_features setting, to
reduce overfitting, however, often the default parameters of the random forest already work
quite well.
1/28/2023

FEATURE IMPORTANCE

• Similarly to the decision tree, the random forest provides


feature importances, which are computed by aggregating
the importances of all trees and these typically are more
reliable than the ones provided by a single tree.
• Note that random forest considers more features as
important. The levels of importance for different features
might also differ from those of a single tree, thus it
captures a much broader picture of the data than a single
tree hence improved accuracy levels.
• Random forests don’t tend to perform well on high
dimensional, sparse data, such as text data.Thus it is
recommended to use linear models for such data.
• Random forests are not easy to explain to non experts, and
their random nature makes it difficult to have reproducible
results. In such cases you can opt for decision trees
• Random forests on large datasets might time consuming,
requiring you to use multiple CPU cores. Alternative
ensemble trees such as gradient boosted trees can
address this
• Where random forests seem to overfit, using
alternative trees such as gradient boosted trees would
help
1/28/2023

GRADIENT BOOSTED
R E G R E SS I O N
TREES

• Despite the name, these models can be used for regression and classification.
• There is no randomization but instead, strong pre-pruning is used.Thus
the trees are shallow (depth one to five), hence using less memory
and time.
• Unlike random forests, they build trees in a sequential manner with each
tree trying to improve on the previous ones. In this model a simple
models (known as weak learners) is created from a shallow trees and
the good predictions from the tree used to create other trees.
• They perform very well hence are very popular, however they need more
parameter tuning than random forests to perform that well.
• Parameter are for pre-pruning (max_depth) and the number of trees
(n_estimators) just like random forests., but they also have another
important parameter called learning_rate. This is the ability of a tree
to correct the mistakes of the previous trees. A higher learning rate
means each tree can make stronger corrections, allowing for more
complex models. Adding more trees to the ensemble, also increases the
models chances to correct mistakes on the training set.
Example
• Let us apply GradientBoostingClassifier on the Breast Cancer dataset
with 100 trees of maximum depth 1 and a learning rate of 0.1.
• Without adjusting any parameters the training set accuracy is 100%, a
possible sign of overfitting. Limiting the maximum depth of the tree
reduces overfitting, while lowering the learning rate increased the
generalization performance slightly.We are able to get similar accuracy
levels to random forests
1/28/2023

SELECTING PARAMETERS

• When you view the feature importances of the gradient


boosted trees you will notice that they are somewhat
similar to those of random forests, though the gradient
boosting completely ignored some of the features.
• Because both gradient boosting and random forests
perform well on similar kinds of data, a common
approach is to first try random forests, which work quite
robustly with less tuning.
• Improved algorithms exist in other packages e.g. the the
xgboost package would be better for large scale
problems because its faster than the scikit-learn
implementation of gradient boosted tree.
• Gradient boosted decision trees also do not work well
on high- dimensional sparse data.
• Increasing the number of trees in gradient boost trees
may lead
to overfitting hence adjusting the learning_rates is
better.
• As you adjust the max_depth in gradient boosted
trees, keep it low ( less than 5)

You might also like