0% found this document useful (0 votes)
4 views19 pages

Trees

Uploaded by

sabamadadi9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views19 pages

Trees

Uploaded by

sabamadadi9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

From tree to forest

1 / 19
Decision tree
● Decision trees are very popular supervised classification
algorithms:
– They perform quite well on classification problems
– The decision path is relatively easy to interpret
– The algorithm to build (train) them is fast and simple

● A decision tree is a flowchart-like structure made of nodes and


branches:
– At each node, a split on the data is performed based on one of the
input features, generating two or more branches.
– More and more splits are made in the upcoming nodes to partition
the original data.
– This continues until a node is generated where all or almost all of
the data belong to the same class.
2 / 19
Example: Sailing plan

3 / 19
Building a decision tree
● There are several automatic procedures (like C4.5, ID3 or the
CART algorithm) to extract the rules from the data to build a
decision tree.

● These algorithms partition the training set into subsets until


each partition is either “pure” in terms of target class or
sufficiently small:
– A pure subset is a subset that contains only samples of
one class.
– Each partitioning operation is implemented by a rule that
splits the incoming data based on the values of one of the
input features.

4 / 19
Split rules

● How does an algorithm decide which feature to use at


each point to split the input subset?
– At each step, the algorithm uses the feature that
leads to the purest output subsets.
– Therefore, we need a metric to measure the purity of
a split:
● Information gain

● Gini index

● Gain ration

5 / 19
Entropy
● Entropy is used to measure purity,
information, or disorder:

where p is the whole dataset, N is the number of classes,


and pi is the frequency of class i in the same dataset.

6 / 19
Entropy based splits

The goal of each split in


a decision tree is to move
from a confused dataset to
two (or more) purer subsets
with lesser entropys.

Ideally, the split should lead


to subsets with an entropy of
0.0.

7 / 19
Information Gain (ID3)

● In order to evaluate how good a feature is for splitting, the


difference in entropy before and after the split is calculated:

where “before” is the dataset before the split,


K is the number of subsets generated by the split,
(j, after) is subset j after the split.

● We choose to split the data on the feature with the highest


value in information gain.

8 / 19
Gain Ratio (C4.5)
The information gained by a balanced split is higher than the
information gained by an unbalanced split.

9 / 19
Gini index
● Gini impurity is a measure of how often a randomly
chosen element from the set would be incorrectly
labeled:

Proof:

● The Gini index:

where K is the number of subsets generated by the split 10 / 19


and (j, after) is subset j after the split.
Split candidates

Nominal features:
– We can create a child node for each possible value ( a wider
tree)
– we can make a binary split (a higher tree)

11 / 19
Split candidates

Numerical features:
– All numerical values could actually be split candidates
(computationally expensive).
– The candidate split points are taken in between every two
consecutive values of the selected numerical feature. the
binary split producing the best quality measure is adopted.

12 / 19
Size and Overfitting

● Trees that are too deep can lead to models that are too
detailed and don’t generalize on new data.
● On the other hand, trees that are too shallow might lead to
overly simple models that can’t fit the data.

13 / 19
Pruning

● Pruning is a way to avoid overfitting.


● Pruning is applied to a decision tree after the training
phase.

● Basically, we let the tree be free to grow as much as


allowed by its settings, without applying any explicit
restrictions. At the end, we proceed to cut those
branches that are not populated sufficiently

14 / 19
reduced error pruning


At each iteration,
– a low populated branch is pruned
– The tree is applied again to the training
data.
– If the pruning of the branch doesn’t
decrease the accuracy on the training set,
the branch is removed.

15 / 19
Early Stopping

● Another option to avoid overfitting is early stopping,


based on a stopping criterion.
● One common stopping criterion is the minimum number
of samples per node.
– A higher value of this minimum number leads to
shallower trees
– While a smaller value leads to deeper trees
● What other criteria?

17 / 19
Random forest
● Many is better than one.
– Several decision trees together can produce more accurate
predictions than just one single decision tree by itself.
● Random forest algorithm builds N slightly differently trained
decision trees and merges them together to get more accurate
and stable predictions.

18 / 19
Bootstrapping of Training Sets
In a random forest, N decision trees are trained each on a subset
of the original training set obtained via bootstrapping of the
original dataset (random sampling with replacement.)

19 / 19
The Majority Rule

● The N slightly differently trained trees will produce N


slightly different predictions for the same input vector.
● Usually, the majority rule is applied to make the final
decision.
● The prediction offered by the majority of the N trees is
adopted as the final one.
● While the predictions from a single tree are highly
sensitive to noise in the training set, predictions from
the majority of many trees are not (if trees are diverse).

20 / 19

You might also like