Trees
Trees
1 / 19
Decision tree
● Decision trees are very popular supervised classification
algorithms:
– They perform quite well on classification problems
– The decision path is relatively easy to interpret
– The algorithm to build (train) them is fast and simple
3 / 19
Building a decision tree
● There are several automatic procedures (like C4.5, ID3 or the
CART algorithm) to extract the rules from the data to build a
decision tree.
4 / 19
Split rules
● Gini index
● Gain ration
5 / 19
Entropy
● Entropy is used to measure purity,
information, or disorder:
6 / 19
Entropy based splits
7 / 19
Information Gain (ID3)
8 / 19
Gain Ratio (C4.5)
The information gained by a balanced split is higher than the
information gained by an unbalanced split.
9 / 19
Gini index
● Gini impurity is a measure of how often a randomly
chosen element from the set would be incorrectly
labeled:
Proof:
11 / 19
Split candidates
●
Numerical features:
– All numerical values could actually be split candidates
(computationally expensive).
– The candidate split points are taken in between every two
consecutive values of the selected numerical feature. the
binary split producing the best quality measure is adopted.
12 / 19
Size and Overfitting
● Trees that are too deep can lead to models that are too
detailed and don’t generalize on new data.
● On the other hand, trees that are too shallow might lead to
overly simple models that can’t fit the data.
13 / 19
Pruning
14 / 19
reduced error pruning
●
At each iteration,
– a low populated branch is pruned
– The tree is applied again to the training
data.
– If the pruning of the branch doesn’t
decrease the accuracy on the training set,
the branch is removed.
15 / 19
Early Stopping
17 / 19
Random forest
● Many is better than one.
– Several decision trees together can produce more accurate
predictions than just one single decision tree by itself.
● Random forest algorithm builds N slightly differently trained
decision trees and merges them together to get more accurate
and stable predictions.
18 / 19
Bootstrapping of Training Sets
In a random forest, N decision trees are trained each on a subset
of the original training set obtained via bootstrapping of the
original dataset (random sampling with replacement.)
19 / 19
The Majority Rule
20 / 19