Session 04 - Tree-Based Methods
Session 04 - Tree-Based Methods
In this session
• We will learn about tree-based methods and ensemble learning
1
28/06/2019
R2
4.5
2
28/06/2019
R1 R3 𝑥1
< 4.5
117.5
R1 𝑥2
< 117.5
R2 Input variables
4.5 𝑥1 Years R2 R3
𝑥2 Hits
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 5
3
28/06/2019
• Examples: 𝑥2 𝑥3
< −2 <0
< 100
2) IF 𝑥1 < 4.5 AND 𝑥2 < −2 AND 𝑥4 ≥ 100
THEN “B”
A B
3) …
Regression trees
• Let’s assume we have the following
regression problem:
𝑌 = 𝑓 𝑋1 , 𝑋2
Y
• Estimate f using decision trees.
4
28/06/2019
Regression trees
• A tree training algorithm will partition the
space following a diverse of criteria …
• Different decision tree variants will follow 𝑦4
different partitioning criteria. 𝑦3
• The figure is one possible partitioning of 𝑦2
the space.
• In regression trees, response takes discrete Y
values.
𝑦1
• In our example: 𝑌 = 𝑦0 , 𝑦1 , 𝑦2 , 𝑦3 , 𝑦4 𝑦0
• The response for a given test observation
will be the mean of the training 𝑥2𝐵
observations in the region to which that 𝑥2𝐴 𝑥1𝐵
test observation belongs. 𝑥1𝐴
X2 X1
Regression trees
𝑥1
X2 < 𝑥1𝐴
𝑦4 𝑦4
𝑦3 𝑥2𝐵
𝑦0 𝑥2 𝑥1
𝑦2
𝑦2 < 𝑥2𝐴 < 𝑥1𝐵
𝑥2𝐴
𝑦1 𝑦3
𝑦0 𝑦0 𝑦1 𝑦2 𝑥2
𝑦1 < 𝑥2𝐵
𝑥2𝐵
𝑥2𝐴 𝑥1𝐵
𝑥1𝐴
𝑥1𝐴 𝑥1𝐵
X1 𝑦3 𝑦4
X2 X1
10
5
28/06/2019
𝐽
2
𝑅𝑆𝑆 = 𝑦𝑖 − 𝑦ො𝑅𝑗
𝑗=1 𝑖∈𝑅𝑗
• where 𝑦ො𝑅𝑗 is the mean response for the training observations within the jth box.
11
• Next, we repeat the process, looking for the best input variable and best cut-point in order
to split the data further so as to minimise the RSS within each of the resulting regions.
• However, this time, instead of splitting the entire predictor space, we split one of the two
previously identified regions. We now have three regions.
• Again, we look to split one of these three regions further, so as to minimise the RSS. The
process continues until a stopping criterion is reached; for instance, we may continue until
no region contains more than five observations.
12
6
28/06/2019
R1 R2
𝑆𝑎𝑙𝑎𝑟𝑦1 = 464 𝑆𝑎𝑙𝑎𝑟𝑦2 = 755
11
13
R1 R2
𝑆𝑎𝑙𝑎𝑟𝑦1 = 532 𝑆𝑎𝑙𝑎𝑟𝑦2 = 613
16.5
14
7
28/06/2019
R1 R2
𝑆𝑎𝑙𝑎𝑟𝑦1 = 365 𝑆𝑎𝑙𝑎𝑟𝑦2 = 742
6.5
15
R1 R2
𝑆𝑎𝑙𝑎𝑟𝑦1 = 226 𝑆𝑎𝑙𝑎𝑟𝑦2 = 697
4.5
*optimum cut-
𝑅𝑆𝑆 = 40162637 point for Years
16
8
28/06/2019
125
R1 𝑥2
𝑆𝑎𝑙𝑎𝑟𝑦1 = 226 < 125
R2
4.5 R2 R3
𝑆𝑎𝑙𝑎𝑟𝑦2 = 494 𝑆𝑎𝑙𝑎𝑟𝑦3 = 963
𝑅𝑆𝑆 = 30820324
Hits (x2) variable added, the
process is repeated
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 17
17
R1 𝑥2
77
𝑆𝑎𝑙𝑎𝑟𝑦1 = 226 < 77
R2
4.5 R2 R3
𝑆𝑎𝑙𝑎𝑟𝑦2 = 402 𝑆𝑎𝑙𝑎𝑟𝑦3 = 795
𝑅𝑆𝑆 = 35161358
18
9
28/06/2019
117.5
R1 𝑥2
𝑆𝑎𝑙𝑎𝑟𝑦1 = 226 < 117.5*
R2
4.5 R2 R3
𝑆𝑎𝑙𝑎𝑟𝑦2 = 465 𝑆𝑎𝑙𝑎𝑟𝑦3 = 949
𝑅𝑆𝑆 = 30037022
*optimum cut-
point for Hits.
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 19
19
• Given that a player is less experienced, the number of Hits that he made in the previous
year seems to play little role in his Salary.
• But among players who have been in the major leagues for five or more years, the number
of Hits made in the previous year does affect Salary, and players who made more Hits last
year tend to have higher salaries.
20
10
28/06/2019
21
Overfitting problem
• The process described above may produce good predictions on the training set, but is
likely to overfit the data, leading to poor test set performance.
Test data
RSS
Training data
# splits
22
11
28/06/2019
Pruning
• A better strategy is to grow a very large tree T0, and then prune it back in order to obtain a
subtree.
• Cost complexity pruning:
𝑇
2
𝑦𝑖 − 𝑦ො𝑅𝑚 +𝛼 𝑇
𝑚=1 𝑖:𝑥𝑖 ∈𝑅𝑚
23
2. Apply cost complexity pruning to the large tree in order to obtain a sequence of best
subtrees, as a function of 𝛼.
4. Return the subtree from Step 2 that corresponds to the chosen value of 𝛼.
24
12
28/06/2019
Classification trees
• Very similar to regression trees
• But we predict a qualitative response (class) instead.
• Instead of using the mean, the predicted class of an observation will be the most commonly
occurring class (mode).
• In the classification setting, RSS cannot be used as a criterion for making the binary splits.
• One option could be the equivalent misclassification rate:
𝑀
𝐸 = 1 − max 𝑝Ƹ 𝑚𝑘
𝑘
𝑚=1
• Here 𝑝Ƹ𝑚𝑘 represents the proportion of training observations in the mth region that are
from the kth class.
25
Gini index
• An alternative to misclassification rate is the Gini index.
• It is defined by
𝑀 𝐾
𝐺 = 𝑝Ƹ𝑚𝑘 1 − 𝑝Ƹ𝑚𝑘
𝑚=1 𝑘=1
• It is a measure of total variance across the K classes. The Gini index takes on a small value if
all of the 𝑝Ƹ𝑚𝑘 ’s are close to zero or one.
• For this reason the Gini index is referred to as a measure of node purity – a small value
indicates that a node contains predominantly observations from a single class.
26
13
28/06/2019
Information gain
• An alternative to the Gini index is cross-entropy, given by
𝑀 𝐾
𝐷 = − 𝑝Ƹ𝑚𝑘 log 𝑝Ƹ 𝑚𝑘
𝑚=1 𝑘=1
27
28
14
28/06/2019
• Some people believe that decision trees more closely mirror human decision-making than
do other regression and classification approaches.
• Trees can be displayed graphically, and are easily interpreted even by a non-expert
(especially if they are small).
• Trees can easily handle qualitative predictors without the need to create dummy variables.
• Unfortunately, trees generally do not have the same level of predictive accuracy as some of
the other regression and classification approaches.
29
• That depends on the data available, data complexity, number of free model parameters,
etc.
30
15
28/06/2019
𝜎 2 Τ𝑛
• In other words, averaging a set of observations reduces variance.
31
Bagging
Dataset
Sample 2
Sample N
training data.
…
• That is, it takes repeated
samples from the training Model 1 Model 2 Model N
set.
32
16
28/06/2019
Bagging
• Algorithm for regression:
1. Generate B different bootstrapped training data sets.
2. Train method on the bth bootstrapped training set in order to get 𝑓መ ∗𝑏 𝑥 , the
prediction at a point x.
3. The final prediction is the average over all predictions:
𝐵
1
𝑓መbag 𝑥 = 𝑓መ ∗𝑏 𝑥
𝐵
𝑏=1
33
Bagging
• For classification – instead of taking the average, the overall predicted value (class) is the
majority vote – i.e. the mode of all predictions:
𝑓መbag 𝑥 = mode𝐵𝑏=1 𝑓መ ∗𝑏 𝑥
• In principle, bagging can be applied to any machine learning method. But it is usually used
with decision trees.
34
17
28/06/2019
Random forests
Random forests are similar to bagged trees, but with a tweak to decrease the variance even
further:
• But when building these decision trees, each time a split in a tree is considered, a random
selection of m input variables is chosen as split candidates from the full set of p.
• Usually we choose 𝑚 ≈ 𝑝
35
36
18
28/06/2019
37
• For regression:
• We record the total amount that the RSS is decreased due to splits over a given input
variable, averaged over all trees. A large value indicates an important variable.
• For classification:
• We add up the total amount that the Gini index (or cross-entropy measure) is decreased
by splits over a given input variable, averaged over all trees.
38
19
28/06/2019
39
Ensemble learning
Dataset
Dataset 2
Dataset N
same problem.
• In contrast to ordinary machine
learning approaches which try to …
learn one hypothesis from
Model 1 Model 2 Model N
training data, ensemble methods
try to construct a set of
hypotheses and combine them to
use. Prediction 1 Prediction 2 Prediction N
40
20
28/06/2019
Algorithm 1 Predictions 1
Prediction
Predictions 2 (<non-linear>
Algorithm 2
Meta-learner combination)
Dataset
(combiner)
Algorithm 3 Predictions 3
Algorithm 4 Predictions 4
41
Stacking – algorithm
Level 0 – Base learners
Original dataset Algorithm responses
Fold 1 Fold 2 Fold 3 Truth Algoritm Algoritm Algoritm
1 2 3
Tr_split_1 Tr_split_1 Tst_split_1 Truth_1 Resp_1_1 Resp_2_1 Resp_3_1
New ML problem, inputs are the algorithm Resp_1_2 Resp_2_2 Resp_3_2 Truth_2
responses using test sets from level 0. Outputs
Resp_1_3 Resp_2_3 Resp_3_3 Truth_3
are the original true values.
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 45
42
21
28/06/2019
43
Summary
We learnt about tree-based methods and ensemble learning (classification and
regression trees – CART, bagging, random forests).
44
22
28/06/2019
Exercise
1. Estimate the RSS of a regression tree that was generated from the dataset in the table.
2. Predict Salary for 𝑌𝑒𝑎𝑟𝑠 = 2,5,14
𝑥1
< 11
R1 R2
𝑆𝑎𝑙𝑎𝑟𝑦1 =? ? 𝑆𝑎𝑙𝑎𝑟𝑦2 =? ?
45
Solution
• RSS estimation
75 + 145 + 155 + 1600 + 600 + 1008.333
𝑆𝑎𝑙𝑎𝑟𝑦1 = = 597.2
6
825 + 733.333
𝑆𝑎𝑙𝑎𝑟𝑦2 = = 779.2
2
𝑅𝑆𝑆 = 75 − 597.2 2 + 145 − 597.2 2 + 155 − 597.2 2 +
1600 − 597.2 2 + 600 − 597.2 2 + 1008.333 − 597.2 2 +
825 − 779.2 2 + 733.333 − 779.2 2
𝑅𝑆𝑆 = 1851566
• Predictions
𝑆𝑎𝑙𝑎𝑟𝑦 𝑌𝑒𝑎𝑟 = 2 = 592.2 𝑆𝑎𝑙𝑎𝑟𝑦 𝑌𝑒𝑎𝑟 = 5 = 592.2 𝑆𝑎𝑙𝑎𝑟𝑦 𝑌𝑒𝑎𝑟 = 14 = 779.2
46
23
28/06/2019
Exercise
𝑥1
• Draw the resulting partitioning of a 2-D <4
feature space according to this binary
tree, and for 𝑥1 , 𝑥2 > 0. 𝑥2 𝑥2
<2 <3
𝑥1 R4 R5 R6
<1
𝑥2 R3
<1
R1 R2
47
Solution
𝑥1
<4 x2
R6
𝑥2 𝑥2
<3 3 R4
<2
𝑥1 R4 R5 R6 2
<1 R2 R5
1 R3
𝑥2 R3 R1
<1
1 4 x1
R1 R2
48
24
28/06/2019
Exercise
• We have a 1-D dataset. Two classification
trees (A and B) were fitted. The cut-points 𝑥1 𝒚 ෝ
𝒚𝑨 ෝ
𝒚𝑩
for x1 were 4.6 and 3.0, respectively. 1.3 c1 c1 c1
2.9 c1 c1 c1
3.1 c1 c1 c2
• Table shows the dataset along with the
ෝB ) by the two
𝒚𝑨 and 𝒚
predicted classes (ෝ 4.5 c2 c1 c2
trees. 4.8 c1 c2 c2
6.1 c2 c2 c2
7.2 c2 c2 c2
• Estimate misclassification rate, Gini index
and cross-entropy 8.9 c2 c2 c2
49
Solution Tree A
𝐸𝐴 = 1 − 3Τ4 + 1 − 3Τ4 = 0.5
𝑥1
𝐺𝐴 = 3Τ4 1 − 3Τ4 + 3Τ4 1 − 3Τ4 = 0.187
< 4.6
𝑥1 𝒚 ෝ
𝒚𝑨 ෝ
𝒚𝑩
1.3 c1 c1 c1 𝐷𝐴 = − 3Τ4 ln 3Τ4 − 3Τ4 ln 3Τ4 = 0.216
2.9 c1 c1 c1 C1 C2
3.1 c1 c1 c2 𝐸𝐵 = 1 − 2Τ4 + 1 − 4Τ4 = 0.5
(3,1) (1,3)
4.5 c2 c1 c2
𝐺𝐵 = 2Τ4 1 − 2Τ4 + 4Τ4 1 − 4Τ4 = 0.25
4.8 c1 c2 c2
6.1 c2 c2 c2 Tree B 𝐷𝐵 = − 2Τ4 ln 2Τ4 − 4Τ4 ln 4Τ4 = 0.347
7.2 c2 c2 c2
𝑥1
8.9 c2 c2 c2
< 3.0 Notice that misclassification rate is less
sensitive than Gini index and cross-
C1 C2 entropy
(2,0) (2,4)
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 50
50
25