0% found this document useful (0 votes)
23 views

Session 04 - Tree-Based Methods

The document discusses a session on tree-based methods and ensemble learning. It will cover regression trees, decision trees, and their terminology. An example baseball salary dataset is used to demonstrate how the data could be stratified into regions based on average hits and experience, and represented visually as a decision tree. Regression trees are also introduced for predicting a continuous variable, where the response is the mean of training observations in the region a test observation falls into.

Uploaded by

HGE05
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Session 04 - Tree-Based Methods

The document discusses a session on tree-based methods and ensemble learning. It will cover regression trees, decision trees, and their terminology. An example baseball salary dataset is used to demonstrate how the data could be stratified into regions based on average hits and experience, and represented visually as a decision tree. Regression trees are also introduced for predicting a continuous variable, where the response is the mean of training observations in the region a test observation falls into.

Uploaded by

HGE05
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

28/06/2019

Session 4 – Tree-based methods and


ensemble learning
Dr Ivan Olier
[email protected]

ECI – International Summer School /


Machine Learning
2019

In this session
• We will learn about tree-based methods and ensemble learning

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 2

1
28/06/2019

Example – Baseball salary data*


• The figure shows the salary of
baseball players as a function of
average hits and experience in
years.

• Salary is color-coded from low


(blue, green) to high (yellow, red)

• How would you stratify the data?

*Hitters data (available from ISLR R package)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 3

Example – Baseball salary data


• Salary data could be stratified as
follows: R1 R3

• R1 – IF Years<4.5 THEN Salary = low


• R2 – IF Years>=4.5 AND Hits<117.5
THEN Salary = average 117.5
• R3 – IF Years>=4.5 AND Hits>=117.5
THEN Salary = high

R2
4.5

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 4

2
28/06/2019

Representing the rules as a decision tree

Partitioning of the space… Building a binary decision tree

R1 R3 𝑥1

< 4.5

117.5

R1 𝑥2
< 117.5
R2 Input variables
4.5 𝑥1 Years R2 R3
𝑥2 Hits
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 5

Terminology for Trees


Root
• We keep a tree analogy.
• Regions R1, R2, and R3 are known as terminal
nodes. 𝑥1
Branch Internal
• Decision trees are typically drawn upside nodes
< 4.5
down, in the sense that the leaves are at the
bottom of the tree.
• The points along the tree where the predictor R1 𝑥2
space is split are referred to as internal nodes < 117.5
• Usually, left-hand branch corresponds to Xj < tk,
and right-hand branch, to Xj >= tk. R2 R3
• In the hitters tree, the two internal nodes are Terminal
nodes or
indicated by the text Years<4.5 and Hits<117.5. leaves

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 6

3
28/06/2019

Output of a decision tree


𝑥1
• The output of a decision tree is a set of rules < 4.5

• Examples: 𝑥2 𝑥3
< −2 <0

1) IF 𝑥1 ≥ 4.5 AND 𝑥3 ≥ 0 THEN “E”


𝑥4 C D E

< 100
2) IF 𝑥1 < 4.5 AND 𝑥2 < −2 AND 𝑥4 ≥ 100
THEN “B”
A B

3) …

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 7

Regression trees
• Let’s assume we have the following
regression problem:

𝑌 = 𝑓 𝑋1 , 𝑋2
Y
• Estimate f using decision trees.

• Decision trees for regression are called X2


“Regression trees”. X1

• The figure displays the training data.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 8

4
28/06/2019

Regression trees
• A tree training algorithm will partition the
space following a diverse of criteria …
• Different decision tree variants will follow 𝑦4
different partitioning criteria. 𝑦3
• The figure is one possible partitioning of 𝑦2
the space.
• In regression trees, response takes discrete Y
values.
𝑦1
• In our example: 𝑌 = 𝑦0 , 𝑦1 , 𝑦2 , 𝑦3 , 𝑦4 𝑦0
• The response for a given test observation
will be the mean of the training 𝑥2𝐵
observations in the region to which that 𝑥2𝐴 𝑥1𝐵
test observation belongs. 𝑥1𝐴
X2 X1

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 9

Regression trees
𝑥1
X2 < 𝑥1𝐴
𝑦4 𝑦4
𝑦3 𝑥2𝐵
𝑦0 𝑥2 𝑥1
𝑦2
𝑦2 < 𝑥2𝐴 < 𝑥1𝐵
𝑥2𝐴
𝑦1 𝑦3
𝑦0 𝑦0 𝑦1 𝑦2 𝑥2
𝑦1 < 𝑥2𝐵
𝑥2𝐵
𝑥2𝐴 𝑥1𝐵
𝑥1𝐴
𝑥1𝐴 𝑥1𝐵
X1 𝑦3 𝑦4
X2 X1

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 10

10

5
28/06/2019

The tree-building process


• The goal is to find boxes 𝑅1 , … , 𝑅𝐽 that minimise the residual sum of squares (RSS), given
by:

𝐽
2
𝑅𝑆𝑆 = ෍ ෍ 𝑦𝑖 − 𝑦ො𝑅𝑗
𝑗=1 𝑖∈𝑅𝑗

• where 𝑦ො𝑅𝑗 is the mean response for the training observations within the jth box.

• Unfortunately, it is computationally unfeasible to consider every possible partition of the


feature space into J boxes.
• For this reason, we take a top-down, greedy approach that is known as recursive binary
partitioning.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 11

11

Recursive binary partitioning


• We first select the input variable 𝑥𝑗 and the cut-point s such that splitting the feature space
into the regions 𝑥|𝑥𝑗 < 𝑠 and 𝑥|𝑥𝑗 ≥ 𝑠 leads to the greatest possible reduction in RSS.

• Next, we repeat the process, looking for the best input variable and best cut-point in order
to split the data further so as to minimise the RSS within each of the resulting regions.

• However, this time, instead of splitting the entire predictor space, we split one of the two
previously identified regions. We now have three regions.

• Again, we look to split one of these three regions further, so as to minimise the RSS. The
process continues until a stopping criterion is reached; for instance, we may continue until
no region contains more than five observations.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 12

12

6
28/06/2019

Recursive partitioning example – Hitters data (2


variables)
x1 – Years
𝑥1 x2 – Hits
R1 R2
< 11

R1 R2
𝑆𝑎𝑙𝑎𝑟𝑦1 = 464 𝑆𝑎𝑙𝑎𝑟𝑦2 = 755

11

𝑅𝑆𝑆 = 49171342 Iteration = 1

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 13

13

Recursive partitioning example – Hitters data (2


variables)
x1 – Years
𝑥1 x2 – Hits
R1 R2
< 16.5

R1 R2
𝑆𝑎𝑙𝑎𝑟𝑦1 = 532 𝑆𝑎𝑙𝑎𝑟𝑦2 = 613

16.5

𝑅𝑆𝑆 = 53232041 Iteration = 2

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 14

14

7
28/06/2019

Recursive partitioning example – Hitters data (2


variables)
x1 – Years
𝑥1 x2 – Hits
R1 R2
< 6.5

R1 R2
𝑆𝑎𝑙𝑎𝑟𝑦1 = 365 𝑆𝑎𝑙𝑎𝑟𝑦2 = 742

6.5

𝑅𝑆𝑆 = 44051523 Iteration = 3

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 15

15

Recursive partitioning example – Hitters data (2


variables)
x1 – Years
𝑥1 x2 – Hits
R1 R2
< 4.5*

R1 R2
𝑆𝑎𝑙𝑎𝑟𝑦1 = 226 𝑆𝑎𝑙𝑎𝑟𝑦2 = 697

4.5

*optimum cut-
𝑅𝑆𝑆 = 40162637 point for Years

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 16

16

8
28/06/2019

Recursive partitioning example – Hitters data (2


variables)
x1 – Years
𝑥1 x2 – Hits
R1 R3
< 4.5

125
R1 𝑥2
𝑆𝑎𝑙𝑎𝑟𝑦1 = 226 < 125
R2
4.5 R2 R3
𝑆𝑎𝑙𝑎𝑟𝑦2 = 494 𝑆𝑎𝑙𝑎𝑟𝑦3 = 963
𝑅𝑆𝑆 = 30820324
Hits (x2) variable added, the
process is repeated
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 17

17

Recursive partitioning example – Hitters data (2


variables)
x1 – Years
𝑥1 x2 – Hits
R1 R3
< 4.5

R1 𝑥2
77
𝑆𝑎𝑙𝑎𝑟𝑦1 = 226 < 77
R2
4.5 R2 R3
𝑆𝑎𝑙𝑎𝑟𝑦2 = 402 𝑆𝑎𝑙𝑎𝑟𝑦3 = 795
𝑅𝑆𝑆 = 35161358

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 18

18

9
28/06/2019

Recursive partitioning example – Hitters data (2


variables)
x1 – Years
𝑥1 x2 – Hits
R1 R3
< 4.5

117.5
R1 𝑥2
𝑆𝑎𝑙𝑎𝑟𝑦1 = 226 < 117.5*
R2
4.5 R2 R3
𝑆𝑎𝑙𝑎𝑟𝑦2 = 465 𝑆𝑎𝑙𝑎𝑟𝑦3 = 949
𝑅𝑆𝑆 = 30037022
*optimum cut-
point for Hits.
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 19

19

Interpretation of the results


• Years is the most important factor in determining Salary, and players with less experience
earn lower salaries than more experienced players.

• Given that a player is less experienced, the number of Hits that he made in the previous
year seems to play little role in his Salary.

• But among players who have been in the major leagues for five or more years, the number
of Hits made in the previous year does affect Salary, and players who made more Hits last
year tend to have higher salaries.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 20

20

10
28/06/2019

Recursive partitioning example – Hitters data (2


variables)
…but we can keep
splitting the space until
RSS converges

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 21

21

Overfitting problem
• The process described above may produce good predictions on the training set, but is
likely to overfit the data, leading to poor test set performance.

Test data
RSS

Training data

# splits

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 22

22

11
28/06/2019

Pruning
• A better strategy is to grow a very large tree T0, and then prune it back in order to obtain a
subtree.
• Cost complexity pruning:
𝑇
2
෍ ෍ 𝑦𝑖 − 𝑦ො𝑅𝑚 +𝛼 𝑇
𝑚=1 𝑖:𝑥𝑖 ∈𝑅𝑚

• 𝛼 - complexity parameter: It is a nonnegative tuning parameter that controls the trade-off


between the subtree's complexity and its fit to the training data.
• For each value of 𝛼 there corresponds a subtree 𝑇 ⊂ 𝑇0.
• 𝑇 - indicates the number of terminal nodes of the tree 𝑇.
• 𝛼 can be selected by cross-validation.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 23

23

Tree algorithm – Summary


1. Use recursive binary splitting to grow a large tree on the training data, stopping only
when each terminal node has fewer than some minimum number of observations.

2. Apply cost complexity pruning to the large tree in order to obtain a sequence of best
subtrees, as a function of 𝛼.

3. Use K-fold cross-validation to choose 𝛼.

4. Return the subtree from Step 2 that corresponds to the chosen value of 𝛼.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 24

24

12
28/06/2019

Classification trees
• Very similar to regression trees
• But we predict a qualitative response (class) instead.
• Instead of using the mean, the predicted class of an observation will be the most commonly
occurring class (mode).
• In the classification setting, RSS cannot be used as a criterion for making the binary splits.
• One option could be the equivalent misclassification rate:
𝑀

𝐸 = ෍ 1 − max 𝑝Ƹ 𝑚𝑘
𝑘
𝑚=1
• Here 𝑝Ƹ𝑚𝑘 represents the proportion of training observations in the mth region that are
from the kth class.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 25

25

Gini index
• An alternative to misclassification rate is the Gini index.
• It is defined by

𝑀 𝐾

𝐺 = ෍ ෍ 𝑝Ƹ𝑚𝑘 1 − 𝑝Ƹ𝑚𝑘
𝑚=1 𝑘=1

• It is a measure of total variance across the K classes. The Gini index takes on a small value if
all of the 𝑝Ƹ𝑚𝑘 ’s are close to zero or one.

• For this reason the Gini index is referred to as a measure of node purity – a small value
indicates that a node contains predominantly observations from a single class.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 26

26

13
28/06/2019

Information gain
• An alternative to the Gini index is cross-entropy, given by

𝑀 𝐾

𝐷 = − ෍ ෍ 𝑝Ƹ𝑚𝑘 log 𝑝Ƹ 𝑚𝑘
𝑚=1 𝑘=1

• Cross-entropy is a measure of information gain.


• It turns out that the Gini index and the cross-entropy are very similar numerically.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 27

27

Trees Versus Linear Models

• Top Row: True linear boundary;


Bottom row: true non-linear
boundary.

• Left column: linear model; Right


column: tree-based model

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 28

28

14
28/06/2019

Advantages and Disadvantages of Trees


• Trees are very easy to explain to people. In fact, they are even easier to explain than linear
regression!

• Some people believe that decision trees more closely mirror human decision-making than
do other regression and classification approaches.

• Trees can be displayed graphically, and are easily interpreted even by a non-expert
(especially if they are small).

• Trees can easily handle qualitative predictors without the need to create dummy variables.

• Unfortunately, trees generally do not have the same level of predictive accuracy as some of
the other regression and classification approaches.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 29

29

Reducing the variance


• The same data may be partitioned differently

• That depends on the data available, data complexity, number of free model parameters,
etc.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 30

30

15
28/06/2019

Reducing the variance


• One way to reduce the variance is by having more independent data:

• … given a set of n independent observations:

𝑍1 , 𝑍2 , … , 𝑍𝑛 , each with variance 𝜎 2 ,

• The variance of the mean 𝑍ҧ of the observations is

𝜎 2 Τ𝑛
• In other words, averaging a set of observations reduces variance.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 31

31

Bagging
Dataset

• Bagging uses the same


principle. Although, instead
of using independent
datasets, it bootstraps the
Sample 1

Sample 2

Sample N

training data.


• That is, it takes repeated
samples from the training Model 1 Model 2 Model N
set.

Prediction 1 Prediction 2 Prediction N

Final prediction : Average over all predictions

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 32

32

16
28/06/2019

Bagging
• Algorithm for regression:
1. Generate B different bootstrapped training data sets.
2. Train method on the bth bootstrapped training set in order to get 𝑓መ ∗𝑏 𝑥 , the
prediction at a point x.
3. The final prediction is the average over all predictions:

𝐵
1
𝑓መbag 𝑥 = ෍ 𝑓መ ∗𝑏 𝑥
𝐵
𝑏=1

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 33

33

Bagging
• For classification – instead of taking the average, the overall predicted value (class) is the
majority vote – i.e. the mode of all predictions:

𝑓መbag 𝑥 = mode𝐵𝑏=1 𝑓መ ∗𝑏 𝑥

• In principle, bagging can be applied to any machine learning method. But it is usually used
with decision trees.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 34

34

17
28/06/2019

Random forests
Random forests are similar to bagged trees, but with a tweak to decrease the variance even
further:

• As in bagging, a number of decision trees are built on bootstrapped training samples.

• But when building these decision trees, each time a split in a tree is considered, a random
selection of m input variables is chosen as split candidates from the full set of p.

• The split is allowed to use only one of those m predictors.

• Usually we choose 𝑚 ≈ 𝑝

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 35

35

Example – Boston dataset


Task: Predict median housing value

Input variables (crime rate, zone, average age, taxes, etc)


Response variable:
Median Housing Value

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 36

36

18
28/06/2019

Example – Solution using random forest

Solution with 500 trees

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 37

37

Variable importance measure


• One disadvantage of using bagged trees and RFs is the lost of interpretability, if we
compare with decision trees.

• However, we can still measure variable importance:

• For regression:
• We record the total amount that the RSS is decreased due to splits over a given input
variable, averaged over all trees. A large value indicates an important variable.

• For classification:
• We add up the total amount that the Gini index (or cross-entropy measure) is decreased
by splits over a given input variable, averaged over all trees.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 38

38

19
28/06/2019

Example – Variable importance

LSTAT - % lower status of the population


RM - average number of rooms per dwelling
CRIM - per capita crime rate by town

ZN - proportion of residential land zoned for
lots over 25,000 sq.ft.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 39

39

Ensemble learning
Dataset

• Ensemble learning is a machine


learning paradigm where multiple
learners are trained to solve the
Dataset 1

Dataset 2

Dataset N

same problem.
• In contrast to ordinary machine
learning approaches which try to …
learn one hypothesis from
Model 1 Model 2 Model N
training data, ensemble methods
try to construct a set of
hypotheses and combine them to
use. Prediction 1 Prediction 2 Prediction N

• Bagging and random forests are


ensemble learning approaches. Combine prediction

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 40

40

20
28/06/2019

Stacking Base learners New dataset

Algorithm 1 Predictions 1
Prediction
Predictions 2 (<non-linear>
Algorithm 2
Meta-learner combination)
Dataset

(combiner)
Algorithm 3 Predictions 3

Algorithm 4 Predictions 4

• Stacking (or stacked generalisation, or super-learning) trains an algorithm (meta-learner)


to combine the predictions of several other algorithms.
• Performance is usually better than individual algorithms.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 41

41

Stacking – algorithm
Level 0 – Base learners
Original dataset Algorithm responses
Fold 1 Fold 2 Fold 3 Truth Algoritm Algoritm Algoritm
1 2 3
Tr_split_1 Tr_split_1 Tst_split_1 Truth_1 Resp_1_1 Resp_2_1 Resp_3_1

Tr_split_2 Tst_split_2 Tr_split_2 Truth_2 Resp_1_2 Resp_2_2 Resp_3_2

Tst_split_3 Tr_split_3 Tr_split_3 Truth_3 Resp_1_3 Resp_2_3 Resp_3_3


Inputs
Algoritm1 Algoritm2 Algoritm3 Truth
Level 1 – Meta-learner
Resp_1_1 Resp_2_1 Resp_3_1 Truth_1

New ML problem, inputs are the algorithm Resp_1_2 Resp_2_2 Resp_3_2 Truth_2
responses using test sets from level 0. Outputs
Resp_1_3 Resp_2_3 Resp_3_3 Truth_3
are the original true values.
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 45

42

21
28/06/2019

Winning Data Science competitions


The following figure shows the winning solution for the 2015 KDD Cup competition

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 43

43

Summary
We learnt about tree-based methods and ensemble learning (classification and
regression trees – CART, bagging, random forests).

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 44

44

22
28/06/2019

Exercise
1. Estimate the RSS of a regression tree that was generated from the dataset in the table.
2. Predict Salary for 𝑌𝑒𝑎𝑟𝑠 = 2,5,14
𝑥1

< 11

R1 R2
𝑆𝑎𝑙𝑎𝑟𝑦1 =? ? 𝑆𝑎𝑙𝑎𝑟𝑦2 =? ?

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 45

45

Solution
• RSS estimation
75 + 145 + 155 + 1600 + 600 + 1008.333
𝑆𝑎𝑙𝑎𝑟𝑦1 = = 597.2
6
825 + 733.333
𝑆𝑎𝑙𝑎𝑟𝑦2 = = 779.2
2
𝑅𝑆𝑆 = 75 − 597.2 2 + 145 − 597.2 2 + 155 − 597.2 2 +
1600 − 597.2 2 + 600 − 597.2 2 + 1008.333 − 597.2 2 +
825 − 779.2 2 + 733.333 − 779.2 2
𝑅𝑆𝑆 = 1851566

• Predictions
𝑆𝑎𝑙𝑎𝑟𝑦 𝑌𝑒𝑎𝑟 = 2 = 592.2 𝑆𝑎𝑙𝑎𝑟𝑦 𝑌𝑒𝑎𝑟 = 5 = 592.2 𝑆𝑎𝑙𝑎𝑟𝑦 𝑌𝑒𝑎𝑟 = 14 = 779.2

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 46

46

23
28/06/2019

Exercise
𝑥1
• Draw the resulting partitioning of a 2-D <4
feature space according to this binary
tree, and for 𝑥1 , 𝑥2 > 0. 𝑥2 𝑥2
<2 <3

𝑥1 R4 R5 R6

<1

𝑥2 R3
<1

R1 R2

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 47

47

Solution
𝑥1
<4 x2
R6
𝑥2 𝑥2
<3 3 R4
<2

𝑥1 R4 R5 R6 2
<1 R2 R5
1 R3
𝑥2 R3 R1
<1
1 4 x1

R1 R2

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 48

48

24
28/06/2019

Exercise
• We have a 1-D dataset. Two classification
trees (A and B) were fitted. The cut-points 𝑥1 𝒚 ෝ
𝒚𝑨 ෝ
𝒚𝑩
for x1 were 4.6 and 3.0, respectively. 1.3 c1 c1 c1
2.9 c1 c1 c1
3.1 c1 c1 c2
• Table shows the dataset along with the
ෝB ) by the two
𝒚𝑨 and 𝒚
predicted classes (ෝ 4.5 c2 c1 c2
trees. 4.8 c1 c2 c2
6.1 c2 c2 c2
7.2 c2 c2 c2
• Estimate misclassification rate, Gini index
and cross-entropy 8.9 c2 c2 c2

True Predictions Predictions


classes of Tree A of Tree B

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 49

49

Solution Tree A
𝐸𝐴 = 1 − 3Τ4 + 1 − 3Τ4 = 0.5
𝑥1
𝐺𝐴 = 3Τ4 1 − 3Τ4 + 3Τ4 1 − 3Τ4 = 0.187
< 4.6
𝑥1 𝒚 ෝ
𝒚𝑨 ෝ
𝒚𝑩
1.3 c1 c1 c1 𝐷𝐴 = − 3Τ4 ln 3Τ4 − 3Τ4 ln 3Τ4 = 0.216
2.9 c1 c1 c1 C1 C2
3.1 c1 c1 c2 𝐸𝐵 = 1 − 2Τ4 + 1 − 4Τ4 = 0.5
(3,1) (1,3)
4.5 c2 c1 c2
𝐺𝐵 = 2Τ4 1 − 2Τ4 + 4Τ4 1 − 4Τ4 = 0.25
4.8 c1 c2 c2
6.1 c2 c2 c2 Tree B 𝐷𝐵 = − 2Τ4 ln 2Τ4 − 4Τ4 ln 4Τ4 = 0.347
7.2 c2 c2 c2
𝑥1
8.9 c2 c2 c2
< 3.0 Notice that misclassification rate is less
sensitive than Gini index and cross-
C1 C2 entropy
(2,0) (2,4)
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 50

50

25

You might also like