Session 6 - Decision Tree

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

DECISION TREE

“Nothing is particularly hard if you divide it into small jobs”.


- Henry Ford

2
DECISION TREES: INTRODUCTION

 It is one of the supervised learning algorithms used for predicting both the
discrete and the continuous dependent variable.
 In a decision tree learning, when the response variable takes discrete values
then the decision trees are called classification trees.
 Decision trees are effective for solving classification problems in which the
response variable (target variable) takes discrete values.
 When the response variable takes continuous values then the decision trees
are called regression trees.

3
DECISION TREE

4
CLASSIFICATION AND REGRESSION TREE

 Classification and Regression Tree (CART) is a common terminology that is


used for a Classification Tree (used when the dependent variable is discrete)
and a Regression Tree (used when the dependent variable is continuous).

 Classification tree uses various impurity measures such as the Gini Impurity
Index and Entropy to split the nodes.

 Regression Tree, on the other hand, splits the node that minimizes the Sum
of Squared Errors

5
TRAINING DATA EXAMPLE: GOAL IS TO PREDICT WHEN THIS
PLAYER WILL PLAY TENNIS?

6
7
 Outlook is a nominal feature. It can be sunny, overcast or rain. Summarizing the final decisions
for outlook feature.

Outlook Yes No Number of instances


Sunny 2 3 5
Overcast 4 0 4
Rain 3 2 5

 Gini(Outlook=Sunny) = 1 – (2/5)2 – (3/5)2 = 1 – 0.16 – 0.36 = 0.48


 Gini(Outlook=Overcast) = 1 – (4/4)2 – (0/4)2 = 0
 Gini(Outlook=Rain) = 1 – (3/5)2 – (2/5)2 = 1 – 0.36 – 0.16 = 0.48
Then, we will calculate weighted sum of gini indexes for outlook feature.
 Gini(Outlook) = (5/14) x 0.48 + (4/14) x 0 + (5/14) x 0.48 = 0.171 + 0 + 0.171 = 0.342
8
 Similarly, temperature is a nominal feature and it could have 3 different values: Cool, Hot and
Mild. Let’s summarize decisions for temperature feature.

Temperature Yes No Number of instances


Hot 2 2 4
Cool 3 1 4
Mild 4 2 6

 Gini(Temp=Hot) = 1 – (2/4)2 – (2/4)2 = 0.5


 Gini(Temp=Cool) = 1 – (3/4)2 – (1/4)2 = 1 – 0.5625 – 0.0625 = 0.375
 Gini(Temp=Mild) = 1 – (4/6)2 – (2/6)2 = 1 – 0.444 – 0.111 = 0.445
We’ll calculate weighted sum of gini index for temperature feature
 Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445 = 0.142 + 0.107 + 0.190 = 0.439
9
 Humidity is a binary class feature. It can be high or normal.

Number of
Humidity Yes No
instances
High 3 4 7
Normal 6 1 7

 Gini(Humidity=High) = 1 – (3/7)2 – (4/7)2 = 1 – 0.183 – 0.326 = 0.489


 Gini(Humidity=Normal) = 1 – (6/7)2 – (1/7)2 = 1 – 0.734 – 0.02 = 0.244
Weighted sum for humidity feature will be calculated next
 Gini(Humidity) = (7/14) x 0.489 + (7/14) x 0.244 = 0.367

10
 Wind is a binary class similar to humidity. It can be weak and strong.

Number of
Wind Yes No
instances
Weak 6 2 8
Strong 3 3 6

 Gini(Wind=Weak) = 1 – (6/8)2 – (2/8)2 = 1 – 0.5625 – 0.062 = 0.375

 Gini(Wind=Strong) = 1 – (3/6)2 – (3/6)2 = 1 – 0.25 – 0.25 = 0.5

 Gini(Wind) = (8/14) x 0.375 + (6/14) x 0.5 = 0.428

11
Feature Gini index

Outlook 0.342

Temperature 0.439

Humidity 0.367

Wind 0.428

12
13
14
 Focus on the sub dataset for sunny outlook. We need to find the gini index scores for
temperature, humidity and wind features respectively.

Day Outlook Temp. Humidity Wind Decision


1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
11 Sunny Mild Normal Strong Yes

15
 Gini of temperature for sunny outlook

Number of
Temperature Yes No
instances
Hot 0 2 2
Cool 1 0 1
Mild 1 1 2

 Gini(Outlook=Sunny and Temp.=Hot) = 1 – (0/2)2 – (2/2)2 = 0


 Gini(Outlook=Sunny and Temp.=Cool) = 1 – (1/1)2 – (0/1)2 = 0
 Gini(Outlook=Sunny and Temp.=Mild) = 1 – (1/2)2 – (1/2)2 = 1 – 0.25 – 0.25 = 0.5
 Gini(Outlook=Sunny and Temp.) = (2/5)x0 + (1/5)x0 + (2/5)x0.5 = 0.2

16
 Gini of humidity for sunny outlook

Number of
Humidity Yes No
instances
High 0 3 3
Normal 2 0 2

 Gini(Outlook=Sunny and Humidity=High) = 1 – (0/3)2 – (3/3)2 = 0


 Gini(Outlook=Sunny and Humidity=Normal) = 1 – (2/2)2 – (0/2)2 = 0
 Gini(Outlook=Sunny and Humidity) = (3/5)x0 + (2/5)x0 = 0

17
 Gini of wind for sunny outlook

Number of
Wind Yes No
instances
Weak 1 2 3
Strong 1 1 2

 Gini(Outlook=Sunny and Wind=Weak) = 1 – (1/3)2 – (2/3)2 = 0.266

 Gini(Outlook=Sunny and Wind=Strong) = 1- (1/2)2 – (1/2)2 = 0.2

 Gini(Outlook=Sunny and Wind) = (3/5)x0.266 + (2/5)x0.2 = 0.466

18
DECISION FOR SUNNY OUTLOOK
 We’ve calculated gini index scores for feature when outlook is sunny. The winner
is humidity because it has the lowest value.

Feature Gini index


Temperature 0.2
Humidity 0
Wind 0.466

19
20
21
22
GINI IMPURITY

 Gini impurity index is one of the measures of impurity that is used by


classification trees to split the nodes.
K K K K
GI (t ) = ∑ ∑ P (Ci t ) P (C j t ) = ∑ P (Ci | t )(1 − P (Ci | t ) =1 − ∑ [P(Ci | t )]2
i =1 j =1, j ≠ i i =1 i =1

where
GI(t) = Gini index at node t
P(Ci|t) = Proportion of observations belonging to class Ci in node t

 The lower the Gini Impurity, the higher the homogeneity of the node. The Gini
Impurity of a pure node is zero.

23
ENTROPY & INFORMATION GAIN
 Entropy is a popular measure of impurity that is used in classification trees to
split a node.
 Entropy measures the degree of randomness in data

For a set of samples 𝑋𝑋 with 𝑘𝑘 classes:


𝒌𝒌

𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆 𝑿𝑿 = − � 𝒑𝒑𝒊𝒊 𝐥𝐥𝐥𝐥𝐥𝐥 𝟐𝟐 (𝒑𝒑𝒊𝒊 )


𝒊𝒊=𝟏𝟏

 The information gain of an attribute a is the expected reduction in entropy due to


splitting on values of a: (here 𝑋𝑋𝑣𝑣 is the subset of 𝑋𝑋 for which 𝑎𝑎 = 𝑣𝑣)
𝑿𝑿𝒗𝒗
𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈 𝑿𝑿, 𝒂𝒂 = 𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆 𝑿𝑿 − � 𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆(𝑿𝑿𝒗𝒗 )
𝑿𝑿
𝒗𝒗 ∈ 𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽(𝒂𝒂)
24
BEST ATTRIBUTE = HIGHEST INFORMATION GAIN
Does it fly Color Class
No Brown Mammal
No White Mammal
Yes Brown Bird
Yes White Bird
No White Mammal 1 Mammal 2 Mammal 3 Mammal
No Brown Bird 2 Birds 2 Birds 1 Bird
Yes White Bird

3 3 4 4
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = − 𝑝𝑝mammal log2 𝑝𝑝mammal − 𝑝𝑝bird log2 𝑝𝑝bird = − log2 − log2 ≈ 0.985
7 7 7 7
1 1 2 2
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 (𝑋𝑋𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐=𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 ) = − log2 − log2 ≈ 0.918
3 3 3 3

𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 (𝑋𝑋𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐=𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 ) = 1

𝟑𝟑 𝟒𝟒
𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈 𝑿𝑿, 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄 = 𝟎𝟎. 𝟗𝟗𝟗𝟗𝟗𝟗 − � 𝟎𝟎. 𝟗𝟗𝟗𝟗𝟗𝟗 − � 𝟏𝟏 ≈ 𝟎𝟎. 𝟎𝟎𝟎𝟎𝟎𝟎
𝟕𝟕 𝟕𝟕
25
BEST ATTRIBUTE = HIGHEST INFORMATION GAIN
Does it fly Color Class
No Brown Mammal
No White Mammal
Yes Brown Bird
Yes White Bird
No White Mammal 1 Mammal 2 Mammal 3 Mammal
No Brown Bird 2 Birds 2 Birds 1 Bird
Yes White Bird

3 3 4 4
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = − 𝑝𝑝mammal log2 𝑝𝑝mammal − 𝑝𝑝bird log2 𝑝𝑝bird = − log2 − log2 ≈ 0.985
7 7 7 7
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 (𝑋𝑋𝑓𝑓𝑓𝑓𝑓𝑓=𝑦𝑦𝑦𝑦𝑦𝑦 ) = 0
3 3 1 1
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 (𝑋𝑋𝑓𝑓𝑓𝑓𝑓𝑓=𝑛𝑛𝑛𝑛 ) = − log2 − log2 ≈ 0. 811
4 4 4 4
𝟑𝟑 𝟒𝟒
𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈 𝑿𝑿, 𝒇𝒇𝒇𝒇𝒇𝒇 = 𝟎𝟎. 𝟗𝟗𝟗𝟗𝟗𝟗 − � 𝟎𝟎 − � 𝟎𝟎. 𝟖𝟖𝟖𝟖𝟖𝟖 ≈ 𝟎𝟎. 𝟓𝟓𝟓𝟓𝟓𝟓
𝟕𝟕 𝟕𝟕
26
EXERCISE

27
 IG(Outlook) = Entropy(Play Tennis) - [(5/14) * Entropy(Play Tennis | Outlook = Sunny) + (4/14) * Entropy(Play Tennis |
Outlook = Overcast) + (5/14) * Entropy(Play Tennis | Outlook = Rainy)]
= 0.940 - [(5/14) * 0.971 + (4/14) * 0.0 + (5/14) * 0.971]
Outlook feature has the highest IG

= 0.246
 IG(Temperature) = Entropy(Play Tennis) - [(4/14) * Entropy(Play Tennis | Temperature = Hot) + (4/14) * Entropy(Play
Tennis | Temperature = Mild) + (6/14) * Entropy(Play Tennis | Temperature = Cool)]
= 0.940 - [(4/14) * 1.0 + (4/14) * 0.811 + (6/14) * 0.918]
= 0.029
 IG(Humidity) = Entropy(Play Tennis) - [(7/14) * Entropy(Play Tennis | Humidity = High) + (7/14) * Entropy(Play Tennis
| Humidity = Normal)] = 0.15

 IG(Wind) = Entropy(Play Tennis) - [(8/14) * Entropy(Play Tennis | Wind = Weak) + (6/14) * Entropy(Play Tennis | Wind
= Strong)]
= 0.940 - [(8/14) * 0.811 + (6/14) * 1.0]
= 0.048

28
29
COMPARISON OF GINI INDEX AND ENTROPY

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Gini Entropy

30
STEPS
The following steps are used to generate a classification and a regression tree (Breiman et
al. 1984)

 Step 1 : Start with the complete training data in the root node.
 Step 2 : Decide on the measure of impurity (usually Gini impurity index or Entropy).
Choose a predictor variable that minimizes the impurity when the parent node is split
into children nodes [see Eq. (12.4)].
 This happens when the original data is divided into two subsets using a predictor
variable such that it results in the maximum reduction in the impurity in the case of
discrete dependent variable or the maximum reduction in SSE in the case of a
continuous dependent variable.

31
STEPS

 Step 3 : Repeat step 2 for each subset of the data (for each internal node) using
the independent variables until:

All the dependent variables are exhausted .


The stopping criteria is met. Few stopping criteria used are number of levels of
tree from the root node, minimum number of observations in parent/child node
(eg. 10% of the training data) and minimum reduction in impurity index.

 Step 4 : Generate business rules for the leaf (terminal) nodes of the tree.

32
CREDIT RATING OR CREDIT SCORING

 Credit ratings estimate the level of risk involved in investing in or lending money to a
particular business or other entity
 A high credit rating indicates that, in the rating agency's opinion, a bond issuer is likely
to repay its debts to investors without difficulty.
 A poor credit rating suggests it might struggle to make its payments or even fail to
make them.
 Investors and lenders use credit ratings to decide whether to do business with the
rated entity and to determine how much interest they would expect to receive to
compensate them for the risk involved.

33
CREDIT RATING DATA
Other Credit
CHK_A Durat Credit Credit Balance in Employ Install_ Marital Present Num_Cre Credit
S.No Age installme Job classificatio
CCT ion History Amount Savings A/C ment rate status Resident dits Rating
nt n
over-
1 0DM 6 critical 1169 unknown 4 Single 4 67 1 2 Unskilled good. 0
seven

less- four- female-


2 48 all-paid-duly 5951 less100DM 2 2 22 0 1 skilled bad. 1
200DM years divorced

no- seven-
3 12 critical 2096 less100DM 2 Single 3 49 0 1 Unskilled good. 0
account years

seven-
4 0DM 42 all-paid-duly 7882 less100DM 2 Single 4 45 0 1 skilled good. 0
years
four-
5 0DM 24 delay 4870 less100DM 3 Single 4 53 1 2 skilled bad. 1
years
no- four-
6 36 all-paid-duly 9055 unknown 2 Single 4 35 0 1 Unskilled good. 0
account years

no- Between 500 and over-


7 24 all-paid-duly 2835 3 Single 4 53 0 1 skilled good. 0
account 1000 DM seven

34
EXAMPLE
Attribute Variable Name Description
1 Credit Rating Credit_Rating Bad (1)
Good (0)
2 Checking Account CHK_ACCT No account
Balance 0 DM
>0 & ≤ 200 DM
(DM – Deutsch Mark) > 200 DM
3 Duration Duration Duration of the credit

4 Credit Amount CreditAmount Amount of credit given


5 Balance in Savings Balance_Savings Unknown
Account <100 DM
≥ 100 DM & <500 DM
≥ 500 DM & <1000 DM
≥1000 DM
6 Employment in Years Employment Unemployed
<1 Year
≥ 1 Year & <4 Years
≥ 4 Years & <7 Years
≥ 7 Years
35
Attribute Variable Name Description

7 Installment Rate Install_rate As a percentage of disposable income

Single/ Married
Divorced Male
Divorced Female
8 Marital Status Maritalstatus

9 Present Resident PresentResident Present Residence in Years

10 Age Age Age of the applicant in years


11 Other Installments Other_installment Applicant has other installments[1]
Applicant has no other installments [0]

12 Number of Credits Num_Credits Number of existing credits in the bank

13 Job Job Unskilled


Skilled
Manageme
nt
Unemploye
d
14 Credit History CreditHistory All paid duly [No credit taken / all credits paid back duly] Bank
paid duly [All credits at this bank paid back duly] Delay
[Delay in paying off in the past]
Critical [Critical account/ other credits existing (not at this bank)]
36
Dependent Variable: Credit Rating
Good (0) 490 70%
Bad (1) 210 30%
Node 0 700 100%

Checking Account?

0 DM, >0 & ≤ 200 DM, > 200 DM No Account


Good (0) 251 59% Good (0) 239 87%
Bad (1) 175 41% Bad (1) 35 13%
Node 1 426 61% Node 2 274 39%

Duration? Credit Amount?

<=22.5 months >22.5 months <=3891 DM >3891 DM


Good (0) 164 68% Good (0) 87 47% Good (0) 186 92% Good (0) 53 75%
Bad (1) 76 32% Bad (1) 99 53% Bad (1) 17 8% Bad (1) 18 25%
Node 3 240 34% Node 4 186 27% Node 5 203 29% Node 6 71 10%
37

You might also like