Session 6 - Decision Tree
Session 6 - Decision Tree
Session 6 - Decision Tree
2
DECISION TREES: INTRODUCTION
It is one of the supervised learning algorithms used for predicting both the
discrete and the continuous dependent variable.
In a decision tree learning, when the response variable takes discrete values
then the decision trees are called classification trees.
Decision trees are effective for solving classification problems in which the
response variable (target variable) takes discrete values.
When the response variable takes continuous values then the decision trees
are called regression trees.
3
DECISION TREE
4
CLASSIFICATION AND REGRESSION TREE
Classification tree uses various impurity measures such as the Gini Impurity
Index and Entropy to split the nodes.
Regression Tree, on the other hand, splits the node that minimizes the Sum
of Squared Errors
5
TRAINING DATA EXAMPLE: GOAL IS TO PREDICT WHEN THIS
PLAYER WILL PLAY TENNIS?
6
7
Outlook is a nominal feature. It can be sunny, overcast or rain. Summarizing the final decisions
for outlook feature.
Number of
Humidity Yes No
instances
High 3 4 7
Normal 6 1 7
10
Wind is a binary class similar to humidity. It can be weak and strong.
Number of
Wind Yes No
instances
Weak 6 2 8
Strong 3 3 6
11
Feature Gini index
Outlook 0.342
Temperature 0.439
Humidity 0.367
Wind 0.428
12
13
14
Focus on the sub dataset for sunny outlook. We need to find the gini index scores for
temperature, humidity and wind features respectively.
15
Gini of temperature for sunny outlook
Number of
Temperature Yes No
instances
Hot 0 2 2
Cool 1 0 1
Mild 1 1 2
16
Gini of humidity for sunny outlook
Number of
Humidity Yes No
instances
High 0 3 3
Normal 2 0 2
17
Gini of wind for sunny outlook
Number of
Wind Yes No
instances
Weak 1 2 3
Strong 1 1 2
18
DECISION FOR SUNNY OUTLOOK
We’ve calculated gini index scores for feature when outlook is sunny. The winner
is humidity because it has the lowest value.
19
20
21
22
GINI IMPURITY
where
GI(t) = Gini index at node t
P(Ci|t) = Proportion of observations belonging to class Ci in node t
The lower the Gini Impurity, the higher the homogeneity of the node. The Gini
Impurity of a pure node is zero.
23
ENTROPY & INFORMATION GAIN
Entropy is a popular measure of impurity that is used in classification trees to
split a node.
Entropy measures the degree of randomness in data
3 3 4 4
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = − 𝑝𝑝mammal log2 𝑝𝑝mammal − 𝑝𝑝bird log2 𝑝𝑝bird = − log2 − log2 ≈ 0.985
7 7 7 7
1 1 2 2
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 (𝑋𝑋𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐=𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 ) = − log2 − log2 ≈ 0.918
3 3 3 3
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 (𝑋𝑋𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐=𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 ) = 1
𝟑𝟑 𝟒𝟒
𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈 𝑿𝑿, 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄 = 𝟎𝟎. 𝟗𝟗𝟗𝟗𝟗𝟗 − � 𝟎𝟎. 𝟗𝟗𝟗𝟗𝟗𝟗 − � 𝟏𝟏 ≈ 𝟎𝟎. 𝟎𝟎𝟎𝟎𝟎𝟎
𝟕𝟕 𝟕𝟕
25
BEST ATTRIBUTE = HIGHEST INFORMATION GAIN
Does it fly Color Class
No Brown Mammal
No White Mammal
Yes Brown Bird
Yes White Bird
No White Mammal 1 Mammal 2 Mammal 3 Mammal
No Brown Bird 2 Birds 2 Birds 1 Bird
Yes White Bird
3 3 4 4
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = − 𝑝𝑝mammal log2 𝑝𝑝mammal − 𝑝𝑝bird log2 𝑝𝑝bird = − log2 − log2 ≈ 0.985
7 7 7 7
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 (𝑋𝑋𝑓𝑓𝑓𝑓𝑓𝑓=𝑦𝑦𝑦𝑦𝑦𝑦 ) = 0
3 3 1 1
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 (𝑋𝑋𝑓𝑓𝑓𝑓𝑓𝑓=𝑛𝑛𝑛𝑛 ) = − log2 − log2 ≈ 0. 811
4 4 4 4
𝟑𝟑 𝟒𝟒
𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈 𝑿𝑿, 𝒇𝒇𝒇𝒇𝒇𝒇 = 𝟎𝟎. 𝟗𝟗𝟗𝟗𝟗𝟗 − � 𝟎𝟎 − � 𝟎𝟎. 𝟖𝟖𝟖𝟖𝟖𝟖 ≈ 𝟎𝟎. 𝟓𝟓𝟓𝟓𝟓𝟓
𝟕𝟕 𝟕𝟕
26
EXERCISE
27
IG(Outlook) = Entropy(Play Tennis) - [(5/14) * Entropy(Play Tennis | Outlook = Sunny) + (4/14) * Entropy(Play Tennis |
Outlook = Overcast) + (5/14) * Entropy(Play Tennis | Outlook = Rainy)]
= 0.940 - [(5/14) * 0.971 + (4/14) * 0.0 + (5/14) * 0.971]
Outlook feature has the highest IG
= 0.246
IG(Temperature) = Entropy(Play Tennis) - [(4/14) * Entropy(Play Tennis | Temperature = Hot) + (4/14) * Entropy(Play
Tennis | Temperature = Mild) + (6/14) * Entropy(Play Tennis | Temperature = Cool)]
= 0.940 - [(4/14) * 1.0 + (4/14) * 0.811 + (6/14) * 0.918]
= 0.029
IG(Humidity) = Entropy(Play Tennis) - [(7/14) * Entropy(Play Tennis | Humidity = High) + (7/14) * Entropy(Play Tennis
| Humidity = Normal)] = 0.15
IG(Wind) = Entropy(Play Tennis) - [(8/14) * Entropy(Play Tennis | Wind = Weak) + (6/14) * Entropy(Play Tennis | Wind
= Strong)]
= 0.940 - [(8/14) * 0.811 + (6/14) * 1.0]
= 0.048
28
29
COMPARISON OF GINI INDEX AND ENTROPY
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Gini Entropy
30
STEPS
The following steps are used to generate a classification and a regression tree (Breiman et
al. 1984)
Step 1 : Start with the complete training data in the root node.
Step 2 : Decide on the measure of impurity (usually Gini impurity index or Entropy).
Choose a predictor variable that minimizes the impurity when the parent node is split
into children nodes [see Eq. (12.4)].
This happens when the original data is divided into two subsets using a predictor
variable such that it results in the maximum reduction in the impurity in the case of
discrete dependent variable or the maximum reduction in SSE in the case of a
continuous dependent variable.
31
STEPS
Step 3 : Repeat step 2 for each subset of the data (for each internal node) using
the independent variables until:
Step 4 : Generate business rules for the leaf (terminal) nodes of the tree.
32
CREDIT RATING OR CREDIT SCORING
Credit ratings estimate the level of risk involved in investing in or lending money to a
particular business or other entity
A high credit rating indicates that, in the rating agency's opinion, a bond issuer is likely
to repay its debts to investors without difficulty.
A poor credit rating suggests it might struggle to make its payments or even fail to
make them.
Investors and lenders use credit ratings to decide whether to do business with the
rated entity and to determine how much interest they would expect to receive to
compensate them for the risk involved.
33
CREDIT RATING DATA
Other Credit
CHK_A Durat Credit Credit Balance in Employ Install_ Marital Present Num_Cre Credit
S.No Age installme Job classificatio
CCT ion History Amount Savings A/C ment rate status Resident dits Rating
nt n
over-
1 0DM 6 critical 1169 unknown 4 Single 4 67 1 2 Unskilled good. 0
seven
no- seven-
3 12 critical 2096 less100DM 2 Single 3 49 0 1 Unskilled good. 0
account years
seven-
4 0DM 42 all-paid-duly 7882 less100DM 2 Single 4 45 0 1 skilled good. 0
years
four-
5 0DM 24 delay 4870 less100DM 3 Single 4 53 1 2 skilled bad. 1
years
no- four-
6 36 all-paid-duly 9055 unknown 2 Single 4 35 0 1 Unskilled good. 0
account years
34
EXAMPLE
Attribute Variable Name Description
1 Credit Rating Credit_Rating Bad (1)
Good (0)
2 Checking Account CHK_ACCT No account
Balance 0 DM
>0 & ≤ 200 DM
(DM – Deutsch Mark) > 200 DM
3 Duration Duration Duration of the credit
Single/ Married
Divorced Male
Divorced Female
8 Marital Status Maritalstatus
Checking Account?