DM-Lecture Decision Trees (A)
DM-Lecture Decision Trees (A)
Classification-I
M M Awais
SPJCM
Decision tree induction
Outlook
Sunny Rain
Overcast
Humidity Wind
High Normal Strong Weak
N P N P
no yes no yes
Rule1: Rule3:
If (outlook=“sunny”) AND (humidity<=0.75) If (outlook=“overcast”)
Then (play=“yes”) Then (play=“yes”)
Rule2: ...
If (outlook=“rainy”) AND (wind>20)
Then (play=“no”)
13
DECISION TREES
D8 D9• Humidity
D7
D13
• Wind
14
DECISION TREES
D10
D1 D8 D6
D3
D14
D11 D4
D9 D12
D2 D7
D5
D13
15
DECISION TREES
16
DECISION TREES
Outlook
Sunny Rain
Overcast
D10
D1 D8 D6
D3
D14
D11 D4
D9 D12
D2 D7
D5
D13
17
DECISION TREES
18
DECISION TREES
19
Top-Down Decision Tree Generation
The basic approach usually consists of two phases:
Tree construction
No Uncertainty
Classification - Decision Tree 23
Valuable Information
gi o n
tai n re
g h U ncer
H i
|C |
Pr(c ) 1,
j 1
j
E(PlayGolf, humidity)
E(PlayGolf, windy)
Yes No A B
Married ? Gender
Yes No M F
Feature
No No F Yes A A
Yes Yes F Yes B C
Gender Yes No M No B B
No Yes F Yes B C
M F No No F Yes B A
No No M No B B
Yes No F Yes A A
No Yes F Yes A C
Class A: 0 Class A: 3 Yes Yes F Yes a C
Class B: 3 Class B: 0
Class C: 0 Class C: 4
Feature
No No F Yes A A
Yes Yes F Yes B C
Gender Yes No M No B B
No Yes F Yes B C
M F No No F Yes B A
No No M No B B
Yes No F Yes A A
No Yes F Yes A C
Class A: 0 Class A: 3 Yes Yes F Yes a C
Class B: 3 Class B: 0
Class C: 0 Class C: 4
Completing DT Yes
No
Yes
Yes
No
Yes
M
F
F
Yes
Yes
Yes
A
A
B
B
A
C
Yes No M No B B
No Yes F Yes B C
No No F Yes B A
Gender No No M No B B
Yes No F Yes A A
M F No Yes F Yes A C
Yes Yes F Yes a C
Yes No
Class C: 4 Class A: 3
Completing DT
House d History Class
Yes Yes M Yes A B
No No F Yes A A
Gender Yes Yes F Yes B C
Yes No M No B B
No Yes F Yes B C
M F
No No F Yes B A
No No M No B B
Yes No F Yes A A
Yes No
Class C: 4 Class A: 3
78
Classification - Decision Tree 79
Classification - Decision Tree 80
Classification - Decision Tree 81
Trees Construction Algorithm (ID3)
Decision Tree Learning Method (ID3)
Input: a set of examples S, a set of features F, and a target set T (target
T return “no”
2. Otherwise, choose the best feature f from F (if there are no features
4. Distribute training examples to leaf nodes (so each leaf node S is now
the set of examples at that node, and F is the remaining set of features not
yet selected)
5. Repeat steps 1-5 for each leaf node
Main Question:
how do we choose the best feature at each step?
Note:
Note:ID3ID3algorithm
algorithmonly
onlydeals
dealswith
withcategorical
categoricalattributes,
attributes,but
butcan
canbe
beextended
extended
(as
(asininC4.5)
C4.5)totohandle
handlecontinuous
continuousattributes
attributes
Classification - Decision Tree 82
Choosing the “Best” Feature
Using Information Gain to find the “best” (most discriminating) feature
Entropy, E(I) of a set of instance I, containing p positive and n negative examples
p p n n
E(I ) log 2 log 2
pn pn pn pn
Gain(A, I) is the expected reduction in entropy due to feature (attribute) A
pj nj
Gain( A, I ) E ( I ) pn
E(I j )
descendant j
the jth descendant of I is the set of instances with value vj for A
S: [9+,5-]
Outlook? E = -(9/14).log(9/14) - (5/14).log(5/14)
= 0.940
overcast rainy
sunny
USA 0.41
Japan 0.25
Most evenly distributed income
GINI (t ) 1 p 2 ( j | t )
j
C1 1 2 1 C1 4 5
C1 7 2
C2 4 1 1 C2 0 5 C2 2 3
Gini 0 .48 .48 Gini 0 .5 Gini .345 .48
0.34 0.36 0.391
Split Pos
≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
5 2 5 2 1
GINI (outlook sunny rainy ) 1
10 10 2
4 2 0 2
GINI (outlook overcast ) 1 0
4 4
10 1 4
GINI ( split based on outlook ) * (0) 0.3571
14 2 14
5 2 3 2
GINI (temperatur e hot cold ) 1 0.46875
8 8
4 2 2 2
GINI (temperatur e mild ) 1 0.44
6 6
8 6
GINI ( split based on temperatur e) * 0.46875 * 0.44 0.456
14 14
3 2 4 2 24
GINI (humidity high) 1
7 7 49
4 2 2 2 12
GINI (humidity normal ) 1
6 6 49
7 24 7 12
GINI ( split based on humidity ) * * 0.37
14 49 14 49
6 2 2 2
GINI ( windy FALSE ) 1 0.375
8 8
3 2 3 2
GINI ( windy TRUE ) 1 0. 5
6 6
8 6
GINI ( split based on windy) * 0.375 * 0.5 0.43
14 14
Humidity
4 yes, 0 no
N = 10
Normal High
windy
4 yes 1 no
N=5
False true
V=outlook
3 yes, 0 no
N=2
Rain Sunny
Another Solution:
take a minimum threshold M of the examples of the majority class in each
adjacent partition; then merge adjacent partitions with the same majority class
70.5 77.5
Example: M = 3
Temperature 64 65 68 69 70 71 72 72 75 75 80 81 83 85
Play? yes no yes yes yes no no yes yes yes no yes yes no
Final mapping: temperature £ 77.5 ==> “yes”; temperature > 77.5 ==> “---”
Si S Gain ( A, S )
SplitInfo ( A, S ) log i GainRatio ( A, S )
S S SplitInfo ( A, S )
Example: “outlook”
S: [9+,5-]
SplitInfo (outlook, S) Outlook
= -(4/14).log(4/14) - (5/14).log(5/14) - (5/14).log(5/14)
= 1.577 overcast rainy
sunny
GainRatio (outlook, S)
= 0.246 / 1.577 = 0.156
S1: [4+,0-] S2 : [2+,3-] S3 : [3+,2-]
Humidity Windy
Info = 0.788 Info = 0.892
Gain = 0.940 - .788 = 0.152 Gain = 0.940 - .892 = 0.048
Split info = info ([7, 7]) = 1 Split info = info ([8, 6]) = .985
Use all the data for training, but apply a statistical test (e.g., chi-square) to
No
6 9
entropy Own _ house ( D) entropy ( D1 ) entropy ( D2 )
15 15
6 9
0 0.918
15 15
0.551
5 5 5
entropy Age ( D) entropy ( D1 ) entropy ( D2 ) entropy ( D3 ) Age Yes No entropy(Di)
15 15 15
young 2 3 0.971
5 5 5
0.971 0.971 0.722 middle 3 2 0.971
15 15 15
0.888
old 4 1 0.722
<=1.6m >2.0m
Height
instance
Tree too wide – not compact
Since it is really numerical – what to do with a
different value?
Use threshold Z, and split into two subsets
Y <= Z and Y > Z
More complex tests, assuming discrete values and
variable number of subsets
Classification - Decision Tree 140
C4.5 decision tree algorithms
C4.5 and continuous attribute
Sort values into v1,…,vm
Try Zi = vi or Zi = (vi + vi+1) / 2 for i=1,…,m-1
C4.5 uses Z = vi – more explainable decision rule
Select splitting value Z*
So that gain(Z*) = max {gain(Zi), i=1,…,m-1)}
For last example – Attribute 2 (see next slide)
Z* = 80
Gain = 0.1022
So even with this approach – would have split on Attribute
1
Given dataset T
Attribute1 Attribute2 Attribute3 Class
A 70 True CLASS1
A 90 True CLASS2
A 85 False CLASS2
A 95 False CLASS2
A 70 False CLASS1
????? 90 True CLASS1
B 78 False CLASS1
B 65 True CLASS1
B 75 False CLASS1
C 80 True CLASS2
C 70 True CLASS2
C 80 False CLASS1
C 80 False CLASS1
C 96 False CLASS1
Prediction
Same approach – with probabilities – is used
If values of attributes known – class is well
defined
Else all paths from the root explored
Probability of each class is determined for all classes
Which is a sum of probabilities along paths
Class with highest probability is selected