0% found this document useful (0 votes)
5 views17 pages

DM Lec8

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views17 pages

DM Lec8

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Decision Tree Construction

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Continuous Attributes: Computing Gini Index

Tid Refund Marital Taxable


Use Binary Decisions based on one Status Income Cheat
value
1 Yes Single 125K No
Several Choices for the splitting value 2 No Married 100K No
– Number of possible splitting values 3 No Single 70K No
= Number of distinct values 4 Yes Married 120K No

Each splitting value has a count matrix 5 No Divorced 95K Yes

associated with it 6 No Married 60K No


7 Yes Divorced 220K No
– Class counts in each of the
partitions, A < v and A  v 8 No Single 85K Yes
9 No Married 75K No
Simple method to choose best v 10 No Single 90K Yes
– For each v, scan the database to
10

gather count matrix and compute Taxable


its Gini index Income
> 80K?
– Computationally Inefficient!
Repetition of work. Yes No

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Continuous Attributes: Computing Gini Index...

For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Taxable Income

Sorted Values 60 70 75 85 90 95 100 120 125 220


55 65 72 80 87 92 97 110 122 172 230
Split Positions
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Measures of Node Impurity

l Gini Index

l Entropy

l Misclassification error

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Entropy

Information is measured in bits


– Given a probability distribution, the info
required to predict an event is the distribution’s
entropy
– Entropy gives the information required in bits
(this can involve fractions of bits!)
Formula for computing the entropy:
entropy( p1 , p2 ,, pn ) = − p1logp1 − p2logp2  − pn logpn

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Entropy

l Entropy at a given node t:


Entropy(t ) = − p( j | t ) log p( j | t )
j

(NOTE: p( j | t) is the relative frequency of class j at node t).


– Measures homogeneity of a node.
◆ Maximum (log nc) when records are equally distributed
among all classes implying least information
◆ Minimum (0.0) when all records belong to one class,
implying most information
– Entropy based computations are similar to the
GINI index computations
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Examples for computing Entropy

Entropy(t ) = − p( j | t ) log p( j | t )
j 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Information Gain

l Information Gain:
 n 
GAIN = Entropy( p) −   Entropy(i ) 
k
i

 n 
split i =1

Parent Node, p is split into k partitions;


ni is number of records in partition i
– Measures Reduction in Entropy achieved because of
the split. Choose the split that achieves most reduction
(maximizes GAIN)
– Used in ID3 and C4.5
– Disadvantage: Tends to prefer splits that result in large
number of partitions, each being small but pure.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Weather Data: Play or not Play?

Outlook Temperature Humidity Windy Play?


sunny hot high false No
Note:
sunny hot high true No Outlook is the
overcast hot high false Yes Forecast,
no relation to
rain mild high false Yes Microsoft
rain cool normal false Yes email program
rain cool normal true No
overcast cool normal true Yes
sunny mild high false No
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Which attribute to select?

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Example: attribute “Outlook”

“Outlook” = “Sunny”:
info([2,3] ) = entropy(2/ 5,3/5) = −2 / 5 log( 2 / 5) − 3 / 5 log(3 / 5) = 0.971 bits

“Outlook” = “Overcast”: Note: log(0) is not


defined, but we evaluate
info([4,0] ) = entropy(1, 0) = −1log(1) − 0 log(0) = 0 bits 0*log(0) as zero

“Outlook” = “Rainy”:
info([3,2] ) = entropy(3/ 5,2/5) = −3 / 5 log(3 / 5) − 2 / 5 log( 2 / 5) = 0.971 bits
Expected information for attribute:
info([3,2] ,[4,0], [3,2]) = (5 / 14)  0.971 + (4 / 14)  0 + (5 / 14)  0.971
= 0.693 bits
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Computing the information gain

Information gain:
(information before split) – (information after split)
gain(" Outlook" ) = info([9,5] ) - info([2,3] , [4,0], [3,2]) = 0.940 - 0.693
= 0.247 bits
Information gain for attributes from weather data:
gain(" Outlook" ) = 0.247 bits
gain(" Temperatur e" ) = 0.029 bits
gain(" Humidity" ) = 0.152 bits
gain(" Windy" ) = 0.048 bits
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
witten&eibe
Continuing to split

gain(" Humidity" ) = 0.971 bits


gain(" Temperatur e" ) = 0.571 bits

gain(" Windy" ) = 0.020 bits

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


The final decision tree

Note: not all leaves need to be pure; sometimes


identical instances have different classes
 Splitting stops when data can’t be split any
further
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Highly-branching attributes

l Problematic: attributes with a large number of


values (extreme case: ID code)
l Subsets are more likely to be pure if there is a
large number of values
Information gain is biased towards choosing
attributes with a large number of values
This may result in overfitting (selection of an
attribute that is non-optimal for prediction)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Weather Data with ID code

ID Outlook Temperature Humidity Windy Play?


A sunny hot high false No
B sunny hot high true No
C overcast hot high false Yes
D rain mild high false Yes
E rain cool normal false Yes
F rain cool normal true No
G overcast cool normal true Yes
H sunny mild high false No
I sunny cool normal false Yes
J rain mild normal false Yes
K sunny mild normal true Yes
L overcast mild high true Yes
M overcast hot normal false Yes
N rain mild high true No
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Split for ID Code Attribute

Entropy of split = 0 (since each leaf node is “pure”, having only


one case.

Information gain is maximal for ID code

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


witten&eibe

You might also like