DM Lec8
DM Lec8
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
l Gini Index
l Entropy
l Misclassification error
Entropy(t ) = − p( j | t ) log p( j | t )
j 2
l Information Gain:
n
GAIN = Entropy( p) − Entropy(i )
k
i
n
split i =1
“Outlook” = “Sunny”:
info([2,3] ) = entropy(2/ 5,3/5) = −2 / 5 log( 2 / 5) − 3 / 5 log(3 / 5) = 0.971 bits
“Outlook” = “Rainy”:
info([3,2] ) = entropy(3/ 5,2/5) = −3 / 5 log(3 / 5) − 2 / 5 log( 2 / 5) = 0.971 bits
Expected information for attribute:
info([3,2] ,[4,0], [3,2]) = (5 / 14) 0.971 + (4 / 14) 0 + (5 / 14) 0.971
= 0.693 bits
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Computing the information gain
Information gain:
(information before split) – (information after split)
gain(" Outlook" ) = info([9,5] ) - info([2,3] , [4,0], [3,2]) = 0.940 - 0.693
= 0.247 bits
Information gain for attributes from weather data:
gain(" Outlook" ) = 0.247 bits
gain(" Temperatur e" ) = 0.029 bits
gain(" Humidity" ) = 0.152 bits
gain(" Windy" ) = 0.048 bits
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
witten&eibe
Continuing to split