0% found this document useful (0 votes)
19 views18 pages

Decision Trees 2

This document discusses using entropy to select attributes for decision tree induction. It explains that entropy is a measure of uncertainty in a dataset, and that selecting the attribute with the greatest information gain (the difference between original and new entropy) will minimize uncertainty at each split. The example shows calculating the entropy and information gain for splitting the lens24 dataset on different attributes, finding that splitting on 'tears' gives the greatest information gain and smallest new entropy. The process of splitting on attributes is repeated down the tree until reaching pure leaf nodes with entropy of zero.

Uploaded by

Miguel pilamunga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views18 pages

Decision Trees 2

This document discusses using entropy to select attributes for decision tree induction. It explains that entropy is a measure of uncertainty in a dataset, and that selecting the attribute with the greatest information gain (the difference between original and new entropy) will minimize uncertainty at each split. The example shows calculating the entropy and information gain for splitting the lens24 dataset on different attributes, finding that splitting on 'tears' gives the greatest information gain and smallest new entropy. The process of splitting on attributes is repeated down the tree until reaching pure leaf nodes with entropy of zero.

Uploaded by

Miguel pilamunga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Decision Tree Induction: Using

Entropy
for Attribute Selection
Principles of Datamining Chap 5
Dra María Hallo
Attribute selection
• Dependieng of the order of the attribute
selection for the tree we have different trees.
• No attribute may be selected twice in the same
branch.
• Figure 5.1 shows the results of running the TDIDT
algorithm with attribute selection strategies
takefirst, takelast and random in turn to generate
decision trees for the seven datasets contact
lenses, lens24, chess, vote, monk1, monk2and
monk3.
Number of Branches Generated by
TDIDT with Three Attribute
Selection Methods
Ex of a Dataset
Decision Tree 1 - IDDT algorithm
Decision Tree 2 - IDDT algorithm
Choosing Attributes to Split On: Using
Entropy
• One commonly used method is to select the
attribute that minimises the value of entropy,
thus maximising the information gain.
• Entropy is an information-theoretic measure
of the ‘uncertainty’ contained a training set,
due to the presence of more than one
possible classification.
Choosing Attributes to Split On: Using
Entropy
• If there are K classes, we can denote the
proportion of instances with classification i by pi
• for i =1 to K. The value of p is the number of
occurrences of class i divided by the total number
of instances, which is a number between
0 and 1 inclusive.
• The entropy of the training set is denoted by E. It is
measured in ‘bits’ of information and is defined by
the formula
Choosing Attributes to Split On: Using
Entropy (number of generated
branches)
Choosing Attributes to Split On: Using
Entropy (number of generated
branches)
There is no guarantee that using entropy will
always lead to a small decision tree, but
experience shows that it generally produces
trees with fewer branches than other attribute
selection criteria.
Choosing Attributes to Split On: Using
Entropy (number of generated
branches)
There is no guarantee that using entropy will
always lead to a small decision tree, but
experience shows that it generally produces
trees with fewer branches than other attribute
selection criteria.
E start
For the initial lens24, training set of 24 instances, there are 3
classes. There are 4 instances with classification 1, 5 instances
with classification 2 and 15 instances with classification 3. So p1
=4/24, p2=5/24 and p3=15/24.
Entropy Estart. is given by
Estart= -(4/24) log 2 (4/24) - (5/24) log2 (5/24) - (15/24) log
(15/24)
=0.4308 + 0.4715 + 0.4238
=1.3261 bits
Using Entropy for Attribute Selection
• Training set 1 (age = 1)
Entropy E1= -(2/8) log2(2/8) - (2/8) log2(2/8) - (4/8)
log(4/8)
=0.5+0.5+0.5=1.5
• Training set 2 (age = 2)
Entropy E2= -(1/8) log2(1/8) - (2/8) log2(2/8) - (5/8)
log(5/8)
=0.375 + 0.5+0.4238 = 1.2988
• Training Set 3 (age = 3)
Entropy E3= -(1/8) log2(1/8) - (1/8) log2(1/8) - (6/8) log
(6/8)
=0.375 + 0.375 + 0.3113 = 1.0613
• 2
• 2
• 2
Average entropy
The values E1, E2 and E3 need to be weighted by
the proportion of the original instances in each
of the three subsets. In this case all the weights
are the same, i.e. 8/24.
Average entropy of the three training sets
produced by splitting on attribute age is
denoted by E new , then
Enew =(8/24)E1+(8/24)E2 +(8/24)E23= 1.2867
Information Gain = Estart- E new
Information gain from splitting on attribute age
is 1.3261 - 1.2867 = 0.0394 bits
Minimising the value of E new
The ‘entropy method’ of attribute selection is to
choose to split on the attribute that gives the
greatest reduction in (average) entropy, i.e. the
one that maximises the value of Information
Gain.
This is equivalent to minimising the value of E
new as E start is fixed
Information Gain
Maximising Information Gain
• atribute age E new =1.2867
Information Gain = 1.3261 - 1.2867 = 0.0394 bits
• attribute specRx E new = 1.2866
Information Gain = 1.3261 - 1.2866 = 0.0395 bits
• attribute astig E new = 0.9491
Information Gain = 1.3261 - 0.9491 = 0.3770 bits
• attribute tears E = 0.7773
Information Gain = 1.3261 - 0.7773 = 0.5488 bits
Thus, the largest value of Information Gain (and the smallest
value of the new entropy E ) is obtained by splitting on
attribute tears
Process of splitting on nodes
The process of splitting on nodes is
repeated for each branch of the evolving
decision tree, terminating when the subset
at every leaf node has entropy zero

You might also like