Decision Tree-Using Entropy
Decision Tree-Using Entropy
Example
5/13/2022
Decision tree representation (PlayTennis)
5/13/2022
Decision trees expressivity
Decision trees represent a disjunction of conjunctions on
constraints on the value of attributes:
(Outlook = Sunny Humidity = Normal)
(Outlook = Overcast)
(Outlook = Rain Wind = Weak)
5/13/2022
Top-down induction of Decision Trees
ID3 (Quinlan, 1986) is a basic algorithm used to create DT's
Given a training set of examples, the algorithms for building DT
performs search in the space of decision trees.
The construction of the tree is top-down. The algorithm is greedy.
The fundamental question is “which attribute should be tested next?
Which attribute gives us more information?”
Select the best attribute
A descendent node is then created for each possible value of this
attribute and data set is partitioned according to this value.
The process is repeated for each successor node until all the
examples are classified correctly or there are no attributes left
5/13/2022
Which attribute is the best classifier?
5/13/2022
5/13/2022
Entropy in binary classification
Entropy measures the impurity of a collection of examples. It
depends from the distribution of the random variable p.
S is a collection of training examples
p+ the proportion of positive examples in S
p– the proportion of negative examples in S
Entropy (S) – p+ log2 p+ – p–log2 p– [0 log20 = 0]
Entropy ([14+, 0–]) = – 14/14 log2 (14/14) – 0 log2 (0) = 0
Entropy ([9+, 5–]) = – 9/14 log2 (9/14) – 5/14 log2 (5/14) = 0.94
Entropy ([7+, 7– ]) = – 7/14 log2 (7/14) – 7/14 log2 (7/14) =
= 1/2 + 1/2 = 1 [log21/2 = – 1]
Note: the log of a number < 1 is negative, 0 p 1, 0 entropy 1
Entropy in general
Entropy measures the amount of information in a random
variable
H(X) = – p+ log2 p+ – p– log2 p– X = {+, –}
for binary classification [two-valued random variable]
c c
H(X) = – pi log2 pi = pi log2 1/ pi X = {i, …, c}
i=1 i=1
for classification in c classes
Example: rolling a die with 8, equally probable, sides
8
H(X) = – 1/8 log2 1/8 = – log2 1/8 = log2 8 = 3
i=1
Information gain as entropy reduction
Information gain is the expected reduction in entropy caused by
partitioning the examples on an attribute.
The higher the information gain the more effective the attribute
in classifying training data.
Expected reduction in entropy knowing A
{D1, D2, D8} {D9, D11} {D4, D5, D10} {D6, D14}
No Yes Yes No
Thanks