Decision - Tree
Decision - Tree
CUKUROVA UNIVERSITY
BIOMEDICAL ENGINEERING DEPARTMENT
2016
Also known as
– Hierarchical classifiers
– Tree classifiers
– Multistage classification
– Divide & conquer strategy
2
Decision Trees
Classify a pattern through a sequence of questions (20-question game); next question
asked depends on the answer to the current question
This approach is particularly useful for non-metric data; questions can be asked in a “yes-
no” or “true-false” style that do not require any notion of metric
Classification of a pattern begins at the root node until we reach the leaf node; pattern is
assigned the category of the leaf node
4
When to consider Decision Trees
Instances describable by attribute-value pairs
e.g Humidity: High, Normal
5
Decision tree representation
ID3 learning algorithm
Entropy & Information gain
Examples
Gini Index
Overfitting
6
Decision Trees
Decision tree is a simple but powerful learning paradigm. In this
method a set of training examples is broken down into smaller and
smaller subsets while at the same time an associated decision tree get
incrementally developed. At the end of the learning process, a decision
tree covering the training set is returned.
The decision tree can be thought of as a set sentences (in Disjunctive
Normal Form) written propositional logic.
Some characteristics of problems that are well suited to Decision Tree
Learning are:
Attribute-value paired elements
Discrete target function
Disjunctive descriptions (of target function)
Works well with missing or erroneous training data
7
Decision Tree Example
Debt
Income
8
Decision Tree Example
Debt
Income > t1
??
Income
t1
9
Decision Tree Example
Debt
Income > t1
t2 Debt > t2
Income
t1
??
10
Decision Tree Example
Debt
Income > t1
t2 Debt > t2
Income
t3 t1
Income > t3
11
Decision Tree Construction
Debt
Income > t1
t2 Debt > t2
Income
t3 t1
Income > t3
Note: tree boundaries are piecewise
linear and axis-parallel
12
ID3 Algorithm
13
Top-Down Induction of
Decision Trees ID3
14
Building a Decision Tree
1. First test all attributes and select the on that would function as the best
root;
2. Break-up the training set into subsets based on the branches of the root
node;
3. Test the remaining attributes to see which ones fit best underneath the
branches of the root node;
4. Continue this process for all other branches until
a. all examples of a subset are of one type
b. there are no examples left (return majority classification of the
parent)
c. there are no more attributes left (default value should be majority
classification)
15
Decision Tree for Playing Tennis
Consider the following table
Day Outlook Temp. Humidity Wind Play Tennis
16
Decision Tree for Playing Tennis
17
Decision Tree for Playing Tennis
18
Decision Tree for Playing Tennis
Temperature:Hot,Mild,Cold
19
Decision Tree for Playing Tennis
Outlook
No Yes No Yes
(Outlook = Sunny Humidity = Normal) (Outlook = Overcast) (Outlook = Rain Wind = Weak)
20
[See: Tom M. Mitchell, Machine Learning, McGraw-Hill, 1997]
Complete Tree
Outlook
No Yes No Yes
[D1,D2] [D8,D9,D11] [D6,D14] [D4,D5,D10]
21
OR Complete Tree
Huminity
Normal
High
Wind Wind
Strong Weak Strong Weak
Outlook Outlook Outlook Outlook
23
Steps for a desion tree construction:
24
Entropy & Gain
Determining which attribute is best (Entropy & Gain)
Entropy (E) is the minimum number of bits needed in
order to classify an arbitrary example as yes or no
E(S) = ci=1 –pi log2 pi
E(S)=0 , if all variables are the same
E(S)=1 , if variables are distributed equally.
0<E(S)<1 , if variables are distributed randomly.
The information gain G(S,A) where A is an attribute
G(S,A) E(S) - v in Values(A) (|Sv| / |S|) * E(Sv)
25
Entropy
The average amount of information I needed to classify an object is given by the entropy
measure
26 p(c1)
Example The Entropy of A1 is computed as the following:
[29+,35-] A1=?
E(A) = -29/(29+35)*log2(29/(29+35)) –
35/(35+29)log2(35/(35+29))
= 0.9937
True False
• The Entropy of True:
E(TRUE) = -21/(21+5)*log2(21/(21+5))
– 5/(5+21)*log2(5/(5+21))= 0.7960
[21+, 5-] [8+, 30-]
E(FALSE) = -8/(8+30)*log2(8/(8+30)) –
30/(30+8)*log2(30/(30+8))
• The Entropy of False:
= 0.7426
27
. . .
.
. .
. .
red
Color? green
Entropy
reduction .
by yellow .
data set
partitioning .
.
28
. . .
.
. .
. .
red
Color? green
.
yellow .
.
.
29
. . .
.
. .
. .
Information Gain
red
Color? green
.
yellow .
.
.
30
Information Gain of The Attribute
Attributes
Gain(Color) = 0.246
Gain(Outline) = 0.151
Gain(Dot) = 0.048
Heuristics: attribute with the highest gain is chosen
This heuristics is local (local minimization of impurity)
31
. .
. . .
.
. . red
Color?
green
. yellow
.
. .
32
. .
. . .
.
. . red
. yellow
.
. .
solid
.
Outline?
dashed
.
33
. .
. . .
.
. . red
yes .
Dot? .
Color?
no
green
. yellow
.
. .
solid
.
Outline?
dashed
.
34
Decision Tree
. .
. .
. .
Color
red green
yellow
35
Which Attribute is ”best”?
36
Information Gain
Entropy([21+,5-]) = 0.71 Entropy([18+,33-]) = 0.94
Entropy([8+,30-]) = 0.74 Entropy([8+,30-]) = 0.62
Gain(S,A1)=Entropy(S) Gain(S,A2)=Entropy(S)
-26/64*Entropy([21+,5-]) -51/64*Entropy([18+,33-])
-38/64*Entropy([8+,30-]) -13/64*Entropy([11+,2-])
=0.27 =0.12
Over
Sunny Rain
cast
[2+, 3-] [4+, 0] [3+, 2-]
E=0.971 E=0.0 E=0.971
Gain(S,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 – (5/14)*0.0971
=0.247
40
Selecting the Next Attribute
The information gain values for the 4 attributes
are:
• Gain(S,Outlook) =0.247
• Gain(S,Humidity) =0.151
• Gain(S,Wind) =0.048
• Gain(S,Temperature) =0.029
Note: 0Log20 =0
41
Decision Tree Learning:
A Simple Example
42
Now our decision tree looks like:
43
Let
E([X+,Y-]) represent that there are X positive training elements and Y
negative elements.
Therefore the Entropy for the training data, E(S), can be represented as
E([9+,5-]) because of the 14 training examples 9 of them are yes and 5 of
them are no.
44
G(S,Humidity) = 0.94 – [7/14*E(Humidity=high) + 7/14*E(Humidity=normal)]
G(S,Humidity = 0.94 – [7/14*E([3+,4-]) + 7/14*E([6+,1-])]
G(S,Humidity = 0.94 – [7/14*0.985 + 7/14*0.592]
G(S,Humidity) = 0.1515
45
The information gain for Outlook is:
G(S,Outlook) = E(S) – [5/14 * E(Outlook=sunny) + 4/14 * E(Outlook = overcast) +
5/14 * E(Outlook=rain)]
G(S,Outlook) = E([9+,5-]) – [5/14*E(2+,3-) + 4/14*E([4+,0-]) + 5/14*E([3+,2-])]
G(S,Outlook) = 0.94 – [5/14*0.971 + 4/14*0.0 + 5/14*0.971]
G(S,Outlook) = 0.246
46
G(S,Humidity) = 0.94 – [7/14*E(Humidity=high) + 7/14*E(Humidity=normal)]
G(S,Humidity = 0.94 – [7/14*E([3+,4-]) + 7/14*E([6+,1-])]
G(S,Humidity = 0.94 – [7/14*0.985 + 7/14*0.592]
G(S,Humidity) = 0.1515
47
Then the Complete Tree should be like this
Outlook
No Yes No Yes
[D1,D2] [D8,D9,D11] [D6,D14] [D4,D5,D10]
48
Decision Tree for PlayTennis
Outlook
No Yes No Yes 50
Decision Tree for Conjunction
Outlook=Sunny Wind=Weak
Outlook
Wind No No
Strong Weak
No Yes
51
Decision Tree for Disjunction
Outlook=Sunny Wind=Weak
Outlook
No Yes No Yes
52
Decision Tree
• Decision trees represent disjunctions of conjunctions
Outlook
(Outlook=Sunny Humidity=Normal)
(Outlook=Overcast)
(Outlook=Rain Wind=Weak)
53
ID3 Algorithm Note: 0Log20 =0
[D1,D2,…,D14] Outlook
[9+,5-]
No Yes No Yes
[D1,D2] [D8,D9,D11] [D6,D14] [D4,D5,D10]
55
Converting a Tree to Rules
Outlook
58
Information Gain
Gain (Sample, Attributes) or Gain (S,A) is expected
reduction in entropy due to sorting S on attribute A
59
Information Gain Ratio
60
. . .
.
. .
. .
red
Information Gain Ratio
Color? green
.
yellow .
.
.
61
Information Gain
and
Information Gain Ratio
62
Gini Index
63
Gini Index
. .
. .
. .
64
. . .
.
. .
. .
red
Gini Index for Color
Color? green
.
yellow .
.
.
65
Gain of Gini Index
66
Used by the CART (classification and regression tree) algorithm.
Gini impurity is a measure of how often a randomly chosen element
from the set would be incorrectly labeled if it were randomly labeled
according to the distribution of labels in the subset.
Gini impurity can be computed by summing the probabi lity of each item
being chosen times the probability of a mistake in categorizing that item.
It reaches its minimum (zero) when all cases in the node fall into a
single target category.
67
Three Impurity Measures
A Gain(A) GainRatio(A) GiniGain(A)
coincidence
A long hypothesis that fits the data might be a
coincidence
Arguments opposed:
There are many ways to define small sets of
hypotheses
69
Overfitting in Decision Tree
One of the biggest problems with decision trees is
Overfitting
70
Avoid Overfitting
How can we avoid overfitting?
Stop growing when data split not statistically
significant
Grow full tree then post-prune
set(training data)
min( |tree|+|misclassifications(tree)|)
Minimize:
size(tree) + size(misclassifications(tree))
71
Effect of Reduced Error Pruning
72
Unknown Attribute Values
What if some examples have missing values of A?
Use training example anyway sort through tree
If node n tests A, assign most common value of A
Model selection
Feature selection
74
Reference:
Alppaydin,Machine Learning,2010,2nd Press
Dr. Lee’s Slides, San Jose State University, Spring 2007
"Building Decision Trees with the ID3 Algorithm", by: Andrew Colin, Dr. Dobbs
Journal, June 1996
"Incremental Induction of Decision Trees", by Paul E. Utgoff, Kluwer Academic
Publishers, 1989
http://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/2.htm
http://decisiontrees.net/node/27
Tom M. Mitchell, Machine Learning, McGraw-Hill, 1997
Barros R. C., Cerri R., Jaskowiak P. A., Carvalho, A. C. P. L. F., A bottom-up
oblique decision tree induction algorithm
(http://dx.doi.org/10.1109/ISDA.2011.6121697). Proceedings of the 11th
International
Breiman, Leo; Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classif
ication and regression tree Monterey, CA: Wadsworth & Brooks/Cole Advanced
Books & Software. ISBN 978-0-412-04841-8.
Barros, Rodrigo C., Basgalupp, M. P., Carvalho, A. C. P. L. F., Freitas, Alex A.
(2011). A Survey of Evolutionary Algorithms for Decision-Tree Induction
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5928432). IEEE
Transactions on Systems, Man and Cybernetics, Part C: Applications and
Reviews,vol. 42, n. 3, p. 291-312, May 2012.
75