0% found this document useful (0 votes)
22 views75 pages

Decision - Tree

The document discusses decision trees, including their structure and use for classification. Decision trees classify data by asking a series of questions, with each question splitting the data into smaller subsets. The questions are represented as nodes in a tree structure, with leaf nodes indicating a class prediction.

Uploaded by

727721eucs170
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views75 pages

Decision - Tree

The document discusses decision trees, including their structure and use for classification. Decision trees classify data by asking a series of questions, with each question splitting the data into smaller subsets. The questions are represented as nodes in a tree structure, with leaf nodes indicating a class prediction.

Uploaded by

727721eucs170
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 75

Decision Trees

CUKUROVA UNIVERSITY
BIOMEDICAL ENGINEERING DEPARTMENT
2016
 Also known as
 – Hierarchical classifiers
 – Tree classifiers
 – Multistage classification
 – Divide & conquer strategy

2
Decision Trees
 Classify a pattern through a sequence of questions (20-question game); next question
asked depends on the answer to the current question

 This approach is particularly useful for non-metric data; questions can be asked in a “yes-
no” or “true-false” style that do not require any notion of metric

 Sequence of questions is displayed in a directed decision tree

 Root node, links or branches, leaf or terminal nodes

 Classification of a pattern begins at the root node until we reach the leaf node; pattern is
assigned the category of the leaf node

 Benefit of decision tee:


– Interpretability: a tree can be expressed as a logical expression
– Rapid classification: a sequence of simple queries
– Higher accuracy & speed
3
Decision Tree Structure

4
When to consider Decision Trees
 Instances describable by attribute-value pairs
 e.g Humidity: High, Normal

 Target function is discrete valued


 e.g Play tennis; Yes, No

 Disjunctive hypothesis may be required


 e.g Outlook=Sunny  Wind=Weak

 Possibly noisy training data


 Missing attribute values
 Application Examples:
 Medical diagnosis

 Credit risk analysis

 Object classification for robot manipulator (Tan 1993)

5
 Decision tree representation
 ID3 learning algorithm
 Entropy & Information gain
 Examples
 Gini Index
 Overfitting

6
 Decision Trees
 Decision tree is a simple but powerful learning paradigm. In this
method a set of training examples is broken down into smaller and
smaller subsets while at the same time an associated decision tree get
incrementally developed. At the end of the learning process, a decision
tree covering the training set is returned.
 The decision tree can be thought of as a set sentences (in Disjunctive
Normal Form) written propositional logic.
 Some characteristics of problems that are well suited to Decision Tree
Learning are:
 Attribute-value paired elements
 Discrete target function
 Disjunctive descriptions (of target function)
 Works well with missing or erroneous training data

7
Decision Tree Example
Debt

Income

8
Decision Tree Example
Debt
Income > t1

??

Income
t1

9
Decision Tree Example
Debt
Income > t1

t2 Debt > t2

Income
t1
??

10
Decision Tree Example
Debt
Income > t1

t2 Debt > t2

Income
t3 t1
Income > t3

11
Decision Tree Construction
Debt
Income > t1

t2 Debt > t2

Income
t3 t1
Income > t3
Note: tree boundaries are piecewise
linear and axis-parallel

12
ID3 Algorithm

 Is the algorithm to construct a decision tree


 Using Entropy to generate the information
gain
 The best value then be selected

13
Top-Down Induction of
Decision Trees ID3

1. A  the “best” decision attribute for next node


2. Assign A as decision attribute for node
3. For each value of A create new descendant
4. Sort training examples to leaf node according to
the attribute value of the branch
5. If all training examples are perfectly classified
(same value of target attribute) stop, else
iterate over new leaf nodes.

14
Building a Decision Tree
1. First test all attributes and select the on that would function as the best
root;
2. Break-up the training set into subsets based on the branches of the root
node;
3. Test the remaining attributes to see which ones fit best underneath the
branches of the root node;
4. Continue this process for all other branches until
a. all examples of a subset are of one type
b. there are no examples left (return majority classification of the
parent)
c. there are no more attributes left (default value should be majority
classification)

15
Decision Tree for Playing Tennis
Consider the following table
Day Outlook Temp. Humidity Wind Play Tennis

D1 Sunny Hot High Weak No


D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Weak Yes
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D10 Rain Mild Normal Strong Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

16
Decision Tree for Playing Tennis

 We want to build a decision tree for the tennis


matches
 The schedule of matches depend on the weather
(Outlook, Temperature, Humidity, and Wind)
 So to apply what we know to build a decision tree
based on this table

17
Decision Tree for Playing Tennis

 Calculating the information gains for each of


the weather attributes:
 For the Wind
 For the Humidity
 For the Outlook
 For the Temperature

18
Decision Tree for Playing Tennis

 Attributes and their values:


 Outlook: Sunny, Overcast, Rain

 Humidity: High, Normal

 Wind: Strong, Weak

 Temperature:Hot,Mild,Cold

 Target concept - Play Tennis: Yes, No

19
Decision Tree for Playing Tennis

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
(Outlook = Sunny  Humidity = Normal)  (Outlook = Overcast)  (Outlook = Rain  Wind = Weak)
20
[See: Tom M. Mitchell, Machine Learning, McGraw-Hill, 1997]
Complete Tree
Outlook

Sunny Overcast Rain

Humidity Yes Wind


[D3,D7,D12,D13]

High Normal Strong Weak

No Yes No Yes
[D1,D2] [D8,D9,D11] [D6,D14] [D4,D5,D10]
21
OR Complete Tree
Huminity
Normal
High

Wind Wind
Strong Weak Strong Weak
Outlook Outlook Outlook Outlook

Sunny Overcast Rainy Sunny Overcast Rainy

Sunny Overcast Rainy


22
Sunny Overcast Rainy
22
 Decision Trees are a good solution but not optimum.
 To obtain an optimum decision tree a rule is needed.
 We need to find the root node according to some
calculations (Entropy & Gain).
 After find the root node it will continue up to leaves.

23
Steps for a desion tree construction:

1. Decision tree needs a decision (Go to Play


Tennis or not)

2. Entropy calculation of the system

3. Root Node selection ( according to Information


Gain)

24
Entropy & Gain
 Determining which attribute is best (Entropy & Gain)
 Entropy (E) is the minimum number of bits needed in
order to classify an arbitrary example as yes or no
 E(S) = ci=1 –pi log2 pi
 E(S)=0 , if all variables are the same
 E(S)=1 , if variables are distributed equally.
 0<E(S)<1 , if variables are distributed randomly.
 The information gain G(S,A) where A is an attribute
 G(S,A)  E(S) - v in Values(A) (|Sv| / |S|) * E(Sv)

25
Entropy
 The average amount of information I needed to classify an object is given by the entropy
measure

 Where S is a set of training examples,


 c is the number of classes, and
 pi is the proportion of the training set that is of class i
 Zero Entropy occurs when the entire set is from one class
For our entropy equation 0 log2 0 = 0
 Also it can be written as: E(S) = -(p+)*log2(p+ ) - (p_ )*log2(p_ )
 Where p+ is the positive samples
entropy
 Where p_ is the negative samples
 Where S is the sample of attributions

For a two-class problem:

26 p(c1)
Example The Entropy of A1 is computed as the following:

[29+,35-] A1=?
E(A) = -29/(29+35)*log2(29/(29+35)) –
35/(35+29)log2(35/(35+29))
= 0.9937
True False
• The Entropy of True:
E(TRUE) = -21/(21+5)*log2(21/(21+5))
– 5/(5+21)*log2(5/(5+21))= 0.7960
[21+, 5-] [8+, 30-]
E(FALSE) = -8/(8+30)*log2(8/(8+30)) –
30/(30+8)*log2(30/(30+8))
• The Entropy of False:
= 0.7426

27
. . .
.
. .
. .
red

Color? green
Entropy
reduction .
by yellow .
data set
partitioning .
.

28
. . .
.
. .
. .
red

Color? green

.
yellow .

.
.

29
. . .
.
. .
. .
Information Gain

red

Color? green

.
yellow .

.
.

30
Information Gain of The Attribute
 Attributes
 Gain(Color) = 0.246
 Gain(Outline) = 0.151
 Gain(Dot) = 0.048
 Heuristics: attribute with the highest gain is chosen
 This heuristics is local (local minimization of impurity)

31
. .
. . .
.
. . red

Color?
green

. yellow
.
. .

Gain(Outline) = 0.971 – 0 = 0.971 bits


Gain(Dot) = 0.971 – 0.951 = 0.020 bits

32
. .
. . .
.
. . red

Color? Gain(Outline) = 0.971 – 0.951 = 0.020 bits


Gain(Dot) = 0.971 – 0 = 0.971 bits
green

. yellow
.
. .
solid
.
Outline?
dashed

.
33
. .
. . .
.
. . red
yes .
Dot? .
Color?
no
green

. yellow
.
. .
solid
.
Outline?
dashed

.
34
Decision Tree
. .
. .
. .

Color

red green
yellow

Dot square Outline

yes no dashed solid

triangle square triangle square

35
Which Attribute is ”best”?

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]

36
Information Gain
Entropy([21+,5-]) = 0.71 Entropy([18+,33-]) = 0.94
Entropy([8+,30-]) = 0.74 Entropy([8+,30-]) = 0.62
Gain(S,A1)=Entropy(S) Gain(S,A2)=Entropy(S)
-26/64*Entropy([21+,5-]) -51/64*Entropy([18+,33-])
-38/64*Entropy([8+,30-]) -13/64*Entropy([11+,2-])
=0.27 =0.12

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]


37
Playing Tennis or Not Example
Calculations
Day Outlook Temp. Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Weak Yes
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D10 Rain Mild Normal Strong Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
38
Selecting the Next Attribute
S=[9+,5-] S=[9+,5-]
E=0.940 E=0.940
Humidity Wind

High Normal Weak Strong

[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]


E=0.985 E=0.592 E=0.811 E=1.0
Gain(S,Humidity) Gain(S,Wind)
=0.940-(7/14)*0.985 =0.940-(8/14)*0.811
– (7/14)*0.592 – (6/14)*1.0
=0.151 =0.048
Humidity provides greater info. gain than Wind, w.r.t target classification. 39
Selecting the Next Attribute
S=[9+,5-]
E=0.940
Outlook

Over
Sunny Rain
cast
[2+, 3-] [4+, 0] [3+, 2-]
E=0.971 E=0.0 E=0.971
Gain(S,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 – (5/14)*0.0971
=0.247
40
Selecting the Next Attribute
The information gain values for the 4 attributes
are:
• Gain(S,Outlook) =0.247
• Gain(S,Humidity) =0.151
• Gain(S,Wind) =0.048
• Gain(S,Temperature) =0.029

where S denotes the collection of training examples

Note: 0Log20 =0

41
Decision Tree Learning:
A Simple Example

 Outlook is our winner!

42
 Now our decision tree looks like:

43
 Let
 E([X+,Y-]) represent that there are X positive training elements and Y
negative elements.
 Therefore the Entropy for the training data, E(S), can be represented as
E([9+,5-]) because of the 14 training examples 9 of them are yes and 5 of
them are no.

 Let’s start off by calculating the Entropy of the Training Set.


 E(S) = E([9+,5-]) = (-9/14 log2 9/14) + (-5/14 log2 5/14)
 = 0.94
 Next we will need to calculate the information gain G(S,A) for each attribute
A where A is taken from the set {Outlook, Temperature, Humidity, Wind}.

44
 G(S,Humidity) = 0.94 – [7/14*E(Humidity=high) + 7/14*E(Humidity=normal)]
 G(S,Humidity = 0.94 – [7/14*E([3+,4-]) + 7/14*E([6+,1-])]
 G(S,Humidity = 0.94 – [7/14*0.985 + 7/14*0.592]
 G(S,Humidity) = 0.1515

 Now that we have discovered the root of our decision tree


 We must now recursively find the nodes that should go below Sunny, Overcast,
and Rain.

 G(Outlook=Rain, Humidity) = 0.971 – [2/5*E(Outlook=Rain ^ Humidity=high)


+ 3/5*E(Outlook=Rain ^Humidity=normal]
 G(Outlook=Rain, Humidity) = 0.02
 G(Outlook=Rain,Wind) = 0.971- [3/5*0 + 2/5*0]
 G(Outlook=Rain,Wind) = 0.971

45
 The information gain for Outlook is:
 G(S,Outlook) = E(S) – [5/14 * E(Outlook=sunny) + 4/14 * E(Outlook = overcast) +
5/14 * E(Outlook=rain)]
 G(S,Outlook) = E([9+,5-]) – [5/14*E(2+,3-) + 4/14*E([4+,0-]) + 5/14*E([3+,2-])]
 G(S,Outlook) = 0.94 – [5/14*0.971 + 4/14*0.0 + 5/14*0.971]
 G(S,Outlook) = 0.246

 G(S,Temperature) = 0.94 – [4/14*E(Temperature=hot) +6/14*E(Temperature=mild) +


4/14*E(Temperature=cool)]
 G(S,Temperature) = 0.94 – [4/14*E([2+,2-]) + 6/14*E([4+,2-]) + 4/14*E([3+,1-])]
 G(S,Temperature) = 0.94 – [4/14 + 6/14*0.918 + 4/14*0.811]
 G(S,Temperature) = 0.029

46
 G(S,Humidity) = 0.94 – [7/14*E(Humidity=high) + 7/14*E(Humidity=normal)]
 G(S,Humidity = 0.94 – [7/14*E([3+,4-]) + 7/14*E([6+,1-])]
 G(S,Humidity = 0.94 – [7/14*0.985 + 7/14*0.592]
 G(S,Humidity) = 0.1515

 G(S,Wind) = 0.94 – [8/14*0.811 + 6/14*1.00]


 G(S,Wind) = 0.048

47
Then the Complete Tree should be like this

Outlook

Sunny Overcast Rain

Humidity Yes Wind


[D3,D7,D12,D13]

High Normal Strong Weak

No Yes No Yes
[D1,D2] [D8,D9,D11] [D6,D14] [D4,D5,D10]
48
Decision Tree for PlayTennis
Outlook

Sunny Overcast Rain

Humidity Each internal node tests an attribute

High Normal Each branch corresponds to an


attribute value node
No Yes Each leaf node assigns a classification
49
Decision Tree for PlayTennis
Outlook Temperature Humidity Wind PlayTennis
Sunny Hot High Weak ?No
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes 50
Decision Tree for Conjunction
Outlook=Sunny  Wind=Weak

Outlook

Sunny Overcast Rain

Wind No No

Strong Weak

No Yes
51
Decision Tree for Disjunction
Outlook=Sunny  Wind=Weak
Outlook

Sunny Overcast Rain

Yes Wind Wind

Strong Weak Strong Weak

No Yes No Yes
52
Decision Tree
• Decision trees represent disjunctions of conjunctions
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak


No Yes No Yes

(Outlook=Sunny  Humidity=Normal)
 (Outlook=Overcast)
 (Outlook=Rain  Wind=Weak)
53
ID3 Algorithm Note: 0Log20 =0

[D1,D2,…,D14] Outlook
[9+,5-]

Sunny Overcast Rain

Ssunny=[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]


[2+,3-] [4+,0-] [3+,2-]
Test for this node
? Yes ?
Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970
Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570
Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019
54
ID3 Algorithm
Outlook

Sunny Overcast Rain

Humidity Yes Wind


[D3,D7,D12,D13]

High Normal Strong Weak

No Yes No Yes
[D1,D2] [D8,D9,D11] [D6,D14] [D4,D5,D10]
55
Converting a Tree to Rules
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak


No Yes No Yes
R1: If (Outlook=Sunny)  (Humidity=High) Then PlayTennis=No
R2: If (Outlook=Sunny)  (Humidity=Normal) Then PlayTennis=Yes
R3: If (Outlook=Overcast) Then PlayTennis=Yes
R4: If (Outlook=Rain)  (Wind=Strong) Then PlayTennis=No
56
Advantages/Disadvantages
of Decision Trees
Advantages
Easy construction
Simple to understand and interpret

Understandanle rules can be obtained.

Requires little data preparation.(Other techniques often

require data normalisation, dummy variables need to be


created and blank values to be removed)
Disadvantages
Lots of class makes it hard to predict or learn.
Both construction and deconstruction is so high that it

causes confusion for learning.


57
A Defect of Ires

 Ires favors attributes with many values


 Such attribute splits S to many subsets, and if these
are small, they will tend to be pure anyway
 One way to rectify this is through a corrected
measure of information gain ratio.

58
Information Gain
 Gain (Sample, Attributes) or Gain (S,A) is expected
reduction in entropy due to sorting S on attribute A

Gain(S,A) = Entropy(S) - vvalues(A) |Sv|/|S| Entropy(Sv)


So, for the previous example, the Information gain is
calculated:

G(A1) = E(A1) E(TRUE)
- (21+5)/(29+35) *
- (8+30)/(29+35) * E(FALSE)
= E(A1) - 26/64 * E(TRUE) - 38/64* E(FALSE)
= 0.9937– 26/64 * 0.796 – 38/64* 0.7426
= 0.5465

59
Information Gain Ratio

 I(A) is amount of information needed to


determine the value of an attribute A

 Information gain ratio

60
. . .
.
. .
. .
red
Information Gain Ratio

Color? green

.
yellow .

.
.

61
Information Gain
and
Information Gain Ratio

A |v(A)| Gain(A) GainRatio(A)


Color 3 0.247 0.156
Outline 2 0.152 0.152
Dot 2 0.048 0.049

62
Gini Index

 Another sensible measure of impurity


(i and j are classes)

 After applying attribute A, the resulting Gini index is

 Gini can be interpreted as expected error rate

63
Gini Index

. .
. .
. .

64
. . .
.
. .
. .
red
Gini Index for Color

Color? green

.
yellow .

.
.

65
Gain of Gini Index

66
 Used by the CART (classification and regression tree) algorithm.
 Gini impurity is a measure of how often a randomly chosen element
from the set would be incorrectly labeled if it were randomly labeled
according to the distribution of labels in the subset.
 Gini impurity can be computed by summing the probabi lity of each item
being chosen times the probability of a mistake in categorizing that item.
 It reaches its minimum (zero) when all cases in the node fall into a
single target category.

67
Three Impurity Measures
A Gain(A) GainRatio(A) GiniGain(A)

Color 0.247 0.156 0.058


Outline 0.152 0.152 0.046
Dot 0.048 0.049 0.015

• These impurity measures assess the effect of a single


attribute
• Criterion “most informative” that they define is local
(and “myopic”)
• It does not reliably predict the effect of several
attributes applied jointly
68
Occam’s Razor
”If two theories explain the facts equally weel, then the
simpler theory is to be preferred”
Arguments in favor:
 Fewer short hypotheses than long hypotheses

 A short hypothesis that fits the data is unlikely to be a

coincidence
 A long hypothesis that fits the data might be a

coincidence
Arguments opposed:
 There are many ways to define small sets of

hypotheses

69
Overfitting in Decision Tree
 One of the biggest problems with decision trees is
Overfitting

70
Avoid Overfitting
How can we avoid overfitting?
 Stop growing when data split not statistically

significant
 Grow full tree then post-prune

Select “best” tree:


 measure performance over separate validation data

set(training data)
 min( |tree|+|misclassifications(tree)|)

 Minimum description length (MDL):

Minimize:
size(tree) + size(misclassifications(tree))
71
Effect of Reduced Error Pruning

72
Unknown Attribute Values
What if some examples have missing values of A?
Use training example anyway sort through tree
 If node n tests A, assign most common value of A

among other examples sorted to node n.


 Assign most common value of A among other examples

with same target value


 Assign probability p to each possible value v of A
i i

 Assign fraction pi of example to each descendant in


tree

Classify new examples in the same fashion


73
Cross-Validation
 Estimate the accuracy of an hypothesis
induced by a supervised learning algorithm
 Predict the accuracy of an hypothesis over
future unseen instances
 Select the optimal hypothesis from a given
set of alternative hypotheses
 Pruning decision trees

 Model selection

 Feature selection

 Combining multiple classifiers (boosting)

74
Reference:
 Alppaydin,Machine Learning,2010,2nd Press
 Dr. Lee’s Slides, San Jose State University, Spring 2007
 "Building Decision Trees with the ID3 Algorithm", by: Andrew Colin, Dr. Dobbs
Journal, June 1996
 "Incremental Induction of Decision Trees", by Paul E. Utgoff, Kluwer Academic
Publishers, 1989
 http://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/2.htm
 http://decisiontrees.net/node/27
 Tom M. Mitchell, Machine Learning, McGraw-Hill, 1997
 Barros R. C., Cerri R., Jaskowiak P. A., Carvalho, A. C. P. L. F., A bottom-up
oblique decision tree induction algorithm
(http://dx.doi.org/10.1109/ISDA.2011.6121697). Proceedings of the 11th
International
 Breiman, Leo; Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classif
ication and regression tree Monterey, CA: Wadsworth & Brooks/Cole Advanced
Books & Software. ISBN 978-0-412-04841-8.
 Barros, Rodrigo C., Basgalupp, M. P., Carvalho, A. C. P. L. F., Freitas, Alex A.
(2011). A Survey of Evolutionary Algorithms for Decision-Tree Induction
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5928432). IEEE
Transactions on Systems, Man and Cybernetics, Part C: Applications and
Reviews,vol. 42, n. 3, p. 291-312, May 2012.
75

You might also like