Decision Tree
Decision Tree
1
Representation of Concepts
Concept learning: conjunction of attributes
(Sunny AND Hot AND Humid AND Windy) +
No Yes No Yes
2
Rectangle learning….
-----------
-- +++++++ --
-- -- Conjunctions
-- ++++++ (single rectangle)
--
-- --
-----------
-----------
--
----------- +++++++ --
----------- ++++++ --
-- Disjunctions of Conjunctions
-- -- +++++++ --
-- (union of rectangles)
-- -- --
-- ++++++ --
-- --
-- --
-- --
--
-----------
3
Decision Tree :
Decision tree is the most powerful and
popular tool for classification and
prediction. A Decision tree is a flowchart
like tree structure, where each internal
node denotes a test on an attribute, each
branch represents an outcome of the test,
and each leaf node (terminal node) holds
Fig. 1: A decision tree for the concept Play Tennis.
a class label.
Training Examples
Day Outlook Temp Humidity Wind Tennis?
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Decision Trees
◦ Decision tree to represent learned target functions
◦ Each internal node tests an attribute
◦ Each branch corresponds to attribute value
◦ Each leaf node assigns a classification
No Yes No Yes
6
Representation in decision trees
Example of representing rule in DT’s:
if outlook = sunny AND humidity = normal
OR
if outlook = overcast
OR
if outlook = rain AND wind = weak
then playtennis
7
Applications of Decision Trees
Instances describable by a fixed set of attributes and their values
Target function is discrete valued
– 2-valued
– N-valued
– But can approximate continuous functions
Disjunctive hypothesis space
Possibly noisy training data
– Errors, missing values, …
Examples:
– Equipment or medical diagnosis
– Credit risk analysis
– Calendar scheduling preferences
8
Decision Trees
Given distribution
+ + + + - - - - + + + + Of training instances
Attribute 2 - Draw axis parallel
+ + + + - - - + + + +
+ + + + - - - - + + + + Lines to separate the
- - - - Instances of each class
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
Attribute 1
9
Decision Tree Structure
Attribute 1
10
Decision Tree Structure
Decision leaf
* Alternate splits possible
Attribute 2
+ + + + - - - - + + + +
- Decision node
+ + + + - - - + + + +
+ + + + - - - - + + + + = condition
30 - - - - = box
- - - - - - - - + + + + = collection of satisfying
- - - - - - - - + + + + examples
- - - - - - - - + + + +
- - - - - - - -
20 40 Attribute 1
11
The strengths of decision tree methods are:
Decision trees are able to handle both continuous and categorical variables.
Decision trees provide a clear indication of which fields are most important
Decision trees are less appropriate for estimation tasks where the goal is to predict the value
of a continuous attribute.
Decision trees are prone to errors in classification problems with many class and relatively
15
Top-Down Construction
Start with empty tree
Main loop:
1. Split the “best” decision attribute (A) for next node
2. Assign A as decision attribute for node
3. For each value of A, create new descendant of node
4. Sort training examples to leaf nodes
5. If training examples perfectly classified, STOP,
Else iterate over new leaf nodes
Grow tree just deep enough for perfect classification
– If possible (or can approximate at chosen depth)
Which attribute is best?
16
Principle of Decision Tree Construction
◦ Finally we want to form pure leaves
◦ Correct classification
◦ Greedy approach to reach correct classification
1. Initially treat the entire data set as a single box
2. For each box choose the spilt that reduces its impurity (in
terms of class labels) by the maximum amount
3. Split the box having highest reduction in impurity
4. Continue to Step 2
5. Stop when all boxes are pure
17
Best attribute to split?
+ + + + - - - - + + + +
Attribute 2 -
+ + + + - - - + + + +
+ + + + - - - - + + + +
- - - -
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
Attribute 1
18
Best attribute to split?
+ + + + - - - - + + + +
Attribute 2 -
+ + + + - - - + + + +
+ + + + - - - - + + + +
- - - -
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
19
Best attribute to split?
+ + + + - - - - + + + +
Attribute 2 -
+ + + + - - - + + + +
+ + + + - - - - + + + +
- - - -
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
20
Which split to make next?
Pure box/node
Mixed box/node
+ + + + - - - - + + + +
Attribute 2 -
+ + + + - - - + + + + Already pure leaf
+ + + + - - - - + + + + No further need to split
- - - -
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
21
Which split to make next?
Already pure leaf
No further need to split
Attribute 2
+ + + + - - - - + + + +
+ + + + - - - - + + + +
+ + + + - - - - + + + +
A2 > 30? - - - -
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
Attribute 1
22
In Decision Tree the major challenge is to identification of the attribute for the
root node in each level. This process is known as attribute selection. We have two
Information Gain
Gini Index
1. Information Gain
When we use a node in a decision tree to partition the training instances into
smaller subsets the entropy changes. Information gain is a measure of this
change in entropy.
Entropy:
Entropy is the measure of uncertainty of a random variable, it characterizes the
impurity of an arbitrary collection of examples. The higher the entropy more the
information content.
Example:
Choosing Best Attribute?
◦ Consider 64 examples, 29+ and 35-
◦ Which one is better?
26
Entropy
◦ A measure for
◦ uncertainty
◦ purity
◦ information content
◦ Information theory: optimal length code assigns (-
log2p) bits to message having probability p
◦ S is a sample of training examples
◦ p+ is the proportion of positive examples in S
◦ p- is the proportion of negative examples in S
◦ Entropy of S: average optimal number of bits to encode
information about certainty/uncertainty about S
Entropy(S) = p+(-log2p+) + p-(-log2p-) = -p+log2p+- p-log2p-
◦ Can be generalized to more than two values
27
Entropy
Entropy can also be viewed as measuring
– purity of S,
– uncertainty in S,
– information in S, …
E.g.: values of entropy for p+=1, p+=0, p+=.5
Easy generalization to more than binary values
– Sum over pi *(-log2 pi) , i=1,n
i is + or – for binary
i varies from 1 to n in the general case
28
Choosing Best Attribute?
◦ Consider 64 examples (29+,35-) and compute entropies:
◦ Which one is better?
E(S)=0.993
29+, 35- A1 E(S)=0.993 29+, 35- A2
t f 0.522 t f 0.997
0.650 0.989
25+, 5- 4+, 30- 15+, 19- 14+, 16-
◦ Which is better?
E(S)=0.993 E(S)=0.993
29+, 35- A1 29+, 35- A2
t f t f
0.708 0.742 0.937 0.619
21+, 5- 8+, 30- 18+, 33- 11+, 2-
29
Information Gain
◦ Gain(S,A): reduction in entropy after choosing attr. A
Sv
Gain( S , A) Entropy ( S )
vValues ( A ) S
Entropy ( S v )
E(S)=0.993
29+, 35- A1 E(S)=0.993 29 , 35
+ - A2
t f t f
0.650 0.522 0.989 0.997
25+, 5- 4+, 30- 15+, 19- 14+, 16-
Gain: 0.395 Gain: 0.000
E(S)=0.993 E(S)=0.993
29+, 35- A1 29 , 35 A2
+ -
t f t f
0.708 0.742 0.937 0.619
21+, 5- 8+, 30- 18+, 33- 11+, 2-
Gain: 0.265 Gain: 0.121
30
Gain function
Gain is measure of how much can
– Reduce uncertainty
Value lies between 0,1
What is significance of
gain of 0?
example where have 50/50 split of +/- both before and
after discriminating on attributes values
gain of 1?
Example of going from “perfect uncertainty” to perfect
certainty after splitting example with predictive attribute
– Find “patterns” in TE’s relating to attribute
values
Move to locally minimal representation of TE’s
31
Training Examples
Day Outlook Temp Humidity Wind Tennis?
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Determine the Root Attribute
9+, 5- E=0.940 9+, 5- E=0.940
Humidity Wind
33
Sort the Training Examples
9+, 5- {D1,…,D14}
Outlook
? Yes ?
Ssunny= {D1,D2,D8,D9,D11}
Gain (Ssunny, Humidity) = .970
Gain (Ssunny, Temp) = .570
Gain (Ssunny, Wind) = .019
34
Final Decision Tree for Example
Outlook
Sunny Rain
Overcast
Humidity
Yes Wind
High
Normal Strong Weak
No Yes No
Yes
35
GINI Index:
Gini index or Gini impurity measures the degree or probability of a particular
variable being wrongly classified when it is randomly chosen. But what is
actually meant by ‘impurity’? If all the elements belong to a single class, then
it can be called pure. The degree of Gini index varies between 0 and 1, where
0 denotes that all elements belong to a certain class or if there exists only one
class, and 1 denotes that the elements are randomly distributed across various
classes. A Gini Index of 0.5 denotes equally distributed elements into some
classes.
Formula for Gini Index:
Automatically
Ease of Decision A decision threshold
handles decision
Making has to be set
making
On training
accuracy On testing
Complexity of tree 43
Overfitting
Attribute 2
A very deep tree required
+ + + + - - - - + + + + To fit just one odd training
+ + - + - - - - + + + + example
+ + + + - - - - + + + +
- - - -
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
Attribute 1
44
When to stop splitting further?
Attribute 2
A very deep tree required
+ + + + - - - - + + + + To fit just one odd training
+ + - + - - - - + + + + example
+ + + + - - - - + + + +
- - - -
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
Attribute 1
45
Avoiding Overfitting
46
Reduced-Error Pruning
• A post-pruning, cross validation approach
- Partition training data into “grow” set and “validation” set.
- Build a complete tree for the “grow” data
- Until accuracy on validation set decreases, do:
For each non-leaf node in the tree
Temporarily prune the tree below; replace it by majority vote.
Test the accuracy of the hypothesis on the validation set
Permanently prune the node with the greatest increase
in accuracy on the validation test.
• Problem: Uses less data to construct the tree
• Sometimes done at the rules level
47
Rule post-pruning
48
Thanks to the audience .
Gini Index
◦ Another sensible measure of impurity
(i and j are classes)
. .
. .
51
Gini Index for Color
. .
. .
. .
. .
red
Color? green
.
yellow .
.
.
52
Gain of Gini Index
53
Three Impurity Measures
A Gain(A) GainRatio(A) GiniGain(A)
54