Ch02 DecisionTree
Ch02 DecisionTree
Decision Tree
1. Decision-Tree Learning
2. Decision-Trees
1
Decision-Tree Learning
Decision-Tree Learning
Introduction
• Decision Trees
• TDIDT: Top-Down Induction of Decision Trees
ID3
• Attribute selection
• Entropy, Information, Information Gain
• Gain Ratio
C4.5
• Numeric Values
• Missing Values
• Pruning
Regression and Model Trees 2
Decision-Trees
Decision-Trees
3
Decision-Trees
4
Decision-Trees
5
Decision-Trees
6
Decision-Trees: Divide-And-Conquer Algorithms
Family of decision tree learning algorithms: TDIDT: Top-Down Induction of Decision Trees.
Learn trees in a Top-Down fashion:
7
Decision-Trees: ID3 Algorithm
Function ID3
If all examples in S belong to the same class c, return a new leaf and label it with c. Else:
8
Decision-Trees: A Different Decision Tree
9
Decision-Trees: What is a good Attribute?
A good attribute prefers attributes that split the data so that each successor node is as pure as
possible.
In other words, we want a measure that prefers attributes that have a high degree of “order”:
10
Decision-Trees: Entropy (for two classes)
• S is a set of examples
• p⊕ is the proportion of examples in class ⊕
• p = 1 − p⊕ is the proportion of examples in class
Entropy:
E(S) = −p⊕ log2 p⊕ − p log2 p (1)
11
Decision-Trees: Entropy (for two classes)
12
Decision-Trees: Entropy (for more classes)
13
Decision-Trees: Average Entropy / Information
14
Decision-Trees: Information Gain
When an attribute A splits the set S into subsets Si , we then compute the average entropy
and compare the sum to the entropy of the original set S.
Information Gain for Attribute A:
X |Si |
Gain(S, A) = E(S) − I(S, A) = E(S) − E(Si ) (4)
i
|S|
15
Decision-Trees: Properties of Entropy
Entropy is the only function that satisfies all of the following three properties:
16
Decision-Trees: Highly-branching attributes
17
Decision-Trees: Intrinsic Information of an Attribute
18
Decision-Trees: Gain Ratio
Modification of the information gain that reduces its bias towards multi-valued attributes.
Takes number and size of branches into account when choosing an attribute. Corrects the
information gain by taking the intrinsic information of a split into account.
Definition of Gain Ratio:
Gain(S, A)
GR(S, A) = (6)
IntI(S, A)
19
Decision-Trees: Gini Index
There are many alternative measures to Information Gain. Most popular altermative is Gini
index.
Impurity measure (instead of entropy):
X
Gini(S) = 1 − p2i (7)
i
Gini Gain could be defined analogously to information gain but typically avg. Gini index is
minimized instead of maximizing Gini gain.
20
Decision-Trees: Comparison among Splitting Criteria
21
Decision-Trees: Industrial-strength algorithms
22
Decision-Trees: Numeric attributes
• Evaluate info gain (or other measure) for every possible split point of attribute
• Choose “best” split point
• Info gain for best split point is info gain for attribute
23
Decision-Trees: Efficient Computation
• Linearly scan the sorted values, each time updating the count matrix and computing the
evaluation measure
• Choose the split position that has the best value
24
Decision-Trees: Binary vs. vs. Multiway Splits
25
Decision-Trees: Missing values
Info gain or gain ratio work with fractional instances, use sums of weights instead of counts.
During classification, split the instance in the same way. Merge probability distribution using
weights of fractional instances
26
Decision-Trees: Overfitting and Pruning
The smaller the complexity of a concept, the less danger that it overfits the data. Thus,
learning algorithms try to keep the learned concepts simple.
27
Decision-Trees: Prepruning
Based on statistical significance test. Stop growing the tree when there is no statistically
significant association between any attribute and the class at a particular node.
Most popular test: chi-squared test. Only statistically significant attributes were allowed to be
selected by information gain procedure.
Pre-pruning may stop the growth process prematurely: early stopping.
28
Decision-Trees: Post-Pruning
Learn a complete and consistent decision tree that classifies all examples in the training set
correctly .
As long as the performance increases
29
Decision-Trees: Post-Pruning
• Subtree replacement
• Subtree raising
30
Decision-Trees: Subtree replacement
31
Decision-Trees: Subtree raising
32
Decision-Trees: Estimating Error Rates
33
Decision-Trees: Reduced Error Pruning
34
Decision-Trees: Decision Lists and Decision Graphs
Decision Lists
Decision Graphs
35
Decision-Trees: Rules vs. Trees
Each decision tree can be converted into a rule set. A decision tree can be viewed as a set of
non-overlapping rules and typically learned via divide-and-conquer algorithms (recursive
partitioning) Transformation of rule sets / decision lists into trees is less trivial
36
Decision-Trees: Regression Problems
37
Decision-Trees: Regression Trees
• Leaf Nodes: Predict the average value of all instances in this leaf
• Splitting criterion: Minimize the variance of the values in each subset
• Termination criteria: Lower bound on standard deviation in a node and lower bound on
number of examples in a node
• Pruning criterion: Numeric error measures, e.g. Mean-Squared Error
38