Week 11 - Decision Tree Learning
Week 11 - Decision Tree Learning
Methods
Lecture 11
Decision Tree Learning
Decision Trees
• Tree-based classifiers for instances represented as feature-vectors. Nodes
test features, there is one branch for each value of the feature, and leaves
specify the category.
color color
red blue green red green
blue
shape shape
neg pos B C
circle square triangle circle square triangle
pos neg neg A B C
2
Properties of Decision Tree Learning
• Continuous (real-valued) features can be handled by
allowing nodes to split a real valued feature into two ranges
based on a threshold (e.g. length < 3 and length 3)
• Classification trees have discrete class labels at the leaves,
regression trees allow real-valued outputs at the leaves.
• Algorithms for finding consistent trees are efficient for
processing large amounts of training data for data mining
tasks.
• Methods developed for handling noisy training data (both
class and feature noise).
• Methods developed for handling missing feature values.
3
4
Problems for Decision Tree Learning
• Instances are represented by attribute-value pairs. Instances are described
by a fixed set of attributes (e.g., Temperature) and their values (e.g., Hot). The
easiest situation for decision tree learning is when each attribute takes on a small
number of disjoint possible values (e.g., Hot, Mild, Cold). However, extensions to
the basic algorithm allow handling real-valued attributes as well (e.g., representing
Temperature numerically).
• The target function has discrete output values. The decision tree in Figure
3.1 assigns a boolean classification (e.g., yes or no) to each example. Decision tree
methods easily extend to learning functions with more than two possible output
values. A more substantial extension allows learning target functions with real-
valued outputs, though the application of decision trees in this setting is less
common.
• Disjunctive descriptions may be required. As noted above, decision trees
naturally represent disjunctive expressions.
5
Problems for Decision Tree Learning
CLASSIFICATION PROBLEMS
6
Decision Tree Induction Pseudocode
DTree(examples, features) returns a tree
If all examples are in one category, return a leaf node with that category label.
Else if the set of features is empty, return a leaf node with the category label that
is the most common in examples.
Else pick a feature F and create a node R for it
For each possible value vi of F:
Let examplesi be the subset of examples that have value vi for F
Add an out-going edge E to node R labeled with the value vi.
If examplesi is empty
then attach a leaf node to edge E labeled with the category that
is the most common in examples.
else call DTree(examplesi , features – {F}) and attach the resulting
tree as the subtree under edge E.
Return the subtree rooted at R.
9
Picking a Good Split Feature
• Goal is to have the resulting tree be as small as possible, per Occam’s razor.
• Finding a minimal decision tree (nodes, leaves, or depth) is an NP-hard
optimization problem.
• Top-down divide-and-conquer method does a greedy search for a simple
tree but does not guarantee to find the smallest.
– General lesson in ML: “Greed is good.”
• Want to pick a feature that creates subsets of examples that are relatively
“pure” in a single class so they are “closer” to being leaf nodes.
• There are a variety of heuristics for picking a good test, a popular one is
based on information gain that originated with the ID3 system of
Quinlan (1979).
10
Which attribute is the best classifier?
11
Entropy
• Entropy (disorder, impurity) of a set of examples, S, relative to a binary
classification is:
Entropy(S ) = − p1 log 2 ( p1 ) − p0 log 2 ( p0 )
where p1 is the fraction of positive examples in S and p0 is the fraction of
negatives.
• If all examples are in one category, entropy is zero (we define
0log(0)=0)
• If examples are equally mixed (p1=p0=0.5), entropy is a maximum of 1.
• Entropy can be viewed as the number of bits required on average to
encode the class of an example in S where data compression (e.g.
Huffman coding) is used to give shorter codes to more likely cases.
• For multi-class problems with c categories, entropy generalizes to:
𝑐
12
Entropy Plot for Binary
Classification
13
Information Gain
𝑆𝑣
𝐺𝑎𝑖𝑛 𝑆, 𝐹 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
𝑆
𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝐹)
14
Information Gain
• Example:
– <big, red, circle>: + <small, red, circle>: +
– <small, red, square>: − <big, blue, circle>: −
15
Hypothesis Space Search
• Performs batch learning that processes all training instances
at once rather than incremental learning that updates a
hypothesis after each example.
• Performs hill-climbing (greedy search) that may only find a
locally-optimal solution. Guaranteed to find a tree consistent
with any conflict-free training set (i.e. identical feature vectors
always assigned the same class), but not necessarily the
simplest tree.
• Finds a single discrete hypothesis, so there is no way to
provide confidences or create useful queries.
17
Bias in Decision-Tree Induction
18
History of Decision-Tree Research
• Hunt and colleagues use exhaustive search decision-tree
methods (CLS) to model human concept learning in the 1960’s.
• In the late 70’s, Quinlan developed ID3 with the information gain
heuristic to learn expert systems from examples.
• Simultaneously, Breiman and Friedman and colleagues develop
CART (Classification and Regression Trees), similar to ID3.
• In the 1980’s a variety of improvements are introduced to
handle noise, continuous features, missing features, and
improved splitting criteria. Various expert-system
development tools results.
• Quinlan’s updated decision-tree package (C4.5) released in
1993.
• Weka includes Java version of C4.5 called J48.
19
Computational Complexity
• Worst case builds a complete tree where every path test every
feature. Assume n examples and m features.
F1
Fm
Maximum of n examples spread across
all nodes at each of the m levels
i n = O(nm 2
)
i=1
• However, learned tree is rarely complete (number of leaves is n). In
practice, complexity is linear in both number of features (m) and
number of training examples (n).
23
Overfitting
• Learning a tree that classifies the training data perfectly may not
lead to the tree with the best generalization to unseen data.
– There may be noise in the training data that the tree is erroneously
fitting.
– The algorithm may be making poor decisions towards the leaves of
the tree that are based on very little data and may not reflect reliable
trends.
• A hypothesis, h, is said to over fit the training data is there exists
another hypothesis which, h´, such that h has less error than h´ on
the training data but greater error on independent test data.
on training data
accuracy
on test data
hypothesis complexity
24
Overfitting Example
Testing Ohms Law: V = IR (I = (1/R)V)
Experimentally
measure 10 points current (I)
voltage (V)
current (I)
voltage (V)
26
Overfitting Noise in Decision Trees
• Category or feature noise can easily cause overfitting.
– Add noisy instance <medium, blue, circle>: pos (but really neg)
color
red green blue
shape neg
neg
circle square triangle
pos neg pos
27
Overfitting Noise in Decision Trees
• Category or feature noise can easily cause overfitting.
– Add noisy instance <medium, blue, circle>: pos (but really neg)
color
red green blue <big, blue, circle>: −
shape <medium, blue, circle>: +
neg
circle square triangle small med big
pos neg pos neg pos neg
• Noise can also cause different instances of the same feature vector
to have different classes. Impossible to fit this data and must label
leaf with the majority class.
– <big, red, circle>: neg (but really pos)
• Conflicting examples can also arise if the features are incomplete
and inadequate to determine the class or if the target concept is
non-deterministic.
28
Overfitting Prevention (Pruning) Methods
• Two basic approaches for decision trees
– Prepruning: Stop growing tree as some point during top-down
construction when there is no longer sufficient data to make
reliable decisions.
– Postpruning: Grow the full tree, then remove subtrees that do not
have sufficient evidence.
• Label leaf resulting from pruning with the majority class of
the remaining data, or a class probability distribution.
• Method for determining which subtrees to prune:
– Cross-validation: Reserve some training data as a hold-out set
(validation set, tuning set) to evaluate utility of subtrees.
– Statistical test: Use a statistical test on the training data to
determine if any observed regularity can be dismisses as likely
due to random chance.
– Minimum description length (MDL): Determine if the additional
complexity of the hypothesis is less complex than just explicitly
remembering any exceptions resulting from pruning.
29
Reduced Error Pruning
• A post-pruning, cross-validation approach.
Partition training data in “grow” and “validation” sets.
Build a complete tree from the “grow” data.
Until accuracy on validation set decreases do:
For each non-leaf node, n, in the tree do:
Temporarily prune the subtree below n and replace it with a
leaf labeled with the current majority class at that node.
Measure and record the accuracy of the pruned tree on the validation set.
Permanently prune the node that results in the greatest increase in accuracy on
the validation set.
30
Issues with Reduced Error Pruning
31
Cross-Validating without
Losing Training Data
• If the algorithm is modified to grow trees breadth-first
rather than depth-first, we can stop growing after
reaching any specified tree complexity.
• First, run several trials of reduced error-pruning using
different random splits of grow and validation sets.
• Record the complexity of the pruned tree learned in each
trial. Let C be the average pruned-tree complexity.
• Grow a final tree breadth-first from all the training data
but stop when the complexity reaches C.
• Similar cross-validation approach can be used to set
arbitrary algorithm parameters in general.
32
Additional Decision Tree Issues
• Better splitting criteria
– Information gain prefers features with many values.
• Continuous features
• Predicting a real-valued function (regression trees)
• Missing feature values
• Features with costs
• Misclassification costs
• Incremental learning
– ID4
– ID5
• Mining large databases that do not fit in main memory
33
What is ID3?
• A mathematical algorithm for building the decision tree.
• Invented by J. Ross Quinlan in 1979.
• Uses Information Theory invented by Shannon in 1948.
• Builds the tree from the top down, with no backtracking.
• Information Gain is used to select the most useful
attribute for classification.
Information Gain (IG)
• The information gain is based on the decrease in entropy after a
dataset is split on an attribute.
• Which attribute creates the most homogeneous branches?
• First the entropy of the total dataset is calculated.
• The dataset is then split on the different attributes.
• The entropy for each branch is calculated. Then it is added
proportionally, to get total entropy for the split.
• The resulting entropy is subtracted from the entropy before the split.
• The result is the Information Gain, or decrease in entropy.
• The attribute that yields the largest IG is chosen for the decision
node.
Information Gain (cont’d)
• A branch set with entropy of 0 is a leaf node.
• Otherwise, the branch needs further splitting to classify
its dataset.
• The ID3 algorithm is run recursively on the non-leaf
branches, until all data is classified.
Advantages of using ID3
• Understandable prediction rules are created from the
training data.
• Builds the fastest tree.
• Builds a short tree.
• Only need to test enough attributes until all data is
classified.
• Finding leaf nodes enables test data to be pruned,
reducing number of tests.
• Whole dataset is searched to create tree.
Disadvantages of using ID3
• Data may be over-fitted or over-classified, if a small
sample is tested.
• Only one attribute at a time is tested for making a
decision.
• Classifying continuous data may be computationally
expensive, as many trees must be generated to see
where to break the continuum.
Example: The Simpsons
Person Hair
Length
Weight Age Class
Homer 0” 250 36 M
Marge 10” 150 34 F
Bart 2” 90 10 M
Lisa 6” 78 8 F
Maggie 4” 20 1 F
Abe 1” 170 70 M
Selma 8” 160 41 F
Otto 10” 180 38 M
Krusty 6” 200 45 M
Comic 8” 290 38 ?
p p n n
Entropy(S) = − log 2 − log 2
p+n p+n p+n p+n
yes no
Hair Length <= 2?
We need don’t need to keep the data around,
just the test conditions. Weight <= 160?
yes no
Male Female
It is trivial to convert Decision Trees to
rules… Weight <= 160?
yes no
Male Female
• http://dms.irb.hr/tutorial/tut_dtrees.php
• http://www.dcs.napier.ac.uk/~peter/vldb/dm/node11.html
• http://www2.cs.uregina.ca/~dbd/cs831/notes/ml/dtrees/4
_dtrees2.html
48
SUMMARY AND FURTHER READING
• Overfitting the training data is an important issue in decision tree learning.
Because the training examples are only a sample of all possible instances, it is
possible to add branches to the tree that improve performance on the training
examples while decreasing performance on other instances outside this set.
Methods for post-pruning the decision tree are therefore important to avoid
overfitting in decision tree learning (and other inductive inference methods that
employ a preference bias).
• A large variety of extensions to the basic ID3 algorithm has been developed by
different researchers. These include methods for post-pruning trees, handling real-
valued attributes, accommodating training examples with missing attribute values,
incrementally refining decision trees as new training examples become available,
using attribute selection measures other than information gain, and considering
costs associated with instance attributes.
49