2.decision Tree
2.decision Tree
Outline
• Decision tree representation
• ID3 learning algorithm
• Entropy, Information gain
• Overfitting
Representation of Concepts
• Concept learning: conjunction of attributes
• (Sunny AND Hot AND Humid AND Windy) +
• Decision trees: disjunction of conjunction of attribues
• (Sunny AND Normal) OR (Overcast) OR (Rain AND Weak) +
• More powerful representation
• Larger hypothesis space H
• Can be represented as a tree
• Common form of decision making in humans
Decision Trees
Decision tree for Play Tennis
• There are as many rules as there are leaf nodes in the decision tree.
Employed?
No Yes
Credit
Income?
Score?
High Low High Low
Examples:
• Equipment or medical diagnosis
• Credit risk analysis
• Modeling calendar scheduling preference
Decision Tree –Decision Boundary
•Decision trees divide the feature space into axis-parallel rectangles
•Each rectangular region is labeled with one label–or a probability distribution over labels
Expressiveness
Decision trees can represent any function of the input attributes
–Boolean operations (and, or, xor, etc.)
–All Boolean functions
Issues
• Given some training examples, what decision tree should be
generated?
• One proposal: prefer the smallest tree that is consistent with the data
(Bias)
• Possible method:
• search the space of decision trees for the smallest decision tree that fits the
data
Searching for a good tree
• The space of decision trees is too big for systematic search.
• Stop and
• return the a value for the target feature or
• a distribution over target feature values
• Choose a test (e.g. an input feature) to split on.
• For each value of the test, build a subtree for those examples with this
value for the test.
Top-Down Induction of Decision Trees ID3
1. Which node to proceed with?
1. A the “best” decision attribute for next node
2. Assign A as decision attribute for node
3. For each value of A create new descendant
4. Sort training examples to leaf node according to the
attribute value of the branch
5. If all training examples are perfectly classified (same value
of target attribute) stop, else iterate over new leaf nodes.
2. When to stop?
The Basic ID3 Algorithm
Choices
• When to stop
• no more input features
• all examples are classified the same
• too few examples to make an informative split
• First term: entropy of the original collection S; second term: expected value
of entropy after S is partitioned using attribute A (Sv subset of S).
• ID3 uses information gain to select the best attribute at each step in
growing the tree.
Information Gain
Gain(S,A): expected reduction in entropy due to partitioning S
on attribute A
Humidity Wind
Gain(S,Humidity) Gain(S,Wind)
=0.940-(7/14)*0.985 =0.940-(8/14)*0.811
– (7/14)*0.592 – (6/14)*1.0
=0.151 =0.048
Humidity provides greater info. gain than Wind, w.r.t target classification.
Selecting the Next Attribute
S=[9+,5-]
E=0.940
Outlook
Gain(S,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 – (5/14)*0.0971
=0.247
Selecting the Next Attribute
The information gain values for the 4 attributes are:
• Gain(S,Outlook) =0.247
• Gain(S,Humidity) =0.151
• Gain(S,Wind) =0.048
• Gain(S,Temperature) =0.029
Note: 0Log20 =0
ID3 Algorithm
[D1,D2,…,D14] Outlook
[9+,5-]
? Yes ?
Test for this node
No Yes No Yes
c classes
Sv
GINIsplit (A) S
GINI (N v )
v Values(A)
Splitting Based on Continuous Attributes
Continuous Attribute – Binary Split
• For continuous attribute
• Partition the continuous value of attribute A into a
discrete set of intervals. Temperature = 82:5
• Create a new boolean attribute Ac , looking for a
threshold c,
true if Ac c
Ac
false otherwise
• consider all
How possible splitsc ?and finds the best cut
to choose
Temperature: 40 48 60 72 80 90
PlayTennis: No No Yes Yes Yes No
Hypothesis Space Search in Decision Trees
The ID3 algorithm
• searches hypotheses space which is set of possible hypotheses
• Performs simple-to-complex, Hill-Climbing search
• Starts with empty tree
• Depends on Information Gain which guides hill-climbing search
Hypothesis Space Search in Decision Trees
ID3 candidate-Elimination
ID3 maintains only a single current candidate-Elimination method, which
hypothesis maintains the set of all hypotheses
consistent with the available training
examples
It does not know how many alternative
decision trees are consistent with the
available training data
performs no backtracking in its search
ID3 uses all training examples at each FIND-S or CANDIDATE-ELIMINANATION
step in the search to make statistically make decisions incrementally, based on
based decisions individual training examples
Goal: to find the best decision tree
Hypothesis Space Search in Decision Trees
Hypothesis Space Search in Decision Trees
• ID3 searches a complete hypothesis space but does so incompletely since once it finds a good
hypothesis it stops (cannot find others).
• Candidate-Elimination searches an incomplete hypothesis space (it can only represent some
hypothesis) but does so completely.
• A preference bias is an inductive bias where some hypothesis are preferred over others.
• A restriction bias is an inductive bias where the set of hypothesis considered is restricted to a
smaller set.
INDUCTIVE BIAS IN DECISION TREE LEARNING
• Inductive bias is the set of assumptions that, together with the training data
• Which of these decision trees does ID3 choose?
• It chooses the first acceptable tree it encounters in its simple-to-complex, hill
climbing search through the space of possible trees
• the ID3 search strategy
• selects in favor of shorter trees over longer ones
• selects trees that place the attributes with highest information gain closest to the
root
Approximate inductive bias of ID3: Shorter trees are preferred over larger trees.
• Consider breadth-first search algorithm BFS-ID3 which finds a shortest decision tree and
thus exhibits precisely the bias "shorter trees are preferred over longer trees."
• BFS-ID3 conducts the entire breadth-first search through the hypothesis space
• ID3 can be viewed as an efficient approximation to BFS-ID3
• Because ID3 uses the information gain heuristic and a hill climbing strategy, it exhibits a
more complex bias than BFS-ID3
• it does not always find the shortest consistent tree
• it is biased to favor trees that place attributes with high information gain closest to the root.
A closer approximation to the inductive bias of ID3: Shorter trees are preferred over
longer trees. Trees that place high information gain attributes close to the root are preferred
over those that do not.
Restriction Biases and Preference Biases
• The inductive bias of ID3 is thus a preference for certain hypotheses over others
(e.g., for shorter hypotheses), with no hard restriction on the hypotheses that can
be eventually enumerated. This form of bias is typically called a preference bias (or,
alternatively, a search bias).
• the bias of the CANDIDATE ELIMINATION algorithm is in the form of a categorical
restriction on the set of hypotheses considered. This form of bias is typically called
a restriction bias (or, alternatively, a language bias).
Argument in favor:
• Fewer short hypotheses than long hypotheses
• A short hypothesis that fits the data is unlikely to be a coincidence
• A long hypothesis that fits the data might be a coincidence
ICS320 39
Issues in Decision Tree Learning
• Determine how deeply to grow the decision tree, underfitting and overfitting
• Handling continuous attributes
• Choosing an appropriate attribute selection measure
• Handling training data with missing attribute values
• Handling attributes with differing costs
• Improving computational efficiency
Overfitting
An algorithm can produce trees that overfit the training examples in the following two
cases
• There is noise in the data
• The number of training examples is too small to produce a representative sample of
the true target function
Overfitting
• Learning a tree that classifies the training data perfectly may not
lead to the tree with the best generalization performance.
• There may be noise in the training data
• May be based on insufficient data
• A hypothesis h is said to overfit the training data if there is another
hypothesis, h’, such that h has smaller error than h’ on the training
data but h has larger error on the test data than h’.
On training
accuracy On testing
Complexity of tree
Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting due to Noise
Lack of data points makes it difficult to predict correctly the class labels
of that region
Notes on Overfitting
• Overfitting results in decision trees that are more complex than
necessary
• Training error no longer provides a good estimate of how well the
tree will perform on previously unseen records
• Overfitting happens when a model is capturing idiosyncrasies of the
data rather than generalities.
• Often caused by too many parameters relative to the amount of
training data.
• E.g. an order-N polynomial can intersect any N+1 data points
Avoid Overfitting
• There are several approaches to avoiding overfitting in decision
tree learning. These can be grouped into two classes:
• Prepruning: Stop growing when data split not statistically
significant.
• Postpruning: Grow full tree then remove nodes
ICS320 48
Pre-Pruning (Early Stopping)
• It is the difficult in the this approach of estimating precisely when to stop growing
the tree
• Typical stopping conditions for a node:
• Stop if all instances belong to the same class
• Stop if all the attribute values are the same
• More restrictive conditions:
• Stop if number of instances is less than some user-specified threshold
• Stop if class distribution of instances are independent of the available features (e.g.,
using 2 test)
• Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
Reduced-error Pruning
Split data into training and validation set.
We use a validation set to prevent overfitting.
54
Triple Trade-Off
• There is a trade-off between three factors:
• Complexity of H, c (H),
• Training set size, N,
• Generalization error, E on new data
overfitting
• As N increases, E decreases
• As c (H) increases, first E decreases and then E increases
• As c (H) increases, the training error decreases for some time
and then stays constant (frequently at 0)
55
References:
1. Machine Learning: Tom Mitchel book
2. Introduction to Machine Learning (IITKGP)
Prof. Sudeshna Sarkar