0% found this document useful (0 votes)
29 views20 pages

A08 Decision Trees 2up

The document discusses decision trees as a method for learning from data, highlighting their ability to handle complex systems and various types of data through supervised, unsupervised, and reinforcement learning. It explains the construction of decision trees, their interpretability, and the importance of measures like entropy and information gain in selecting attributes for splitting. Additionally, it addresses challenges such as overfitting and the complexity of finding optimal decision trees.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views20 pages

A08 Decision Trees 2up

The document discusses decision trees as a method for learning from data, highlighting their ability to handle complex systems and various types of data through supervised, unsupervised, and reinforcement learning. It explains the construction of decision trees, their interpretability, and the importance of measures like entropy and information gain in selecting attributes for splitting. Additionally, it addresses challenges such as overfitting and the complexity of finding optimal decision trees.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Decision Trees

Mehul Motani
Electrical & Computer Engineering
National University of Singapore
Email: [email protected]
Office: E4-05-18
Tel: 6516 6918

© Mehul Motani Decision Trees 1

Motivation for Learning


• Modern systems are complex and may have many
parameters.
• It is impractical and often impossible to encode all the
knowledge a system needs.
• Different types of data may require very different
parameters.
• Instead of trying to hard code all the knowledge, it
makes sense to learn what we need from the data
itself.
• Three broad approaches to learning
– Supervised learning
– Unsupervised learning
– Reinforcement learning
© Mehul Motani Decision Trees 2
Learning from Observations
• Supervised Learning – learn a function from a set of
training examples which are preclassified feature
vectors.
Feature vector Class Given a previously unseen
(square, red) 1 feature vector, what is the
(square, blue) 1 rule that tells us if it is in
(circle, red) 2 class 1 or class 2?
(circle blue) 2
(triangle, red) 1
(triangle, green) 1
(circle, green) ?
(ellipse, blue) 2
(triangle, blue) ?
(ellipse, red) 2

© Mehul Motani Decision Trees 3

Learning from Observations


• Unsupervised Learning • Reinforcement Learning
– No classes are given. – learn from feedback
The idea is to find after a decision is made.
patterns in the data.
This generally involves
clustering.

© Mehul Motani Decision Trees 4


Decision Trees
• The theory behind decision trees is well-understood.
• Decision trees have the nice property that you can
easily understand the decision rule that was learned.
• Easy to explain and interpret (rule-based)
• There exist fast deterministic algorithms for
computing decision trees.
• Can handle discrete and continuous parameters
• Cautionary notes
– Complexity can grow large.
– Prone to overfitting

© Mehul Motani Decision Trees 5

Decision Trees
• Use training data to build the decision tree.
• Use a decision tree to predict categories for
new events.
New
Events

Training
Decision
Events and
Tree
Categories

Category

© Mehul Motani Decision Trees 6


Example – Should I play tennis?
Day Outlook Temp Humidity Wind Tennis?
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

© Mehul Motani Decision Trees 7

Tennis Example – Decision Tree

© Mehul Motani Decision Trees 8


19/7/20

Decision Tree for Tennis Example

• This is the full decision tree for the Tennis example.


• Compare this to the dataset and notice that the
decision tree captures the dataset precisely.
• Note that the Temperature feature is not relevant.
© Mehul Motani Decision Trees 9

Continuous vs. Discrete Feature Spaces

Continuous Discrete Strong Weak

• Decision trees can handle both continuous and


discrete feature spaces

© Mehul Motani Decision Trees 10


Decision Tree Decision Boundaries

© Mehul Motani Decision Trees 11

19/7/20

Example - Word Sense Disambiguation


• Given an occurrence of a word, decide which sense, or
meaning, was intended.
• Example: "run"
– run1: move swiftly (I ran to the store.)
– run2: operate (I run a store.)
– run3: flow (Water runs from the spring.)
– run4: length of torn stitches (Her stockings had a run.)
Features Word Sense
pos near(race) near(river) near(stockings)
noun no no no run4
verb no no no run1
verb no yes no run3
noun yes yes yes run4
verb no no yes run1
verb yes yes no run2
verb no yes yes run3
© Mehul Motani Decision Trees 12
Example - Word Sense Disambiguation
• Categories
– Use word sense labels (run1, run2, etc.) to name the
possible categories.
• Features
– Features describe the context of the word we want to
disambiguate.
– Possible features include:
• near(w): is the given word near an occurrence of word w?
• pos: the word’s part of speech
• left(w): is the word immediately preceded by the word w?
• etc.

© Mehul Motani Decision Trees 13

Example - Word Sense Disambiguation


• Example decision tree:
pos
noun verb

near(stocking) near(race)
yes no yes no

run4 run1 near(river)


yes no

run3

(Note: Decision trees for WSD tend to be quite large)

© Mehul Motani Decision Trees 14


19/7/20

Learning Algorithm for Decision Trees


• The decision tree encodes the optimal sequence of
questions to ask to make the classification decision.
• Finding the optimal decision tree is NP-hard.
• The number of decision trees is huge!
• With 6 binary attributes, there are
18,446,744,073,709,551,616 possible trees!
• The decision tree algorithm used in practice is a fast,
greedy but suboptimal algorithm.
• In practice, the algorithm has good competitive
performance.

© Mehul Motani Decision Trees 15

Learning Algorithm for Decision Trees

How do we
choose the best
attribute?

© Mehul Motani Decision Trees 16


Example – Should I play tennis?
Day Outlook Temp Humidity Wind Tennis?
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

© Mehul Motani Decision Trees 17

19/7/20

Which attribute to select?


Which is the best
attribute?
• The one which will
result in the
smallest tree
• Heuristic: choose
the attribute that
produces the
“purest” nodes
Need a good
measure of purity!

© Mehul Motani Decision Trees 18


Example – Which split is more informative?
Which attribute results in a more impure split?

(a) Split over whether (b) Split over whether


balance exceeds 25K applicant is employed

Less than 25K Over 25K Unemployed Employed

© Mehul Motani Decision Trees 19

Impurity is Uncertainty
• The key idea is to think of Impurity as Uncertainty
• We will use the counts at the leaves to define
probability distributions and use them to measure
uncertainty
Very impure group Less impure No impurity
High uncertainty Less uncertainty No uncertainty

© Mehul Motani Decision Trees 20


19/7/20

Entropy: a common way to measure uncertainty


• Entropy = H(X) = å - p log
i
i 2 pi (1)

• pi is the probability of class i


• pi is the fraction of class i in the set
• Entropy comes from information theory
– Cover and Thomas, Elements of information
Theory, Wiley & Sons, 2012.
• Entropy = Uncertainty = Impurity
– Higher entropy à More uncertainty
Figure 2
– Higher entropy à More impure
– Lower entropy à Less uncertainty
– Lower entropy à Less impure Binary entropy function
What does this mean for learning from data?
© Mehul Motani Decision Trees 21

19/7/20

Example: Two Classes


Minimum
• What is the entropy of a group in impurity
which all examples belong to the
same class?
• Entropy = H(X) = - 1 log21 = 0
• Not a good training set for learning
• What is the entropy of a group with Maximum
50% in either class? impurity
• H(X) = -0.5 log20.5 – 0.5 log20.5 =1
• Good training set for learning

© Mehul Motani Decision Trees 22


Information Gain
• We want to determine which attribute in the training set is
most useful for discriminating between the classes to be
learned.
• Information gain tells us how much information a given
attribute tells us about the class label. Conditional Entropy
– Information Gain: I(X;Y) = H(Y) – H(Y|X)
– Information Gain is the decrease in entropy after splitting
• We will use information gain to decide the ordering of
attributes in the nodes of the decision tree.
– The higher the information gain, the more information that
attribute contains about the label
– Attributes with higher information gain are selected before
attributes with lower information gain.
• Information gain is also known as the mutual information
between the features and the label.
© Mehul Motani Decision Trees 23

Entropy and Mutual Information


• The area contained by both circles
(A+B+C) is the joint entropy
Η(X,Y).
• The circle on the left (A+C) is the
C individual entropy Η(X)
A B • The part labeled A is the
conditional entropy Η(X|Y).
• The circle on the right (B+C) is
Η(Y).
• The part labeled B is the
conditional entropy Η(Y|X).
• The part labeled C is the mutual
information I(X;Y).

(Method 1)

(Method 2)

© Mehul Motani Decision Trees 24


Calculating Information Gain – Method 1
Information Gain = Entropy(parent) – [Average Entropy(children)]

Entire population (30 instances) child = -æç 13 × log 13 ö÷ - æç 4 × log 4 ö÷ = 0.787


impurity
2 entropy 17 2
17 ø è 17
2
17 ø
Pr(green circle)=16/30
è
Pr(red plus)=14/30

17 instances

2 child æ1 1 ö æ 12 12 ö
impurity = -
entropy 13ç × log 2 ÷ - ç × log 2 ÷ = 0.391
è 13 ø è 13 13ø

parent æ 14 14 ö æ 16 16 ö
1 impurity = -ç × log 2 ÷ - ç × log 2 ÷ = 0.996 13 instances
entropy è 30 30 ø è 30 30 ø

æ 17 ö æ 13 ö
3 Average Entropy of Children = ç × 0.787 ÷ + ç × 0.391÷ = 0.615
è 30 ø è 30 ø
4 Information Gain = 0.996 - 0.615 = 0.38
© Mehul Motani Decision Trees 25

Calculating Information Gain – Method 2


• The information gain I(C;F) of the class variable C with
possible values {c1, c2, … cm} with respect to the feature
variable F with possible values {f1, f2, … , fd} is defined by:

(1)

• P(C = ci) is the probability of class C having value ci.


• P(F=fj) is the probability of feature F having value fj.
• P(C=ci,F=fj) is the joint probability of class C = ci
and variable F = fj.
These are estimated from frequencies in the training data.

© Mehul Motani Decision Trees 26


Calculating Information Gain – Method 2

X Y Z C
1 1 1 I
1 1 0 I (1)
0 0 1 II
1 0 0 II

How would you


distinguish
class I from
class II? (2)

(3)
Which attribute is best? Which is worst? Does it make sense?

© Mehul Motani Decision Trees 27

Using Information Gain to Construct a Decision Tree


1. Start with the root of the decision
Full Training Set S à Attribute A tree and the whole training set.
2. Choose the attribute A with
v1 v2 vk highest information gain for the
full training set at the root of the
tree.
3. Construct child nodes for each
value of A. Each has an
S1¢={sÎS | value(A)=v1} associated subset of vectors in
which A has a particular value.
4. Repeat recursively.
• Quinlan suggested Information Gain in his ID3 system and later the Gain
Ratio, both based on Entropy.
• Information Gain has the disadvantage that it prefers attributes with large
number of values that split the data into small, pure subsets.
• Quinlan suggested the Gain Ratio to improve this by normalization.
• Reference: Quinlan, J. R. (1986). Induction of decision trees. Machine
Learning, 1(1):81-106
© Mehul Motani Decision Trees 28
Common Measures of impurity
• Entropy – measures uncertainty

(1)

• Gini Index – minimizes the probability of


misclassification
(2)

• Classification Error
(3)

© Mehul Motani Decision Trees 29

19/7/20

Dealing with non-binary features in Decision Trees

• Features with multiple discrete values


• Construct a multiway split
• Test for one value versus all of the others
• Group the values into two disjoint subsets
• Real-valued features
• Consider a threshold split using each observed valued of
the feature
• Whichever method is used, the information gain can be
computed to choose the best split.

© Mehul Motani Decision Trees 30


Overfitting in Decision Trees
Look at the performance of the
Decision Tree on the Training Data and
Test Data versus the size of the tree.

On training data it looks great.

Consider adding a noisy


training sample:
But not for the test data.
Sunny, Hot, Normal,
Strong, PlayTennis=No
What is the effect on the
decision tree?
© Mehul Motani Decision Trees 31

Avoid Overfitting
• Occam’s Razor
– ”If two theories explain the facts equally well, then
the simpler theory is to be preferred”
– Fewer short hypotheses than long hypotheses
– A short hypothesis that fits the data is unlikely to be
a coincidence
– A long hypothesis that fits the data might be a
coincidence
• Stop growing when split not statistically significant
• Grow full tree, then post-prune
– Prune tree to reduce errors or improve accuracy

© Mehul Motani Decision Trees 32


Effects of Reduced-error Pruning
On training data it looks great.

But not for the test data.

The tree is pruned back to the red line where


it gives more accurate results on the test data.
© Mehul Motani Decision Trees 33

19/7/20

Ensemble Learning
• An ensemble method is a technique that combines the
predictions from multiple machine learning algorithms together
to make more accurate predictions than any individual model.
• Popular ensemble methods are Bagging, Boosting, and Stacking.
• Ensembling can reduce overfitting without decreasing
performance.

© Mehul Motani Decision Trees 34


19/7/20

Benefits of Ensemble Learning


Relationship between ozone and temperature
(data from Rousseeuw and Leroy (1986) Ensemble average of
100 regression models,
each trained on a
subset of the original
dataset (blue markers).

Individual predictors
(gray lines) wiggle a lot
and are clearly
overfitting.

The averaged
ensemble predictor (red
line) is more stable and
less overfitting.
© Mehul Motani Decision Trees 35

19/7/20

Bagging – Bootstrap Aggregating


Machine Learning “Bagging predictors is a method for
Ensemble meta-algorithm generating multiple versions of a
predictor and using these to get an
aggregated predictor.” - Leo Breiman

Bagging improves accuracy, reduces


variance, and helps to prevent overfitting

Bootstrapping: Random
sampling with replacement

Train multiple decision trees &


search all features to find best
feature to split on for each tree

Aggregating: Combine multiple


predictions via averaging or
majority voting
© Mehul Motani Decision Trees 36
Random Forest

Random Forest is an Random Forests reduce


ensemble of decision trees, variance and limit
trained on different data overfitting without
subsets of the data, whose What is the substantially increasing
results are aggregated into difference between error due to bias.
one final result. Bagging and
Random Forests?
© Mehul Motani Decision Trees 37

19/7/20

Performance of Learning Algorithms


• Sources of error
– High Bias – Learning algorithm is not adequate and is
not able to fit the training data (underfitting)
– High Variance – Algorithm is sensitive to small
fluctuations in the training data (overfitting)
– Irreducible Error – Due to inherent noise in the data
• Bias-Variance Tradeoff – Algorithms with a
lower bias have higher variance and vice versa.
• We want a learning algorithm that:
– captures the regularities in its training data, but
also generalizes well to unseen data.
– has low bias and low variance
© Mehul Motani Decision Trees 38
19/7/20

Bias-Variance Tradeoff
Precision vs. Accuracy Low Variance High Variance
High Precision Low Precision

Low Bias
High Accuracy

High Bias
Low Accuracy

© Mehul Motani Decision Trees 39

19/7/20

The Bias squared-Variance Curve

• A curve of squared bias vs variance showing the inverse correlation that is


typical of the relation between the two as the model gets more complex.
• It is not uncommon for the resulting Total Error to follow some variant of
the U-shape shown in the figure above.
© Mehul Motani Decision Trees 40

You might also like