P4-DTRF 1
P4-DTRF 1
Machine Learning
(part 4 : DT and RF)
Elisa Fromont
Option “Machine Learning” M2 IL(a)
1
Decision trees : outline
(adapted from Tom Mitchell’s “Machine Learning” book and
Hendrik Blockeel’s lecture, KULeuven, Belgium)
2
What are decision trees?
• (Binary) Cf. guessing a person using only
yes/no questions:
– ask some question
– depending on answer, ask a new question
– continue until answer known
• A decision tree
– Tells you which question to ask, depending on
outcome of previous questions
– Gives you the answer (= prediction) in the end
• Usually not used for guessing an individual, but
for predicting some property (e.g., classification)
3
Example decision tree 1
• Mitchell’s example: Play tennis or not?
(depending on weather conditions)
Outlook Temp. Hum. Wind Play?
Sunny 85 85 False no
Sunny 80 90 True no
Overcast 83 86 False yes Outlook
... ... ... ... ...
Sunny Overcast Rainy
No Yes No Yes
4
Example decision tree 2
• tree for predicting whether C-section
necessary
• Leaves are not pure here; ratio pos/neg is
given Fetal_Presentation
1 2 3
Previous_Csection - -
0 1 [3+, 29-] [8+, 22-]
.11+ .89- .27+ .73-
Primiparous +
[55+, 35-]
… … .61+ .39-
5
Representation power
e.g. propositional logic
• Typically:
– examples represented by array of
attributes
– 1 node in tree tests value of 1 attribute
– 1 child node for each possible outcome A
of test true false
– Leaf nodes assign classification true B
• Note: true false
– tree can represent any Boolean function
true false
• i.e., also disjunctive concepts (e.g. A Ú B)
– tree can allow noise (non-pure leaves)
6
Classification, Regression and
Clustering trees
• Classification trees represent function X -> C with
C discrete (like the decision trees we just saw)
– Hence, can be used for classification
• Regression trees predict numbers in leaves
– could use a constant (e.g., mean), or linear regression
model, or …
• Clustering trees just group examples in leaves
• Most (but not all) decision tree research in
machine learning focuses on classification trees
7
Inducing Decision Trees...
• In general, we like decision trees that give us a
result after as few questions as possible
• We can construct such a tree manually, or we
can try to obtain it automatically (inductively)
from a set of data
Outlook
Outlook Temp. Hum. Wind Play?
Sunny 85 85 False no Sunny Overcast Rainy
Sunny 80 90 True no Yes Wind
Humidity
Overcast 83 86 False yes
High Strong Weak
... ... ... ... ... Normal
No Yes No Yes
8
Top-Down Induction of
Decision Trees
• Basic algorithm for TDIDT: (later more formal version)
– start with full data set
– find test that partitions examples as good as possible
• “good” = examples with same class, or otherwise similar
examples, should be put together
– for each outcome of test, create child node
– move examples to children according to outcome of test
– repeat procedure for each child that is not “pure”
• Main questions:
– how to decide which test is “best”
– when to stop the procedure
9
Example problem
10
Data set: 8 classified instances
11
Obs 1: Shape is important
SHAPE
12
Obs 2: for some glasses,
colour is important
SHAPE
COLOUR
13
The decision tree
SHAPE
?
COLOUR
Non-orange orange
14
A DT creates a decision surface
A2
-
+ ++
+ - -
+ ++ - -
-
- - -
-
-
- -
- - -
- -
- -
A1
15
Exercise 1: (Decision Surface)
Consider a dataset with two
numerical attributes a1 and a2
and one nominal target
attribute c with two possible
values: and –
16
Finding the best test
(for classification trees)
• For classification trees: find test for which
children are as “pure” as possible
• Purity measure borrowed from information
theory: entropy
– is a measure of “missing information”; more
precisely, number of bits needed to represent the
missing information, on average, using optimal
encoding
• Given set S with instances belonging to class i
with probability pi: Entropy(S) = - Si pi log2 pi
17
Entropy
• For 2 classes if x = proportion of instances of class 1 in a given node e
– Entropy (e) = - x log2(x) – (1-x) log2(1-x)
(note that if x = 0, x log2(x) is undefined but the limit when x->0 is 0 so it is
considered 0)
• Entropy in function of p (proportion of examples from class 1), for 2
classes:
20
Exercise 2
• Assume S has 9 + and 5 - examples; partition (with
entropy) according to Wind or Humidity attribute
S: [9+,5-] S: [9+,5-]
E= E=
Humidity Wind
High Normal Strong Weak
S1: [3+,4-] S2: [6+,1-] S3: [6+,2-] S4: [3+,3-]
E= E= E= E=
S: [9+,5-] S: [9+,5-]
E = 0.940 E = 0.940
Humidity Wind
High Normal Strong Weak
S: [3+,4-] S: [6+,1-] S: [6+,2-] S: [3+,3-]
E = 0.985 E = 0.592 E = 0.811 E = 1.0
Gain(S, Humidity) Gain(S, Wind)
= .940 - (7/14).985 - (7/14).592 = .940 - (8/14).811 - (6/14)1.0
= 0.151 = 0.048
22
Example
• Assume Outlook was chosen: continue
partitioning using info gain in child nodes
[9+,5-]
Outlook
? Yes ?
[2+,3-] [4+,0-] [3+,2-]
23
Exercise 3 (entropy, info gain)
Consider the following table of training examples:
... 25
Which tree? Occam’s Razor
• Preference for simple models over complex
models is quite generally used in machine
learning
• Similar principle in science: Occam’s Razor
– roughly: do not make things more complicated than
necessary
• Reasoning, in the case of decision trees: more
complex trees have higher probability of
overfitting the data set
– Why? Somewhat controversial, see later
26
Avoiding Overfitting
• Phenomenon of overfitting:
– keep improving a model, making it better and better
on training set by making it more complicated …
– increases risk of modeling noise and coincidences in
the data set
– may actually harm predictive power of theory on
unseen cases
• Cf. fitting a curve with too many parameters
. . . .
. . .
. . .
. .
27
Overfitting: example
-
+ ++
+ - -
+ ++ - -
-
area with probably
- - + wrong predictions
-
-
- -
- - -
- -
- -
28
Overfitting:
effect on predictive accuracy
• Typical phenomenon when overfitting:
– training accuracy keeps increasing
– accuracy on unseen validation set starts
decreasing
accuracy accuracy on training data
size of tree 29
How to avoid overfitting when
building classification trees?
• Option 1:
– stop adding nodes to tree when overfitting
starts occurring
– need stopping criterion
• Option 2:
– don’t bother about overfitting when growing
the tree
– after the tree has been built, start pruning it
again
30
Stopping criteria
• How do we know when overfitting starts?
1. use a validation set: data not considered for choosing the best test
• when accuracy goes down on validation set: stop adding nodes to
this branch
2. use some statistical test
• significance test: is the change in class distribution significant? (c2-
test) [in other words: does the test yield a clearly better situation?]
• MDL: minimal description length principle
– entirely correct theory = tree + corrections for specific
misclassifications
– minimize size(theory) = size(tree) + size(misclassifications(tree))
– Cf. Occam’s razor
31
Post-pruning trees
• After learning the tree: start pruning
branches away
– For all nodes in tree:
• Estimate effect of pruning tree at this node on
predictive accuracy
– e.g. using accuracy on validation set
– Prune node that gives greatest improvement
– Continue until no improvements
• Note : this pruning constitutes a second
search in the hypothesis space
32
Effect of pruning
effect of pruning
accuracy on unseen data
size of tree
33
Comparison
• Pros/Cons of Option 1:
– no superfluous work
– but tests may be misleading
e.g., validation accuracy may go down briefly,
then go up again
34
Handling missing values
• What if result of test is unknown for example?
– e.g. because value of attribute unknown
• Some possible solutions, when training:
– guess value: just take most common value (among
all examples, among examples in this node / class,
…)
– assign example partially to different branches
• e.g. counts for 0.7 in yes subtree, 0.3 in no subtree
• When using tree for prediction:
– assign example partially to different branches
– combine predictions of different branches
35
Alternative heuristics
for choosing tests
• Attributes with continuous domains (numbers)
– cannot have different branch for each possible outcome
– allow, e.g., binary test of the form Temperature < 20
– same evaluation as before, but need to generate value (e.g. 20)
• For instance, just try all reasonable values
• Attributes with many discrete values
– unfair advantage over attributes with few values
• cf. question with many possible answers is more informative than
yes/no question
– To compensate: divide gain by “max. potential gain” SI
– Gain Ratio: GR(S,A) = Gain(S,A) / SI(S,A)
36
Heuristic which handles costs
• Tests may have different costs
– e.g. medical diagnosis: blood test, visual
examination, … have different costs
– try to find tree with low expected cost, instead
of low expected number of tests
– alternative heuristics, taking cost into account,
have been proposed e.g:
• replace gain by: Gain2(S,A)/Cost(A) [Tan,
Schimmer 1990]
• 2Gain(S,A) -1/(Cost(A)+1)w w Î[0,1] [Nunez 1988]
37
Continuous Valued Attributes
Create a discrete attribute (= discretization) to
test continuous
• Temperature = 24.50C
• (Temperature > 20.00C) = {true, false}
Where to set the threshold?
38
Generic TDIDT algorithm
• Many different algorithms for top-down
induction of decision trees exist
• What do they have in common, and where
do they differ?
• We look at a generic algorithm
– General framework for TDIDT algorithms
– Several “parameter procedures”
• instantiating them yields a specific algorithm
• Summarizes previously discussed points
and puts them into perspective
42
Generic TDIDT algorithm
function TDIDT(E: set of examples) returns tree;
T' := grow_tree(E);
T := prune(T');
return T;
then show how ID3 would induce a decision tree for these
5 examples.
46
Reduced Error Pruning (1/3)
• Consider each node for pruning starting from the
leaves
• Pruning = removing the subtree at that node, make it a
leaf and assign the most common class at that node
• A node is removed if the resulting tree performs no
worse then the original on the validation set (removes
coincidences and errors)
• Nodes are removed iteratively choosing the node
whose removal most increases the decision tree
accuracy (on the validation set)
• Pruning continues until further pruning is harmful
• Uses training, validation & test sets - effective
approach if a large amount of data is available
48
Reduced-error pruning (2/3)
• IDEA : Prune a tree at a node A if the (weighted) sum of
the error of the child node Cv of A > error of A
weight
(10+,3-) A
(9+,1-) (1+,2-)
B C
52
For regression... (cf. exo5)
• change
– best_test: e.g. minimize average variance
– info: mean
– stop_criterion: significance test (e.g., F-test)
{1,3,4,7,8,12} {1,3,4,7,8,12}
A1 A2
54
Exercise 5: regression tree
Consider the following set S of training
examples (with a numeric target attribute).
Here, Gain(S,A) = Var(S) - Sv (|Sv| / |S|) Var(Sv)
Q: Which attribute (a1 or a2) will be put in the top node of the
regression tree?
55
n-dimensional target spaces
• Instead of predicting 1 number, predict
vector of numbers
• info: mean vector
• best_test: variance (mean squared
distance) in n-dimensional space
• stop_criterion: F-test
• mixed vectors (numbers and symbols)?
– use appropriate distance measure
• -> "clustering trees"
60
Model trees
• Make predictions using linear regression models
in the leaves
• info: regression model (y=ax1+bx2+c)
• best_test: ?
– variance: simple, not so good (M5 approach)
– residual variance after model construction: better,
computationally expensive (RETIS approach)
• stop_criterion: significant reduction of variance
61
Random Forests
• Ensemble learning: train a set of classifiers and combine
their predictions
• A Random forest (or random forests) is an ensemble
classifier that consists of many decision trees and
outputs the majority class (i.e. the mode of the class's
output by individual trees).
• The term came from random decision forests that was
first proposed by Tin Kam Ho of Bell Labs in 1995.
• Leo Breiman, Random Forests, Machine Learning, 45, 5-
32, 2001
• The method combines Breiman's "bagging" idea and the
random selection of features.
62/14
63
Training a random forest
• For some number of trees T:
• Sample N examples at random with replacement to
create a subset of the data (see top layer of previous
figure). The subset should be about 66% of the total set.
• At each node:
– m attributes (among M) are selected at random from all the
predictor variables.
– The attribute that provides the best split, according to some
objective function, is used to do a binary split on that node.
– At the next node, choose another m attributes at random from all
attributes and do the same.
– Each tree is grown to the largest extent possible. There is no
pruning.
64
Choose « m » predictors?
Depending upon the value of m, there are three
slightly different systems:
• Random splitter selection: m =1
• Breiman’s bagger: m = total number of attributes
• Random forest: m << number of attributes.
Brieman suggests three possible values for m:
½√M, √M, and 2√M
65
Prediction using a random forest
• When a new input is entered into the system, it is run down all
of the trees. The result may either be an average or weighted
average of all of the terminal nodes that are reached, or, in
the case of categorical variables, a voting majority.
Note that:
• With a large number of attributes, the eligible attribute set will
be quite different from node to node.
• The greater the inter-tree correlation, the greater the random
forest error rate, so one pressure on the model is to have the
trees as uncorrelated as possible.
• As m goes down, both inter-tree correlation and the strength
of individual trees go down. So some optimal value of m must
be discovered.
66
Estimating the test error
• While growing forest, estimate test error from training
samples
• For each tree grown, 33-36% of samples are not selected in
bootstrap, called out of bootstrap/out of bag (OOB) samples
• Using OOB samples as input to the corresponding tree,
predictions are made as if they were novel test samples
• Through book-keeping, majority vote (classification),
average (regression) is computed for all OOB samples from
all trees.
• Such estimated test error (over samples that are OOB for a
given tree) is very accurate in practice, with reasonable N
67
Use a RF to evaluate the
importance of each attributes
• Denote by ê the OOB estimate of the loss when
using original training set, D.
• For each predictor (i.e. feature or variable) xp
where p∈{1,..,k}
– Randomly permute the value of pth predictor to
generate a new set of samples D' ={(y1,x'1),…,(yN,x'N)}
– Compute OOB estimate êk of prediction error with the
new samples
• A measure of importance of predictor xp is
êk – ê, the increase in error due to random
perturbation of pth predictor
68
Conclusions on random forests
• Fast fast fast!
– RF is fast to build. Even faster to predict!
– Practically speaking, not requiring cross-validation alone for
model selection significantly speeds training by 10x-100x or
more.
– Fully parallelizable … to go even faster!
• Automatic attribute selection from large number of candidates
• Resistance to over training
• Ability to handle data without preprocessing
– data does not need to be rescaled, transformed, or modified
– resistant to outliers
– automatic handling of missing values
• Cluster identification can be used to generate tree-based clusters
through sample proximity
Try it out...
• scikit-learn (Machine Learning library in
Python)
• http://scikit-
learn.org/stable/modules/tree.html
• http://scikit-
learn.org/stable/modules/generated/sklear
n.ensemble.RandomForestClassifier.html
• Can parameterize everything to obtain
your favorite decision tree/RF algorithm
73
To Remember
• Decision trees & their representational power
• Generic TDIDT algorithm and how to instantiate
its parameters
• For classification trees: details on heuristics,
discretization handling missing values, pruning,
…
• Some general concepts: overfitting, Occam’s
razor applied to DT
• Random Forests, principles ( “random”, where?)
75