0% found this document useful (0 votes)
6 views63 pages

P4-DTRF 1

This document provides an introduction to decision trees and random forests in machine learning, detailing their structure, induction methods, and applications. It discusses the top-down induction of decision trees, techniques to avoid overfitting, and various heuristics for selecting tests. Additionally, it covers concepts such as entropy, information gain, and the importance of pruning in decision tree models.

Uploaded by

Daouda Traore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views63 pages

P4-DTRF 1

This document provides an introduction to decision trees and random forests in machine learning, detailing their structure, induction methods, and applications. It discusses the top-down induction of decision trees, techniques to avoid overfitting, and various heuristics for selecting tests. Additionally, it covers concepts such as entropy, information gain, and the importance of pruning in decision tree models.

Uploaded by

Daouda Traore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Introduction to

Machine Learning
(part 4 : DT and RF)

Elisa Fromont
Option “Machine Learning” M2 IL(a)

1
Decision trees : outline
(adapted from Tom Mitchell’s “Machine Learning” book and
Hendrik Blockeel’s lecture, KULeuven, Belgium)

• What are decision trees?


• How can they be induced automatically?
– top-down induction of decision trees
– avoiding overfitting
– alternative heuristics
– a generic TDIDT algorithm
• Random Forests

2
What are decision trees?
• (Binary) Cf. guessing a person using only
yes/no questions:
– ask some question
– depending on answer, ask a new question
– continue until answer known
• A decision tree
– Tells you which question to ask, depending on
outcome of previous questions
– Gives you the answer (= prediction) in the end
• Usually not used for guessing an individual, but
for predicting some property (e.g., classification)
3
Example decision tree 1
• Mitchell’s example: Play tennis or not?
(depending on weather conditions)
Outlook Temp. Hum. Wind Play?
Sunny 85 85 False no
Sunny 80 90 True no
Overcast 83 86 False yes Outlook
... ... ... ... ...
Sunny Overcast Rainy

Humidity Yes Wind


Strong Weak
High Normal

No Yes No Yes
4
Example decision tree 2
• tree for predicting whether C-section
necessary
• Leaves are not pure here; ratio pos/neg is
given Fetal_Presentation
1 2 3

Previous_Csection - -
0 1 [3+, 29-] [8+, 22-]
.11+ .89- .27+ .73-
Primiparous +
[55+, 35-]
… … .61+ .39-
5
Representation power
e.g. propositional logic
• Typically:
– examples represented by array of
attributes
– 1 node in tree tests value of 1 attribute
– 1 child node for each possible outcome A
of test true false
– Leaf nodes assign classification true B
• Note: true false
– tree can represent any Boolean function
true false
• i.e., also disjunctive concepts (e.g. A Ú B)
– tree can allow noise (non-pure leaves)

6
Classification, Regression and
Clustering trees
• Classification trees represent function X -> C with
C discrete (like the decision trees we just saw)
– Hence, can be used for classification
• Regression trees predict numbers in leaves
– could use a constant (e.g., mean), or linear regression
model, or …
• Clustering trees just group examples in leaves
• Most (but not all) decision tree research in
machine learning focuses on classification trees

7
Inducing Decision Trees...
• In general, we like decision trees that give us a
result after as few questions as possible
• We can construct such a tree manually, or we
can try to obtain it automatically (inductively)
from a set of data

Outlook
Outlook Temp. Hum. Wind Play?
Sunny 85 85 False no Sunny Overcast Rainy
Sunny 80 90 True no Yes Wind
Humidity
Overcast 83 86 False yes
High Strong Weak
... ... ... ... ... Normal
No Yes No Yes
8
Top-Down Induction of
Decision Trees
• Basic algorithm for TDIDT: (later more formal version)
– start with full data set
– find test that partitions examples as good as possible
• “good” = examples with same class, or otherwise similar
examples, should be put together
– for each outcome of test, create child node
– move examples to children according to outcome of test
– repeat procedure for each child that is not “pure”
• Main questions:
– how to decide which test is “best”
– when to stop the procedure

9
Example problem

Is this drink going to


make us ill, or not?

10
Data set: 8 classified instances

11
Obs 1: Shape is important
SHAPE

12
Obs 2: for some glasses,
colour is important
SHAPE

COLOUR

13
The decision tree

SHAPE

?
COLOUR
Non-orange orange

14
A DT creates a decision surface
A2

-
+ ++
+ - -
+ ++ - -
-

- - -
-
-
- -
- - -
- -
- -
A1

15
Exercise 1: (Decision Surface)
Consider a dataset with two
numerical attributes a1 and a2
and one nominal target
attribute c with two possible
values: and –

1. Find (« by hand ») a binary <


decision tree that classifies all
training examples correctly.

2. Draw the decision surface of


this tree on the Figure

16
Finding the best test
(for classification trees)
• For classification trees: find test for which
children are as “pure” as possible
• Purity measure borrowed from information
theory: entropy
– is a measure of “missing information”; more
precisely, number of bits needed to represent the
missing information, on average, using optimal
encoding
• Given set S with instances belonging to class i
with probability pi: Entropy(S) = - Si pi log2 pi
17
Entropy
• For 2 classes if x = proportion of instances of class 1 in a given node e
– Entropy (e) = - x log2(x) – (1-x) log2(1-x)
(note that if x = 0, x log2(x) is undefined but the limit when x->0 is 0 so it is
considered 0)
• Entropy in function of p (proportion of examples from class 1), for 2
classes:

• Entropy max (1) when the 2 classes are perfectly mixed


18
Information gain
• Heuristic for choosing a test in a node:
– test that on average provides most information about
the class
– test that, on average, reduces class entropy most
• on average: class entropy reduction differs according to
outcome of test
– expected reduction of entropy = information gain

Gain(S,A) = Entropy(S) - Sv (|Sv| / |S|) Entropy(Sv)

n S = set of instances in a given node A


n |Sv|/|S| = proportion of instances of S which go in
the vst child of A
19
Other purity measure/gain
• Gini impurity index (not to be confused with Gini
coefficient) :
– “measure of how often a randomly chosen element
from the set would be incorrectly labeled if it was
randomly labeled according to the distribution of
labels in the subset” (lower the better):
• Gini (S) = Si pi (1-pi) = Si pi – Si pi2 = 1- Si pi2
= 2 pi * pj (for a 2 class pb with classes i and j)

• Gain(S,A) = Gini(S) - Sv (|Sv| / |S|) Gini(Sv)

20
Exercise 2
• Assume S has 9 + and 5 - examples; partition (with
entropy) according to Wind or Humidity attribute

S: [9+,5-] S: [9+,5-]
E= E=
Humidity Wind
High Normal Strong Weak
S1: [3+,4-] S2: [6+,1-] S3: [6+,2-] S4: [3+,3-]
E= E= E= E=

• Which attribute gives the best gain ?


21
Exercise 2 (solution)
• Assume S has 9 + and 5 - examples; partition (with
entropy) according to Wind or Humidity attribute

S: [9+,5-] S: [9+,5-]
E = 0.940 E = 0.940
Humidity Wind
High Normal Strong Weak
S: [3+,4-] S: [6+,1-] S: [6+,2-] S: [3+,3-]
E = 0.985 E = 0.592 E = 0.811 E = 1.0
Gain(S, Humidity) Gain(S, Wind)
= .940 - (7/14).985 - (7/14).592 = .940 - (8/14).811 - (6/14)1.0
= 0.151 = 0.048
22
Example
• Assume Outlook was chosen: continue
partitioning using info gain in child nodes
[9+,5-]
Outlook

Sunny Overcast Rainy

? Yes ?
[2+,3-] [4+,0-] [3+,2-]

23
Exercise 3 (entropy, info gain)
Consider the following table of training examples:

1. What is the entropy of the set {1,2,3,4,5,6} with


respect to the “classification” attribute?
2. What is the information gain of a2 relative to these
training examples? What can you conclude?
24
Hypothesis space search in TDIDT
• Hypothesis space H = set of all trees
• H is searched in a hill-climbing fashion,
from simple to complex

... 25
Which tree? Occam’s Razor
• Preference for simple models over complex
models is quite generally used in machine
learning
• Similar principle in science: Occam’s Razor
– roughly: do not make things more complicated than
necessary
• Reasoning, in the case of decision trees: more
complex trees have higher probability of
overfitting the data set
– Why? Somewhat controversial, see later
26
Avoiding Overfitting
• Phenomenon of overfitting:
– keep improving a model, making it better and better
on training set by making it more complicated …
– increases risk of modeling noise and coincidences in
the data set
– may actually harm predictive power of theory on
unseen cases
• Cf. fitting a curve with too many parameters
. . . .
. . .
. . .
. .
27
Overfitting: example

-
+ ++
+ - -
+ ++ - -
-
area with probably
- - + wrong predictions
-
-
- -
- - -
- -
- -

28
Overfitting:
effect on predictive accuracy
• Typical phenomenon when overfitting:
– training accuracy keeps increasing
– accuracy on unseen validation set starts
decreasing
accuracy accuracy on training data

accuracy on unseen data

overfitting starts about here

size of tree 29
How to avoid overfitting when
building classification trees?
• Option 1:
– stop adding nodes to tree when overfitting
starts occurring
– need stopping criterion
• Option 2:
– don’t bother about overfitting when growing
the tree
– after the tree has been built, start pruning it
again
30
Stopping criteria
• How do we know when overfitting starts?
1. use a validation set: data not considered for choosing the best test
• when accuracy goes down on validation set: stop adding nodes to
this branch
2. use some statistical test
• significance test: is the change in class distribution significant? (c2-
test) [in other words: does the test yield a clearly better situation?]
• MDL: minimal description length principle
– entirely correct theory = tree + corrections for specific
misclassifications
– minimize size(theory) = size(tree) + size(misclassifications(tree))
– Cf. Occam’s razor

31
Post-pruning trees
• After learning the tree: start pruning
branches away
– For all nodes in tree:
• Estimate effect of pruning tree at this node on
predictive accuracy
– e.g. using accuracy on validation set
– Prune node that gives greatest improvement
– Continue until no improvements
• Note : this pruning constitutes a second
search in the hypothesis space
32
Effect of pruning

accuracy accuracy on training data

effect of pruning
accuracy on unseen data

size of tree

33
Comparison
• Pros/Cons of Option 1:
– no superfluous work
– but tests may be misleading
e.g., validation accuracy may go down briefly,
then go up again

• Therefore, Option 2 (post-pruning) is


usually preferred (though more work)

34
Handling missing values
• What if result of test is unknown for example?
– e.g. because value of attribute unknown
• Some possible solutions, when training:
– guess value: just take most common value (among
all examples, among examples in this node / class,
…)
– assign example partially to different branches
• e.g. counts for 0.7 in yes subtree, 0.3 in no subtree
• When using tree for prediction:
– assign example partially to different branches
– combine predictions of different branches

35
Alternative heuristics
for choosing tests
• Attributes with continuous domains (numbers)
– cannot have different branch for each possible outcome
– allow, e.g., binary test of the form Temperature < 20
– same evaluation as before, but need to generate value (e.g. 20)
• For instance, just try all reasonable values
• Attributes with many discrete values
– unfair advantage over attributes with few values
• cf. question with many possible answers is more informative than
yes/no question
– To compensate: divide gain by “max. potential gain” SI
– Gain Ratio: GR(S,A) = Gain(S,A) / SI(S,A)

• Split-information SI(S,A) = - å |Sv|/|S| log2 |Sv|/|S|

• with v ranging over different results of test A

36
Heuristic which handles costs
• Tests may have different costs
– e.g. medical diagnosis: blood test, visual
examination, … have different costs
– try to find tree with low expected cost, instead
of low expected number of tests
– alternative heuristics, taking cost into account,
have been proposed e.g:
• replace gain by: Gain2(S,A)/Cost(A) [Tan,
Schimmer 1990]
• 2Gain(S,A) -1/(Cost(A)+1)w w Î[0,1] [Nunez 1988]

37
Continuous Valued Attributes
Create a discrete attribute (= discretization) to
test continuous
• Temperature = 24.50C
• (Temperature > 20.00C) = {true, false}
Where to set the threshold?

Temperature 150C 180C 190C 220C 240C 270C

PlayTennis No No Yes Yes Yes No

38
Generic TDIDT algorithm
• Many different algorithms for top-down
induction of decision trees exist
• What do they have in common, and where
do they differ?
• We look at a generic algorithm
– General framework for TDIDT algorithms
– Several “parameter procedures”
• instantiating them yields a specific algorithm
• Summarizes previously discussed points
and puts them into perspective
42
Generic TDIDT algorithm
function TDIDT(E: set of examples) returns tree;
T' := grow_tree(E);
T := prune(T');
return T;

function grow_tree(E: set of examples) returns


tree;
T := generate_tests(E);
t := best_test(T, E);
P := partition induced on E by t;
if stop_criterion(E, P)
then return leaf(info(E))
else
for all Ej in P: tj := grow_tree(Ej);
return node(t, {( j , tj )}); 43
For classification...
• prune: e.g. reduced-error pruning, ...
• generate_tests : Attr=val, Attr<val, ...
– for numeric attributes: generate val
• best_test : Gain, Gainratio, ...
• stop_criterion : MDL, significance test (e.g. c2-
test), ...
• Info (how do we label the node ?) : most
frequent class ("mode")
• Popular systems: C4.5 (Quinlan 1993), C5.0
(www.rulequest.com)
44
Simple algorithm: ID3
(no pruning, 2 classes, categorical attributes)
ID3 (Examples E, Target_Attribute T, Attributes A)
• Create a root node for the tree
– If (all e ∈ E are of only one class c) or A is empty,
Return the single-node tree Root, with label = majority class in E ( = c)
– Otherwise
Begin
1. A = the Attribute that “best” classifies examples (using InfoGain).
2. Decision Tree attribute for Root = A.
3. For each possible value, vi, of A :
• Add a new tree branch below Root, corresponding to the test A = vi if
Examples(vi) (the subset of examples that have the value vi for A) is
not empty.
• Below this new branch add the subtree ID3 (Examples(vi),
Target_Attribute, Attributes – {A})
End
• Return Root
45
Exercise 4: ID3
1. Show a decision tree that could be learned by ID3
assuming it gets the following examples:

2. Add this example to the 4 above:

then show how ID3 would induce a decision tree for these
5 examples.

46
Reduced Error Pruning (1/3)
• Consider each node for pruning starting from the
leaves
• Pruning = removing the subtree at that node, make it a
leaf and assign the most common class at that node
• A node is removed if the resulting tree performs no
worse then the original on the validation set (removes
coincidences and errors)
• Nodes are removed iteratively choosing the node
whose removal most increases the decision tree
accuracy (on the validation set)
• Pruning continues until further pruning is harmful
• Uses training, validation & test sets - effective
approach if a large amount of data is available

48
Reduced-error pruning (2/3)
• IDEA : Prune a tree at a node A if the (weighted) sum of
the error of the child node Cv of A > error of A

• Weight = proportion of instances in a given node


compared to total number of instances
• Don’t use only error but error within a confidence interval

– The proportion p (X/n) follows probability distribution close to a


Normal distribution (= Gaussian ) with a mean µ = p = fn and a
standard deviation s = ((p(1-p)/n)1/2 (NB: s2 : variance)
– u : confidence level or confidence coefficient (quantile
~cumulative distribution function) è need tables
49
Example: confidence interval
• Toss a coin 10 times : 3 heads, 7 tails, is
there a pb with the coin ? (not 5/5 ???)
• Probability to give the wrong answer < 5%,
u = 1.96
– fn = p = ½
– n = 10
– Compute IC….
• If 3/10 and 7/10 are within IC, the coin is
ok
50
Reduce error pruning (3/3)
• To compare the error (proportion of wrongly
classified instances) of the child node and the
error of the father node, we only use the upper
bound of the confidence interval. We prune if:

weight

GAME: match the notations! 51


Example
• Should we prune A ? (u = 1.96)

(10+,3-) A

(9+,1-) (1+,2-)
B C

52
For regression... (cf. exo5)
• change
– best_test: e.g. minimize average variance
– info: mean
– stop_criterion: significance test (e.g., F-test)
{1,3,4,7,8,12} {1,3,4,7,8,12}
A1 A2

{1,4,12} {3,7,8} {1,3,7} {4,8,12} 53


CART
• Binary classification and regression trees [Breiman et
al., 1984]
• Classification:
– info: mode,
– best_test: Gini
• Regression:
– info: mean,
– best_test: variance
• prune: “cost-complexity" pruning
– penalty a for each node
– the higher a, the smaller the tree will be
– optimal a obtained empirically (cross-validation)

54
Exercise 5: regression tree
Consider the following set S of training
examples (with a numeric target attribute).
Here, Gain(S,A) = Var(S) - Sv (|Sv| / |S|) Var(Sv)

Find Argmin{attributes A} H(A) = Sv in Values(A) ( (|Sv|/|S|) *Var(Sv) ))


variance: Var(Sv) = åi Î Sv pi(xi-µ)2
• pi : probability of i in Sv,
• xi : value of the target for example i in Sv
• µ : mean of the target in Sv

Q: Which attribute (a1 or a2) will be put in the top node of the
regression tree?
55
n-dimensional target spaces
• Instead of predicting 1 number, predict
vector of numbers
• info: mean vector
• best_test: variance (mean squared
distance) in n-dimensional space
• stop_criterion: F-test
• mixed vectors (numbers and symbols)?
– use appropriate distance measure
• -> "clustering trees"
60
Model trees
• Make predictions using linear regression models
in the leaves
• info: regression model (y=ax1+bx2+c)
• best_test: ?
– variance: simple, not so good (M5 approach)
– residual variance after model construction: better,
computationally expensive (RETIS approach)
• stop_criterion: significant reduction of variance

61
Random Forests
• Ensemble learning: train a set of classifiers and combine
their predictions
• A Random forest (or random forests) is an ensemble
classifier that consists of many decision trees and
outputs the majority class (i.e. the mode of the class's
output by individual trees).
• The term came from random decision forests that was
first proposed by Tin Kam Ho of Bell Labs in 1995.
• Leo Breiman, Random Forests, Machine Learning, 45, 5-
32, 2001
• The method combines Breiman's "bagging" idea and the
random selection of features.

62/14
63
Training a random forest
• For some number of trees T:
• Sample N examples at random with replacement to
create a subset of the data (see top layer of previous
figure). The subset should be about 66% of the total set.
• At each node:
– m attributes (among M) are selected at random from all the
predictor variables.
– The attribute that provides the best split, according to some
objective function, is used to do a binary split on that node.
– At the next node, choose another m attributes at random from all
attributes and do the same.
– Each tree is grown to the largest extent possible. There is no
pruning.

64
Choose « m » predictors?
Depending upon the value of m, there are three
slightly different systems:
• Random splitter selection: m =1
• Breiman’s bagger: m = total number of attributes
• Random forest: m << number of attributes.
Brieman suggests three possible values for m:
½√M, √M, and 2√M

65
Prediction using a random forest
• When a new input is entered into the system, it is run down all
of the trees. The result may either be an average or weighted
average of all of the terminal nodes that are reached, or, in
the case of categorical variables, a voting majority.

Note that:
• With a large number of attributes, the eligible attribute set will
be quite different from node to node.
• The greater the inter-tree correlation, the greater the random
forest error rate, so one pressure on the model is to have the
trees as uncorrelated as possible.
• As m goes down, both inter-tree correlation and the strength
of individual trees go down. So some optimal value of m must
be discovered.

66
Estimating the test error
• While growing forest, estimate test error from training
samples
• For each tree grown, 33-36% of samples are not selected in
bootstrap, called out of bootstrap/out of bag (OOB) samples
• Using OOB samples as input to the corresponding tree,
predictions are made as if they were novel test samples
• Through book-keeping, majority vote (classification),
average (regression) is computed for all OOB samples from
all trees.
• Such estimated test error (over samples that are OOB for a
given tree) is very accurate in practice, with reasonable N

67
Use a RF to evaluate the
importance of each attributes
• Denote by ê the OOB estimate of the loss when
using original training set, D.
• For each predictor (i.e. feature or variable) xp
where p∈{1,..,k}
– Randomly permute the value of pth predictor to
generate a new set of samples D' ={(y1,x'1),…,(yN,x'N)}
– Compute OOB estimate êk of prediction error with the
new samples
• A measure of importance of predictor xp is
êk – ê, the increase in error due to random
perturbation of pth predictor

68
Conclusions on random forests
• Fast fast fast!
– RF is fast to build. Even faster to predict!
– Practically speaking, not requiring cross-validation alone for
model selection significantly speeds training by 10x-100x or
more.
– Fully parallelizable … to go even faster!
• Automatic attribute selection from large number of candidates
• Resistance to over training
• Ability to handle data without preprocessing
– data does not need to be rescaled, transformed, or modified
– resistant to outliers
– automatic handling of missing values
• Cluster identification can be used to generate tree-based clusters
through sample proximity
Try it out...
• scikit-learn (Machine Learning library in
Python)
• http://scikit-
learn.org/stable/modules/tree.html
• http://scikit-
learn.org/stable/modules/generated/sklear
n.ensemble.RandomForestClassifier.html
• Can parameterize everything to obtain
your favorite decision tree/RF algorithm
73
To Remember
• Decision trees & their representational power
• Generic TDIDT algorithm and how to instantiate
its parameters
• For classification trees: details on heuristics,
discretization handling missing values, pruning,

• Some general concepts: overfitting, Occam’s
razor applied to DT
• Random Forests, principles ( “random”, where?)

75

You might also like