Lecture 5
Lecture 5
http://www.scaasymposium.org/portfolio/part-v-the-power-of-innovation-and-the-market/
Ensemble methods
• A single decision tree does not perform well
• But, it is super fast
• What if we learn multiple trees?
We need to make sure they do not all just learn the same
Bagging
If we split the data in random different ways, decision
trees give different results, high variance.
How?
Bootstrap
Construct B (hundreds) of trees (no pruning)
Learn a classifier for each bootstrap sample and
average them
Very effective
Bagging for classification: Majority vote
Test error
NO OVERFITTING
X2
X1
Bagging decision trees
Hastie et al.,”The Elements of Statistical Learning: Data Mining, Inference, and Prediction”, Springer (2009)
Out-of-Bag Error Estimation
• No cross validation?
• Remember, in bootstrapping we sample with
replacement, and therefore not all observations are
used for each bootstrap sample. On average 1/3 of them
are not used!
• We call them out-of-bag samples (OOB)
• We can predict the response for the i-th observation
using each of the trees in which that observation was
OOB and do this for n observations
• Calculate overall OOB MSE or classification error
Bagging
• Reduces overfitting (variance)
• Normally uses one type of classifier
• Decision trees are popular
• Easy to parallelize
Variable Importance Measures
• Bagging results in improved accuracy over prediction
using a single tree
• Unfortunately, difficult to interpret the resulting model.
Bagging improves prediction accuracy at the expense of
interpretability.
http://www.stat.berkeley.edu/~breiman/RandomForest
s/
Random Forests Algorithm
For b = 1 to B:
(a) Draw a bootstrap sample Z∗ of size N from the training data.
(b) Grow a random-forest tree to the bootstrapped data, by
recursively repeating the following steps for each terminal node of the
tree, until the minimum node size nmin is reached.
i. Select m variables at random from the p variables.
ii. Pick the best variable/split-point among the m.
iii. Split the node into two daughter nodes.
Output the ensemble of trees.
In practice the best values for these parameters will depend on the
problem, and they should be treated as tuning parameters.
Like with Bagging, we can use OOB and therefore RF can be fit in one
sequence, with cross-validation being performed along the way. Once
the OOB error stabilizes, the training can be terminated.
Example
• 4,718 genes measured on tissue samples from 349 patients.
• Each gene has different expression
• Each of the patient samples has a qualitative label with 15
different levels: either normal or 1 of 14 different types of
cancer.
Why?
Because:
At each split the chance can be small that the relevant variables will be
selected
For example, with 3 relevant and 100 not so relevant variables the
probability of any of the relevant variables being selected at any split is
~0.25
Probability of being selected
Can RF overfit?
Random forests “cannot overfit” the data wrt to
number of trees.
Why?