Statsp 6
Statsp 6
1/32
Statistical Methods for Bioinformatics
Today
1 Regression tree
2 Classification tree
3 Ensemble methods:
1 Bagging
2 Random Forests
3 Boosting
2/32
Statistical Methods for Bioinformatics
Even more flexible models
A default GAM does not inherently incorporate interactions
between variables, though they can be included.
Another form of flexibility is to focus on interactions between
variables.
One can consider decision trees, Random Forests and SVM
(etc)
3/32
Statistical Methods for Bioinformatics
Trees are very broadly used
4/32
Statistical Methods for Bioinformatics
Tree-Based Methods
5/32
Statistical Methods for Bioinformatics
Building a tree
6/32
Statistical Methods for Bioinformatics
Building a tree
Procedure
1 Split the predictor space so that the biggest drop in RSS is
achieved.
2 Then split one of two new spaces with the same criterion
3 Continue till some criterion is reached.
This process can overfit the data if divisions continue till data
scarcity
Smaller trees tend to have less variance for a bit more bias.
A strong limit on growth of the tree is often sub-optimal
however
Stopping early may prevent finding v. good fits deeper in the
tree.
7/32
Statistical Methods for Bioinformatics
Pruning a Tree
8/32
Statistical Methods for Bioinformatics
Example: Baseball Players’ Salaries
9/32
Statistical Methods for Bioinformatics
Trees vs Linear Model: classification example
10/32
Statistical Methods for Bioinformatics
Classification tree
11/32
Statistical Methods for Bioinformatics
Tree recap
12/32
Statistical Methods for Bioinformatics
Ensemble methods
Definition
Ensemble methods combine multiple instances of learning
algorithms for predictions.
Goal
Improve predictive performance over any of the constituent
instances
Especially useful with high model variance and overlearning of
individual models
Averaging stabilizes prediction performance for variable models
13/32
Statistical Methods for Bioinformatics
Ensemble methods
14/32
Statistical Methods for Bioinformatics
Ensemble methods
In statistics and machine learning, ensemble methods use multiple
learning algorithms to obtain better predictive performance than
could be obtained from any of the constituent learning algorithms.
e.g. Bagging: a general-purpose procedure for reducing the
variance of a statistical learning method; we introduce it here
because it is particularly useful and frequently used in the
context of decision trees.
Important player was Leo Breiman (who proposed a.o.
Random forests), a very creative man to advanced age.
15/32
Statistical Methods for Bioinformatics
Bagging
Procedure
1 Produce several identical sized training datasets by sampling
16/32
Statistical Methods for Bioinformatics
Bagging
17/32
Statistical Methods for Bioinformatics
Bagging Performance Measurement
Cross-Validation! or...
Out-of-Bag Error Estimation
On average, each bagged tree uses about two-thirds of the
observations
The remaining one-third can be used to evaluate performance
(the out-of-bag (OOB) observations)
The response for an observation can be estimated with the
trees for which it was not selected for learning.
Average the predictions, or take majority vote, then estimate
RSS or classification error.
18/32
Statistical Methods for Bioinformatics
Random Forests
As in bagging, we build several decision trees on bootstrapped
training samples
Random forests improve over bagged trees by decorrelating
the trees
Makes trees differ, exploring variables beyond strongest
predicting ones
Averaging highly correlated quantities reduces variance less
than averaging many uncorrelated quantities
Each time a split in a tree is considered, a random selection of
m predictors is chosen as split candidates from the full set of
p predictors. The split is allowed to use only one of those m
predictors
A new selection of m predictors is taken at each split, and
√
typically we choose m ≈ p — the number of predictors
considered at each split is approximately equal to the square
root of the total number of predictors (4 out of the 13 for the
Heart data)
19/32
Statistical Methods for Bioinformatics
Random Forests in the context of Bioinformatics
Note
20/32
Statistical Methods for Bioinformatics
The heart data
22/32
Statistical Methods for Bioinformatics
Boosting
23/32
Statistical Methods for Bioinformatics
Boosting
24/32
Statistical Methods for Bioinformatics
Boosting
Tuning features:
Number of Trees in the ensemble (select with CV,
overlearning can occur)
Shrinkage parameter λ (speed of learning, value interacts with
required number of trees)
Depth of the individual trees (often a depth of 1 or 2)
25/32
Statistical Methods for Bioinformatics
AdaBoost: the adaptive booster
Algorithm tweaks
Use of a weighted error function:
Weights are given to datapoints to train a weak classifier with
weights according to Dt for every iteration t.
Dt,i is proportional to the error for sample i, at the current
boosting iteration, high weight corresponds to high error.
Instead of fˆ(x) = B ˆb
P
b=1 λf (x) , λ is replaced by adaptive
weights αb , which are inversely proportional to the error rate.
26/32
Statistical Methods for Bioinformatics
AdaBoost
27/32
αt given by negative logit function
28/32
Statistical Methods for Bioinformatics
Dt (i) scales the impact of a point during learning
29/32
Statistical Methods for Bioinformatics
AdaBoost tends to resist over-learning
Often a well performing classifier. All individual models can be
poor/weak, but when better than random, the final model will
converge to a “strong” learner.
Can be sensitive to noisy data and outliers.
In particular cases can resist over-learning.
The training and test percent error rates obtained using boosting on an OCR
datasetwith C4.5 as the base learner. The top and bottom curves are test and training
error, respectively. From Explaining AdaBoost by RE. Schapire
30/32
Statistical Methods for Bioinformatics
What you should learn from this chapter
31/32
Statistical Methods for Bioinformatics
To do:
Labs of chapter 8
Provided walk through for SA heart data
Make an artificial dataset where you explicitly add an
interaction between variables. The number of observations
and the nature and strength of the interaction are important
variables. Compare a boosting model to a linear model with
an interaction term and measure the performance. Study and
describe how the interaction is modelled by the set of trees.
For the vd Vijver dataset of class 3: Can you improve
predictive performance with trees?
Evaluate performance for a classification tree, a bagging of
classification trees, a random forest, and classification trees
with boosting
Compare the variable importance plots for the simple Bagging
and for Random Forests
32/32
Statistical Methods for Bioinformatics