0% found this document useful (0 votes)
6 views32 pages

Statsp 6

Uploaded by

javabe7544
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views32 pages

Statsp 6

Uploaded by

javabe7544
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Statistical Methods for Bioinformatics

II-5: Trees, Bagging and Boosting

1/32
Statistical Methods for Bioinformatics
Today

1 Regression tree
2 Classification tree
3 Ensemble methods:
1 Bagging
2 Random Forests
3 Boosting

2/32
Statistical Methods for Bioinformatics
Even more flexible models
A default GAM does not inherently incorporate interactions
between variables, though they can be included.
Another form of flexibility is to focus on interactions between
variables.
One can consider decision trees, Random Forests and SVM
(etc)

3/32
Statistical Methods for Bioinformatics
Trees are very broadly used

Systematically structuring knowledge (Gene Ontology)


Phylogenetic tree
Many data structures
e.g. directory structure
Decision trees as a procedure: e.g. in clinical practice

4/32
Statistical Methods for Bioinformatics
Tree-Based Methods

Basic tree approaches are simple, and useful for interpretation


Progressively stratifying or segmenting predictor space into
regions.
Readily exploit interactions between variables.

5/32
Statistical Methods for Bioinformatics
Building a tree

1 We divide the predictor space — that is, the set of possible


values for X1,X2,...,Xp — into J distinct and non-overlapping
regions, R1, R2, . . . , RJ .
2 For every observation that falls into the region Rj, we make
the same prediction, which is simply the mean of the response
values for the training observations in Rj.
3 The regions are high-dimensional rectangles/boxes
The goal is minimize RSS: Jj=1 i∈Rj (yi − ŷRj )2
P P
4

6/32
Statistical Methods for Bioinformatics
Building a tree

Procedure
1 Split the predictor space so that the biggest drop in RSS is

achieved.
2 Then split one of two new spaces with the same criterion
3 Continue till some criterion is reached.

This process can overfit the data if divisions continue till data
scarcity
Smaller trees tend to have less variance for a bit more bias.
A strong limit on growth of the tree is often sub-optimal
however
Stopping early may prevent finding v. good fits deeper in the
tree.

7/32
Statistical Methods for Bioinformatics
Pruning a Tree

Strategy of choice is to grow a tree and then prune it back.


The branches that give the smallest drop in RSS for their
number of splits are removed first. This is formalized as
minimizing:

|T | represents the terminal node count


α is a non-negative tuning parameter chosen with
cross-validation

8/32
Statistical Methods for Bioinformatics
Example: Baseball Players’ Salaries

The minimum cross validation error occurs at a tree size of 3

9/32
Statistical Methods for Bioinformatics
Trees vs Linear Model: classification example

10/32
Statistical Methods for Bioinformatics
Classification tree

Same principle as regression tree


Intuitive optimization function is to take for every box the
most common class and take all examples not of this class as
errors: E = 1 − maxk (p̂mk ) with p̂mk the proportion of
observations in the m-th box of the k-th class.
Above classification error is not very sensitive (many models
have very similar scores) so we need something else
Different cost function that measures purity of the nodes
PK
Gini index G = k=1 p̂mk (1 − p̂mk )
PK
cross-entropy D = − k=1 pmk log (pmk )

11/32
Statistical Methods for Bioinformatics
Tree recap

Transparent and easy to understand method


Plotting and interpretation is easy
Naturally incorporates interactions between variables
Naturally incorporates qualitative predictors
But they tend to not perform very well for most datasets

12/32
Statistical Methods for Bioinformatics
Ensemble methods

From weak to strong


Can we combine multiple “weak” learning models to make one
“strong” learning model?

Definition
Ensemble methods combine multiple instances of learning
algorithms for predictions.

Goal
Improve predictive performance over any of the constituent
instances
Especially useful with high model variance and overlearning of
individual models
Averaging stabilizes prediction performance for variable models
13/32
Statistical Methods for Bioinformatics
Ensemble methods

In this class: 2 methods to represent the class


1 Bagging, with Random Forests as a variant
2 Boosting
General-purpose procedures for reducing the variance of a
statistical learning method
Both particularly useful in the context of decision trees.

14/32
Statistical Methods for Bioinformatics
Ensemble methods
In statistics and machine learning, ensemble methods use multiple
learning algorithms to obtain better predictive performance than
could be obtained from any of the constituent learning algorithms.
e.g. Bagging: a general-purpose procedure for reducing the
variance of a statistical learning method; we introduce it here
because it is particularly useful and frequently used in the
context of decision trees.
Important player was Leo Breiman (who proposed a.o.
Random forests), a very creative man to advanced age.

15/32
Statistical Methods for Bioinformatics
Bagging

Bagging stands for “Bootstrap aggregating”

Procedure
1 Produce several identical sized training datasets by sampling

with replacement (bootstrap)


2 Train models for all training sets using the same technique
3 Combine the individual model to come to a single predictor
average the predictions
majority vote for classification

16/32
Statistical Methods for Bioinformatics
Bagging

from He, Chaney, Schleiss, Sheffield. (2016). Spatial downscaling of precipitation


using adaptable random forest

17/32
Statistical Methods for Bioinformatics
Bagging Performance Measurement

Cross-Validation! or...
Out-of-Bag Error Estimation
On average, each bagged tree uses about two-thirds of the
observations
The remaining one-third can be used to evaluate performance
(the out-of-bag (OOB) observations)
The response for an observation can be estimated with the
trees for which it was not selected for learning.
Average the predictions, or take majority vote, then estimate
RSS or classification error.

18/32
Statistical Methods for Bioinformatics
Random Forests
As in bagging, we build several decision trees on bootstrapped
training samples
Random forests improve over bagged trees by decorrelating
the trees
Makes trees differ, exploring variables beyond strongest
predicting ones
Averaging highly correlated quantities reduces variance less
than averaging many uncorrelated quantities
Each time a split in a tree is considered, a random selection of
m predictors is chosen as split candidates from the full set of
p predictors. The split is allowed to use only one of those m
predictors
A new selection of m predictors is taken at each split, and

typically we choose m ≈ p — the number of predictors
considered at each split is approximately equal to the square
root of the total number of predictors (4 out of the 13 for the
Heart data)
19/32
Statistical Methods for Bioinformatics
Random Forests in the context of Bioinformatics

A good and popular predictor.


Works well with multiple correlated variables
Suitable for high dimensional datasets.
It can yield an increase in predictive power at the cost of
transparency

Note

Bagging and Random Forests don’t overlearn with more trained


trees! But the effect of stabilization is normally quickly achieved,
leaving hardly any benefit for adding trees beyond a certain level.

20/32
Statistical Methods for Bioinformatics
The heart data

Dotted line shows test error for single tree 21/32


Statistical Methods for Bioinformatics
Variable Importance to interpret Tree Ensembles
Variable Importance measures the performance measure drop
over the variable’s tree splits:
Defined per tree, averaged over the ensemble
Regression Trees: RSS drop
Classification Trees: Gini index/cross-entropy drop

22/32
Statistical Methods for Bioinformatics
Boosting

A popular ensemble method


Progressive (or slow) learning
Successively learn and then combine multiple “weak” learning
models to make one “strong” learning model
later models focus on unexplained variation by weighting the
data
Again a very general meta-procedure that works beyond just
trees

23/32
Statistical Methods for Bioinformatics
Boosting

24/32
Statistical Methods for Bioinformatics
Boosting

Tuning features:
Number of Trees in the ensemble (select with CV,
overlearning can occur)
Shrinkage parameter λ (speed of learning, value interacts with
required number of trees)
Depth of the individual trees (often a depth of 1 or 2)

25/32
Statistical Methods for Bioinformatics
AdaBoost: the adaptive booster

There exist many variants and flavors of boosting. AdaBoost


is a popular choice.
Published by Freund and Schapire in 1997, Gödel prize 2003.
Algorithm for classification. Slightly different from general
“Boosting” as above.

Algorithm tweaks
Use of a weighted error function:
Weights are given to datapoints to train a weak classifier with
weights according to Dt for every iteration t.
Dt,i is proportional to the error for sample i, at the current
boosting iteration, high weight corresponds to high error.
Instead of fˆ(x) = B ˆb
P
b=1 λf (x) , λ is replaced by adaptive
weights αb , which are inversely proportional to the error rate.

26/32
Statistical Methods for Bioinformatics
AdaBoost

27/32
αt given by negative logit function

1 Weight is 0 when error rate is 0.5


2 Exponential increase/decrease apporaching bounds 0 (strong
predictor) and 1 (inverse predictor)

28/32
Statistical Methods for Bioinformatics
Dt (i) scales the impact of a point during learning

Dt (i) exp(−αt yi ht (xi ))


Dt (i + 1) =
Zt
Z scales the score to probabilities (range 0-1, summing to 1)
yi ht (xi ) is 1 when correctly classified, -1 when missclassified
See below how e x scales the weight of misclassified samples to
more than 1, less than 1 for correct answers.

29/32
Statistical Methods for Bioinformatics
AdaBoost tends to resist over-learning
Often a well performing classifier. All individual models can be
poor/weak, but when better than random, the final model will
converge to a “strong” learner.
Can be sensitive to noisy data and outliers.
In particular cases can resist over-learning.

The training and test percent error rates obtained using boosting on an OCR
datasetwith C4.5 as the base learner. The top and bottom curves are test and training
error, respectively. From Explaining AdaBoost by RE. Schapire
30/32
Statistical Methods for Bioinformatics
What you should learn from this chapter

Basic principles of Regression and Classification Trees


learning and pruning
performance measures
Bagging (incl. definitions, rationale)
Random Forests
Boosting (incl. definitions, rationale)
Variable Importance for trees

31/32
Statistical Methods for Bioinformatics
To do:

Labs of chapter 8
Provided walk through for SA heart data
Make an artificial dataset where you explicitly add an
interaction between variables. The number of observations
and the nature and strength of the interaction are important
variables. Compare a boosting model to a linear model with
an interaction term and measure the performance. Study and
describe how the interaction is modelled by the set of trees.
For the vd Vijver dataset of class 3: Can you improve
predictive performance with trees?
Evaluate performance for a classification tree, a bagging of
classification trees, a random forest, and classification trees
with boosting
Compare the variable importance plots for the simple Bagging
and for Random Forests
32/32
Statistical Methods for Bioinformatics

You might also like