0% found this document useful (0 votes)

6 views32 pages

Statsp 6

Uploaded by

javabe7544

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views32 pages

Statsp 6

Uploaded by

javabe7544

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Statistical Methods for Bioinformatics

II-5: Trees, Bagging and Boosting

1/32
Statistical Methods for Bioinformatics
Today

1 Regression tree
2 Classification tree
3 Ensemble methods:
1 Bagging
2 Random Forests
3 Boosting

2/32
Statistical Methods for Bioinformatics
Even more flexible models
A default GAM does not inherently incorporate interactions
between variables, though they can be included.
Another form of flexibility is to focus on interactions between
variables.
One can consider decision trees, Random Forests and SVM
(etc)

3/32
Statistical Methods for Bioinformatics
Trees are very broadly used

Systematically structuring knowledge (Gene Ontology)

Phylogenetic tree
Many data structures
e.g. directory structure
Decision trees as a procedure: e.g. in clinical practice

4/32
Statistical Methods for Bioinformatics
Tree-Based Methods

Basic tree approaches are simple, and useful for interpretation

Progressively stratifying or segmenting predictor space into
regions.
Readily exploit interactions between variables.

5/32
Statistical Methods for Bioinformatics
Building a tree

1 We divide the predictor space — that is, the set of possible

values for X1,X2,...,Xp — into J distinct and non-overlapping
regions, R1, R2, . . . , RJ .
2 For every observation that falls into the region Rj, we make
the same prediction, which is simply the mean of the response
values for the training observations in Rj.
3 The regions are high-dimensional rectangles/boxes
The goal is minimize RSS: Jj=1 i∈Rj (yi − ŷRj )2
P P
4

6/32
Statistical Methods for Bioinformatics
Building a tree

Procedure
1 Split the predictor space so that the biggest drop in RSS is

achieved.
2 Then split one of two new spaces with the same criterion
3 Continue till some criterion is reached.

This process can overfit the data if divisions continue till data
scarcity
Smaller trees tend to have less variance for a bit more bias.
A strong limit on growth of the tree is often sub-optimal
however
Stopping early may prevent finding v. good fits deeper in the
tree.

7/32
Statistical Methods for Bioinformatics
Pruning a Tree

Strategy of choice is to grow a tree and then prune it back.

The branches that give the smallest drop in RSS for their
number of splits are removed first. This is formalized as
minimizing:

|T | represents the terminal node count

α is a non-negative tuning parameter chosen with
cross-validation

8/32
Statistical Methods for Bioinformatics
Example: Baseball Players’ Salaries

The minimum cross validation error occurs at a tree size of 3

9/32
Statistical Methods for Bioinformatics
Trees vs Linear Model: classification example

10/32
Statistical Methods for Bioinformatics
Classification tree

Same principle as regression tree

Intuitive optimization function is to take for every box the
most common class and take all examples not of this class as
errors: E = 1 − maxk (p̂mk ) with p̂mk the proportion of
observations in the m-th box of the k-th class.
Above classification error is not very sensitive (many models
have very similar scores) so we need something else
Different cost function that measures purity of the nodes
PK
Gini index G = k=1 p̂mk (1 − p̂mk )
PK
cross-entropy D = − k=1 pmk log (pmk )

11/32
Statistical Methods for Bioinformatics
Tree recap

Transparent and easy to understand method

Plotting and interpretation is easy
Naturally incorporates interactions between variables
Naturally incorporates qualitative predictors
But they tend to not perform very well for most datasets

12/32
Statistical Methods for Bioinformatics
Ensemble methods

From weak to strong

Can we combine multiple “weak” learning models to make one
“strong” learning model?

Definition
Ensemble methods combine multiple instances of learning
algorithms for predictions.

Goal
Improve predictive performance over any of the constituent
instances
Especially useful with high model variance and overlearning of
individual models
Averaging stabilizes prediction performance for variable models
13/32
Statistical Methods for Bioinformatics
Ensemble methods

In this class: 2 methods to represent the class

1 Bagging, with Random Forests as a variant
2 Boosting
General-purpose procedures for reducing the variance of a
statistical learning method
Both particularly useful in the context of decision trees.

14/32
Statistical Methods for Bioinformatics
Ensemble methods
In statistics and machine learning, ensemble methods use multiple
learning algorithms to obtain better predictive performance than
could be obtained from any of the constituent learning algorithms.
e.g. Bagging: a general-purpose procedure for reducing the
variance of a statistical learning method; we introduce it here
because it is particularly useful and frequently used in the
context of decision trees.
Important player was Leo Breiman (who proposed a.o.
Random forests), a very creative man to advanced age.

15/32
Statistical Methods for Bioinformatics
Bagging

Bagging stands for “Bootstrap aggregating”

Procedure
1 Produce several identical sized training datasets by sampling

with replacement (bootstrap)

2 Train models for all training sets using the same technique
3 Combine the individual model to come to a single predictor
average the predictions
majority vote for classification

16/32
Statistical Methods for Bioinformatics
Bagging

from He, Chaney, Schleiss, Sheffield. (2016). Spatial downscaling of precipitation

using adaptable random forest

17/32
Statistical Methods for Bioinformatics
Bagging Performance Measurement

Cross-Validation! or...
Out-of-Bag Error Estimation
On average, each bagged tree uses about two-thirds of the
observations
The remaining one-third can be used to evaluate performance
(the out-of-bag (OOB) observations)
The response for an observation can be estimated with the
trees for which it was not selected for learning.
Average the predictions, or take majority vote, then estimate
RSS or classification error.

18/32
Statistical Methods for Bioinformatics
Random Forests
As in bagging, we build several decision trees on bootstrapped
training samples
Random forests improve over bagged trees by decorrelating
the trees
Makes trees differ, exploring variables beyond strongest
predicting ones
Averaging highly correlated quantities reduces variance less
than averaging many uncorrelated quantities
Each time a split in a tree is considered, a random selection of
m predictors is chosen as split candidates from the full set of
p predictors. The split is allowed to use only one of those m
predictors
A new selection of m predictors is taken at each split, and
√
typically we choose m ≈ p — the number of predictors
considered at each split is approximately equal to the square
root of the total number of predictors (4 out of the 13 for the
Heart data)
19/32
Statistical Methods for Bioinformatics
Random Forests in the context of Bioinformatics

A good and popular predictor.

Works well with multiple correlated variables
Suitable for high dimensional datasets.
It can yield an increase in predictive power at the cost of
transparency

Note

Bagging and Random Forests don’t overlearn with more trained

trees! But the effect of stabilization is normally quickly achieved,
leaving hardly any benefit for adding trees beyond a certain level.

20/32
Statistical Methods for Bioinformatics
The heart data

Dotted line shows test error for single tree 21/32

Statistical Methods for Bioinformatics
Variable Importance to interpret Tree Ensembles
Variable Importance measures the performance measure drop
over the variable’s tree splits:
Defined per tree, averaged over the ensemble
Regression Trees: RSS drop
Classification Trees: Gini index/cross-entropy drop

22/32
Statistical Methods for Bioinformatics
Boosting

A popular ensemble method

Progressive (or slow) learning
Successively learn and then combine multiple “weak” learning
models to make one “strong” learning model
later models focus on unexplained variation by weighting the
data
Again a very general meta-procedure that works beyond just
trees

23/32
Statistical Methods for Bioinformatics
Boosting

24/32
Statistical Methods for Bioinformatics
Boosting

Tuning features:
Number of Trees in the ensemble (select with CV,
overlearning can occur)
Shrinkage parameter λ (speed of learning, value interacts with
required number of trees)
Depth of the individual trees (often a depth of 1 or 2)

25/32
Statistical Methods for Bioinformatics
AdaBoost: the adaptive booster

There exist many variants and flavors of boosting. AdaBoost

is a popular choice.
Published by Freund and Schapire in 1997, Gödel prize 2003.
Algorithm for classification. Slightly different from general
“Boosting” as above.

Algorithm tweaks
Use of a weighted error function:
Weights are given to datapoints to train a weak classifier with
weights according to Dt for every iteration t.
Dt,i is proportional to the error for sample i, at the current
boosting iteration, high weight corresponds to high error.
Instead of fˆ(x) = B ˆb
P
b=1 λf (x) , λ is replaced by adaptive
weights αb , which are inversely proportional to the error rate.

26/32
Statistical Methods for Bioinformatics
AdaBoost

27/32
αt given by negative logit function

1 Weight is 0 when error rate is 0.5

2 Exponential increase/decrease apporaching bounds 0 (strong
predictor) and 1 (inverse predictor)

28/32
Statistical Methods for Bioinformatics
Dt (i) scales the impact of a point during learning

Dt (i) exp(−αt yi ht (xi ))

Dt (i + 1) =
Zt
Z scales the score to probabilities (range 0-1, summing to 1)
yi ht (xi ) is 1 when correctly classified, -1 when missclassified
See below how e x scales the weight of misclassified samples to
more than 1, less than 1 for correct answers.

29/32
Statistical Methods for Bioinformatics
AdaBoost tends to resist over-learning
Often a well performing classifier. All individual models can be
poor/weak, but when better than random, the final model will
converge to a “strong” learner.
Can be sensitive to noisy data and outliers.
In particular cases can resist over-learning.

The training and test percent error rates obtained using boosting on an OCR
datasetwith C4.5 as the base learner. The top and bottom curves are test and training
error, respectively. From Explaining AdaBoost by RE. Schapire
30/32
Statistical Methods for Bioinformatics
What you should learn from this chapter

Basic principles of Regression and Classification Trees

learning and pruning
performance measures
Bagging (incl. definitions, rationale)
Random Forests
Boosting (incl. definitions, rationale)
Variable Importance for trees

31/32
Statistical Methods for Bioinformatics
To do:

Labs of chapter 8
Provided walk through for SA heart data
Make an artificial dataset where you explicitly add an
interaction between variables. The number of observations
and the nature and strength of the interaction are important
variables. Compare a boosting model to a linear model with
an interaction term and measure the performance. Study and
describe how the interaction is modelled by the set of trees.
For the vd Vijver dataset of class 3: Can you improve
predictive performance with trees?
Evaluate performance for a classification tree, a bagging of
classification trees, a random forest, and classification trees
with boosting
Compare the variable importance plots for the simple Bagging
and for Random Forests
32/32
Statistical Methods for Bioinformatics

Air Compressor Parts PDF
0% (1)
Air Compressor Parts PDF
51 pages
ML Lecture 15 Ensemble
No ratings yet
ML Lecture 15 Ensemble
27 pages
05 - Ensemble Learning
No ratings yet
05 - Ensemble Learning
39 pages
Onga'nya 24
No ratings yet
Onga'nya 24
23 pages
Expression of Interest Bhushan - 1
No ratings yet
Expression of Interest Bhushan - 1
6 pages
Problems and Prospects of General Insurance in Bangladesh
75% (4)
Problems and Prospects of General Insurance in Bangladesh
56 pages
02 Landscape Design Methodology
No ratings yet
02 Landscape Design Methodology
48 pages
UNIT-V (Bagging, Boosting, Random Forest) : by Dr. K. Aditya Shastry Associate Professor Dept. of ISE NMIT, Bengaluru
No ratings yet
UNIT-V (Bagging, Boosting, Random Forest) : by Dr. K. Aditya Shastry Associate Professor Dept. of ISE NMIT, Bengaluru
27 pages
Classification Algorithms
No ratings yet
Classification Algorithms
68 pages
Unit 3
No ratings yet
Unit 3
63 pages
Touch Me - Chapter 1 - Baeconandeggs, Xiaolianhua - EXO (Band) (Archive of Our Own)
No ratings yet
Touch Me - Chapter 1 - Baeconandeggs, Xiaolianhua - EXO (Band) (Archive of Our Own)
12 pages
Random Forest-Supervised ML
No ratings yet
Random Forest-Supervised ML
45 pages
Unit 3
No ratings yet
Unit 3
59 pages
Lecture 5
No ratings yet
Lecture 5
53 pages
Bagging and Boosting
No ratings yet
Bagging and Boosting
32 pages
SDG Primer FINAL PDF
No ratings yet
SDG Primer FINAL PDF
238 pages
Machine Learning Lecture 2,3,4
No ratings yet
Machine Learning Lecture 2,3,4
26 pages
5 - EnsembleModeling
No ratings yet
5 - EnsembleModeling
80 pages
Statistical Methods For Bioinformatics Lecture 3
No ratings yet
Statistical Methods For Bioinformatics Lecture 3
33 pages
ML Mod1
No ratings yet
ML Mod1
48 pages
Lecture 7-2
No ratings yet
Lecture 7-2
37 pages
Accresm Research Sample
No ratings yet
Accresm Research Sample
46 pages
CHP 8.2 Intro To Statistical Learning
No ratings yet
CHP 8.2 Intro To Statistical Learning
13 pages
ML4 - Decision Trees & Random Forest
No ratings yet
ML4 - Decision Trees & Random Forest
44 pages
Module 2
No ratings yet
Module 2
34 pages
Ensemble Methods
No ratings yet
Ensemble Methods
32 pages
Phys361 S24 Lecture 17 Random Forests
No ratings yet
Phys361 S24 Lecture 17 Random Forests
24 pages
Gravity Light Project
No ratings yet
Gravity Light Project
16 pages
Random Forest
No ratings yet
Random Forest
29 pages
22AIP3101A Session 11
No ratings yet
22AIP3101A Session 11
30 pages
Unit-V 1
No ratings yet
Unit-V 1
26 pages
2025 Ensemble Learning
No ratings yet
2025 Ensemble Learning
25 pages
Book Sizes
No ratings yet
Book Sizes
9 pages
08 Tree Advanced
No ratings yet
08 Tree Advanced
68 pages
Ensemble Method
No ratings yet
Ensemble Method
18 pages
ML Unit-3
No ratings yet
ML Unit-3
16 pages
Ensemble Methods in Machine Learning
No ratings yet
Ensemble Methods in Machine Learning
24 pages
Ensemble Learning and Random Forest 4th
No ratings yet
Ensemble Learning and Random Forest 4th
19 pages
ML Unit-3 Part-1
No ratings yet
ML Unit-3 Part-1
17 pages
Unit 5 ML
No ratings yet
Unit 5 ML
14 pages
Ensemble Methods
No ratings yet
Ensemble Methods
30 pages
ML Unit 3-1
No ratings yet
ML Unit 3-1
14 pages
BMC Bioinformatics: Bias in Random Forest Variable Importance Measures: Illustrations, Sources and A Solution
No ratings yet
BMC Bioinformatics: Bias in Random Forest Variable Importance Measures: Illustrations, Sources and A Solution
21 pages
Ensemble Learning
No ratings yet
Ensemble Learning
16 pages
Evolutionary Bagging For Ensemble Learning: Keywords
No ratings yet
Evolutionary Bagging For Ensemble Learning: Keywords
16 pages
Waste Valorization For Bioenergy and Bioproducts Hwai Chyuan Ong - The Latest Ebook Edition With All Chapters Is Now Available
100% (3)
Waste Valorization For Bioenergy and Bioproducts Hwai Chyuan Ong - The Latest Ebook Edition With All Chapters Is Now Available
50 pages
Lecture 05 Random Forest 07112022 124639pm
No ratings yet
Lecture 05 Random Forest 07112022 124639pm
25 pages
ML Unit-3
No ratings yet
ML Unit-3
15 pages
Source Follower: (Common-Drain Amplifier)
No ratings yet
Source Follower: (Common-Drain Amplifier)
40 pages
Air Brake Rake Testing Procedure (LHB Coaches (2) - 0
No ratings yet
Air Brake Rake Testing Procedure (LHB Coaches (2) - 0
22 pages
Max 30
No ratings yet
Max 30
6 pages
CS109a Lecture16 Bagging RF Boosting
No ratings yet
CS109a Lecture16 Bagging RF Boosting
48 pages
Week 12
No ratings yet
Week 12
34 pages
Unit 4 ML
No ratings yet
Unit 4 ML
9 pages
Bagging and Random Forests
No ratings yet
Bagging and Random Forests
24 pages
U1-Ensemble Methods
No ratings yet
U1-Ensemble Methods
17 pages
Unit-3 ML
No ratings yet
Unit-3 ML
18 pages
13 PracticalMachineLearning
100% (1)
13 PracticalMachineLearning
84 pages
Bagging
No ratings yet
Bagging
7 pages
Four Dimension of Cloud Cube Model
No ratings yet
Four Dimension of Cloud Cube Model
2 pages
Ensemble Learning Methods
100% (1)
Ensemble Learning Methods
24 pages
Eda - M4
No ratings yet
Eda - M4
7 pages
EST Cheatsheet
No ratings yet
EST Cheatsheet
5 pages
Decision Tree & Regression
No ratings yet
Decision Tree & Regression
33 pages
Lecture 20: Bagging, Random Forests, Boosting: Reading: Chapter 8
No ratings yet
Lecture 20: Bagging, Random Forests, Boosting: Reading: Chapter 8
53 pages
Answer All Questions 1. Read The Passage Below. Use The Information in The Passage and Your Own Knowledge To Answer
No ratings yet
Answer All Questions 1. Read The Passage Below. Use The Information in The Passage and Your Own Knowledge To Answer
14 pages
Data Science - Decision Tree - Random Forest
No ratings yet
Data Science - Decision Tree - Random Forest
15 pages
RandomForest2324 CR - 4p
No ratings yet
RandomForest2324 CR - 4p
11 pages
Pumeet
No ratings yet
Pumeet
46 pages
Heuristic Search Strategies
No ratings yet
Heuristic Search Strategies
23 pages
Framo Pumps
No ratings yet
Framo Pumps
5 pages
Simple Carburetor Operation
100% (2)
Simple Carburetor Operation
6 pages
Implementation and Analysis of Smart Lamp Using An
No ratings yet
Implementation and Analysis of Smart Lamp Using An
4 pages
A Review of Ensemble Methods in Bioinformatics: Pengyi Yang, Yee Hwa Yang, Bing B. Zhou and Albert Y. Zomaya
No ratings yet
A Review of Ensemble Methods in Bioinformatics: Pengyi Yang, Yee Hwa Yang, Bing B. Zhou and Albert Y. Zomaya
13 pages
BMC Bioinformatics: Gene Selection and Classification of Microarray Data Using Random Forest
No ratings yet
BMC Bioinformatics: Gene Selection and Classification of Microarray Data Using Random Forest
13 pages
Detailed Lesson Plan
No ratings yet
Detailed Lesson Plan
6 pages
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
36 pages
Tense Changes in Reported Speech Rules, Examples, and Usage
No ratings yet
Tense Changes in Reported Speech Rules, Examples, and Usage
1 page
COVID-19 Testing
No ratings yet
COVID-19 Testing
2 pages
Random Forests: N 1 N J X A I X A I
No ratings yet
Random Forests: N 1 N J X A I X A I
12 pages
CP 4
No ratings yet
CP 4
2 pages
20 Questions 35 Minutes
No ratings yet
20 Questions 35 Minutes
7 pages
LCD TV: Service Manual
No ratings yet
LCD TV: Service Manual
51 pages
Subdivision Warranty Bond
No ratings yet
Subdivision Warranty Bond
2 pages
YouTube and You Learning in The Digital Age
No ratings yet
YouTube and You Learning in The Digital Age
7 pages
Ensembles 1
No ratings yet
Ensembles 1
4 pages
Alyssamari Aurereyes
No ratings yet
Alyssamari Aurereyes
2 pages
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
From Everand
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
César Pérez López
No ratings yet
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet