0% found this document useful (0 votes)
813 views10 pages

Cornell CS578: Bagging and Boosting

The document discusses the bias-variance tradeoff in machine learning models. It explains that model loss can be decomposed into noise, bias, and variance. Models can exhibit either high bias (underfitting) or high variance (overfitting). Bagging and boosting are ensemble methods that aim to reduce variance without increasing bias. Bagging averages predictions from models trained on bootstrap samples of a dataset, while boosting iteratively reweights training examples to focus on those misclassified by previous models. Both methods can improve performance over a single model.

Uploaded by

Ian
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
813 views10 pages

Cornell CS578: Bagging and Boosting

The document discusses the bias-variance tradeoff in machine learning models. It explains that model loss can be decomposed into noise, bias, and variance. Models can exhibit either high bias (underfitting) or high variance (overfitting). Bagging and boosting are ensemble methods that aim to reduce variance without increasing bias. Bagging averages predictions from models trained on bootstrap samples of a dataset, while boosting iteratively reweights training examples to focus on those misclassified by previous models. Both methods can improve performance over a single model.

Uploaded by

Ian
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Bias/Variance Tradeoff

Model Loss (Error) Bias/Variance Decomposition

• Squared loss of model on test case i:


2 (L( x,D) − T (x))2 = Noise 2 + Bias 2 + Variance
( Learn(x i ,D) − Truth( xi )) D

Noise 2 = lower bound on performance


• Expected prediction error: 2
Bias 2 = (expected error due to model mismatch)
2
( Learn( x, D) − Truth( x)) D
Variance = variation due to train sample and randomization

1
Bias2 Variance

• Low bias • Low variance


– linear regression applied to linear data – constant function
– 2nd degree polynomial applied to quadratic data – model independent of training data
– ANN with many hidden units trained to completion – model depends on stable measures of data
• High bias • mean
• median
– constant function
– linear regression applied to non-linear data
• High variance
– ANN with few hidden units applied to non-linear data – high degree polynomial
– ANN with many hidden units trained to completion

Sources of Variance in Supervised Learning Bias/Variance Tradeoff

• noise in targets or input attributes • (bias2+variance) is what counts for prediction


• bias (model mismatch) • Often:
• training sample – low bias => high variance
• randomness in learning algorithm – low variance => high bias
– neural net weight initialization • Tradeoff:
• randomized subsetting of train set: – bias2 vs. variance
– cross validation, train and early stopping set

2
Bias/Variance Tradeoff Bias/Variance Tradeoff

Duda, Hart, Stork “Pattern Classification”, 2nd edition, 2001 Hastie, Tibshirani, Friedman “Elements of Statistical Learning” 2001

Reduce Variance Without Increasing Bias Bagging: Bootstrap Aggregation

• Averaging reduces variance: • Leo Breiman (1994)


• Bootstrap Sample:
Var(X )
Var(X ) = – draw sample of size |D| with replacement from D
N
Train Li (BootstrapSamplei ( D) )
• Average models to reduce model variance
• One problem: Regression : Lbagging = Li
– only one train set Classification : Lbagging = Plurality (Li )
– where do multiple models come from?

3
Bagging Bagging Results

• Best case:

Variance(L( x,D))
Var(Bagging( L(x, D))) =
N

• In practice:
– models are correlated, so reduction is smaller than 1/N
– variance of models trained on fewer training cases
usually somewhat larger
– stable learning methods have low variance to begin
with, so bagging may not help much
Breiman “Bagging Predictors” Berkeley Statistics Department TR#421, 1994

How Many Bootstrap Samples? More bagging results

Breiman “Bagging Predictors” Berkeley Statistics Department TR#421, 1994

4
More bagging results Bagging with cross validation

• Train neural networks using 4-fold CV


– Train on 3 folds earlystop on the fourth
– At the end you have 4 neural nets

• How to make predictions on new examples?

Bagging with cross validation Can Bagging Hurt?

• Train neural networks using 4-fold CV


– Train on 3 folds earlystop on the fourth
– At the end you have 4 neural nets

• How to make predictions on new examples?


– Train a neural network until the mean earlystopping
point
– Average the predictions from the four neural networks

5
Can Bagging Hurt? Reduce Bias2 and Decrease Variance?

• Bagging reduces variance by averaging


• Each base classifier is trained on less data • Bagging has little effect on bias
– Only about 63.2% of the data points are in any • Can we average and reduce bias?
bootstrap sample
• Yes:

• However the final model has seen all the data


– On average a point will be in >50% of the bootstrap Boosting
samples

Boosting Boosting

• Freund & Schapire: • Weight all training samples equally


– theory for “weak learners” in late 80’s • Train model on train set
• Weak Learner: performance on any train set is • Compute error of model on train set
slightly better than chance prediction • Increase weights on train cases model gets wrong!
• intended to answer a theoretical question, not as a • Train new model on re-weighted train set
practical way to improve learning • Re-compute errors on weighted train set
• tested in mid 90’s using not-so-weak learners • Increase weights again on cases model gets wrong
• works anyway! • Repeat until tired (100+ iterations)
• Final model: weighted prediction of each model

6
Boosting Boosting: Initialization

Initialization

Iteration

Final Model

Boosting: Iteration Boosting: Prediction

7
Weight updates Reweighting vs Resampling

• Weights for incorrect instances are multiplied by • Example weights might be harder to deal with
1/(2Error_i) – Some learning methods can’t use weights on examples
– Small train set errors cause weights to grow by several – Many common packages don’t support weighs on the
orders of magnitude train
• We can resample instead:
• Total weight of misclassified examples is 0.5 – Draw a bootstrap sample from the data with the
probability of drawing each example is proportional to
it’s weight
• Total weight of correctly classified examples is • Reweighting usually works better but resampling
0.5 is easier to implement

Boosting Performance Boosting vs. Bagging

• Bagging doesn’t work so well with stable models.


Boosting might still help.

• Boosting might hurt performance on noisy


datasets. Bagging doesn’t have this problem

• In practice bagging almost always helps.

8
Boosting vs. Bagging
Bagged Decision Trees
 Draw 100 bootstrap samples of data
• On average, boosting helps more than bagging,  Train trees on each sample -> 100 trees
but it is also more common for boosting to hurt  Un-weighted average prediction of trees
performance.


• The weights grow exponentially. Code must be
written carefully (store log of weights, …)

• Bagging is easier to parallelize. Average prediction


(0.23 + 0.19 + 0.34 + 0.22 + 0.26 + … + 0.31) / # Trees = 0.24

 Marriage made in heaven. Highly under-rated!

Random Forests (Bagged Trees++)


Model Averaging
 Draw 1000+ bootstrap samples of data
 Draw sample of available attributes at each split • Almost always helps
 Train trees on each sample/attribute set -> 1000+ trees • Often easy to do
 Un-weighted average prediction of trees • Models shouldn’t be too similar
• Models should all have pretty good performance
(not too many lemons)
… • When averaging, favor low bias, high variance
• Models can individually overfit
• Not just in ML
Average prediction
(0.23 + 0.19 + 0.34 + 0.22 + 0.26 + … + 0.31) / # Trees = 0.24

9
Out of Bag Samples

• With bagging, each model trained on about 63%


of training sample
• That means each model does not use 37% of data
• Treat these as test points!
– Backfitting in trees
– Pseudo cross validation
– Early stopping sets

10

You might also like