0% found this document useful (0 votes)
18 views63 pages

14 Model Ensembles

Uploaded by

sairohith068620
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views63 pages

14 Model Ensembles

Uploaded by

sairohith068620
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Model Ensembles

Instructor: Saravanan Thirumuruganathan


ML Paradigms

1. Build one GREAT model


• Traditional approach
• Logistic regression, KNN, Naïve Bayes, SVM, Decision Trees, ….

2. Build MANY decent models and combine them smartly


• Have become popular recently due to their great empirical performance and
interesting theoretical results
No Free Lunch Theorem
• There is no single machine learning algorithm that performs best for
all possible problems.

• Universal performance: If we average an algorithm's performance


across all possible problems, every algorithm will have the same
average performance.

• The effectiveness of an algorithm depends on how well it matches the


specific problem at hand. This is why domain knowledge is important!
Ensembles and Netflix Prize
• One of the winning teams BellKor was ensemble of 107 models!

“Our experience is that most efforts should be concentrated inderiving


substantially different approaches, rather than refining a simple
technique.”

“We strongly believe that the success of an ensemble approach


depends on the ability of its various predictors to expose different
complementing aspects of the data. Experience shows that this is very
different than optimizing the accuracy of each individual predictor.”

Quotes via Rich Zemel


Strong and Weak Learners
• Strong Learners
• Product a classifier that is very accurate
• Most of ML is focused on this
• A challenging problem

• Weak Learners
• Produce a classifier that is more accurate than random guessing
• Not hard to build weak learners
Ensemble Learning

1. Build strong learners from weak learners

2. Given a set of base classifiers, build an ensemble such that its


accuracy is higher than that of the base learners
Ensemble Learning Design Space

Image from Raymond Mooney


Ensemble Learning
When will ensemble learning work?

Any thoughts on how to combine the classifiers?


Ensemble Learning
• Necessary and sufficient condition for ensemble learning
• Accuracy
• Diversity

• A classifier is accurate if it is better than random guessing

• A set of classifiers is diverse if they make uncorrelated errors


Condorcet's jury theorem
• A theorem from 1785!

• A group of juries want to reach a decision by majority vote. Each voter


has an independent probability p of being correct

• If p > 1/2, then adding more voters increases the probability that the
majority voting is correct. At the limit, this probability approaches 1

• If p < 1/2, then more voters is bad. It is better to use a single jury
Majority Vote Classifier

Sebastian Raschka STAT 479: FS 2019


Majority Voting Classifier
Why does majority voting work?

Assumptions
• n classifiers
• Each classifier has an accuracy > 0.5
• Errors are uncorrelated

Sebastian Raschka STAT 479: FS 2019


Majority Voting Classifier
The probability that we make a wrong prediction via the ensemble
happens when k classifiers predict the same class label where k > n/2

Sebastian Raschka STAT 479: FS 2019


Base Error vs Ensemble Error

Sebastian Raschka STAT 479: FS 2019


Extensions to Majority Voting
• Majority voting works very well
• Even with weak learners when #classifiers increase with uncorrelated errors

• What can you do improve this simple approach?

Sebastian Raschka STAT 479: FS 2019


Extensions to Majority Voting
• Weighted majority voting
• Majority voting gives a weight of 1/n to each classifier
• Give a different weight based on held-out/validation dataset accuracy

• Soft voting
• Also take the output probability into account
• Classifiers has to be well calibrated

• Learn the weights using a ML model

Sebastian Raschka STAT 479: FS 2019


Soft Voting

Sebastian Raschka STAT 479: FS 2019


Stacking Algorithm

Sebastian Raschka STAT 479: FS 2019


Wolpert, David H. "Stacked generalization." Neural networks 5.2 (1992): 241-259.
Tang, J., S. Alelyani, and H. Liu. "Data Classification: Algorithms and Applications." Data Mining and Knowledge Discovery Series, CRC Press (2015): pp. 498-500.
Stacking Algorithm

Sebastian Raschka STAT 479: FS 2019


Stacking Algorithm
• What is the problem with this simple algorithm?
Stacking Algorithm

Sebastian Raschka STAT 479: FS 2019


Stacking Algorithm

Sebastian Raschka STAT 479: FS 2019


Wolpert, David H. "Stacked generalization." Neural networks 5.2 (1992): 241-259.
Tang, J., S. Alelyani, and H. Liu. "Data Classification: Algorithms and Applications." Data Mining and Knowledge Discovery Series, CRC Press (2015): pp. 498-500.
Finding Base Classifiers for Ensembles
• For a good ensemble, the base classifiers should be
• Accurate : have accuracy > 50%
• Diverse: have uncorrelated errors

• Building accurate classifiers is not hard (at least for binary


classification)

• How to get diverse base classifiers?


Bagging
• Bootstrap Aggregating : Breiman, L. (1996). Bagging predictors.
Machine learning, 24(2), 123-140.

Sebastian Raschka STAT 479: FS 2019


Bootstrap Sampling

Sebastian Raschka STAT 479: FS 2019


Bagging Classifier

Sebastian Raschka STAT 479: FS 2019


Asymptotic Behavior of Bagging

Sebastian Raschka STAT 479: FS 2019


Bagging and Correlated Trees
• Suppose you have a feature f that is a great discriminator. Other
features are good but not as good as f

• So all bagged trees will select f at the top of the tree. The only
difference between trees will be in how the rest of the sub-tree
changes which might not be that much

• Solution?
Random Subspace Method

Training data

Md. Abu Sayed, University of Nevada Reno


Random Subspace Method

A test sample

66% confidence

Md. Abu Sayed, University of Nevada Reno


Random Forests

Random Forests = Bagging with trees + random feature subsets

random subspace method where each tree got a random subset of


features.

Sebastian Raschka STAT 479: FS 2019


Random Forests

Tree 1 Tree 2 Random Forest Tree N

Md. Abu Sayed, University of Nevada Reno


Simple Random Forest Algorithm
Difference with Standard Decision Tree Algo
• Train each tree on bootstrapped sample (not on entire data)

• For each split, consider only m random features

• Does not prune.


Out of Bag (OOB) Error

https://en.wikipedia.org/wiki/Out-of-bag_error
Random Forest: Bias and Variance
• Increasing the number of models (trees) decreases variance (less
overfitting)
• Bias is mostly unaffected, but will increase if the forest becomes too
large (oversmoothing)

Joaquin Vanschoren; ML for Engineers


Random Forest Tips
• Rule of thumb: start with #features * 10 and adjust
• Sklearn’s default values for rest of variables is fine

Illustration by Bradley Boehmke


Random Forest Pros
• Gives competitive performance
• Can give great performance with little tuning
• Individual trees can overfit, random forest does not (usually)
• Has built in validation dataset using OOB data
• OOB error is a good estimate for generalization error
• Usually, you do not need to do cross validation for random forests
Random Forest Cons
• Can be slow for large datasets
• Not very interpretable
• Can be beaten by advanced boosting based ensembles
ExtraTrees / Extremely Randomized Trees
• Takes randomness to one step deeper
• By default, decision trees are built on entire dataset (not bootstrap)
• When growing a decision tree, it randomly selects m out of M
features
• For each feature, it selects a split randomly
• Eg attr_k = v1 (for categorical) or attr_k <= v1 (for continuous)
• Then uses some metric like gini/entropy to find the best of the m
splits
ExtraTrees
• Robust to noise and irrelevant features
• Very efficient: trees constructed in parallel, feature selection is fast
due to random subset and random splits
• Low variance (even when compared to RF and much lower than DT)
• Bias reduction: random subset/splitting makes the bias lower

• Performance comparable to RF and does better for noisy dataset


• Not widely used due to lack of awareness
Bagging Summary
• In Bagging, the models can be trained in parallel

• Take different K bootstrap samples and train K models

• Errors of one base model do not influence another: Why?

• Less susceptible to overfitting on noisy data as models do not focus


on particular instances of data.
Boosting

Sebastian Raschka STAT 479: FS 2019


General Boosting Algorithm

• Initialize a weight vector with uniform weights


• Loop
• Apply weak learner to weighted training examples
• Increase weight for misclassified examples
• (Weighted) majority voting on trained classifiers

• Intuition: force classifier Ci+1 to focus on mistakes of Ci

Sebastian Raschka STAT 479: FS 2019


Decision Tree Stumps
• Decision tree with depth 1

• Categorical attribute: attr = v


• Numerical attribute: attr < v

• Simple classifier and weak learner

Sebastian Raschka STAT 479: FS 2019


Boosting with Decision Stumps

Sebastian Raschka STAT 479: FS 2019


AdaBoost: Bias and Variance
• AdaBoost reduces bias (and a little variance)
• Boosting too much will eventually increase variance

Joaquin Vanschoren; ML for Engineers


AdaBoost

Sebastian Raschka STAT 479: FS 2019


AdaBoost Pros
• High accuracy: Generally outperforms single models, especially on
complex datasets. Possible to get training error of 0

• Possible to get feature importance (e.g. using decision stumps)

• Can work with diverse base learners


AdaBoost Cons
• Sensitivity to noisy data and outliers
• Computationally expensive: Especially for large datasets or many
iterations
• Can overfit data (and increased weights)
• Harder to interpret than simpler models
• Sequential nature: Difficult to parallelize, which can slow down
training
Gradient Boosting
• Ensemble of models, each fixing the remaining mistakes of the previous
ones
• Base models are regression trees, predict probability of positive class p
• Each iteration, the task is to predict the residual error of the ensemble

• Additive model: Predictions at iteration I are sum of base model


predictions
• Base models should be low variance, but flexible enough to predict
residuals accurately (e.g. decision trees of depth 2-5)
Gradient Boosting: Bias and Variance
Very effective at reducing bias but too much boosting increases variance

Joaquin Vanschoren; ML for Engineers


Joaquin Vanschoren; ML for Engineers
Gradient Boosting: Pros and Cons
• Among the most powerful and widely used models
• Work well on heterogeneous features and different scales
• Typically better than random forests, but requires more tuning, longer
training
• Does not work well on high-dimensional sparse data

Joaquin Vanschoren; ML for Engineers


XGBoost
• Faster version of Gradient Boosting models.

• Empirically, one of the best performing models

• RandomForest, XGBoost, LightGBM are the first approaches that you


should try

• Not very easy to explain.

Sebastian Raschka STAT 479: FS 2019


Boosting and Bagging

Rich Zemel CSC411 Fall 2014


Mixture of Experts (MoE)

Rich Zemel CSC411 Fall 2014


Cooperations vs Specialization
• Boosting and Bagging
• base classifiers cooperate to produce a prediction
• Each classifier has a fixed weight that is used for weighted majority voting

• MoE
• Weight of expert depends on input x
• Gating network forces experts to “specialize” instead of cooperate
Ensemble Learning Limitations
• If classifiers are accurate and diverse, we can push the accuracy of the
ensemble arbitrarily high by combining classifiers

• Typically, it is challenging for classifiers to make uncorrelated errors

• A realistic claim: for data points where classifiers predict with > 50%
accuracy, can push accuracy arbitrarily high (some data points just too
hard)

From: Neural network ensembles. Hansen and Salamon. TPAMI 1990


Why Decision Stumps as Base Learners
• Use max depth of tree as hyper parameter

• Shallow trees
• High bias but very low variance (underfitting)
• Keep low variance, reduce bias with Boosting

• Deep trees
• High variance but low bias (overfitting)
• Keep low bias, reduce variance with Bagging

Observation by Joaquin Vanschoren


Which ML Models to Combine?
• If model underfits (high bias, low variance): combine with other low-
variance models
• Need to be different: 'experts' on different parts of the data
• Bias reduction. Can be done with Boosting

• If model overfits (low bias, high variance): combine with other low-
bias models
• Need to be different: individual mistakes must be different
• Variance reduction. Can be done with Bagging

Observation by Joaquin Vanschoren


Bagging Summary
• Bagging is a variance-reduction technique
• Build many high-variance (overfitting) models on random data
samples
• Aggregation (soft voting) over many models reduces variance
• Diminishing returns, over-smoothing may increase bias error
• Parallelizes easily, doesn't require much tuning

Observation by Joaquin Vanschoren


Boosting Summary
• Boosting is a bias-reduction technique
• Build low-variance models that correct each other's mistakes
• By reweighting misclassified samples: AdaBoost
• By predicting the residual error: Gradient Boosting
• Additive models: predictions are sum of base-model predictions
• Can drive the error to zero, but risk overfitting
• Doesn't parallelize easily. Slower to train, much faster to predict.
• XGBoost,LightGBM,... are fast and offer some parallelization

Observation by Joaquin Vanschoren

You might also like