0% found this document useful (0 votes)

242 views30 pages

Lasso and Ridge Regression

The document discusses various methods for improving linear regression models, including feature selection techniques like best subset selection, stepwise selection, ridge regression, and the lasso. It aims to reduce overfitting by selecting a subset of important predictor variables or by shrinking coefficient estimates. Cross-validation is recommended for estimating test error and selecting tuning parameters. These methods help develop models that have good predictive performance and interpretability.

Uploaded by

Aarti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

242 views30 pages

Lasso and Ridge Regression

Uploaded by

Aarti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 30

Objectives

 Understand best subset selection and stepwise

selection methods for reducing the number of
predictor variables in regression.
 Indirectly estimate test error by adjusting training
error to account for bias due to overfitting (AIC, BIC,
adjusted R2).
 Directly estimate test error using validation set
approach and cross-validation approach.
 Understand and know how to perform ridge
regression and the lasso as shrinkage
(regularization) methods.
Improving the Linear Model
 We may want to improve the simple linear model by
replacing OLS estimation with some alternative fitting
procedure.

 Why use an alternative fitting procedure?

 Prediction Accuracy
 Model Interpretability
Model Interpretability
 When we have a large number of predictors in the model,
there will generally be many that have little or no effect on the
response.

 Including such irrelevant variable leads to unnecessary

complexity.

 Leaving these variables in the model makes it harder to see the

effect of the important variables.

 The model would be easier to interpret by removing (i.e.

setting the coefficients to zero) the unimportant variables.
Feature/Variable Selection
 Carefully selected features
can improve model accuracy,
but adding too many can lead
to overfitting.
 Overfitted models describe random
error or noise instead of any underlying
relationship.
 They generally have poor predictive
performance on test data.

 For instance, we can use a 15-degree polynomial function to fit the

following data so that the fitted curve goes nicely through the data
points.
 However, a brand new dataset collected from the same population
may not fit this particular curve well at all.
Feature/Variable Selection
 Subset Selection
 Identify a subset of the p predictors that we believe to be related to the
response; then, fit a model using OLS on the reduced set.
 Methods: best subset selection, stepwise selection

 Shrinkage (Regularization)
 Involves shrinking the estimated coefficients toward zero relative to the OLS
estimates; has the effect of reducing variance and performs variable selection.
 Methods: ridge regression, lasso

 Dimension Reduction
 Involves projecting the p predictors into a M-dimensional subspace, where M
< p, and fit the linear regression model using the M projections as predictors.
 Methods: principal components regression, partial least squares
Best Subset Selection
 The RSS (R2) will always decline (increase) as the number of
predictors included in the model increases, so they are not
very useful statistics for selecting the best model.

 The red line tracks the best model for a given number of
predictors, according to RSS and R2
Best Subset Selection
 While best subset selection is a simple and conceptually
appealing approach, it suffers from computational
limitations.

 The number of possible models that must be considered

grows rapidly as p increases.

 Best subset selection becomes computationally infeasible

for value of p greater than around 40.
Stepwise Selection
 For computational reasons, best subset selection cannot be
applied with very large p.

 The larger the search space, the higher the chance of finding
models that look good on the training data, even though
they might not have any predictive power on future data.

 An enormous search space can lead to overfitting and high

variance of the coefficient estimates.
Stepwise Selection
More attractive methods include:

 Forward Stepwise Selection

 Begins with a null OLS model containing no predictors, and then
adds one predictor at a time that improves the model the most until
no further improvement is possible.

 Backward Stepwise Selection

 Begins with a full OLS model containing all predictors, and then
deletes one predictor at a time that improves the model the most
until no further improvement is possible.
Choosing the Optimal Model
 The model containing all the predictors will always have the
smallest RSS and the largest R2, since these quantities are
related to the training error.

 We wish to choose a model with low test error, not a model

with low training error. Recall that training error is usually a
poor estimate of test error.

 Thus, RSS and R2 are not suitable for selecting the best
model among a collection of models with different numbers
of predictors.
Estimating Test Error
1. We can indirectly estimate test error by making an
adjustment to the training error to account for the
bias due to overfitting.

2. We can directly estimate the test error, using either a

validation set approach or a cross-validation
approach.
Other Measures of Comparison
 To compare different models, we can use other approaches:
 Adjusted R2
 AIC (Akaike information criterion)
 BIC (Bayesian information criterion)

 These techniques adjust the training error for the model size,
and can be used to select among a set of models with
different numbers of variables.

 These methods add penalty to RSS for the number of

predictors in the model.
Shrinkage (Regularization) Methods
 The subset selection methods use OLS to fit a linear model
that contains a subset of the predictors.

 As an alternative, we can fit a model containing all p

predictors using a technique that constrains or regularizes
the coefficient estimates (i.e. shrinks the coefficient
estimates towards zero).

 It may not be immediately obvious why such a constraint

should improve the fit, but it turns out that shrinking the
coefficient estimates can significantly reduce their variance.
Shrinkage (Regularization) Methods
 Regularization is our first weapon to combat overfitting.

 It constrains the machine learning algorithm to improve

out-of-sample error, especially when noise is present.

 Look at what a little regularization can do:

Ridge Regression
 The effect of this equation is to add a shrinkage penalty of the
form

where the tuning parameter λ is a positive value.

 This has the effect of shrinking the estimated beta coefficients
towards zero. It turns out that such a constraint should improve
the fit, because shrinking the coefficients can significantly reduce
their variance.

 Note that when λ = 0, the penalty term as no effect, and ridge

regression will procedure the OLS estimates. Thus, selecting a
good value for λ is critical (can use cross-validation for this).
Ridge Regression
 As λ increases, the standardized
ridge regression coefficients
shrinks towards zero.

 Thus, when λ is extremely large,

then all of the ridge coefficient
estimates are basically zero; this
corresponds to the null model that
contains no predictors.
Ridge Regression
 Black = Bias
 Green = Variance
 Purple = MSE

 Increased λ leads to
increased bias but
decreased variance
Ridge Regression
 In general, the ridge
regression estimates will be
more biased than the OLS
ones but have lower
variance.

 Ridge regression will work

best in situations where the
OLS estimates have high
variance.
Ridge Regression
Computational Advantages of Ridge Regression
 If p is large, then using the best subset selection approach
requires searching through enormous numbers of possible
models.

 With ridge regression, for any given λ we only need to fit one
model and the computations turn out to be very simple.

 Ridge regression can even be used when p > n, a situation

where OLS fails completely (i.e. OLS estimates do not even
have a unique solution).
The Lasso
 One significant problem of ridge regression is that the
penalty term will never force any of the coefficients to be
exactly zero.

 Thus, the final model will include all p predictors, which

creates a challenge in model interpretation

 A more modern machine learning alternative is the lasso.

 The lasso works in a similar way to ridge regression, except it

uses a different penalty term that shrinks some of the
coefficients exactly to zero.
The Lasso
 The lasso and ridge regression coefficient estimates are given
by the first point at which an ellipse contacts the constraint
region.
OLS Solution

Ridge
Lasso Regression
Lasso vs. Ridge Regression
 The lasso has a major advantage over ridge regression, in
that it produces simpler and more interpretable models
that involved only a subset of predictors.

 The lasso leads to qualitatively similar behavior to ridge

regression, in that as λ increases, the variance decreases and
the bias increases.

 The lasso can generate more accurate predictions compared

to ridge regression.

 Cross-validation can be used in order to determine which

approach is better on a particular data set.
Selecting the Tuning Parameter λ
 As for subset selection, for ridge regression and lasso we
require a method to determine which of the models under
consideration in best; thus, we required a method selecting
a value for the tuning parameter λ or equivalently, the value
of the constraint s.

 Select a grid of potential values; use cross-validation to

estimate the error rate on test data (for each value of λ) and
select the value that gives the smallest error rate.

 Finally, the model is re-fit using all of the variable

observations and the selected value of the tuning
parameter λ.
Considerations in High Dimensions
 While p can be extremely large, the number of observations
n is often limited due to cost, sample availability, etc.

 Data sets containing more features than observations are

often referred to a high-dimensional.

 When the number of features p is as large as, or larger than,

the number of observations n, OLS should not be
performed.
 It is too flexible and hence overfits the data.

 Forward stepwise selection, ridge regression, lasso, and PCR

are particularly useful for performing regression in the high-
dimensional setting.
Considerations in High Dimensions
 Regularization or shrinkage plays a key role in high-
dimensional problems.

 Appropriate tuning parameter selection is crucial for good

predictive performance.

 The test error tends to increase as the dimensionality of the

problem (i.e. the number of features or predictors)
increases, unless the additional features are truly associated
with the response.
 Known as the curse of dimensionality
Considerations in High Dimensions
 Curse of dimensionality
 Adding additional signal features that are truly associated with the
response will improve the fitted model, in the sense of leading to a
reduction in test set error.
 Adding noise features that are not truly associated with the response
will lead to a deterioration in the fitted model, and consequently an
increased test set error.

 Noise features increase the dimensionality of the problem,

exacerbating the risk of overfitting without any potential
upside in terms of improved test set error.
Considerations in High Dimensions
 In the high-dimensional setting, the multi-collinearity problem is
extreme: any variable in the model can be written as a linear
combination of all of the other variables in the models.

 It is also important to be particularly careful in reporting errors

and measures of model fit in the high-dimensional setting.

 One should never use sum of squared errors, p-values, R2

statistics, or other traditional measures of model fit on the
training data as evidence of good model fit in the high-
dimensional setting.

 It is important to report results on an independent test set, or

cross-validation errors.
Summary
 Best subset selection and stepwise selection methods.
 Estimate test error by adjusting training error to account for
bias due to overfitting.
 Estimate test error using validation set approach and cross-
validation approach.
 Ridge regression and the lasso as shrinkage (regularization)
methods.
 Principal components regression and partial least squares.
 Considerations for high-dimensional settings.
THANK YOU

Machine Learning With Ridge and Lasso Regression
No ratings yet
Machine Learning With Ridge and Lasso Regression
19 pages
Untitled
No ratings yet
Untitled
1,326 pages
Pokemon HP Predictions
No ratings yet
Pokemon HP Predictions
24 pages
Python Assignment
No ratings yet
Python Assignment
7 pages
Bootstrap Powerpoint
100% (1)
Bootstrap Powerpoint
20 pages
Deep Learning Methods and Applications For Electrical Power Systems A Comprehensive Review
No ratings yet
Deep Learning Methods and Applications For Electrical Power Systems A Comprehensive Review
22 pages
Slides Ridge Lasso Regression
No ratings yet
Slides Ridge Lasso Regression
23 pages
Lasoo Regression
No ratings yet
Lasoo Regression
8 pages
Lasso Regression
No ratings yet
Lasso Regression
3 pages
DecisionTrees RandomForest v2
No ratings yet
DecisionTrees RandomForest v2
27 pages
Quiz Week 7 - Support Vector Machines
100% (1)
Quiz Week 7 - Support Vector Machines
3 pages
Amt305 Introduction To Machine Learning, Pyq
No ratings yet
Amt305 Introduction To Machine Learning, Pyq
5 pages
Nueral Network Mcqs
No ratings yet
Nueral Network Mcqs
6 pages
Correlation & Regression Analysis
100% (1)
Correlation & Regression Analysis
39 pages
02 ML Supervised Learning
No ratings yet
02 ML Supervised Learning
32 pages
Text
No ratings yet
Text
131 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Bagging and Boosting Regression Algorithms
100% (1)
Bagging and Boosting Regression Algorithms
84 pages
Module 1 Quiz
No ratings yet
Module 1 Quiz
7 pages
Machine Learning (Analytics Vidhya) : What Is Logistic Regression?
100% (1)
Machine Learning (Analytics Vidhya) : What Is Logistic Regression?
5 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Linear Regression
No ratings yet
Linear Regression
83 pages
ML Concepts: 1. Parametric Vs Non-Parametric Models:: Examples: Linear, Logistic, SVM
No ratings yet
ML Concepts: 1. Parametric Vs Non-Parametric Models:: Examples: Linear, Logistic, SVM
34 pages
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
100% (2)
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
26 pages
d2 - 1 PDF
No ratings yet
d2 - 1 PDF
5 pages
Unit4 DL Final
No ratings yet
Unit4 DL Final
30 pages
Lecture 9 PDF
100% (1)
Lecture 9 PDF
28 pages
Machine Lpipearning Interview Questions: Algorithms/Tp: Q1-What's The Trade-Off Between Bias and Variance?
No ratings yet
Machine Lpipearning Interview Questions: Algorithms/Tp: Q1-What's The Trade-Off Between Bias and Variance?
46 pages
hw3 Solutions PDF
No ratings yet
hw3 Solutions PDF
11 pages
ML Lecture 15 Ensemble
No ratings yet
ML Lecture 15 Ensemble
27 pages
Soft Max
No ratings yet
Soft Max
6 pages
Quiz
No ratings yet
Quiz
6 pages
05 ANN Artificial Neural Networks
No ratings yet
05 ANN Artificial Neural Networks
216 pages
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
100% (1)
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
28 pages
Study Materials - Restricted Boltzmann Machine
No ratings yet
Study Materials - Restricted Boltzmann Machine
6 pages
Data Science Intervieew Questions
100% (1)
Data Science Intervieew Questions
16 pages
Lecture - 2 Classification (Machine Learning Basic and KNN)
No ratings yet
Lecture - 2 Classification (Machine Learning Basic and KNN)
90 pages
AdaBoost Classifier in Python (Article) - DataCamp
100% (1)
AdaBoost Classifier in Python (Article) - DataCamp
9 pages
CS550 Regression Aug12
100% (1)
CS550 Regression Aug12
63 pages
Answer Book (Ashish)
100% (1)
Answer Book (Ashish)
21 pages
Chapter-3-Linear Models For Regression
100% (1)
Chapter-3-Linear Models For Regression
61 pages
Ue22cs342aa2 20241114095341
No ratings yet
Ue22cs342aa2 20241114095341
23 pages
SVM PRESENTATION
No ratings yet
SVM PRESENTATION
34 pages
Multicollinearity Exercise
100% (1)
Multicollinearity Exercise
6 pages
Chapter 4 Estimation Theory
0% (1)
Chapter 4 Estimation Theory
40 pages
Linear Regression: in Machine Learning
No ratings yet
Linear Regression: in Machine Learning
6 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
35 pages
ML0101EN Clas K Nearest Neighbors CustCat Py v1
100% (1)
ML0101EN Clas K Nearest Neighbors CustCat Py v1
11 pages
The Problem of Overfitting: Overfitting With Linear Regression
No ratings yet
The Problem of Overfitting: Overfitting With Linear Regression
32 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Lecture Week 2 KNN and Model Evaluation PDF
100% (1)
Lecture Week 2 KNN and Model Evaluation PDF
53 pages
Machine Learning Mini-Project Report
No ratings yet
Machine Learning Mini-Project Report
26 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
Logistic+Regression - Done
100% (1)
Logistic+Regression - Done
41 pages
Gradient Descent
No ratings yet
Gradient Descent
18 pages
Linear Regression 18may
No ratings yet
Linear Regression 18may
28 pages
Module 4: Regression Shrinkage Methods
No ratings yet
Module 4: Regression Shrinkage Methods
5 pages
EDA 4th Module
No ratings yet
EDA 4th Module
26 pages
0 Regularization PDF
No ratings yet
0 Regularization PDF
88 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Fractional Programming
No ratings yet
Fractional Programming
53 pages
A Primer On Partial Least Squares Struct PDF
No ratings yet
A Primer On Partial Least Squares Struct PDF
2 pages
QT Syllabus Revised
No ratings yet
QT Syllabus Revised
3 pages
Errata Partial Differential Equations: Analytical and Numerical Methods Second Edition Mark S. Gockenbach (SIAM 2010)
No ratings yet
Errata Partial Differential Equations: Analytical and Numerical Methods Second Edition Mark S. Gockenbach (SIAM 2010)
5 pages
All About Functions
No ratings yet
All About Functions
51 pages
Self-Study - The Difference Between Link Functions and Data Transformations
No ratings yet
Self-Study - The Difference Between Link Functions and Data Transformations
3 pages
Merits and Demerits of Mean
100% (2)
Merits and Demerits of Mean
6 pages
Master 2025 Ist Recommendation en
100% (1)
Master 2025 Ist Recommendation en
34 pages
Calc 2.8 Packet
No ratings yet
Calc 2.8 Packet
4 pages
12
No ratings yet
12
4 pages
Grade 12 Mathematics Inverse Functions Solutions
100% (2)
Grade 12 Mathematics Inverse Functions Solutions
20 pages
Mean (Grouped Data)
No ratings yet
Mean (Grouped Data)
11 pages
BCA Analysis
No ratings yet
BCA Analysis
6 pages
Stylistics Retrospect and Prospect
No ratings yet
Stylistics Retrospect and Prospect
12 pages
2017 Spring MATH311 A1 Syllabus
No ratings yet
2017 Spring MATH311 A1 Syllabus
3 pages
PI With Anti Windup
No ratings yet
PI With Anti Windup
61 pages
Paper of Update Conflict Condition
No ratings yet
Paper of Update Conflict Condition
12 pages
MATH-8-TOS 2nd Quarter
No ratings yet
MATH-8-TOS 2nd Quarter
3 pages
Math in The Modern World Jan2025 Syllabus Jove Claire Anne D.
No ratings yet
Math in The Modern World Jan2025 Syllabus Jove Claire Anne D.
9 pages
Errata Garling1
No ratings yet
Errata Garling1
4 pages
Chapter1-25 9 12-Morpheus
No ratings yet
Chapter1-25 9 12-Morpheus
148 pages
Chapter 2: Multivariable Calculus: Lecture 2: Partial Derivatives
No ratings yet
Chapter 2: Multivariable Calculus: Lecture 2: Partial Derivatives
23 pages
C4 Partial Fractions B - Questions
No ratings yet
C4 Partial Fractions B - Questions
1 page
Zaphiris, Ang, Laghos - 2009 - Online Communities - Human-Computer Interaction Design Issues, Solutions, and Applications
No ratings yet
Zaphiris, Ang, Laghos - 2009 - Online Communities - Human-Computer Interaction Design Issues, Solutions, and Applications
48 pages
M Stat (2015) - Revised PDF
No ratings yet
M Stat (2015) - Revised PDF
59 pages
CH 4 P 5
No ratings yet
CH 4 P 5
2 pages
Michel Rolle Biography
No ratings yet
Michel Rolle Biography
4 pages
Mws Che Ode TXT Runge4th Examples
No ratings yet
Mws Che Ode TXT Runge4th Examples
6 pages
Inverse Variation Summative Assessment
No ratings yet
Inverse Variation Summative Assessment
5 pages
Number 1 Questions An Answers Management Acc-Old PDF
No ratings yet
Number 1 Questions An Answers Management Acc-Old PDF
106 pages

Lasso and Ridge Regression

Uploaded by

Lasso and Ridge Regression

Uploaded by

Objectives

 Understand best subset selection and stepwise

 Why use an alternative fitting procedure?

 Including such irrelevant variable leads to unnecessary

 Leaving these variables in the model makes it harder to see the

 The model would be easier to interpret by removing (i.e.

 For instance, we can use a 15-degree polynomial function to fit the

 The number of possible models that must be considered

 Best subset selection becomes computationally infeasible

 An enormous search space can lead to overfitting and high

 Forward Stepwise Selection

 Backward Stepwise Selection

 We wish to choose a model with low test error, not a model

2. We can directly estimate the test error, using either a

 These methods add penalty to RSS for the number of

 As an alternative, we can fit a model containing all p

 It may not be immediately obvious why such a constraint

 It constrains the machine learning algorithm to improve

 Look at what a little regularization can do:

where the tuning parameter λ is a positive value.

 Note that when λ = 0, the penalty term as no effect, and ridge

 Thus, when λ is extremely large,

 Ridge regression will work

 Ridge regression can even be used when p > n, a situation

 Thus, the final model will include all p predictors, which

 A more modern machine learning alternative is the lasso.

 The lasso works in a similar way to ridge regression, except it

 The lasso leads to qualitatively similar behavior to ridge

 The lasso can generate more accurate predictions compared

 Cross-validation can be used in order to determine which

 Select a grid of potential values; use cross-validation to

 Finally, the model is re-fit using all of the variable

 Data sets containing more features than observations are

 When the number of features p is as large as, or larger than,

 Forward stepwise selection, ridge regression, lasso, and PCR

 Appropriate tuning parameter selection is crucial for good

 The test error tends to increase as the dimensionality of the

 Noise features increase the dimensionality of the problem,

 It is also important to be particularly careful in reporting errors

 One should never use sum of squared errors, p-values, R2

 It is important to report results on an independent test set, or

You might also like