0% found this document useful (0 votes)
242 views30 pages

Lasso and Ridge Regression

The document discusses various methods for improving linear regression models, including feature selection techniques like best subset selection, stepwise selection, ridge regression, and the lasso. It aims to reduce overfitting by selecting a subset of important predictor variables or by shrinking coefficient estimates. Cross-validation is recommended for estimating test error and selecting tuning parameters. These methods help develop models that have good predictive performance and interpretability.

Uploaded by

Aarti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
242 views30 pages

Lasso and Ridge Regression

The document discusses various methods for improving linear regression models, including feature selection techniques like best subset selection, stepwise selection, ridge regression, and the lasso. It aims to reduce overfitting by selecting a subset of important predictor variables or by shrinking coefficient estimates. Cross-validation is recommended for estimating test error and selecting tuning parameters. These methods help develop models that have good predictive performance and interpretability.

Uploaded by

Aarti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Objectives

 Understand best subset selection and stepwise


selection methods for reducing the number of
predictor variables in regression.
 Indirectly estimate test error by adjusting training
error to account for bias due to overfitting (AIC, BIC,
adjusted R2).
 Directly estimate test error using validation set
approach and cross-validation approach.
 Understand and know how to perform ridge
regression and the lasso as shrinkage
(regularization) methods.
Improving the Linear Model
 We may want to improve the simple linear model by
replacing OLS estimation with some alternative fitting
procedure.

 Why use an alternative fitting procedure?


 Prediction Accuracy
 Model Interpretability
Model Interpretability
 When we have a large number of predictors in the model,
there will generally be many that have little or no effect on the
response.

 Including such irrelevant variable leads to unnecessary


complexity.

 Leaving these variables in the model makes it harder to see the


effect of the important variables.

 The model would be easier to interpret by removing (i.e.


setting the coefficients to zero) the unimportant variables.
Feature/Variable Selection
 Carefully selected features
can improve model accuracy,
but adding too many can lead
to overfitting.
 Overfitted models describe random
error or noise instead of any underlying
relationship.
 They generally have poor predictive
performance on test data.

 For instance, we can use a 15-degree polynomial function to fit the


following data so that the fitted curve goes nicely through the data
points.
 However, a brand new dataset collected from the same population
may not fit this particular curve well at all.
Feature/Variable Selection
 Subset Selection
 Identify a subset of the p predictors that we believe to be related to the
response; then, fit a model using OLS on the reduced set.
 Methods: best subset selection, stepwise selection

 Shrinkage (Regularization)
 Involves shrinking the estimated coefficients toward zero relative to the OLS
estimates; has the effect of reducing variance and performs variable selection.
 Methods: ridge regression, lasso

 Dimension Reduction
 Involves projecting the p predictors into a M-dimensional subspace, where M
< p, and fit the linear regression model using the M projections as predictors.
 Methods: principal components regression, partial least squares
Best Subset Selection
 The RSS (R2) will always decline (increase) as the number of
predictors included in the model increases, so they are not
very useful statistics for selecting the best model.

 The red line tracks the best model for a given number of
predictors, according to RSS and R2
Best Subset Selection
 While best subset selection is a simple and conceptually
appealing approach, it suffers from computational
limitations.

 The number of possible models that must be considered


grows rapidly as p increases.

 Best subset selection becomes computationally infeasible


for value of p greater than around 40.
Stepwise Selection
 For computational reasons, best subset selection cannot be
applied with very large p.

 The larger the search space, the higher the chance of finding
models that look good on the training data, even though
they might not have any predictive power on future data.

 An enormous search space can lead to overfitting and high


variance of the coefficient estimates.
Stepwise Selection
More attractive methods include:

 Forward Stepwise Selection


 Begins with a null OLS model containing no predictors, and then
adds one predictor at a time that improves the model the most until
no further improvement is possible.

 Backward Stepwise Selection


 Begins with a full OLS model containing all predictors, and then
deletes one predictor at a time that improves the model the most
until no further improvement is possible.
Choosing the Optimal Model
 The model containing all the predictors will always have the
smallest RSS and the largest R2, since these quantities are
related to the training error.

 We wish to choose a model with low test error, not a model


with low training error. Recall that training error is usually a
poor estimate of test error.

 Thus, RSS and R2 are not suitable for selecting the best
model among a collection of models with different numbers
of predictors.
Estimating Test Error
1. We can indirectly estimate test error by making an
adjustment to the training error to account for the
bias due to overfitting.

2. We can directly estimate the test error, using either a


validation set approach or a cross-validation
approach.
Other Measures of Comparison
 To compare different models, we can use other approaches:
 Adjusted R2
 AIC (Akaike information criterion)
 BIC (Bayesian information criterion)

 These techniques adjust the training error for the model size,
and can be used to select among a set of models with
different numbers of variables.

 These methods add penalty to RSS for the number of


predictors in the model.
Shrinkage (Regularization) Methods
 The subset selection methods use OLS to fit a linear model
that contains a subset of the predictors.

 As an alternative, we can fit a model containing all p


predictors using a technique that constrains or regularizes
the coefficient estimates (i.e. shrinks the coefficient
estimates towards zero).

 It may not be immediately obvious why such a constraint


should improve the fit, but it turns out that shrinking the
coefficient estimates can significantly reduce their variance.
Shrinkage (Regularization) Methods
 Regularization is our first weapon to combat overfitting.

 It constrains the machine learning algorithm to improve


out-of-sample error, especially when noise is present.

 Look at what a little regularization can do:


Ridge Regression
 The effect of this equation is to add a shrinkage penalty of the
form

where the tuning parameter λ is a positive value.


 This has the effect of shrinking the estimated beta coefficients
towards zero. It turns out that such a constraint should improve
the fit, because shrinking the coefficients can significantly reduce
their variance.

 Note that when λ = 0, the penalty term as no effect, and ridge


regression will procedure the OLS estimates. Thus, selecting a
good value for λ is critical (can use cross-validation for this).
Ridge Regression
 As λ increases, the standardized
ridge regression coefficients
shrinks towards zero.

 Thus, when λ is extremely large,


then all of the ridge coefficient
estimates are basically zero; this
corresponds to the null model that
contains no predictors.
Ridge Regression
 Black = Bias
 Green = Variance
 Purple = MSE

 Increased λ leads to
increased bias but
decreased variance
Ridge Regression
 In general, the ridge
regression estimates will be
more biased than the OLS
ones but have lower
variance.

 Ridge regression will work


best in situations where the
OLS estimates have high
variance.
Ridge Regression
Computational Advantages of Ridge Regression
 If p is large, then using the best subset selection approach
requires searching through enormous numbers of possible
models.

 With ridge regression, for any given λ we only need to fit one
model and the computations turn out to be very simple.

 Ridge regression can even be used when p > n, a situation


where OLS fails completely (i.e. OLS estimates do not even
have a unique solution).
The Lasso
 One significant problem of ridge regression is that the
penalty term will never force any of the coefficients to be
exactly zero.

 Thus, the final model will include all p predictors, which


creates a challenge in model interpretation

 A more modern machine learning alternative is the lasso.

 The lasso works in a similar way to ridge regression, except it


uses a different penalty term that shrinks some of the
coefficients exactly to zero.
The Lasso
 The lasso and ridge regression coefficient estimates are given
by the first point at which an ellipse contacts the constraint
region.
OLS Solution

Ridge
Lasso Regression
Lasso vs. Ridge Regression
 The lasso has a major advantage over ridge regression, in
that it produces simpler and more interpretable models
that involved only a subset of predictors.

 The lasso leads to qualitatively similar behavior to ridge


regression, in that as λ increases, the variance decreases and
the bias increases.

 The lasso can generate more accurate predictions compared


to ridge regression.

 Cross-validation can be used in order to determine which


approach is better on a particular data set.
Selecting the Tuning Parameter λ
 As for subset selection, for ridge regression and lasso we
require a method to determine which of the models under
consideration in best; thus, we required a method selecting
a value for the tuning parameter λ or equivalently, the value
of the constraint s.

 Select a grid of potential values; use cross-validation to


estimate the error rate on test data (for each value of λ) and
select the value that gives the smallest error rate.

 Finally, the model is re-fit using all of the variable


observations and the selected value of the tuning
parameter λ.
Considerations in High Dimensions
 While p can be extremely large, the number of observations
n is often limited due to cost, sample availability, etc.

 Data sets containing more features than observations are


often referred to a high-dimensional.

 When the number of features p is as large as, or larger than,


the number of observations n, OLS should not be
performed.
 It is too flexible and hence overfits the data.

 Forward stepwise selection, ridge regression, lasso, and PCR


are particularly useful for performing regression in the high-
dimensional setting.
Considerations in High Dimensions
 Regularization or shrinkage plays a key role in high-
dimensional problems.

 Appropriate tuning parameter selection is crucial for good


predictive performance.

 The test error tends to increase as the dimensionality of the


problem (i.e. the number of features or predictors)
increases, unless the additional features are truly associated
with the response.
 Known as the curse of dimensionality
Considerations in High Dimensions
 Curse of dimensionality
 Adding additional signal features that are truly associated with the
response will improve the fitted model, in the sense of leading to a
reduction in test set error.
 Adding noise features that are not truly associated with the response
will lead to a deterioration in the fitted model, and consequently an
increased test set error.

 Noise features increase the dimensionality of the problem,


exacerbating the risk of overfitting without any potential
upside in terms of improved test set error.
Considerations in High Dimensions
 In the high-dimensional setting, the multi-collinearity problem is
extreme: any variable in the model can be written as a linear
combination of all of the other variables in the models.

 It is also important to be particularly careful in reporting errors


and measures of model fit in the high-dimensional setting.

 One should never use sum of squared errors, p-values, R2


statistics, or other traditional measures of model fit on the
training data as evidence of good model fit in the high-
dimensional setting.

 It is important to report results on an independent test set, or


cross-validation errors.
Summary
 Best subset selection and stepwise selection methods.
 Estimate test error by adjusting training error to account for
bias due to overfitting.
 Estimate test error using validation set approach and cross-
validation approach.
 Ridge regression and the lasso as shrinkage (regularization)
methods.
 Principal components regression and partial least squares.
 Considerations for high-dimensional settings.
THANK YOU

You might also like