0% found this document useful (0 votes)
104 views

Module 4: Regression Shrinkage Methods

This document discusses regression shrinkage methods, specifically ridge regression. It begins by introducing the concept of shrinkage in regression analysis, where fitted relationships appear to perform worse on new data than the original data used for fitting. It then discusses ridge regression, which aims to reduce this variability by applying a penalty, known as an L2 penalty term, to the coefficients to shrink them towards zero. This balances reducing variance with increasing bias. The key aspects covered are that ridge regression modifies the standard OLS regression formula by introducing this penalty term that is the sum of the squares of the coefficients, controlled by a hyperparameter lambda. This has the effect of performing coefficient shrinkage to reduce overfitting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views

Module 4: Regression Shrinkage Methods

This document discusses regression shrinkage methods, specifically ridge regression. It begins by introducing the concept of shrinkage in regression analysis, where fitted relationships appear to perform worse on new data than the original data used for fitting. It then discusses ridge regression, which aims to reduce this variability by applying a penalty, known as an L2 penalty term, to the coefficients to shrink them towards zero. This balances reducing variance with increasing bias. The key aspects covered are that ridge regression modifies the standard OLS regression formula by introducing this penalty term that is the sum of the squares of the coefficients, controlled by a hyperparameter lambda. This has the effect of performing coefficient shrinkage to reduce overfitting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

MODULE 4: REGRESSION SHRINKAGE METHODS

In statistics, shrinkage is the reduction in the effects of sampling variation. In regression


analysis, a fitted relationship appears to perform less well on a new data set than on the data set
used for fitting. In particular the value of the coefficient of determination 'shrinks'. This idea is
complementary to overfitting and, separately, to the standard adjustment made in the coefficient
of determination to compensate for the subjunctive effects of further sampling, like controlling
for the potential of new explanatory terms improving the model by chance: that is, the
adjustment formula itself provides "shrinkage." But the adjustment formula yields an artificial
shrinkage.

A shrinkage estimator is an estimator that, either explicitly or implicitly, incorporates the


effects of shrinkage. In loose terms this means that a naive or raw estimate is improved by
combining it with other information.

Description

Many standard estimators can be improved, in terms of mean squared error (MSE), by shrinking
them towards zero (or any other finite constant value). In other words, the improvement in the
estimate from the corresponding reduction in the width of the confidence interval can outweigh
the worsening of the estimate introduced by biasing the estimate towards zero (see bias-variance
trade-off).

Assume that the expected value of the raw estimate is not zero and consider other estimators
obtained by multiplying the raw estimate by a certain parameter. A value for this parameter can
be specified so as to minimize the MSE of the new estimate. For this value of the parameter, the
new estimate will have a smaller MSE than the raw one. Thus it has been improved. An effect
here may be to convert an unbiased raw estimate to an improved biased one.

Prediction:

 Linear regression:

 Or for a more general regression function:

 In a prediction context, there is less concern about the values of the components of the
right-hand side, rather interest is on the total contribution.
Variable Selection:

 The driving force behind variable selection:

o The desire for a parsimonious regression model (one that is simpler and easier to
interpret);

o The need for greater accuracy in prediction.

 The notion of what makes a variable "important" is still not well understood, but one
interpretation (Breiman, 2001) is that a variable is important if dropping it seriously
affects prediction accuracy.

 Selecting variables in regression models is a complicated problem, and there are many
conflicting views on which type of variable selection procedure is best, e.g. LRT, F-test,
AIC, and BIC.

There are two main types of stepwise procedures in regression:

 Backward elimination: eliminate the least important variable from the selected ones.

 Forward selection: add the most important variable from the remaining ones.

 A hybrid version that incorporates ideas from both main types: alternates backward and
forward steps, and stops when all variables have either been retained for inclusion or
removed.

Criticisms of Stepwise Methods:

 There is no guarantee that the subsets obtained from stepwise procedures will contain the
same variables or even be the "best" subset.

 When there are more variables than observations (p > n), backward elimination is
typically not a feasible procedure.

 The maximum or minimum of a set of correlated F statistics is not itself an F statistic.

 It produces a single answer (a very specific subset) to the variable selection problem,
although several different subsets may be equally good for regression purposes.

 The computing is easy by the use of R function step() or regsubsets(). However, to


specify a practically good answer, you must know the practical context in which your
inference will be used.

Scott Zeger on 'how to pick the wrong model': Turn your scientific problem over to a computer
that, knowing nothing about your science or your question, is very good at optimizing AIC, BIC,.
Objectives

Upon successful completion of this lesson, you should be able to:

 Introducing biased regression methods to reduce variance.


 Implementation of Ridge and Lasso regression.

Ridge Regression
Ridge regression solves some of the shortcomings of linear regression. Ridge regression is an
extension of the OLS method with an additional constraint. The OLS estimates are
unconstrained, and might exhibit a large magnitude, and therefore large variance. In ridge
regression, the coefficients are applied a penalty, so that they are shrunk towards zero, this also
having the effect of reducing the variance and hence, the prediction error. Similar to the OLS
approach, we choose the ridge coefficients to minimize a penalized residual sum of squares
(RSS). As opposed to OLS, ridge regression provides biased estimators which have a low
variance

Motivation: too many predictors


 It is not unusual to see the number of input variables greatly exceed the number of
observations, e.g. microarray data analysis, environmental pollution studies.
 With many predictors, fitting the full model without penalization will result in large
prediction intervals, and LS regression estimator may not uniquely exist.

One way out of this situation is to abandon the requirement of an unbiased estimator.
We assume only that X's and Y have been centered so that we have no need for a constant term
in the regression:
 X is an n by p matrix with centered columns,
 Y is a centered n-vector.
When initially developing predictive models, we often need to compute coefficients, as
coefficients are not explicitly stated in the training data. To estimate coefficients, we can use a
standard ordinary least squares (OLS) matrix coefficient estimator:

Hoerl and Kennard (1970) proposed that potential instability in the LS estimator
Knowing this formula’s operations requires familiarity with matrix notation. Suffice it to say,
this formula aims to find the best-fitting line for a given dataset by calculating coefficients for
each independent variable that collectively result in the smallest residual sum of squares (also
called the sum of squared errors)

Residual sum of squares (RSS) measures how well a linear regression model matches training
data. It is represented by the formulation:

This formula measures model prediction accuracy for ground-truth values in the training data. If
RSS = 0, the model perfectly predicts dependent variables. A score of zero is not always
desirable, however, as it can indicate overfitting on the training data, particularly if the training
dataset is small. Multicollinearity may be one cause of this.

• High coefficient estimates can often be symptomatic of overfitting. If two or more


variables share a high, linear correlation, OLS may return erroneously high-value
coefficients. When one or more coefficients are too high, the model’s output becomes
sensitive to minor alterations in the input data. In other words, the model has overfitted
on a specific training set and fails to accurately generalize on new test sets. Such a model
is considered unstable.

• Ridge regression modifies OLS by calculating coefficients that account for potentially
correlated predictors. Specifically, ridge regression corrects for high-value coefficients by
introducing a regularization term (often called the penalty term) into the RSS function.
This penalty term is the sum of the squares of the model’s coefficients. It is represented
in the formulation:
The L2 penalty term is inserted as the end of the RSS function, resulting in a new formulation,
the ridge regression estimator. Therein, its effect on the model is controlled by the
hyperparameter lambda (λ):

Remember that coefficients mark a given predictor’s (i.e. independent variable’s) effect on the
predicted value (i.e. dependent variable). Once added into RSS formula, the L2 penalty term
counteracts especially high coefficients by reducing all coefficient values. In statistics, this is
called coefficient shrinkage. The above ridge estimator thus calculates new regression
coefficients that reduce a given model’s RSS. This minimizes every predictor’s effect and
reduces overfitting on training data.

Note that ridge regression does not shrink every coefficient by the same value. Rather,
coefficients are shrunk in proportion to their initial size. As λ increases, high-value coefficients
shrink at a greater rate than low-value coefficients. High-value coefficients are thus penalized
greater than low-value coefficients.

You might also like