Module 4: Regression Shrinkage Methods
Module 4: Regression Shrinkage Methods
Description
Many standard estimators can be improved, in terms of mean squared error (MSE), by shrinking
them towards zero (or any other finite constant value). In other words, the improvement in the
estimate from the corresponding reduction in the width of the confidence interval can outweigh
the worsening of the estimate introduced by biasing the estimate towards zero (see bias-variance
trade-off).
Assume that the expected value of the raw estimate is not zero and consider other estimators
obtained by multiplying the raw estimate by a certain parameter. A value for this parameter can
be specified so as to minimize the MSE of the new estimate. For this value of the parameter, the
new estimate will have a smaller MSE than the raw one. Thus it has been improved. An effect
here may be to convert an unbiased raw estimate to an improved biased one.
Prediction:
Linear regression:
In a prediction context, there is less concern about the values of the components of the
right-hand side, rather interest is on the total contribution.
Variable Selection:
o The desire for a parsimonious regression model (one that is simpler and easier to
interpret);
The notion of what makes a variable "important" is still not well understood, but one
interpretation (Breiman, 2001) is that a variable is important if dropping it seriously
affects prediction accuracy.
Selecting variables in regression models is a complicated problem, and there are many
conflicting views on which type of variable selection procedure is best, e.g. LRT, F-test,
AIC, and BIC.
Backward elimination: eliminate the least important variable from the selected ones.
Forward selection: add the most important variable from the remaining ones.
A hybrid version that incorporates ideas from both main types: alternates backward and
forward steps, and stops when all variables have either been retained for inclusion or
removed.
There is no guarantee that the subsets obtained from stepwise procedures will contain the
same variables or even be the "best" subset.
When there are more variables than observations (p > n), backward elimination is
typically not a feasible procedure.
It produces a single answer (a very specific subset) to the variable selection problem,
although several different subsets may be equally good for regression purposes.
Scott Zeger on 'how to pick the wrong model': Turn your scientific problem over to a computer
that, knowing nothing about your science or your question, is very good at optimizing AIC, BIC,.
Objectives
Ridge Regression
Ridge regression solves some of the shortcomings of linear regression. Ridge regression is an
extension of the OLS method with an additional constraint. The OLS estimates are
unconstrained, and might exhibit a large magnitude, and therefore large variance. In ridge
regression, the coefficients are applied a penalty, so that they are shrunk towards zero, this also
having the effect of reducing the variance and hence, the prediction error. Similar to the OLS
approach, we choose the ridge coefficients to minimize a penalized residual sum of squares
(RSS). As opposed to OLS, ridge regression provides biased estimators which have a low
variance
One way out of this situation is to abandon the requirement of an unbiased estimator.
We assume only that X's and Y have been centered so that we have no need for a constant term
in the regression:
X is an n by p matrix with centered columns,
Y is a centered n-vector.
When initially developing predictive models, we often need to compute coefficients, as
coefficients are not explicitly stated in the training data. To estimate coefficients, we can use a
standard ordinary least squares (OLS) matrix coefficient estimator:
Hoerl and Kennard (1970) proposed that potential instability in the LS estimator
Knowing this formula’s operations requires familiarity with matrix notation. Suffice it to say,
this formula aims to find the best-fitting line for a given dataset by calculating coefficients for
each independent variable that collectively result in the smallest residual sum of squares (also
called the sum of squared errors)
Residual sum of squares (RSS) measures how well a linear regression model matches training
data. It is represented by the formulation:
This formula measures model prediction accuracy for ground-truth values in the training data. If
RSS = 0, the model perfectly predicts dependent variables. A score of zero is not always
desirable, however, as it can indicate overfitting on the training data, particularly if the training
dataset is small. Multicollinearity may be one cause of this.
• Ridge regression modifies OLS by calculating coefficients that account for potentially
correlated predictors. Specifically, ridge regression corrects for high-value coefficients by
introducing a regularization term (often called the penalty term) into the RSS function.
This penalty term is the sum of the squares of the model’s coefficients. It is represented
in the formulation:
The L2 penalty term is inserted as the end of the RSS function, resulting in a new formulation,
the ridge regression estimator. Therein, its effect on the model is controlled by the
hyperparameter lambda (λ):
Remember that coefficients mark a given predictor’s (i.e. independent variable’s) effect on the
predicted value (i.e. dependent variable). Once added into RSS formula, the L2 penalty term
counteracts especially high coefficients by reducing all coefficient values. In statistics, this is
called coefficient shrinkage. The above ridge estimator thus calculates new regression
coefficients that reduce a given model’s RSS. This minimizes every predictor’s effect and
reduces overfitting on training data.
Note that ridge regression does not shrink every coefficient by the same value. Rather,
coefficients are shrunk in proportion to their initial size. As λ increases, high-value coefficients
shrink at a greater rate than low-value coefficients. High-value coefficients are thus penalized
greater than low-value coefficients.