Chapter 6 - 1 Handsout Machine Learning
Chapter 6 - 1 Handsout Machine Learning
Session 17-18
Statistics, QAU
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 1 / 29
Shrinkage Methods
The subset selection methods involve using least squares to fit a linear
model that contains a subset of the predictors.
As an alternative, we can fit a model containing all p predictors using
a technique that constrains or regularizes the coefficient estimates, or
equivalently, that shrinks the coefficient estimates towards zero.
It may not be immediately obvious why such a constraint should im-
prove the fit, but it turns out that shrinking the coefficient estimates
can significantly reduce their variance.
The two best-known techniques for shrinking the regression coefficients
towards zero are ridge regression and the lasso.
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 2 / 29
Ridge Regression
In least squares fitting procedure to estimate β0 , β1 , ..., βp using the
values that minimize
2
Xn Xp
RSS = yi − β0 − βj xij
i=1 j=1
Ridge regression is very similar to OLS, except that the coefficients are
estimated by minimizing a slightly different quantity.
Ridge regression coefficient estimates β̂ R are the values that minimize
2
Xn Xp p
X
R
RSS = yi − β0 − βj xij +λ βj2
i=1 j=1 j=1
p
X
= RSS + λ βj2
j=1
where λ ≥ 0 is a tuning parameter, to be determined separately.
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 3 / 29
Ridge Regression...
The above equation represents trade off between two different criteria.
1 As with least squares, ridge regression seeks coefficient estimates that
fit the data well, by making the RSS small.
The second term, λ pj=1 βj2 called a shrinkage penalty, is small when
P
2
β0 , β1 , ..., βp are close to zero, and so it has the effect of shrinking the
estimates of βj towards zero.
The tuning parameter λ serves to control the relative impact of these
two terms on the regression coefficient estimates.
When λ = 0 , the penalty term has no effect, and ridge regression will
produce the least squares estimates.
However, as λ → ∞ , the impact of the shrinkage penalty grows, and
the ridge regression coefficient estimates will approach zero.
Unlike least squares, which generates only one set of coefficient esti-
mates, ridge regression will produce a different set of coefficient esti-
mates, β̂λR , for each value of λ .
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 4 / 29
Ridge Regression...
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 5 / 29
Ridge regression predictions on a simulated data set
Figure: Squared bias (black), variance (green), and test mean squared error
(purple) as a function of λ and β̂λR 2/ β̂ 2.
The right-hand panel of the Figure displays the same curves as the left-
hand panel, this time plotted against the l2 norm of the ridge regression
coefficient estimates divided by the l2 norm of the least squares esti-
mates.
Now as we move from left to right, the fits become more flexible, and
so the bias decreases and the variance increases.
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 7 / 29
Why Does Ridge Regression Improve Over Least Squares?
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 8 / 29
Why Does Ridge Regression Improve Over Least Squares?
Recall that the test mean squared error (MSE), plotted in purple, is
closely related to the variance plus the squared bias.
For values of λ up to about 10, the variance decreases rapidly, with
very little increase in bias, plotted in black.
Consequently, the MSE drops considerably as λ increases from 0 to 10.
Beyond this point, the decrease in variance due to increasing λ slows,
and the shrinkage on the coefficients causes them to be significantly
underestimated, resulting in a large increase in the bias.
The minimum MSE is achieved at approximately λ = 30.
Interestingly, because of its high variance, the MSE associated with
the least squares fit, when λ = 0, is almost as high as that of the null
model for which all coefficient estimates are zero, when λ = ∞.
However, for an intermediate value of λ, the MSE is considerably lower.
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 9 / 29
Computational Advantages of Ridge Regression
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 10 / 29
The Lasso Regression
Unlike best subset, forward stepwise, and backward stepwise selection,
which will generally select models that involve just a subset of the
variables, ridge regression includes all p predictors in the final model.
The penalty term λ pj=1 βj2 will shrink all of the coefficients towards
P
zero, but it will not set any of them exactly to zero (unless λ = ∞).
This may not be a problem for prediction accuracy, but it can create
a challenge in model interpretation in settings in which the number of
variables p is quite large.
The lasso is a relatively recent alternative to ridge regression that over
comes this disadvantage.
The lasso coefficients, β̂λL , minimize the quantity
2
n
X p
X p
X p
X
L yi − β0 −
RSS = βj xij + λ |βj | = RSS + λ |βj |
i=1 j=1 j=1 j=1
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 11 / 29
The Lasso Regression....
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 12 / 29
The Lasso Regression....
Hence, much like best subset selection, the lasso performs variable
selection.
As a result, models generated from the lasso are generally much easier
to interpret than those produced by ridge regression.
We say that the lasso yields sparse models—that is, models that involve
only a subset of the variables.
As in ridge regression, selecting a good value of λ for the lasso is
critical; it can be determined by using cross-validation.
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 13 / 29
Example Lasso
Figure:
When λ = 0, then the lasso simply gives the least squares fit, and when
λ becomes sufficiently large, the lasso gives the null model in which all
coefficient estimates equal zero.
However, in between these two
Prof. Dr.Ijaz Hussain (Statistics, QAU)
extremes, the ridge regression
Machine Learning May 13, 2024
and lasso
14 / 29
Example Lasso....
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 15 / 29
Comparing the Lasso and Ridge Regression
It is clear that the lasso has a major advantage over ridge regression,
in that it produces simpler and more interpretable models that involve
only a subset of the predictors.
However, which method leads to better prediction accuracy? it cannot
be generalized.
The lasso implicitly assumes that a number of the coefficients truly
equal zero.
If the number of predictors that are function of response is very high
then ridge regression will outperform the lasso in terms of prediction
error.
If the response is function of few variables e.g 2 out of 40 variables
then the lasso will tend to outperform ridge regression in terms of bias,
variance, and MSE.
It can be concluded that neither ridge regression nor the lasso will
universally dominate the other.
In general, one might expect the lasso to perform better in a setting
where a relatively small number of predictors have substantial coef-
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 16 / 29
Comparing the Lasso and Ridge Regression
Ridge regression will perform better when the response is a function of
many predictors, all with coefficients of roughly equal size.
However, the number of predictors that is related to the response is
never known a priori for real data sets.
A technique such as cross-validation can be used in order to determine
which approach is better on a particular data set.
As with ridge regression, when the least squares estimates have exces-
sively high variance, the lasso solution can yield a reduction in variance
at the expense of a small increase in bias, and consequently can gen-
erate more accurate predictions.
Unlike ridge regression, the lasso performs variable selection, and hence
results in models that are easier to interpret.
There are very efficient algorithms for fitting both ridge and lasso mod-
els; in both cases the entire coefficient paths can be computed with
about the same amount of work as a single least squares fit.
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 17 / 29
Comparison of Ridge Regression and Lasso Regression
Figure:
Left: Plots of squared bias (black), variance (green), and test MSE
(purple) for the lasso on a simulated data set.
Right: Comparison of squared bias, variance, and test MSE between
lasso (solid) and ridge (dotted).
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 18 / 29
A Simple Special Case for Ridge Regression and the Lasso
In order to obtain a better understanding about the behavior of ridge
regression and the lasso, consider a simple special case with n = p, and
X a diagonal matrix with 1’s on the diagonal and 0’s in all off-diagonal
elements.
To simplify the problem further, assume also that we are performing
regression without an intercept.
With these assumptions, the usual least squares problem simplifies to
finding β1 , ..., βp that minimize
p
X
(yj − βj )2
j=1
is minimized.
One can show that in this setting, the ridge regression estimates take
the form
yi
β̂jR =
(1 + λ)
and the lasso estimates take the form
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 20 / 29
Example: The ridge regression and lasso
Figure:
We can see from above figure that ridge regression and the lasso per-
form two very different types of shrinkage.
In ridge regression, each least squares coefficient estimate is shrunken
by the same proportion.
In contrast, the lasso shrinks each least squares coefficient towards zero
by a constant amount, λ/2; the least squares coefficients that are less
than λ/2 in absolute value are shrunken entirely to zero.
The type of shrinkage performed by the lasso in this simple setting is
known as softthresholding.
The fact that some lasso coefficients are shrunken entirely to zero ex-
plains why the lasso performs feature selection.
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 22 / 29
Example: The ridge regression and lasso...
In the case of a more general data matrix X , the story is a little more
complicated than what is depicted in above Figure, but the main ideas
still hold approximately
Ridge regression more or less shrinks every dimension of the data by
the same proportion.
Whereas the lasso more or less shrinks all coefficients toward zero by a
similar amount, and sufficiently small coefficients are shrunken all the
way to zero.
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 23 / 29
Bayesian Interpretation for Ridge Regression and the Lasso
We now show that one can view ridge regression and the lasso through
a Bayesian lens.
A Bayesian viewpoint for regression assumes that the coefficient vector
β has some prior distribution, say p(β), where β = (β0 , β1 , ..., βp )T .
The likelihood of the data can be written as f (Y |X , β), where X =
(X1 , ..., Xp ).
Multiplying the prior distribution by the likelihood gives us (up to a pro-
portionality constant) the posterior distribution,which takes the form
where the proportionality above follows from Bayes’ theorem, and the
equality above follows from the assumption that X is fixed.
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 24 / 29
Bayesian Interpretation for Ridge Regression and the Lasso
We assume the usual linear model,
Y = β0 + X1 β1 + .... + Xp βp + ϵ,
and suppose that the errors are independent and drawn from a normal
distribution.
Furthermore, assume that p(β) = pj=1 g (βj ), for some density func-
Q
tion g . It turns out that ridge regression and the lasso follow naturally
from two special cases of g :
1. If g is a Gaussian distribution with mean zero and standard deviation
a function of λ, then it follows that the posterior mode for β—that
is, the most likely value for β, given the data—is given by the ridge
regression solution. (In fact, the ridge regression solution is also the
posterior mean.)
2. If g is a double-exponential (Laplace) distribution with mean zero and
scale parameter a function of λ, then it follows that the posterior mode
for β is the lasso solution. (However, the lasso solution is not the pos-
terior mean, and in fact, the posterior mean does not yield a sparse
coefficient vector.)
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 25 / 29
Posterior of Ridge regression and Lasso Regression
Figure:
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 27 / 29
References
The material used in these slides is borrowed from the following books.
These slides can be used only for academic purpose.
Gareth, J., Daniela, W., Trevor, H., & Robert, T. (2013). An intro-
duction to statistical learning: with applications in R. Spinger.
Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009).
The elements of statistical learning: data mining, inference, and pre-
diction (Vol. 2, pp. 1-758). New York: springer.
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 28 / 29
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 29 / 29