0% found this document useful (0 votes)
11 views29 pages

Chapter 6 - 1 Handsout Machine Learning

chapter 6

Uploaded by

Abid Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views29 pages

Chapter 6 - 1 Handsout Machine Learning

chapter 6

Uploaded by

Abid Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Machine Learning

Session 17-18

Prof. Dr.Ijaz Hussain

Statistics, QAU

May 13, 2024

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 1 / 29
Shrinkage Methods

The subset selection methods involve using least squares to fit a linear
model that contains a subset of the predictors.
As an alternative, we can fit a model containing all p predictors using
a technique that constrains or regularizes the coefficient estimates, or
equivalently, that shrinks the coefficient estimates towards zero.
It may not be immediately obvious why such a constraint should im-
prove the fit, but it turns out that shrinking the coefficient estimates
can significantly reduce their variance.
The two best-known techniques for shrinking the regression coefficients
towards zero are ridge regression and the lasso.

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 2 / 29
Ridge Regression
In least squares fitting procedure to estimate β0 , β1 , ..., βp using the
values that minimize
 2
Xn Xp
RSS = yi − β0 − βj xij 
i=1 j=1

Ridge regression is very similar to OLS, except that the coefficients are
estimated by minimizing a slightly different quantity.
Ridge regression coefficient estimates β̂ R are the values that minimize
 2
Xn Xp p
X
R
RSS =  yi − β0 − βj xij  +λ βj2
i=1 j=1 j=1
p
X
= RSS + λ βj2
j=1
where λ ≥ 0 is a tuning parameter, to be determined separately.
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 3 / 29
Ridge Regression...
The above equation represents trade off between two different criteria.
1 As with least squares, ridge regression seeks coefficient estimates that
fit the data well, by making the RSS small.
The second term, λ pj=1 βj2 called a shrinkage penalty, is small when
P
2

β0 , β1 , ..., βp are close to zero, and so it has the effect of shrinking the
estimates of βj towards zero.
The tuning parameter λ serves to control the relative impact of these
two terms on the regression coefficient estimates.
When λ = 0 , the penalty term has no effect, and ridge regression will
produce the least squares estimates.
However, as λ → ∞ , the impact of the shrinkage penalty grows, and
the ridge regression coefficient estimates will approach zero.
Unlike least squares, which generates only one set of coefficient esti-
mates, ridge regression will produce a different set of coefficient esti-
mates, β̂λR , for each value of λ .
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 4 / 29
Ridge Regression...

Selecting a good value for λ is critical; which can be determined by


using cross-validation methods.
Note that in above, the shrinkage penalty is applied to β0 , β1 , ..., βp ,
but not to the intercept β0 .
We want to shrink the estimated association of each variable with the
response.
However, we do not want to shrink the intercept, which is simply a
measure of the mean value of the response when xi1 = xi2 = ... =
xip = 0.
If we assume that the variables—that is, the columns of the data matrix
X —have been centered to have mean zero before ridge regression is
performed,
P then the estimated intercept will take the form β̂0 = ȳ =
y /n.

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 5 / 29
Ridge regression predictions on a simulated data set

Figure: Squared bias (black), variance (green), and test mean squared error
(purple) as a function of λ and β̂λR 2/ β̂ 2.

The horizontal dashed lines indicate the minimum possible MSE.


The purple crosses indicate the ridge regression models for which the
MSE is smallest
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 6 / 29
Ridge regression predictions on a simulated data set

The right-hand panel of the Figure displays the same curves as the left-
hand panel, this time plotted against the l2 norm of the ridge regression
coefficient estimates divided by the l2 norm of the least squares esti-
mates.
Now as we move from left to right, the fits become more flexible, and
so the bias decreases and the variance increases.

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 7 / 29
Why Does Ridge Regression Improve Over Least Squares?

Ridge regression’s advantage over least squares is rooted in the bias-


variance trade-off.
As λ increases, the flexibility of the ridge regression fit decreases, lead-
ing to decrease variance but increase bias.
This is illustrated in the left-hand panel of the previously displayed
Figure , using a simulated data set containing p = 45 predictors and
n = 50 observations.
The green curve in the left-hand panel of the Figure displays the vari-
ance of the ridge regression predictions as a function of λ.
At the least squares coefficient estimates, which correspond to ridge
regression with λ = 0, the variance is high but there is no bias.
But as λ increases, the shrinkage of the ridge coefficient estimates
leads to a substantial reduction in the variance of the predictions, at
the expense of a slight increase in bias.

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 8 / 29
Why Does Ridge Regression Improve Over Least Squares?

Recall that the test mean squared error (MSE), plotted in purple, is
closely related to the variance plus the squared bias.
For values of λ up to about 10, the variance decreases rapidly, with
very little increase in bias, plotted in black.
Consequently, the MSE drops considerably as λ increases from 0 to 10.
Beyond this point, the decrease in variance due to increasing λ slows,
and the shrinkage on the coefficients causes them to be significantly
underestimated, resulting in a large increase in the bias.
The minimum MSE is achieved at approximately λ = 30.
Interestingly, because of its high variance, the MSE associated with
the least squares fit, when λ = 0, is almost as high as that of the null
model for which all coefficient estimates are zero, when λ = ∞.
However, for an intermediate value of λ, the MSE is considerably lower.

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 9 / 29
Computational Advantages of Ridge Regression

Ridge regression also has substantial computational advantages over


best subset selection, which requires searching through 2p models.
As we discussed previously, even for moderate values of p, such a search
can be computationally infeasible.
In contrast, for any fixed value of λ, ridge regression only fits a single
model, and the model-fitting procedure can be performed quite quickly.
In fact, one can show that the computations required to solve ridge
Regression Equation, simultaneously for all values of λ, are almost
identical to those for fitting a model using least squares.

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 10 / 29
The Lasso Regression
Unlike best subset, forward stepwise, and backward stepwise selection,
which will generally select models that involve just a subset of the
variables, ridge regression includes all p predictors in the final model.
The penalty term λ pj=1 βj2 will shrink all of the coefficients towards
P
zero, but it will not set any of them exactly to zero (unless λ = ∞).
This may not be a problem for prediction accuracy, but it can create
a challenge in model interpretation in settings in which the number of
variables p is quite large.
The lasso is a relatively recent alternative to ridge regression that over
comes this disadvantage.
The lasso coefficients, β̂λL , minimize the quantity
 2
n
X p
X p
X p
X
L yi − β0 −
RSS = βj xij  + λ |βj | = RSS + λ |βj |
i=1 j=1 j=1 j=1

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 11 / 29
The Lasso Regression....

Comparing RSS or ridge to RSS of Lasso, it can be observed that the


lasso and ridge regression have similar formulations.
The only difference is that the βj2 term in the ridge regression penalty
has been replaced by |βj | in the lasso penalty.
In statistical parlance, the lasso uses an l1 (pronounced “ell 1”) penalty
instead of an l2 penalty.
P
The l1 norm of a coefficient vector β is given by ∥β∥1 = |βJ |.
As with ridge regression, the lasso shrinks the coefficient estimates
towards zero.
However, in the case of the lasso, the l1 penalty has the effect of forcing
some of the coefficient estimates to be exactly equal to zero when the
tuning parameter λ is sufficiently large.

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 12 / 29
The Lasso Regression....

Hence, much like best subset selection, the lasso performs variable
selection.
As a result, models generated from the lasso are generally much easier
to interpret than those produced by ridge regression.
We say that the lasso yields sparse models—that is, models that involve
only a subset of the variables.
As in ridge regression, selecting a good value of λ for the lasso is
critical; it can be determined by using cross-validation.

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 13 / 29
Example Lasso

Figure:

When λ = 0, then the lasso simply gives the least squares fit, and when
λ becomes sufficiently large, the lasso gives the null model in which all
coefficient estimates equal zero.
However, in between these two
Prof. Dr.Ijaz Hussain (Statistics, QAU)
extremes, the ridge regression
Machine Learning May 13, 2024
and lasso
14 / 29
Example Lasso....

Moving from left to right in the right-hand panel, we observe that at


first the lasso results in a model that contains only the rating predictor.
Then student and limit enter the model almost simultaneously, shortly
followed by income.
Eventually, the remaining variables enter the model.
Hence, depending on the value of λ, the lasso can produce a model
involving any number of variables.
In contrast, ridge regression will always include all of the variables in
the model, although the magnitude of the coefficient estimates will
depend on λ.

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 15 / 29
Comparing the Lasso and Ridge Regression
It is clear that the lasso has a major advantage over ridge regression,
in that it produces simpler and more interpretable models that involve
only a subset of the predictors.
However, which method leads to better prediction accuracy? it cannot
be generalized.
The lasso implicitly assumes that a number of the coefficients truly
equal zero.
If the number of predictors that are function of response is very high
then ridge regression will outperform the lasso in terms of prediction
error.
If the response is function of few variables e.g 2 out of 40 variables
then the lasso will tend to outperform ridge regression in terms of bias,
variance, and MSE.
It can be concluded that neither ridge regression nor the lasso will
universally dominate the other.
In general, one might expect the lasso to perform better in a setting
where a relatively small number of predictors have substantial coef-
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 16 / 29
Comparing the Lasso and Ridge Regression
Ridge regression will perform better when the response is a function of
many predictors, all with coefficients of roughly equal size.
However, the number of predictors that is related to the response is
never known a priori for real data sets.
A technique such as cross-validation can be used in order to determine
which approach is better on a particular data set.
As with ridge regression, when the least squares estimates have exces-
sively high variance, the lasso solution can yield a reduction in variance
at the expense of a small increase in bias, and consequently can gen-
erate more accurate predictions.
Unlike ridge regression, the lasso performs variable selection, and hence
results in models that are easier to interpret.
There are very efficient algorithms for fitting both ridge and lasso mod-
els; in both cases the entire coefficient paths can be computed with
about the same amount of work as a single least squares fit.
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 17 / 29
Comparison of Ridge Regression and Lasso Regression

Figure:

Left: Plots of squared bias (black), variance (green), and test MSE
(purple) for the lasso on a simulated data set.
Right: Comparison of squared bias, variance, and test MSE between
lasso (solid) and ridge (dotted).
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 18 / 29
A Simple Special Case for Ridge Regression and the Lasso
In order to obtain a better understanding about the behavior of ridge
regression and the lasso, consider a simple special case with n = p, and
X a diagonal matrix with 1’s on the diagonal and 0’s in all off-diagonal
elements.
To simplify the problem further, assume also that we are performing
regression without an intercept.
With these assumptions, the usual least squares problem simplifies to
finding β1 , ..., βp that minimize
p
X
(yj − βj )2
j=1

In this case, the least squares solution is yj = β̂j


And in this setting, ridge regression amounts to finding β1 , ..., βp such
that
Xp Xp
(yj − βj )2 + λ βj2
j=1 j=1
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 19 / 29
A Simple Special Case for Ridge Regression and the
Lasso...

The lasso amounts to finding the coefficients such that


p
X p
X
2
(yj − βj ) + λ |βj |
j=1 j=1

is minimized.
One can show that in this setting, the ridge regression estimates take
the form
yi
β̂jR =
(1 + λ)
and the lasso estimates take the form

β̂jL = yj − λ/2 if yj > λ/2, yj + λ/2 if yj < −λ/2, Otherwise 0

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 20 / 29
Example: The ridge regression and lasso

Figure:

Left: The ridge regression coefficient estimates are shrunken propor-


tionally towards zero, relative to the least squares estimates.
Right: The lasso coefficient estimates are soft-thresholded towards
zero.
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 21 / 29
Example: The ridge regression and lasso...

We can see from above figure that ridge regression and the lasso per-
form two very different types of shrinkage.
In ridge regression, each least squares coefficient estimate is shrunken
by the same proportion.
In contrast, the lasso shrinks each least squares coefficient towards zero
by a constant amount, λ/2; the least squares coefficients that are less
than λ/2 in absolute value are shrunken entirely to zero.
The type of shrinkage performed by the lasso in this simple setting is
known as softthresholding.
The fact that some lasso coefficients are shrunken entirely to zero ex-
plains why the lasso performs feature selection.

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 22 / 29
Example: The ridge regression and lasso...

In the case of a more general data matrix X , the story is a little more
complicated than what is depicted in above Figure, but the main ideas
still hold approximately
Ridge regression more or less shrinks every dimension of the data by
the same proportion.
Whereas the lasso more or less shrinks all coefficients toward zero by a
similar amount, and sufficiently small coefficients are shrunken all the
way to zero.

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 23 / 29
Bayesian Interpretation for Ridge Regression and the Lasso

We now show that one can view ridge regression and the lasso through
a Bayesian lens.
A Bayesian viewpoint for regression assumes that the coefficient vector
β has some prior distribution, say p(β), where β = (β0 , β1 , ..., βp )T .
The likelihood of the data can be written as f (Y |X , β), where X =
(X1 , ..., Xp ).
Multiplying the prior distribution by the likelihood gives us (up to a pro-
portionality constant) the posterior distribution,which takes the form

p (β|X , Y ) ∝ f (Y |X , β) p (β|X ) = f (Y |X , β) p (β)

where the proportionality above follows from Bayes’ theorem, and the
equality above follows from the assumption that X is fixed.

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 24 / 29
Bayesian Interpretation for Ridge Regression and the Lasso
We assume the usual linear model,
Y = β0 + X1 β1 + .... + Xp βp + ϵ,
and suppose that the errors are independent and drawn from a normal
distribution.
Furthermore, assume that p(β) = pj=1 g (βj ), for some density func-
Q
tion g . It turns out that ridge regression and the lasso follow naturally
from two special cases of g :
1. If g is a Gaussian distribution with mean zero and standard deviation
a function of λ, then it follows that the posterior mode for β—that
is, the most likely value for β, given the data—is given by the ridge
regression solution. (In fact, the ridge regression solution is also the
posterior mean.)
2. If g is a double-exponential (Laplace) distribution with mean zero and
scale parameter a function of λ, then it follows that the posterior mode
for β is the lasso solution. (However, the lasso solution is not the pos-
terior mean, and in fact, the posterior mean does not yield a sparse
coefficient vector.)
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 25 / 29
Posterior of Ridge regression and Lasso Regression

Figure:

Left: Ridge regression is the posterior mode for β under a Gaussian


prior.
Right: The lasso is the posterior mode for β under a double-exponential
prior.
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 26 / 29
Selecting the Tuning Parameter

Just as the subset selection approaches require a method to determine


which of the models under consideration is best.
Implementing ridge regression and the lasso requires a method for se-
lecting a value for the tuning parameter λ or equivalently, the value of
the constraints.
Cross-validation provides a simple way to tackle this problem.
We choose a grid of λ values, and compute the cross-validation error
for each value of λ.
We then select the tuning parameter value for which the cross-validation
error is smallest.
Finally, the model is re-fit using all of the available observations and
the selected value of the tuning parameter.

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 27 / 29
References

The material used in these slides is borrowed from the following books.
These slides can be used only for academic purpose.
Gareth, J., Daniela, W., Trevor, H., & Robert, T. (2013). An intro-
duction to statistical learning: with applications in R. Spinger.
Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009).
The elements of statistical learning: data mining, inference, and pre-
diction (Vol. 2, pp. 1-758). New York: springer.

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 28 / 29
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 13, 2024 29 / 29

You might also like