CP 2
CP 2
CP 2
for the corresponding y. From the Ch 5.2, we learned that when the number of predictors is high
(and especially when the observation to predictor ratio is low), regression models tend to overfit the
training set. Here we have the same problem with smoothing splines, because each data point is
effectively a “predictor”. Graphically, an overfitted regression curve has a high overall slope variation,
i.e. the first derivatives of the curve changes frequently and to large degrees, or the second
derivatives. To moderate this, we add a penalty to our optimization goal (i.e. RSS) to minimize also
the slope-variation, which is done by minimizing the second derivatives. Xn i=1 (yi − g(xi ))2 + λ Z g
′′(t) 2 dt (6.10) λ here is the tuning variable. g becomes perfectly smooth (i.e. becomes a straight line)
while λ → ∞. R g ′′(t) 2 dt is the measure of the overall slopevariation. Although it seems that we
would still have an overwhelmingly large df when the predictors are numerous, λ serves to reduce
the effective df, which is a measure of the flexibility of the smoothing spline. The effecitive df of a
smoothing spline, dfλ, is defined as follows: gˆλ = Sλy (6.11) dfλ = Xn i=1 {Sλ}ii (6.12) gˆ here is an n-
vector containing the fitted values of the smoothing spline at the training points x1 , ..., xn. The
vector can be written, according to Eq. 6.11, the
product of an n× n matrix Sλ and the response vector y. In other words, there exists a transformation
from y to its estimate gˆ. The transformation matrix S is then used to compute dfλ. To choose a λ, we
may use LOOCV by computing the following RSScv(λ) and select the λ for which the value is
minimized. The computation makes use of the transformation matrix {S}, and is incredibly effective.
RSScv(λ) = Xn i=1 (yi − gˆ (−i) λ (xi )) 2 = Xn i=1 yi − gˆλ(xi ) 1 − {Sλ}ii 2 (6.13) In general, we would
like to have a model with less degree of freedom (i.e. less free parameters, and hence simpler). 6.5
Local Regression An alternative to regression splines is Local Regression. The idea of local regression
goes as follows: for each target point x0 , find a set of xi from a defined vicinity of x0 . xi are weighted
(for indicating their “relative importance”) by their distance from x0 . The estimate ˆy0 will be from
the regression fitted with the set xi . This is illustrated graphically in Fig 6.5. The procedure of local
regression is described in the following algorithm: • Gather the fraction s = k/n of training points
whose xi are closest to x0 . • Assign a weight Ki0 = K(xi , x0 ) to each point in this neighborhood, so
that the point furtherest from x0 has weight zero, and the closest has the highest weight. All but
these k nearest neighbors get weight zero. • Fit a weighted least squares regression of the yi on the
xi using the aforementioned weights, by finding βˆ 0 and βˆ 1 that minimize Xn i=1 Ki0 (yi − β0 −
β1xi ) 2
• The fitted value at x0 is given by ˆf(x0 ) = βˆ 0 + βˆ 1x0 Finally, note that local regression suffers
from the same “neighbor sparsity” problem as K-nearest Neighbor approach at high dimensions.
Recall that, at high dimensions, it is very difficult to find a set of neighbors from the target data point.
6.6 Generalized Additive Model Generalized Additive Model (GAM) is a compromise between linear
models and non-parametric models. Specifically, its formulation allows individual predictors to be
associated with the response non-linearly, but at the same time impose a global structure to the
model. Instead of giving each predictor a coefficient, as a linear model does (Eq. 6.15), a GAM
replaces each term in the linear model with a non-linear function: βjxij → fi (xij ) (Eq. 6.16): yi = β0 +
β1xi1 + ... + βpxip + ǫi (6.15) yi = β0 + f1 (xi1 ) + ... + fp (xip) + ǫi (6.16) The pros and cons of GAM are
listed as follows: • Pros of GAM – Incorporate non-linear relationships between predictors and
responses which are amiss in linear models. – Potentially more accurate fit. – The relationships
between individual predictors and the response can be studied while holding other predictors fixed.
• Cons of GAM – Additive restriction. Important interactions among variables can be missed, if there
are any. The extension of GAM to qualitative settings is simple. This is demonstrated with Eq. 6.17 &
Eq. 6.18: log p(X) 1 − p(X) = β0 + β1X1 + ... + βpXp (6.17) log p(X) 1 − p(X) = β0 + f1 (X1 ) + ... +
fp (Xp ) (6.18) 6.7 Lab Code % Polynomial Regression I (linear)
library(ISLR) attach(Wage) fit = lm(wage˜poly(age,4), data=Wage) % orthogonal polynomials∗
coef(summary(fit)) % print out fit2 = lm(wage˜poly(age,4,raw=T), data=Wage) fit2a =
lm(wage˜age+I(ageˆ2)+I(ageˆ3)+I(ageˆ4), data=Wage) fit2b =
lm(wage˜cbind(age,ageˆ2,ageˆ3,ageˆ4), data=Wage) % original/raw polynomial % fitting the same,
coefs change agelims = range(age) age.grid = seq(from=agelims[1], to=agelims[2]) % grid: 18,19,...,90
preds = predict(fit,newdata=list(age=age.grid),se=TRUE) % make prediction se.bands =
cbind(preds$fit+2∗preds$se.fit, preds$fit-2∗preds$se.fit) % show standard error band at 2se
par(mfrow=c(1,2), mar=c(4.5,4.5,1,1), oma=c(0,0,4,0)) % 1 row 2 col grid % margin:
(bottom,left,top,right) % oma: outer margin plot(age, wage, xlim=agelims, cex=.5, col=’darkgrey’)
title(’D-4 Poly’, outer=T) lines(age.grid, preds$fit , lwd=2, col=’blue’) % add fit curve
matlines(age.grid, se.bands, lwd=1, col=’blue’, lty=3) % add standard error band fit .1 = lm(wage˜age,
data=Wage) fit .2 = lm(wage˜poly(age,2), data=Wage) fit .3 = lm(wage˜poly(age,3), data=Wage) fit .4
= lm(wage˜poly(age,4), data=Wage) fit .5 = lm(wage˜poly(age,5), data=Wage) anova(fit.1, fit .2, fit .3,
fit .4, fit .5) % model comparison for choosing degree of polynomial % cutting point is where value is
insignifican