CP 3

library(ISLR) attach(Wage) fit = lm(wage˜poly(age,4), data=Wage) % orthogonal polynomials∗
coef(summary(fit)) % print out fit2 = lm(wage˜poly(age,4,raw=T), data=Wage) fit2a =

lm(wageãge+I(ageˆ2)+I(ageˆ3)+I(ageˆ4), data=Wage) fit2b =
lm(wage˜cbind(age,ageˆ2,ageˆ3,ageˆ4), data=Wage) % original/raw polynomial % fitting the same,
coefs change agelims = range(age) age.grid = seq(from=agelims[1], to=agelims[2]) % grid: 18,19,...,90
preds = predict(fit,newdata=list(age=age.grid),se=TRUE) % make prediction se.bands =
cbind(preds$fit+2∗preds$se.fit, preds$fit-2∗preds$se.fit) % show standard error band at 2se
par(mfrow=c(1,2), mar=c(4.5,4.5,1,1), oma=c(0,0,4,0)) % 1 row 2 col grid % margin:
(bottom,left,top,right) % oma: outer margin plot(age, wage, xlim=agelims, cex=.5, col=’darkgrey’)
title(’D-4 Poly’, outer=T) lines(age.grid, preds$fit , lwd=2, col=’blue’) % add fit curve
matlines(age.grid, se.bands, lwd=1, col=’blue’, lty=3) % add standard error band fit .1 = lm(wageãge,
data=Wage) fit .2 = lm(wage˜poly(age,2), data=Wage) fit .3 = lm(wage˜poly(age,3), data=Wage) fit .4
= lm(wage˜poly(age,4), data=Wage) fit .5 = lm(wage˜poly(age,5), data=Wage) anova(fit.1, fit .2, fit .3,
fit .4, fit .5) % model comparison for choosing degree of polynomial % cutting point is where value is
insignificant % Polynomial Regression II ( logistic ) fit = glm(I(wage>250)˜poly(age,4), data=Wage,
family=binomial) % create fit preds = predict(fit, newdata=list(age=age.grid), se=T)
% make prediction % alternative: % preds = predict(fit, newdata=list(age=age.grid), %

type=’response’, se=T) pfit = exp(preds$fit) / (1+exp(preds$fit)) % convert logit to estimate
se.bands.logit = cbind(preds$fit+2∗preds$se.fit, preds$fit-2∗preds$se.fit) se.bands =
exp(se.bands.logit) / (1+exp(se.bands.logit)) % show standard error band at 2se plot(age,
I(wage>250), xlim=agelims, type=’n’, ylim=c(0,.2)) points(jitter(age), I((wage>250)/5), cex=.5, pch=’|’,
col=’darkgrey’) % jitter(): ‘rug plot’ that makes values non-overlap lines (age.grid, pfit , lwd=2,
col=’blue’) matlines(age.grid, se.bands, lwd=1, col=’blue’, lty=3) % plot i) fit , ii ) 2se bands
table(cut(age,4)) % table representation of prediction % 4 ‘age-buckets’ fit = lm(wage˜cut(age,4),
data=Wage) % partitioned fit coef(summary(fit)) % Splines I ( regression splines ) library(splines) fit =
lm(wage˜bs(age,knots=c(25,40,60)), data=Wage) % fit % bs(): generate matrix of basis functions for
specified knots pred = predict(fit, newdata=list(age=age.grid), se=T) % make prediction plot(age,
wage, col=’gray’) lines(age.grid, pred$fit , lwd=2) lines(age.grid, pred$fit+2∗pred$se, lty=’dashed’)
lines(age.grid, pred$fit -2∗pred$se, lty=’dashed’) dim(bs(age, knots=c(25,40,60))) % two ways to
check df attr(bs(age,df=6), ’knots’) % show quantile percentages % Splines II (natural splines ) fit2 =
lm(wageñs(age,df=4), data=Wage) pred2 = predict(fit2, newdata=list(age=age.grid), se=T)
lines(age.grid, pred2$fit, col=’red’, lwd=2) % Splines III (smoothing splines) fit = smooth.spline(age,
wage, df=16) fit2 = smooth.spline(age, wage, cv=T) $ fit2 $df: 6.8 plot(age, wage, xlim=agelims,
cex=.5, col=’darkgrey’) lines( fit , col=’red’, lwd=2) lines( fit2 , col=’blue’, lwd=2) % Local Regression fit
= loess(wageãge, span=.2, data=Wage) % span=.2: neighborhood consists of 20% of the
observations fit2 = loess(wageãge, span=.5, data=Wage) plot(age, wage, xlim=agelims, cex=.5,
col=’darkgrey’) lines(age.grid, predict(fit, data.frame(age=age.grid)), col=’red’, lwd=2) lines(age.grid,
predict(fit2,data.frame(age=age.grid)), col=’blue’, lwd=2) % GAM gam1 =
lm(wageñs(year,4)+ns(age,5)+education, data=Wage) % ns(data, df, ...) for year & age % regular
qualitative for education library(gam) gam.m3 = gam(wage˜s(year,4)+s(age,5)+education,
data=Wage) par(mfrow=c(1,3)) plot(gam.m3, se=T, col=’blue’) % 3 plots for 3 predictors % each
shows respective predictor’s fit to response gam.m1 = gam(wage˜s(age,5)+education, data=Wage)
gam.m2 = gam(wage˜year+s(age,5)+education, data=Wage) anova(gam.m1, gam.m2, gam.m3,
test=’F’) % model comparison gam.lo = gam(wage˜s(year,df=4)+lo(age,span=.7)+education,
data=Wage) gam.lo.i = gam(wage˜lo(year,age,span=.5)+education, data=Wage) % make use of local
regression gam.lr = gam(I(wage>250)˜year+s(age,df=5)+education, family=binomial, data=Wage)
par(mfrow=c(1,3)) plot(gam.lr, se=T, col=’green’) % logistic GAM
7 Tree-Based Models 7.1 Decision Trees 7.1.1 Model of DT In a typical Decision Tree task, we have n
observations x1 , ..., xn and p predictors/parameters, and we would like to compute estimate ˆyi for
each response yi . Graphically, the following example illustrates how the predictors year and hits are
used to predict a baseball player’s salary36 . In the example, the two predictors are binarily
factorified by an artificial dividing point which minimizes RSS (coming up soon). The tree can also be
represented with a graph of decision regions, as Fig 7.2: Having the basic setup of a decision tree task
in mind, we now formulate the prediction-making and optimization goal of a decision tree, given . •
Prediction – Given a set of possible values of observations X1 , ..., Xp characterized by p predictors,
partition the values into J distinct and nonoverlapping regions R1 , ..., RJ . – For every observation xi
in region Rj , the prediction/estimate for its corresponding ˆyi is the mean of the response values yi
which are in
• Optimization Goal X J j=1 X i∈Rj (yi − yˆRj ) 2 (7.1) Essentially, in constructing a decision tree, we
make two decisions: • The cutting points s1 , ..., sk for each predictor Xj , by which each predictor
gives two decision regions: R1 (j, sj ) = {X|Xj < sj} and R2 (j, sj ) = {X|Xj ≥ sj} (7.2) • The sequence of
predictors X1 , ..., Xk, where k ≤ p, by which the partitioning of the decision space is done. The
sequence should minimize the combined RSS of all decision regions. X j∈J X i:xi∈Rj (j,sj ) (yi − yˆRj ) 2
(7.3) In practice, it is apparently inefficient to scan through all possible sequences (i.e. all possible
tree structures). Further, for the simplicity of a model, we would like to have as less predictors (thus
decision regions) involved at a reasonable cost of RSS37. Therefore, the optimization goal in Eq. 7.1 is
modified to include a penalty term to minimize also the number of terminal nodes in a tree (Eq. 7.4,
where |T| is the number of terminal nodes, m is the index for decision regions). X |T| m=1 X xi∈Rm
(yi − yˆRm ) 2 + α|T| (7.4) 37It is clear that the more predictors we use, the lower the RSS will be on
the training set. This, however, risks overfitting for our model. 6
The sequence selection can be carried out with some variation of forward/backward/hybrid
selection procedure (cf. Ch 5.1), which will not be elaborated here. To guard against overfitting, each
tree is also subject to a cross-validation where the MSE is computed to evaluate a particular tree’s
performance. Regression tree in a classification task differs in both the way in which prediction is
made and the optimization goal. • Prediction Each observation goes to the most commonly occurring
class of training observations in a decision region. • Optimization Goals – Classification Error Rate38 E
= 1 − max k (ˆpmk) (7.5) – Gini Index 39 G = X K k=1 pˆmk(1 − pˆmk) (7.6) – Cross-Entropy40 D = − X K
k=1 pˆmklogpˆmk (7.7) In building a classification tree, Gini or Cross-Entropy is used to evaluate the
quality of a particular split. While Gini and Cross-Entropy are effective with tree pruning,
Classification Error Rate is preferable if the objective is making predictions. Finally, note that node
purity is important because it reduces the uncertainty in a decision when information is incomplete.
7.1.2 DT: Pros & Cons Many tasks can be approached with either a DT or a linear model, therefore we
need to decide which one is more ideal for a particular data set and task. A general rule of thumb is
as follows: A linear model works better if the relationship between the predictors and the response is
close to linear. On the other hand, if this relationship is highly non-linear and complex, then DT
makes a better bet. More generally, the pros and cons of DT are listed as follows: 38 pˆmk represents
the proportion of training observations in the mth region that are from the kth class. 39Gini is a
measure of node purity, in the sense that a small Gini indicates that a node contains predominantly
observations from a single class. 40Cross-Entropy also measures node purity.

CP 3

Uploaded by

Copyright:

Available Formats

CP 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CP 3

Uploaded by

Copyright:

Available Formats

library(ISLR) attach(Wage) fit = lm(wage˜poly(age,4), data=Wage) % orthogonal polynomials∗

coef(summary(fit)) % print out fit2 = lm(wage˜poly(age,4,raw=T), data=Wage) fit2a =

% make prediction % alternative: % preds = predict(fit, newdata=list(age=age.grid), %

You might also like