CP 3
CP 3
CP 3
lines(age.grid, pred2$fit, col=’red’, lwd=2) % Splines III (smoothing splines) fit = smooth.spline(age,
wage, df=16) fit2 = smooth.spline(age, wage, cv=T) $ fit2 $df: 6.8 plot(age, wage, xlim=agelims,
cex=.5, col=’darkgrey’) lines( fit , col=’red’, lwd=2) lines( fit2 , col=’blue’, lwd=2) % Local Regression fit
= loess(wage˜age, span=.2, data=Wage) % span=.2: neighborhood consists of 20% of the
observations fit2 = loess(wage˜age, span=.5, data=Wage) plot(age, wage, xlim=agelims, cex=.5,
col=’darkgrey’) lines(age.grid, predict(fit, data.frame(age=age.grid)), col=’red’, lwd=2) lines(age.grid,
predict(fit2,data.frame(age=age.grid)), col=’blue’, lwd=2) % GAM gam1 =
lm(wage˜ns(year,4)+ns(age,5)+education, data=Wage) % ns(data, df, ...) for year & age % regular
qualitative for education library(gam) gam.m3 = gam(wage˜s(year,4)+s(age,5)+education,
data=Wage) par(mfrow=c(1,3)) plot(gam.m3, se=T, col=’blue’) % 3 plots for 3 predictors % each
shows respective predictor’s fit to response gam.m1 = gam(wage˜s(age,5)+education, data=Wage)
gam.m2 = gam(wage˜year+s(age,5)+education, data=Wage) anova(gam.m1, gam.m2, gam.m3,
test=’F’) % model comparison gam.lo = gam(wage˜s(year,df=4)+lo(age,span=.7)+education,
data=Wage) gam.lo.i = gam(wage˜lo(year,age,span=.5)+education, data=Wage) % make use of local
regression gam.lr = gam(I(wage>250)˜year+s(age,df=5)+education, family=binomial, data=Wage)
par(mfrow=c(1,3)) plot(gam.lr, se=T, col=’green’) % logistic GAM
7 Tree-Based Models 7.1 Decision Trees 7.1.1 Model of DT In a typical Decision Tree task, we have n
observations x1 , ..., xn and p predictors/parameters, and we would like to compute estimate ˆyi for
each response yi . Graphically, the following example illustrates how the predictors year and hits are
used to predict a baseball player’s salary36 . In the example, the two predictors are binarily
factorified by an artificial dividing point which minimizes RSS (coming up soon). The tree can also be
represented with a graph of decision regions, as Fig 7.2: Having the basic setup of a decision tree task
in mind, we now formulate the prediction-making and optimization goal of a decision tree, given . •
Prediction – Given a set of possible values of observations X1 , ..., Xp characterized by p predictors,
partition the values into J distinct and nonoverlapping regions R1 , ..., RJ . – For every observation xi
in region Rj , the prediction/estimate for its corresponding ˆyi is the mean of the response values yi
which are in
• Optimization Goal X J j=1 X i∈Rj (yi − yˆRj ) 2 (7.1) Essentially, in constructing a decision tree, we
make two decisions: • The cutting points s1 , ..., sk for each predictor Xj , by which each predictor
gives two decision regions: R1 (j, sj ) = {X|Xj < sj} and R2 (j, sj ) = {X|Xj ≥ sj} (7.2) • The sequence of
predictors X1 , ..., Xk, where k ≤ p, by which the partitioning of the decision space is done. The
sequence should minimize the combined RSS of all decision regions. X j∈J X i:xi∈Rj (j,sj ) (yi − yˆRj ) 2
(7.3) In practice, it is apparently inefficient to scan through all possible sequences (i.e. all possible
tree structures). Further, for the simplicity of a model, we would like to have as less predictors (thus
decision regions) involved at a reasonable cost of RSS37. Therefore, the optimization goal in Eq. 7.1 is
modified to include a penalty term to minimize also the number of terminal nodes in a tree (Eq. 7.4,
where |T| is the number of terminal nodes, m is the index for decision regions). X |T| m=1 X xi∈Rm
(yi − yˆRm ) 2 + α|T| (7.4) 37It is clear that the more predictors we use, the lower the RSS will be on
the training set. This, however, risks overfitting for our model. 6
The sequence selection can be carried out with some variation of forward/backward/hybrid
selection procedure (cf. Ch 5.1), which will not be elaborated here. To guard against overfitting, each
tree is also subject to a cross-validation where the MSE is computed to evaluate a particular tree’s
performance. Regression tree in a classification task differs in both the way in which prediction is
made and the optimization goal. • Prediction Each observation goes to the most commonly occurring
class of training observations in a decision region. • Optimization Goals – Classification Error Rate38 E
= 1 − max k (ˆpmk) (7.5) – Gini Index 39 G = X K k=1 pˆmk(1 − pˆmk) (7.6) – Cross-Entropy40 D = − X K
k=1 pˆmklogpˆmk (7.7) In building a classification tree, Gini or Cross-Entropy is used to evaluate the
quality of a particular split. While Gini and Cross-Entropy are effective with tree pruning,
Classification Error Rate is preferable if the objective is making predictions. Finally, note that node
purity is important because it reduces the uncertainty in a decision when information is incomplete.
7.1.2 DT: Pros & Cons Many tasks can be approached with either a DT or a linear model, therefore we
need to decide which one is more ideal for a particular data set and task. A general rule of thumb is
as follows: A linear model works better if the relationship between the predictors and the response is
close to linear. On the other hand, if this relationship is highly non-linear and complex, then DT
makes a better bet. More generally, the pros and cons of DT are listed as follows: 38 pˆmk represents
the proportion of training observations in the mth region that are from the kth class. 39Gini is a
measure of node purity, in the sense that a small Gini indicates that a node contains predominantly
observations from a single class. 40Cross-Entropy also measures node purity.