0% found this document useful (0 votes)
48 views30 pages

STA302 Week12 Full

The document provides notes from a lecture on methods of data analysis. It discusses detecting and addressing multicollinearity, various model selection techniques including AIC, BIC, Mallow's Cp, cross-validation, all-subset selection and forward selection. All-subset selection considers all possible subsets of predictors and chooses the best based on a selection criterion. Forward selection starts with an intercept-only model and sequentially adds predictors, performing an F-test at each step.

Uploaded by

tianyuan gu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views30 pages

STA302 Week12 Full

The document provides notes from a lecture on methods of data analysis. It discusses detecting and addressing multicollinearity, various model selection techniques including AIC, BIC, Mallow's Cp, cross-validation, all-subset selection and forward selection. All-subset selection considers all possible subsets of predictors and chooses the best based on a selection criterion. Forward selection starts with an intercept-only model and sequentially adds predictors, performing an F-test at each step.

Uploaded by

tianyuan gu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

STA302/1001 - Methods of Data Analysis I

(Week 12 lecture notes)

Wei (Becky) Lin

Nov. 28, 2016

1/30
Last Week

• Type I and Type III SS


• Use of Extra Sum of Squares in Tests for Regression coefficients
• Coefficient of Partial Determination
• Summary of Tests Concerning Regression Coefficients
• Multicollinearity and Its Effects

2/30
Week 12- Learning objectives & Outcomes

• More on multicollinearity
• Model selection
• Final review.

3/30
How to detect multicollinearity

Some of the common methods used for detecting multicollinearity include:


• The analysis exhibits the signs of multicollinearity — such as,
estimates of the coefficients vary from model to model.
• The t-tests for each of the individual slopes are non-significant
(P > 0.05), but the overall F-test for testing all of the slopes are
simultaneously 0 is significant (P < 0.05).
• The correlations among pairs of predictor variables are large.
(Looking at pairwise correlation is limiting, e.g. X1 = 1 + 2X2 + 5X5 ,
a linear dependence exists among three or even more variables)
• Variance Inflation Factor
Var (bk |X1 , . . . , Xk , . . .)
VIFk =
Var (bk |Xk )

Reference: https://onlinecourses.science.psu.edu/stat501/node/347

4/30
Solutions to multicollinearity

• If interest is only in mean response estimation and prediction,


multicollinearity can be ignored since it does not affect Ŷ or its
standard error (either Var d (Ŷ − Y )).
d (Ŷ ) or Var
• True only if the xh at which we want estimation or prediction is within
the range of the data.
• If the wish is to establish association patters between y and the
predictors, then analyst can:
• Eliminate some predictors from the model.
• Design an experiment in which the pattern of correlation is broken.
• Using centered predictor variables in polynomial regression.
• x = 2, 3, 4, 5, 6 and x 2 = 4, 9, 16, 25, 36, cor (x , x 2 ) = 0.98.
• E (Y ) = β0 + β1 x + β2 x 2 .
• z = x − x̄ = −2, −1, 0, 1, 2; z 2 = 4, 1, 0, 1, 4, cor (z, z 2 ) = 0
• E (Y ) = γ0 + γ1 z + γ2 z 2 .

5/30
Model Selection

6/30
In general

How to compare two non-nested models?


• Bias - variance trade-off
• Cp
• AIC
• Cross-validation
• BIC
How to search the space of possible models?
• Step-wise search
• Best subets.
• LASSO
• Bayesian methods

7/30
Model Selection - AIC
Definition of AIC: Akaike Information Criterion (or AIC) for a model M
is defined as
AIC (M) = n log(SSEM /n) + 2(pM + 1)
where pM is the number of predictors in the model.
• Want to minimize the Kullback-Leibler distance (p(y ) is the true
model for y )
Z Z Z
p(y )
K (p, p̂j ) = p(y ) log = p(y ) log p(y )dy − p(y ) log p̂j (y ; θ)dy
p̂j (y ; θ)
R
• same as maximizing Kj = p(y ) log p̂j (y ; θ)dy
1
Pn 1
• a good estimate of Kj is K̄j = n i log P(yi ; θ̂j ) = n `j (θ̂j )
• Akaike showed that the bias of K̄j is about ≈ dj /n where
dj =dim(parameters),therefore
K̂j = `j (θ̂j )/n − 2dj
• In R, the functions AIC() and extractAIC().In AIC, set k=log(n) gives
BIC value.
• AIC is the most commonly used evaluator. It is used in the R
command step()
• The model with the smaller AIC is considered better. 8/30
Model Selection - BIC
Definition of BIC: Bayesian Information Criterion (or BIC) for a model
M is defined as
BIC (M) = n log(SSEM /n) + log(n)(pM + 1)
• We put a prior πj (θj ) on parameter θj , and a prior pj that Mj is the
true model.
Z
P(Mj |Y1 , . . . , Yn ) ∝ P(Y1 , . . . , Yn |Mj )Pj = L(θj )πj (θj )dθj

• We choose j to maximize
Z
log L(θj )πj (θj )dθj + log pj

• Taylor series approximations show that


Z
dj
log L(θj )πj (θj )dθj + log pj ≈ `j (θ̂j ) − log n = BICj
2
• In contrast to AIC, BICM ≥ AICM when n > 7.3(= exp(2)).
• Puts more penalty for having more predictors and chooses simpler
model than AIC.
• The model with the smaller BIC is considered better.. 9/30
Model selection - Mallows’ Cp

If p predictor variables are selected from a set of K > p, the Mallows Cp


statistic for that particular set of X’s is defined as:

Cp = SSEp + 2pσ̂ 2

• SSEp is the residual sum of squares on a training set of data.


• p is the number of predictor variables.
• σ̂ 2 is the estimate of σ 2 = Var (Y ) using K predictor variables.
• Usual practice: plot Cp versus p, choose model with minimum.
• The model with the smaller Cp is considered better..

10/30
Model selection - Cross-validation

Idea: we divide up the data by training sample and a testing sample. The
training sample is used to fit the regression model while the testing sample
is used to test how accurate the model is.
K-fold CV
• Randomly divide your observations into K parts. Each part should
have roughly n/K observations.
• For each part
• Define this part to be your testing sample.
• Define all other parts to be your training sample.
• Fit the model using only the training sample
• Compute the prediction MSE, denote as PMSE
1 X
PMSE = (Yi − Ŷi )2
size of testing samples
i∈testing sample

• Take the average of the K PMSE computed in the loop.

11/30
Model selection - Cross-validation (contd.)

• 5-fold CV

• For possible models in consideration, we compute the K-fold CV and


obtain the PMSE for each model. We generally choose the model
with the smallest PMSE . In practice, K is chosen to be 5, although it
can depend on the initial sample size.
12/30
Model selection procedure - All-subset selection

All-subset Selection
• Suppose we have p X’s and we want to choose the best subset
X1 , . . . Xp by a criterion.
• Try every subset of X1 , . . . Xp . There will be a total of 2p subsets
because that each X can either in or out of the model.
• Using the information criterion outlined before and choose the subset
with the lowest value.
• Usual practice: this procedure is only practical for small p.
• The R command regsubsets() in the leaps package implements this
procedure.

13/30
Example: all-subset selection

data=read.table("/Users/Wei/TA/Teaching/0-STA302-2016F/Week12-Nov28/BostonHousing.txt",
sep=",",header=T)
n <- dim(data)[1]
head(data)

## CRIM ZN INDUSTRY CHAR NOX NROOMS AGE DIS RAD TAX PTRATIO B
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## LSAT MEDV
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7

### Run All-Subsets ###


library(leaps)
model.allsubsets = regsubsets(log(MEDV) ~ INDUSTRY + NROOMS + AGE + TAX + CRIM,data=data)

14/30
Example: all-subset selection (contd.)

15/30
Model selection procedure - Forward Selection

1. Start with the most parsimonious model Yi = β0 + i


2. For the current model
• For each Xk that is left, add it to the model and perform an F test
comparing the current model (i.e. the reduced model) with the model
that includes Xk (i.e. the full model)
• Choose the Xj with the lowest p-value (or largest F observed value)
• If this p-value is lower than a pre-specified significance level (e.g. α =
0.05), include it into the model, declare this as your current model,
and repeat the procedure. Otherwise, declare the current model as
your final model.

16/30
Example: Forward Selection

data=read.table("/Users/Wei/TA/Teaching/0-STA302-2016F/Week12-Nov28/BostonHousing.txt",
sep=",",header=T)

### Run forward-selection ###


currentmod = lm(log(MEDV)~1,data=data )
add1(currentmod,~INDUSTRY + NROOMS + AGE + TAX + CRIM,test="F",data=data)

## Single term additions


##
## Model:
## log(MEDV) ~ 1
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 84.376 -904.37
## INDUSTRY 1 24.746 59.630 -1078.02 209.16 < 2.2e-16 ***
## NROOMS 1 33.704 50.672 -1160.39 335.23 < 2.2e-16 ***
## AGE 1 17.347 67.029 -1018.83 130.43 < 2.2e-16 ***
## TAX 1 26.599 57.777 -1093.99 232.03 < 2.2e-16 ***
## CRIM 1 23.518 60.858 -1067.70 194.76 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# and repeat the procedure


# currentmod = lm(log(MEDV) ~ NROOMS)
# add1( currentmod,~INDUSTRY + AGE + TAX + CRIM,test="F",data=data)
#...

17/30
Example: Forward Selection (contd.)

data=read.table("/Users/Wei/TA/Teaching/0-STA302-2016F/Week12-Nov28/BostonHousing.txt",
sep=",",header=T)
### Run automatic forward-selection ###
# no predictor in the model
nullmod= lm (log(MEDV)~1, data=data)
# with all predictors in the model
fullmod = lm( log(MEDV)~INDUSTRY + AGE + TAX,data=data)
# forward selection method
forwards = step(nullmod ,scope=list(lower=formula(nullmod),
upper=formula(fullmod)), direction="forward")
formula(forwards)
# will NOT get you the same results since steps() automatically uses AIC, not F tests!

18/30
Model selection procedure - Backward Elimination

1. Start with the least parsimonious model


Yi = β0 + β1 X1 + . . . + βp Xp + i
2. For the current model
• For each Xk that’s in the current model, drop it from the model and
perform an F test comparing the current model (i.e. the full model)
with the model that includes Xk (i.e. the reduced model)
• Choose the Xj with the largest p-value (or smallest F observed value)
• If this p-value is larger than a pre-specified significance level (e.g. α =
0.05), remove Xj it into the model, declare this as your current model,
and repeat the procedure. Otherwise, declare the current model as
your final model.

19/30
Example: Backward selection

data=read.table("/Users/Wei/TA/Teaching/0-STA302-2016F/Week12-Nov28/BostonHousing.txt",
sep=",",header=T)

### Run backward-selection ###


currentmod =lm(log(MEDV) ~ INDUSTRY + NROOMS + AGE + TAX,data=data )
drop1(currentmod,test="F",data=data)

## Single term deletions


##
## Model:
## log(MEDV) ~ INDUSTRY + NROOMS + AGE + TAX
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 35.864 -1329.3
## INDUSTRY 1 0.0001 35.864 -1331.3 0.0021 0.9636
## NROOMS 1 17.4624 53.327 -1130.5 243.9374 < 2.2e-16 ***
## AGE 1 1.3351 37.199 -1312.8 18.6507 1.893e-05 ***
## TAX 1 4.4342 40.299 -1272.3 61.9429 2.200e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Choose the X (INDUSTRY) with the highest p-value or the lowest F value.
# without INDUSTRY and repeat the procedure
# model.current = lm(log(MEDV) ~ NROOMS + AGE + TAX,data=data)
# drop1( model.current,test="F",data=data)
# ...

20/30
Example: Backward Elimination (contd.)
data=read.table("/Users/Wei/TA/Teaching/0-STA302-2016F/Week12-Nov28/BostonHousing.txt",
sep=",",header=T)

### Run automatic backforward-elimination ###


# no predictor in the model
nullmod= lm (log(MEDV)~1, data=data)
# with all predictors in the model
fullmod = lm( log(MEDV)~INDUSTRY + AGE + TAX + CRIM,data=data)
# forward selection method
backward = step(fullmod ,scope=list(lower=formula(nullmod),
upper=formula(fullmod)), direction="backward")

## Start: AIC=-1177.23
## log(MEDV) ~ INDUSTRY + AGE + TAX + CRIM
##
## Df Sum of Sq RSS AIC
## <none> 48.436 -1177.2
## - AGE 1 0.7017 49.137 -1172.0
## - TAX 1 0.8831 49.319 -1170.1
## - INDUSTRY 1 1.5685 50.004 -1163.1
## - CRIM 1 4.8912 53.327 -1130.5

formula(backward)

## log(MEDV) ~ INDUSTRY + AGE + TAX + CRIM

# will NOT get you the same results since steps() automatically uses AIC, not F tests!
21/30
Model selection procedure - Stepwise Selection

1. Start with the some model. In R, this usually is the least


parsimonious model Yi = β0 + β1 X1 + . . . + βp Xp + i
2. For the current model, compute the information criterion.
• Consider all the variables that can be removed. For each of the
variables removed, compute the information criterion.
• Consider all the variables that can be added. For each of the variables
added, compute the information criterion.
• From the models formulated by removing or adding predictors, choose
the model with the smallest information criterion.
• If the information criterion associated with this model has a smaller
value than current model, replace the current model with this one.
Otherwise, declare the current model as the final model.

22/30
Example: Stepwise Selection
data=read.table("/Users/Wei/TA/Teaching/0-STA302-2016F/Week12-Nov28/BostonHousing.txt",
sep=",",header=T)

### Run automatic backforward-elimination ###


# no predictor in the model
nullmod= lm (log(MEDV)~1, data=data)
# with all predictors in the model
fullmod = lm( log(MEDV)~INDUSTRY + AGE + TAX,data=data)
# stepwise selection method using AIC
stepwisemod = step(fullmod ,scope=list(lower=formula(nullmod),
upper=formula(fullmod)), direction="both")

## Start: AIC=-1130.55
## log(MEDV) ~ INDUSTRY + AGE + TAX
##
## Df Sum of Sq RSS AIC
## <none> 53.327 -1130.5
## - AGE 1 1.1608 54.488 -1121.7
## - INDUSTRY 1 1.2069 54.534 -1121.2
## - TAX 1 4.7344 58.061 -1089.5

formula(stepwisemod)

## log(MEDV) ~ INDUSTRY + AGE + TAX

# will NOT get you the same results since steps() automatically uses AIC, not F tests!
23/30
Example: Stepwise Selection (contd.)

data=read.table("/Users/Wei/TA/Teaching/0-STA302-2016F/Week12-Nov28/BostonHousing.txt",
sep=",",header=T)

### Run automatic Stepwise-Selection ###


nullmod= lm (log(MEDV)~1, data=data)
fullmod = lm( log(MEDV)~INDUSTRY + AGE + TAX,data=data)
# Stepwise selection method using F-test
stepwisemod = step(fullmod ,scope=list(lower=formula(nullmod),
upper=formula(fullmod)), direction="both",test="F")

## Start: AIC=-1130.55
## log(MEDV) ~ INDUSTRY + AGE + TAX
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 53.327 -1130.5
## - AGE 1 1.1608 54.488 -1121.7 10.928 0.0010151 **
## - INDUSTRY 1 1.2069 54.534 -1121.2 11.361 0.0008077 ***
## - TAX 1 4.7344 58.061 -1089.5 44.568 6.522e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

formula(stepwisemod)

## log(MEDV) ~ INDUSTRY + AGE + TAX

24/30
Example: Stepwise Selection (contd.)

data=read.table("/Users/Wei/TA/Teaching/0-STA302-2016F/Week12-Nov28/BostonHousing.txt",
sep=",",header=T)

### Run automatic Stepwise-Selection ###


nullmod= lm (log(MEDV)~1, data=data)
fullmod = lm( log(MEDV)~INDUSTRY + AGE + TAX,data=data)
# Stepwise selection method using BIC
stepwisemod = step(fullmod ,scope=list(lower=formula(nullmod),
upper=formula(fullmod)), direction="both",k=log(n))

## Start: AIC=-1113.64
## log(MEDV) ~ INDUSTRY + AGE + TAX
##
## Df Sum of Sq RSS AIC
## <none> 53.327 -1113.6
## - AGE 1 1.1608 54.488 -1109.0
## - INDUSTRY 1 1.2069 54.534 -1108.5
## - TAX 1 4.7344 58.061 -1076.8

formula(stepwisemod)

## log(MEDV) ~ INDUSTRY + AGE + TAX

25/30
That’s all ! You are..

26/30
Final Exam

• Cover page and Formula page (check out portal)


• Coverage on entire term from lecture 1 to 12 (very little on lecture 12)
• Questions type: multiple choice, short answer, proofs, data analysis
• 25% on proofs ( more on MLR )

27/30
Final Exam

Suggestions:
• Practise all proofs in slides and extra assigned questions: All of them.
• Know how to read and interpret R output
• Summary, anova output
• Diagnostics plots
• Other output that you have seen from slides.
• Review 3 assignments and midterm paper
• Do some old exams might help you to see which topics are important

28/30
Final Exam: topics that you could skip

• Week 2:
• different criterion of regression: reverse/orthogonal/reduced major
axis regression.
• How to find MLE
• Review on inference
• Week 4: Normal correlation model
• Week 8: Review on matrices (slide 6-16): skip only if you have good
knowledge of matrices
• Week 10: Geometric perspective of LS regression(slides 4-6).
• Week 11: Type III SS.
• Week 12: Model selection.

29/30
GOOD LUCK!

30/30

You might also like