STA302 Week12 Full
STA302 Week12 Full
1/30
Last Week
2/30
Week 12- Learning objectives & Outcomes
• More on multicollinearity
• Model selection
• Final review.
3/30
How to detect multicollinearity
Reference: https://onlinecourses.science.psu.edu/stat501/node/347
4/30
Solutions to multicollinearity
5/30
Model Selection
6/30
In general
7/30
Model Selection - AIC
Definition of AIC: Akaike Information Criterion (or AIC) for a model M
is defined as
AIC (M) = n log(SSEM /n) + 2(pM + 1)
where pM is the number of predictors in the model.
• Want to minimize the Kullback-Leibler distance (p(y ) is the true
model for y )
Z Z Z
p(y )
K (p, p̂j ) = p(y ) log = p(y ) log p(y )dy − p(y ) log p̂j (y ; θ)dy
p̂j (y ; θ)
R
• same as maximizing Kj = p(y ) log p̂j (y ; θ)dy
1
Pn 1
• a good estimate of Kj is K̄j = n i log P(yi ; θ̂j ) = n `j (θ̂j )
• Akaike showed that the bias of K̄j is about ≈ dj /n where
dj =dim(parameters),therefore
K̂j = `j (θ̂j )/n − 2dj
• In R, the functions AIC() and extractAIC().In AIC, set k=log(n) gives
BIC value.
• AIC is the most commonly used evaluator. It is used in the R
command step()
• The model with the smaller AIC is considered better. 8/30
Model Selection - BIC
Definition of BIC: Bayesian Information Criterion (or BIC) for a model
M is defined as
BIC (M) = n log(SSEM /n) + log(n)(pM + 1)
• We put a prior πj (θj ) on parameter θj , and a prior pj that Mj is the
true model.
Z
P(Mj |Y1 , . . . , Yn ) ∝ P(Y1 , . . . , Yn |Mj )Pj = L(θj )πj (θj )dθj
• We choose j to maximize
Z
log L(θj )πj (θj )dθj + log pj
Cp = SSEp + 2pσ̂ 2
10/30
Model selection - Cross-validation
Idea: we divide up the data by training sample and a testing sample. The
training sample is used to fit the regression model while the testing sample
is used to test how accurate the model is.
K-fold CV
• Randomly divide your observations into K parts. Each part should
have roughly n/K observations.
• For each part
• Define this part to be your testing sample.
• Define all other parts to be your training sample.
• Fit the model using only the training sample
• Compute the prediction MSE, denote as PMSE
1 X
PMSE = (Yi − Ŷi )2
size of testing samples
i∈testing sample
11/30
Model selection - Cross-validation (contd.)
• 5-fold CV
All-subset Selection
• Suppose we have p X’s and we want to choose the best subset
X1 , . . . Xp by a criterion.
• Try every subset of X1 , . . . Xp . There will be a total of 2p subsets
because that each X can either in or out of the model.
• Using the information criterion outlined before and choose the subset
with the lowest value.
• Usual practice: this procedure is only practical for small p.
• The R command regsubsets() in the leaps package implements this
procedure.
13/30
Example: all-subset selection
data=read.table("/Users/Wei/TA/Teaching/0-STA302-2016F/Week12-Nov28/BostonHousing.txt",
sep=",",header=T)
n <- dim(data)[1]
head(data)
## CRIM ZN INDUSTRY CHAR NOX NROOMS AGE DIS RAD TAX PTRATIO B
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## LSAT MEDV
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
14/30
Example: all-subset selection (contd.)
15/30
Model selection procedure - Forward Selection
16/30
Example: Forward Selection
data=read.table("/Users/Wei/TA/Teaching/0-STA302-2016F/Week12-Nov28/BostonHousing.txt",
sep=",",header=T)
17/30
Example: Forward Selection (contd.)
data=read.table("/Users/Wei/TA/Teaching/0-STA302-2016F/Week12-Nov28/BostonHousing.txt",
sep=",",header=T)
### Run automatic forward-selection ###
# no predictor in the model
nullmod= lm (log(MEDV)~1, data=data)
# with all predictors in the model
fullmod = lm( log(MEDV)~INDUSTRY + AGE + TAX,data=data)
# forward selection method
forwards = step(nullmod ,scope=list(lower=formula(nullmod),
upper=formula(fullmod)), direction="forward")
formula(forwards)
# will NOT get you the same results since steps() automatically uses AIC, not F tests!
18/30
Model selection procedure - Backward Elimination
19/30
Example: Backward selection
data=read.table("/Users/Wei/TA/Teaching/0-STA302-2016F/Week12-Nov28/BostonHousing.txt",
sep=",",header=T)
# Choose the X (INDUSTRY) with the highest p-value or the lowest F value.
# without INDUSTRY and repeat the procedure
# model.current = lm(log(MEDV) ~ NROOMS + AGE + TAX,data=data)
# drop1( model.current,test="F",data=data)
# ...
20/30
Example: Backward Elimination (contd.)
data=read.table("/Users/Wei/TA/Teaching/0-STA302-2016F/Week12-Nov28/BostonHousing.txt",
sep=",",header=T)
## Start: AIC=-1177.23
## log(MEDV) ~ INDUSTRY + AGE + TAX + CRIM
##
## Df Sum of Sq RSS AIC
## <none> 48.436 -1177.2
## - AGE 1 0.7017 49.137 -1172.0
## - TAX 1 0.8831 49.319 -1170.1
## - INDUSTRY 1 1.5685 50.004 -1163.1
## - CRIM 1 4.8912 53.327 -1130.5
formula(backward)
# will NOT get you the same results since steps() automatically uses AIC, not F tests!
21/30
Model selection procedure - Stepwise Selection
22/30
Example: Stepwise Selection
data=read.table("/Users/Wei/TA/Teaching/0-STA302-2016F/Week12-Nov28/BostonHousing.txt",
sep=",",header=T)
## Start: AIC=-1130.55
## log(MEDV) ~ INDUSTRY + AGE + TAX
##
## Df Sum of Sq RSS AIC
## <none> 53.327 -1130.5
## - AGE 1 1.1608 54.488 -1121.7
## - INDUSTRY 1 1.2069 54.534 -1121.2
## - TAX 1 4.7344 58.061 -1089.5
formula(stepwisemod)
# will NOT get you the same results since steps() automatically uses AIC, not F tests!
23/30
Example: Stepwise Selection (contd.)
data=read.table("/Users/Wei/TA/Teaching/0-STA302-2016F/Week12-Nov28/BostonHousing.txt",
sep=",",header=T)
## Start: AIC=-1130.55
## log(MEDV) ~ INDUSTRY + AGE + TAX
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 53.327 -1130.5
## - AGE 1 1.1608 54.488 -1121.7 10.928 0.0010151 **
## - INDUSTRY 1 1.2069 54.534 -1121.2 11.361 0.0008077 ***
## - TAX 1 4.7344 58.061 -1089.5 44.568 6.522e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
formula(stepwisemod)
24/30
Example: Stepwise Selection (contd.)
data=read.table("/Users/Wei/TA/Teaching/0-STA302-2016F/Week12-Nov28/BostonHousing.txt",
sep=",",header=T)
## Start: AIC=-1113.64
## log(MEDV) ~ INDUSTRY + AGE + TAX
##
## Df Sum of Sq RSS AIC
## <none> 53.327 -1113.6
## - AGE 1 1.1608 54.488 -1109.0
## - INDUSTRY 1 1.2069 54.534 -1108.5
## - TAX 1 4.7344 58.061 -1076.8
formula(stepwisemod)
25/30
That’s all ! You are..
26/30
Final Exam
27/30
Final Exam
Suggestions:
• Practise all proofs in slides and extra assigned questions: All of them.
• Know how to read and interpret R output
• Summary, anova output
• Diagnostics plots
• Other output that you have seen from slides.
• Review 3 assignments and midterm paper
• Do some old exams might help you to see which topics are important
28/30
Final Exam: topics that you could skip
• Week 2:
• different criterion of regression: reverse/orthogonal/reduced major
axis regression.
• How to find MLE
• Review on inference
• Week 4: Normal correlation model
• Week 8: Review on matrices (slide 6-16): skip only if you have good
knowledge of matrices
• Week 10: Geometric perspective of LS regression(slides 4-6).
• Week 11: Type III SS.
• Week 12: Model selection.
29/30
GOOD LUCK!
30/30