STAT2 2e R Markdown Files Sec4.2
STAT2 2e R Markdown Files Sec4.2
July 3, 2018
The HH package has a nicer function for displaying the regsubsets results.
summaryHH(all)
Regression output for the six predictor model (the model with the best adjusted R2).
mod6=lm(GPA~HSGPA+SATV+Male+HU+SS+White,data=FirstYearGPA)
summary(mod6)
##
## Call:
## lm(formula = GPA ~ HSGPA + SATV + Male + HU + SS + White, data =
FirstYearGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06228 -0.26731 0.05287 0.27230 0.85843
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5466634 0.2835072 1.928 0.0552 .
## HSGPA 0.4829491 0.0714659 6.758 1.33e-10 ***
## SATV 0.0006945 0.0003449 2.013 0.0453 *
## Male 0.0541049 0.0526937 1.027 0.3057
## HU 0.0167958 0.0038181 4.399 1.72e-05 ***
## SS 0.0075702 0.0054421 1.391 0.1657
## White 0.2045215 0.0685954 2.982 0.0032 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3814 on 212 degrees of freedom
## Multiple R-squared: 0.347, Adjusted R-squared: 0.3285
## F-statistic: 18.78 on 6 and 212 DF, p-value: < 2.2e-16
In the ouptut above, the smallest Cp is for the ninth model (3.89), which is the first five-
predictor model.
You can also modify the plot of the regsubsets results to sort by Cp.
plot(all,scale="Cp")
Smallest Cp has HSGPA,SATV,HU,SS, and White as predictors
To calculate Cp for any particular model, we can use extractAIC since Cp is equivalent to
AIC for regression models. Here is Cp for the six-predictor model (4.85).
full=lm(GPA~.,data=FirstYearGPA) #model with all predictors in the pool
MSE=(summary(full)$sigma)^2 #Get out the MSE for full model
extractAIC(lm(GPA~HSGPA+SATV+Male+HU+SS+White,data=FirstYearGPA),scale=MSE)
##
## Call:
## lm(formula = GPA ~ ., data = FirstYearGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.07412 -0.25827 0.05384 0.27675 0.85761
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5268983 0.3487584 1.511 0.13235
## HSGPA 0.4932945 0.0745553 6.616 3.03e-10 ***
## SATV 0.0005919 0.0003945 1.501 0.13498
## SATM 0.0000847 0.0004447 0.190 0.84912
## Male 0.0482478 0.0570277 0.846 0.39850
## HU 0.0161874 0.0039723 4.075 6.53e-05 ***
## SS 0.0073370 0.0055635 1.319 0.18869
## FirstGen -0.0743417 0.0887490 -0.838 0.40318
## White 0.1962316 0.0700182 2.803 0.00555 **
## CollegeBound 0.0214530 0.1003350 0.214 0.83090
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3834 on 209 degrees of freedom
## Multiple R-squared: 0.3496, Adjusted R-squared: 0.3216
## F-statistic: 12.48 on 9 and 209 DF, p-value: 8.674e-16
Weakest (highest) P-value is for SATM so drop it, using the update function.
mod8=update(mod9,.~.-SATM)
summary(mod8)
##
## Call:
## lm(formula = GPA ~ HSGPA + SATV + Male + HU + SS + FirstGen +
## White + CollegeBound, data = FirstYearGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.07160 -0.26357 0.05167 0.27469 0.85550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5551540 0.3149111 1.763 0.07937 .
## HSGPA 0.4950161 0.0738354 6.704 1.84e-10 ***
## SATV 0.0006245 0.0003548 1.760 0.07988 .
## Male 0.0522103 0.0529758 0.986 0.32549
## HU 0.0160823 0.0039247 4.098 5.96e-05 ***
## SS 0.0071772 0.0054873 1.308 0.19231
## FirstGen -0.0755918 0.0883026 -0.856 0.39294
## White 0.1974200 0.0695794 2.837 0.00499 **
## CollegeBound 0.0211753 0.1000939 0.212 0.83266
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3825 on 210 degrees of freedom
## Multiple R-squared: 0.3495, Adjusted R-squared: 0.3247
## F-statistic: 14.11 on 8 and 210 DF, p-value: 2.253e-16
##
## Call:
## lm(formula = GPA ~ HSGPA + SATV + Male + HU + SS + FirstGen +
## White, data = FirstYearGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06911 -0.26259 0.05236 0.26954 0.84134
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5824756 0.2865599 2.033 0.04334 *
## HSGPA 0.4919452 0.0722304 6.811 9.94e-11 ***
## SATV 0.0006315 0.0003524 1.792 0.07458 .
## Male 0.0529590 0.0527377 1.004 0.31643
## HU 0.0160503 0.0039129 4.102 5.85e-05 ***
## SS 0.0071224 0.0054687 1.302 0.19420
## FirstGen -0.0772533 0.0877533 -0.880 0.37967
## White 0.1963878 0.0692509 2.836 0.00501 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3816 on 211 degrees of freedom
## Multiple R-squared: 0.3494, Adjusted R-squared: 0.3278
## F-statistic: 16.19 on 7 and 211 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = GPA ~ HSGPA + SATV + Male + HU + SS + White, data =
FirstYearGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06228 -0.26731 0.05287 0.27230 0.85843
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5466634 0.2835072 1.928 0.0552 .
## HSGPA 0.4829491 0.0714659 6.758 1.33e-10 ***
## SATV 0.0006945 0.0003449 2.013 0.0453 *
## Male 0.0541049 0.0526937 1.027 0.3057
## HU 0.0167958 0.0038181 4.399 1.72e-05 ***
## SS 0.0075702 0.0054421 1.391 0.1657
## White 0.2045215 0.0685954 2.982 0.0032 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3814 on 212 degrees of freedom
## Multiple R-squared: 0.347, Adjusted R-squared: 0.3285
## F-statistic: 18.78 on 6 and 212 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = GPA ~ HSGPA + SATV + HU + SS + White, data = FirstYearGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.08660 -0.25827 0.04326 0.25822 0.87954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5684876 0.2827454 2.011 0.04563 *
## HSGPA 0.4739983 0.0709413 6.682 2.03e-10 ***
## SATV 0.0007481 0.0003410 2.194 0.02932 *
## HU 0.0167447 0.0038183 4.385 1.82e-05 ***
## SS 0.0077474 0.0054401 1.424 0.15587
## White 0.2060408 0.0685881 3.004 0.00298 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3815 on 213 degrees of freedom
## Multiple R-squared: 0.3437, Adjusted R-squared: 0.3283
## F-statistic: 22.31 on 5 and 213 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = GPA ~ HSGPA + SATV + HU + White, data = FirstYearGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06370 -0.26286 0.02436 0.27338 0.87190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6409767 0.2787933 2.299 0.02246 *
## HSGPA 0.4761952 0.0710947 6.698 1.83e-10 ***
## SATV 0.0007372 0.0003417 2.157 0.03209 *
## HU 0.0150566 0.0036383 4.138 5.03e-05 ***
## White 0.2121164 0.0686196 3.091 0.00226 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3824 on 214 degrees of freedom
## Multiple R-squared: 0.3375, Adjusted R-squared: 0.3251
## F-statistic: 27.25 on 4 and 214 DF, p-value: < 2.2e-16
The weakest predictor now is SATV, but it’s P-value is less than 0.05, so we stop, and the
backward elimination model is GPA~HSGPA+SATV+HU+White.
## Start: AIC=104.36
## GPA ~ 1
##
## Df Sum of Sq RSS Cp
## + HSGPA 1 9.4329 37.801 42.181
## + HU 1 4.6765 42.557 74.541
## + SATV 1 4.3741 42.859 76.599
## + White 1 3.7501 43.483 80.844
## + SATM 1 1.7840 45.450 94.221
## + FirstGen 1 1.1580 46.076 98.480
## <none> 47.234 104.359
## + CollegeBound 1 0.1876 47.046 105.082
## + Male 1 0.1319 47.102 105.461
## + SS 1 0.0006 47.233 106.354
##
## Step: AIC=42.18
## GPA ~ HSGPA
##
## Df Sum of Sq RSS Cp
## + HU 1 3.3067 34.494 21.683
## + White 1 3.2292 34.571 22.210
## + SATV 1 2.1861 35.615 29.307
## + FirstGen 1 1.6278 36.173 33.105
## + SATM 1 0.7683 37.032 38.953
## + Male 1 0.4138 37.387 41.365
## <none> 37.801 42.181
## + CollegeBound 1 0.0342 37.766 43.948
## + SS 1 0.0008 37.800 44.175
##
## Step: AIC=21.68
## GPA ~ HSGPA + HU
##
## Df Sum of Sq RSS Cp
## + White 1 2.52100 31.973 6.531
## + SATV 1 1.80435 32.690 11.407
## + SATM 1 0.86034 33.634 17.829
## + FirstGen 1 0.80022 33.694 18.238
## + Male 1 0.43380 34.060 20.732
## + SS 1 0.37935 34.115 21.102
## <none> 34.494 21.683
## + CollegeBound 1 0.03905 34.455 23.417
##
## Step: AIC=6.53
## GPA ~ HSGPA + HU + White
##
## Df Sum of Sq RSS Cp
## + SATV 1 0.68060 31.292 3.9005
## + FirstGen 1 0.30945 31.663 6.4256
## <none> 31.973 6.5310
## + SATM 1 0.28236 31.691 6.6099
## + Male 1 0.27919 31.694 6.6315
## + SS 1 0.27526 31.698 6.6582
## + CollegeBound 1 0.04854 31.924 8.2008
##
## Step: AIC=3.9
## GPA ~ HSGPA + HU + White + SATV
##
## Df Sum of Sq RSS Cp
## + SS 1 0.295150 30.997 3.8924
## <none> 31.292 3.9005
## + Male 1 0.167015 31.125 4.7642
## + FirstGen 1 0.156003 31.136 4.8391
## + SATM 1 0.026915 31.265 5.7173
## + CollegeBound 1 0.013720 31.279 5.8071
##
## Step: AIC=3.89
## GPA ~ HSGPA + HU + White + SATV + SS
##
## Df Sum of Sq RSS Cp
## <none> 30.997 3.8924
## + Male 1 0.153387 30.844 4.8488
## + FirstGen 1 0.119394 30.878 5.0801
## + SATM 1 0.054109 30.943 5.5242
## + CollegeBound 1 0.018808 30.978 5.7644
##
## Call:
## lm(formula = GPA ~ HSGPA + HU + White + SATV + SS, data = FirstYearGPA)
##
## Coefficients:
## (Intercept) HSGPA HU White SATV
SS
## 0.5684876 0.4739983 0.0167447 0.2060408 0.0007481
0.0077474
We see that forward selection (using Cp) takes us to a different model than the four-
variable model provided on p. 163 of the textbook. The primary reason for this is that R is
using Cp, rather than P-values to make decisions. It is important to remember that different
criteria may result in different models. As you will see below in the alternative solutions
section, this five-predictor model is the same as the one we obtain using backward
elimination with Cp.
STEPWISE REGRESSION
This automated option combines forward and backward, having the same syntax as
forward without needing a direction specified. The formula(full) option specifies the pool
of predictors. Leaving out a direction gives the default direction = “both.”
full=lm(GPA~.,data=FirstYearGPA)
none=lm(GPA~1,data=FirstYearGPA)
MSE=(summary(full)$sigma^2)
step(none,scope=formula(full),scale=MSE)
## Start: AIC=104.36
## GPA ~ 1
##
## Df Sum of Sq RSS Cp
## + HSGPA 1 9.4329 37.801 42.181
## + HU 1 4.6765 42.557 74.541
## + SATV 1 4.3741 42.859 76.599
## + White 1 3.7501 43.483 80.844
## + SATM 1 1.7840 45.450 94.221
## + FirstGen 1 1.1580 46.076 98.480
## <none> 47.234 104.359
## + CollegeBound 1 0.1876 47.046 105.082
## + Male 1 0.1319 47.102 105.461
## + SS 1 0.0006 47.233 106.354
##
## Step: AIC=42.18
## GPA ~ HSGPA
##
## Df Sum of Sq RSS Cp
## + HU 1 3.3067 34.494 21.683
## + White 1 3.2292 34.571 22.210
## + SATV 1 2.1861 35.615 29.307
## + FirstGen 1 1.6278 36.173 33.105
## + SATM 1 0.7683 37.032 38.953
## + Male 1 0.4138 37.387 41.365
## <none> 37.801 42.181
## + CollegeBound 1 0.0342 37.766 43.948
## + SS 1 0.0008 37.800 44.175
## - HSGPA 1 9.4329 47.234 104.359
##
## Step: AIC=21.68
## GPA ~ HSGPA + HU
##
## Df Sum of Sq RSS Cp
## + White 1 2.5210 31.973 6.531
## + SATV 1 1.8043 32.690 11.407
## + SATM 1 0.8603 33.634 17.829
## + FirstGen 1 0.8002 33.694 18.238
## + Male 1 0.4338 34.060 20.732
## + SS 1 0.3793 34.115 21.102
## <none> 34.494 21.683
## + CollegeBound 1 0.0390 34.455 23.417
## - HU 1 3.3067 37.801 42.181
## - HSGPA 1 8.0631 42.557 74.541
##
## Step: AIC=6.53
## GPA ~ HSGPA + HU + White
##
## Df Sum of Sq RSS Cp
## + SATV 1 0.6806 31.292 3.9005
## + FirstGen 1 0.3095 31.663 6.4256
## <none> 31.973 6.5310
## + SATM 1 0.2824 31.691 6.6099
## + Male 1 0.2792 31.694 6.6315
## + SS 1 0.2753 31.698 6.6582
## + CollegeBound 1 0.0485 31.924 8.2008
## - White 1 2.5210 34.494 21.6829
## - HU 1 2.5985 34.571 22.2100
## - HSGPA 1 7.7700 39.743 57.3949
##
## Step: AIC=3.9
## GPA ~ HSGPA + HU + White + SATV
##
## Df Sum of Sq RSS Cp
## + SS 1 0.2951 30.997 3.8924
## <none> 31.292 3.9005
## + Male 1 0.1670 31.125 4.7642
## + FirstGen 1 0.1560 31.136 4.8391
## + SATM 1 0.0269 31.265 5.7173
## + CollegeBound 1 0.0137 31.279 5.8071
## - SATV 1 0.6806 31.973 6.5310
## - White 1 1.3973 32.690 11.4068
## - HU 1 2.5042 33.797 18.9383
## - HSGPA 1 6.5602 37.853 46.5337
##
## Step: AIC=3.89
## GPA ~ HSGPA + HU + White + SATV + SS
##
## Df Sum of Sq RSS Cp
## <none> 30.997 3.8924
## - SS 1 0.2951 31.292 3.9005
## + Male 1 0.1534 30.844 4.8488
## + FirstGen 1 0.1194 30.878 5.0801
## + SATM 1 0.0541 30.943 5.5242
## + CollegeBound 1 0.0188 30.978 5.7644
## - SATV 1 0.7005 31.698 6.6582
## - White 1 1.3133 32.310 10.8273
## - HU 1 2.7987 33.796 20.9339
## - HSGPA 1 6.4968 37.494 46.0938
##
## Call:
## lm(formula = GPA ~ HSGPA + HU + White + SATV + SS, data = FirstYearGPA)
##
## Coefficients:
## (Intercept) HSGPA HU White SATV
SS
## 0.5684876 0.4739983 0.0167447 0.2060408 0.0007481
0.0077474
In this case, stepwise proceeeds the same a forward since no variables are dropped at any
step.
ShowSubsets=function(regout){
z=summary(regout)
q=as.data.frame(z$outmat)
q$Rsq=round(z$rsq*100,2)
q$adjRsq=round(z$adjr2*100,2)
q$Cp=round(z$cp,2)
return(q)
}
Once the ShowSubsets( ) function has been defined, it can be used on the result of a
regsubset( ). You may need to scroll to see the full width and all models.
ShowSubsets(all)
BACKWARD ELMINATION
Here’s code to do the full backward elimination in R automatically.
Note: The scale = MSE uses Cp as the criterion for comparing models.
full=lm(GPA~.,data=FirstYearGPA)
MSE=(summary(full)$sigma^2)
step(full,scale=MSE,direction="backward")
## Start: AIC=10
## GPA ~ HSGPA + SATV + SATM + Male + HU + SS + FirstGen + White +
## CollegeBound
##
## Df Sum of Sq RSS Cp
## - SATM 1 0.0053 30.724 8.0363
## - CollegeBound 1 0.0067 30.726 8.0457
## - FirstGen 1 0.1031 30.822 8.7017
## - Male 1 0.1052 30.824 8.7158
## - SS 1 0.2556 30.975 9.7392
## <none> 30.719 10.0000
## - SATV 1 0.3309 31.050 10.2516
## - White 1 1.1545 31.873 15.8545
## - HU 1 2.4409 33.160 24.6066
## - HSGPA 1 6.4345 37.154 51.7779
##
## Step: AIC=8.04
## GPA ~ HSGPA + SATV + Male + HU + SS + FirstGen + White + CollegeBound
##
## Df Sum of Sq RSS Cp
## - CollegeBound 1 0.0065 30.731 6.0808
## - FirstGen 1 0.1072 30.832 6.7657
## - Male 1 0.1421 30.866 7.0031
## - SS 1 0.2503 30.975 7.7392
## <none> 30.724 8.0363
## - SATV 1 0.4532 31.178 9.1194
## - White 1 1.1778 31.902 14.0498
## - HU 1 2.4567 33.181 22.7506
## - HSGPA 1 6.5762 37.301 50.7779
##
## Step: AIC=6.08
## GPA ~ HSGPA + SATV + Male + HU + SS + FirstGen + White
##
## Df Sum of Sq RSS Cp
## - FirstGen 1 0.1129 30.844 4.8488
## - Male 1 0.1469 30.878 5.0801
## - SS 1 0.2470 30.978 5.7616
## <none> 30.731 6.0808
## - SATV 1 0.4677 31.199 7.2626
## - White 1 1.1713 31.902 12.0499
## - HU 1 2.4506 33.181 20.7534
## - HSGPA 1 6.7560 37.487 50.0456
##
## Step: AIC=4.85
## GPA ~ HSGPA + SATV + Male + HU + SS + White
##
## Df Sum of Sq RSS Cp
## - Male 1 0.1534 30.997 3.8924
## - SS 1 0.2815 31.125 4.7642
## <none> 30.844 4.8488
## - SATV 1 0.5898 31.434 6.8613
## - White 1 1.2934 32.137 11.6483
## - HU 1 2.8154 33.659 22.0034
## - HSGPA 1 6.6441 37.488 48.0526
##
## Step: AIC=3.89
## GPA ~ HSGPA + SATV + HU + SS + White
##
## Df Sum of Sq RSS Cp
## <none> 30.997 3.8924
## - SS 1 0.2951 31.292 3.9005
## - SATV 1 0.7005 31.698 6.6582
## - White 1 1.3133 32.310 10.8273
## - HU 1 2.7987 33.796 20.9339
## - HSGPA 1 6.4968 37.494 46.0938
##
## Call:
## lm(formula = GPA ~ HSGPA + SATV + HU + SS + White, data = FirstYearGPA)
##
## Coefficients:
## (Intercept) HSGPA SATV HU SS
White
## 0.5684876 0.4739983 0.0007481 0.0167447 0.0077474
0.2060408
Notice that using Cp as the criterion gives a slightly different final model with five
predictors, including the SS term that was the last one eliminated when we used P-values
to make the decision to drop terms. The Cp when including SS (3.8924) is just barely less
than when it is dropped (3.9005). This example illustrates where there are several criteria
for helping you decide on the “best model.”
STEPWISE REGRESSION
If you want to suppress seeing all the individual steps, add trace = 0, to just show the final
model, but note that there is often valuable information in the individual steps.
step(none,scope=formula(full),scale=MSE,trace=0)
##
## Call:
## lm(formula = GPA ~ HSGPA + HU + White + SATV + SS, data = FirstYearGPA)
##
## Coefficients:
## (Intercept) HSGPA HU White SATV
SS
## 0.5684876 0.4739983 0.0167447 0.2060408 0.0007481
0.0077474