0% found this document useful (0 votes)

8 views19 pages

STAT2 2e R Markdown Files Sec4.2

Uploaded by

1603365

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views19 pages

STAT2 2e R Markdown Files Sec4.2

Uploaded by

1603365

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Topic 4.

2 Techniques for Choosing Predictors

Stat2

July 3, 2018

Load needed packages.

library(Stat2Data)
library(leaps)
library(HH)

EXAMPLE 4.2 First-year GPA

Load FirstYearGPAdata from Stat2Data package and look at the structure of the data.
data(FirstYearGPA)
str(FirstYearGPA)

## 'data.frame': 219 obs. of 10 variables:

## $ GPA : num 3.06 4.15 3.41 3.21 3.48 2.95 3.6 2.87 3.67 3.49 ...
## $ HSGPA : num 3.83 4 3.7 3.51 3.83 3.25 3.79 3.6 3.36 3.7 ...
## $ SATV : int 680 740 640 740 610 600 710 390 630 680 ...
## $ SATM : int 770 720 570 700 610 570 630 570 560 670 ...
## $ Male : int 1 0 0 0 0 0 0 0 0 0 ...
## $ HU : num 3 9 16 22 30.5 18 5 10 8.5 16 ...
## $ SS : num 9 3 13 0 1.5 3 19 0 15.5 12 ...
## $ FirstGen : int 1 0 0 0 0 0 0 0 0 0 ...
## $ White : int 1 1 0 1 1 1 1 0 1 1 ...
## $ CollegeBound: int 1 1 1 1 1 1 1 0 1 1 ...

FIGURE 4.4 Scatterplot matrix for first-year GPA data

For the quantitative variables
pairs(FirstYearGPA[,c(1,2,3,4,6,7)], pch=16)
FIGURE 4.5 GPA versus categorical predictors
For each categorical predictor 0 = no, 1 = yes
par(mfrow=c(1,4)) #puts all four plots side-by-side
boxplot(GPA~Male,data=FirstYearGPA,xlab="Male")
boxplot(GPA~FirstGen,data=FirstYearGPA,xlab="FirstGen")
boxplot(GPA~White,data=FirstYearGPA,xlab="White")
boxplot(GPA~CollegeBound,data=FirstYearGPA,xlab="CollegeBound")
BEST SUBSETS
Using the leaps package, run best subsets for predicting GPA. The regsubsets( ) function
finds the best model of each size (or use nbest = ___ to show more models of each size). The
command below shows the best two models of each size.
all=regsubsets(GPA~.,nbest=2,data=FirstYearGPA)
summary(all)

## Subset selection object

## Call: regsubsets.formula(GPA ~ ., nbest = 2, data = FirstYearGPA)
## 9 Variables (and intercept)
## Forced in Forced out
## HSGPA FALSE FALSE
## SATV FALSE FALSE
## SATM FALSE FALSE
## Male FALSE FALSE
## HU FALSE FALSE
## SS FALSE FALSE
## FirstGen FALSE FALSE
## White FALSE FALSE
## CollegeBound FALSE FALSE
## 2 subsets of each size up to 8
## Selection Algorithm: exhaustive
## HSGPA SATV SATM Male HU SS FirstGen White CollegeBound
## 1 ( 1 ) "*" " " " " " " " " " " " " " " " "
## 1 ( 2 ) " " " " " " " " "*" " " " " " " " "
## 2 ( 1 ) "*" " " " " " " "*" " " " " " " " "
## 2 ( 2 ) "*" " " " " " " " " " " " " "*" " "
## 3 ( 1 ) "*" " " " " " " "*" " " " " "*" " "
## 3 ( 2 ) "*" "*" " " " " "*" " " " " " " " "
## 4 ( 1 ) "*" "*" " " " " "*" " " " " "*" " "
## 4 ( 2 ) "*" " " " " " " "*" " " "*" "*" " "
## 5 ( 1 ) "*" "*" " " " " "*" "*" " " "*" " "
## 5 ( 2 ) "*" "*" " " "*" "*" " " " " "*" " "
## 6 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" " "
## 6 ( 2 ) "*" "*" " " " " "*" "*" "*" "*" " "
## 7 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*" " "
## 7 ( 2 ) "*" "*" " " "*" "*" "*" " " "*" "*"
## 8 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*" "*"
## 8 ( 2 ) "*" "*" "*" "*" "*" "*" "*" "*" " "

Here’s an option to create a plot showing the variables in each model

plot(all,scale="adjr2")

The HH package has a nicer function for displaying the regsubsets results.
summaryHH(all)

## model p rsq rss adjr2 cp bic stderr

## 1 HS 2 0.200 37.8 0.1960 42.18 -38.0 0.417
## 2 HU 2 0.099 42.6 0.0949 74.54 -12.1 0.443
## 3 HS-HU 3 0.270 34.5 0.2630 21.68 -52.7 0.400
## 4 HS-W 3 0.268 34.6 0.2613 22.21 -52.2 0.400
## 5 HS-HU-W 4 0.323 32.0 0.3136 6.53 -63.9 0.386
## 6 HS-SATV-HU 4 0.308 32.7 0.2983 11.41 -59.0 0.390
## 7 HS-SATV-HU-W 5 0.337 31.3 0.3251 3.90 -63.2 0.382
## 8 HS-HU-F-W 5 0.330 31.7 0.3171 6.43 -60.6 0.385
## 9 HS-SATV-HU-SS-W 6 0.344 31.0 0.3283 3.89 -59.9 0.381
## 10 HS-SATV-M-HU-W 6 0.341 31.1 0.3256 4.76 -59.0 0.382
## 11 HS-SATV-M-HU-SS-W 7 0.347 30.8 0.3285 4.85 -55.6 0.381
## 12 HS-SATV-HU-SS-F-W 7 0.346 30.9 0.3278 5.08 -55.4 0.382
## 13 HS-SATV-M-HU-SS-F-W 8 0.349 30.7 0.3278 6.08 -51.0 0.382
## 14 HS-SATV-M-HU-SS-W-C 8 0.347 30.8 0.3256 6.77 -50.3 0.382
## 15 HS-SATV-M-HU-SS-F-W-C 9 0.350 30.7 0.3247 8.04 -45.7 0.383
## 16 HS-SATV-SATM-M-HU-SS-F-W 9 0.349 30.7 0.3247 8.05 -45.7 0.383
##
## Model variables with abbreviations
## model
## HS HSGPA
## HU HU
## HS-HU HSGPA-HU
## HS-W HSGPA-White
## HS-HU-W HSGPA-HU-White
## HS-SATV-HU HSGPA-SATV-HU
## HS-SATV-HU-W HSGPA-SATV-HU-White
## HS-HU-F-W HSGPA-HU-FirstGen-White
## HS-SATV-HU-SS-W HSGPA-SATV-HU-SS-White
## HS-SATV-M-HU-W HSGPA-SATV-Male-HU-White
## HS-SATV-M-HU-SS-W HSGPA-SATV-Male-HU-SS-White
## HS-SATV-HU-SS-F-W HSGPA-SATV-HU-SS-FirstGen-White
## HS-SATV-M-HU-SS-F-W HSGPA-SATV-Male-HU-SS-FirstGen-White
## HS-SATV-M-HU-SS-W-C HSGPA-SATV-Male-HU-SS-White-CollegeBound
## HS-SATV-M-HU-SS-F-W-C HSGPA-SATV-Male-HU-SS-FirstGen-White-CollegeBound
## HS-SATV-SATM-M-HU-SS-F-W HSGPA-SATV-SATM-Male-HU-SS-FirstGen-White
##
## model with largest adjr2
## 11
##
## Number of observations
## 219

Regression output for the six predictor model (the model with the best adjusted R2).
mod6=lm(GPA~HSGPA+SATV+Male+HU+SS+White,data=FirstYearGPA)
summary(mod6)

##
## Call:
## lm(formula = GPA ~ HSGPA + SATV + Male + HU + SS + White, data =
FirstYearGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06228 -0.26731 0.05287 0.27230 0.85843
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5466634 0.2835072 1.928 0.0552 .
## HSGPA 0.4829491 0.0714659 6.758 1.33e-10 ***
## SATV 0.0006945 0.0003449 2.013 0.0453 *
## Male 0.0541049 0.0526937 1.027 0.3057
## HU 0.0167958 0.0038181 4.399 1.72e-05 ***
## SS 0.0075702 0.0054421 1.391 0.1657
## White 0.2045215 0.0685954 2.982 0.0032 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3814 on 212 degrees of freedom
## Multiple R-squared: 0.347, Adjusted R-squared: 0.3285
## F-statistic: 18.78 on 6 and 212 DF, p-value: < 2.2e-16

Perhaps we don’t need Male or SS in this model?

Mallow’s Cp is given in the summaryHH( ) and SHowSubsets( ) output (see alternative

solutions below). It’s also stored in the summary of the regsubsets object, but you need to
be careful to track which model corresponds to each Cp value.
summary(all)$cp

## [1] 42.180513 74.541330 21.682884 22.210037 6.530981 11.406836 3.900456

## [8] 6.425605 3.892373 4.764152 4.848791 5.080066 6.080830 6.765744
## [15] 8.036280 8.045716

In the ouptut above, the smallest Cp is for the ninth model (3.89), which is the first five-
predictor model.
You can also modify the plot of the regsubsets results to sort by Cp.
plot(all,scale="Cp")
Smallest Cp has HSGPA,SATV,HU,SS, and White as predictors
To calculate Cp for any particular model, we can use extractAIC since Cp is equivalent to
AIC for regression models. Here is Cp for the six-predictor model (4.85).
full=lm(GPA~.,data=FirstYearGPA) #model with all predictors in the pool
MSE=(summary(full)$sigma)^2 #Get out the MSE for full model
extractAIC(lm(GPA~HSGPA+SATV+Male+HU+SS+White,data=FirstYearGPA),scale=MSE)

## [1] 7.000000 4.848791

BACKWARD ELIMINATION (with the GPA data)

First, we’ll use “brute force” by fitting each model. Code to do this more automaticaly using
the step() function appears later.
Full model with nine predictors
mod9=lm(GPA~.,data=FirstYearGPA)
summary(mod9)

##
## Call:
## lm(formula = GPA ~ ., data = FirstYearGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.07412 -0.25827 0.05384 0.27675 0.85761
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5268983 0.3487584 1.511 0.13235
## HSGPA 0.4932945 0.0745553 6.616 3.03e-10 ***
## SATV 0.0005919 0.0003945 1.501 0.13498
## SATM 0.0000847 0.0004447 0.190 0.84912
## Male 0.0482478 0.0570277 0.846 0.39850
## HU 0.0161874 0.0039723 4.075 6.53e-05 ***
## SS 0.0073370 0.0055635 1.319 0.18869
## FirstGen -0.0743417 0.0887490 -0.838 0.40318
## White 0.1962316 0.0700182 2.803 0.00555 **
## CollegeBound 0.0214530 0.1003350 0.214 0.83090
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3834 on 209 degrees of freedom
## Multiple R-squared: 0.3496, Adjusted R-squared: 0.3216
## F-statistic: 12.48 on 9 and 209 DF, p-value: 8.674e-16

Weakest (highest) P-value is for SATM so drop it, using the update function.
mod8=update(mod9,.~.-SATM)
summary(mod8)

##
## Call:
## lm(formula = GPA ~ HSGPA + SATV + Male + HU + SS + FirstGen +
## White + CollegeBound, data = FirstYearGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.07160 -0.26357 0.05167 0.27469 0.85550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5551540 0.3149111 1.763 0.07937 .
## HSGPA 0.4950161 0.0738354 6.704 1.84e-10 ***
## SATV 0.0006245 0.0003548 1.760 0.07988 .
## Male 0.0522103 0.0529758 0.986 0.32549
## HU 0.0160823 0.0039247 4.098 5.96e-05 ***
## SS 0.0071772 0.0054873 1.308 0.19231
## FirstGen -0.0755918 0.0883026 -0.856 0.39294
## White 0.1974200 0.0695794 2.837 0.00499 **
## CollegeBound 0.0211753 0.1000939 0.212 0.83266
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3825 on 210 degrees of freedom
## Multiple R-squared: 0.3495, Adjusted R-squared: 0.3247
## F-statistic: 14.11 on 8 and 210 DF, p-value: 2.253e-16

CollegeBound should be dropped next, so use the update function again.

mod7=update(mod8,.~.-CollegeBound)
summary(mod7)

##
## Call:
## lm(formula = GPA ~ HSGPA + SATV + Male + HU + SS + FirstGen +
## White, data = FirstYearGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06911 -0.26259 0.05236 0.26954 0.84134
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5824756 0.2865599 2.033 0.04334 *
## HSGPA 0.4919452 0.0722304 6.811 9.94e-11 ***
## SATV 0.0006315 0.0003524 1.792 0.07458 .
## Male 0.0529590 0.0527377 1.004 0.31643
## HU 0.0160503 0.0039129 4.102 5.85e-05 ***
## SS 0.0071224 0.0054687 1.302 0.19420
## FirstGen -0.0772533 0.0877533 -0.880 0.37967
## White 0.1963878 0.0692509 2.836 0.00501 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3816 on 211 degrees of freedom
## Multiple R-squared: 0.3494, Adjusted R-squared: 0.3278
## F-statistic: 16.19 on 7 and 211 DF, p-value: < 2.2e-16

Now, drop FirstGen.

mod6=update(mod7,.~.-FirstGen)
summary(mod6)

Next, drop Male.

mod5=update(mod6,.~.-Male)
summary(mod5)

##
## Call:
## lm(formula = GPA ~ HSGPA + SATV + HU + SS + White, data = FirstYearGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.08660 -0.25827 0.04326 0.25822 0.87954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5684876 0.2827454 2.011 0.04563 *
## HSGPA 0.4739983 0.0709413 6.682 2.03e-10 ***
## SATV 0.0007481 0.0003410 2.194 0.02932 *
## HU 0.0167447 0.0038183 4.385 1.82e-05 ***
## SS 0.0077474 0.0054401 1.424 0.15587
## White 0.2060408 0.0685881 3.004 0.00298 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3815 on 213 degrees of freedom
## Multiple R-squared: 0.3437, Adjusted R-squared: 0.3283
## F-statistic: 22.31 on 5 and 213 DF, p-value: < 2.2e-16

Next, drop SS.

mod4=update(mod5,.~.-SS)
summary(mod4)

##
## Call:
## lm(formula = GPA ~ HSGPA + SATV + HU + White, data = FirstYearGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06370 -0.26286 0.02436 0.27338 0.87190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6409767 0.2787933 2.299 0.02246 *
## HSGPA 0.4761952 0.0710947 6.698 1.83e-10 ***
## SATV 0.0007372 0.0003417 2.157 0.03209 *
## HU 0.0150566 0.0036383 4.138 5.03e-05 ***
## White 0.2121164 0.0686196 3.091 0.00226 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3824 on 214 degrees of freedom
## Multiple R-squared: 0.3375, Adjusted R-squared: 0.3251
## F-statistic: 27.25 on 4 and 214 DF, p-value: < 2.2e-16

The weakest predictor now is SATV, but it’s P-value is less than 0.05, so we stop, and the
backward elimination model is GPA~HSGPA+SATV+HU+White.

FORWARD SELECTION (using the GPA data)

We’ll do this with the step function. We need to specify the full model to give a pool of
predictors and a starting point (none) that has just the intercept.
Note: The scale = MSE uses Cp as the criterion for comparing models.
full=lm(GPA~.,data=FirstYearGPA)
none=lm(GPA~1,data=FirstYearGPA)
MSE=(summary(full)$sigma^2)
step(none,scope=list(upper=full),scale=MSE,direction="forward")

## Start: AIC=104.36
## GPA ~ 1
##
## Df Sum of Sq RSS Cp
## + HSGPA 1 9.4329 37.801 42.181
## + HU 1 4.6765 42.557 74.541
## + SATV 1 4.3741 42.859 76.599
## + White 1 3.7501 43.483 80.844
## + SATM 1 1.7840 45.450 94.221
## + FirstGen 1 1.1580 46.076 98.480
## <none> 47.234 104.359
## + CollegeBound 1 0.1876 47.046 105.082
## + Male 1 0.1319 47.102 105.461
## + SS 1 0.0006 47.233 106.354
##
## Step: AIC=42.18
## GPA ~ HSGPA
##
## Df Sum of Sq RSS Cp
## + HU 1 3.3067 34.494 21.683
## + White 1 3.2292 34.571 22.210
## + SATV 1 2.1861 35.615 29.307
## + FirstGen 1 1.6278 36.173 33.105
## + SATM 1 0.7683 37.032 38.953
## + Male 1 0.4138 37.387 41.365
## <none> 37.801 42.181
## + CollegeBound 1 0.0342 37.766 43.948
## + SS 1 0.0008 37.800 44.175
##
## Step: AIC=21.68
## GPA ~ HSGPA + HU
##
## Df Sum of Sq RSS Cp
## + White 1 2.52100 31.973 6.531
## + SATV 1 1.80435 32.690 11.407
## + SATM 1 0.86034 33.634 17.829
## + FirstGen 1 0.80022 33.694 18.238
## + Male 1 0.43380 34.060 20.732
## + SS 1 0.37935 34.115 21.102
## <none> 34.494 21.683
## + CollegeBound 1 0.03905 34.455 23.417
##
## Step: AIC=6.53
## GPA ~ HSGPA + HU + White
##
## Df Sum of Sq RSS Cp
## + SATV 1 0.68060 31.292 3.9005
## + FirstGen 1 0.30945 31.663 6.4256
## <none> 31.973 6.5310
## + SATM 1 0.28236 31.691 6.6099
## + Male 1 0.27919 31.694 6.6315
## + SS 1 0.27526 31.698 6.6582
## + CollegeBound 1 0.04854 31.924 8.2008
##
## Step: AIC=3.9
## GPA ~ HSGPA + HU + White + SATV
##
## Df Sum of Sq RSS Cp
## + SS 1 0.295150 30.997 3.8924
## <none> 31.292 3.9005
## + Male 1 0.167015 31.125 4.7642
## + FirstGen 1 0.156003 31.136 4.8391
## + SATM 1 0.026915 31.265 5.7173
## + CollegeBound 1 0.013720 31.279 5.8071
##
## Step: AIC=3.89
## GPA ~ HSGPA + HU + White + SATV + SS
##
## Df Sum of Sq RSS Cp
## <none> 30.997 3.8924
## + Male 1 0.153387 30.844 4.8488
## + FirstGen 1 0.119394 30.878 5.0801
## + SATM 1 0.054109 30.943 5.5242
## + CollegeBound 1 0.018808 30.978 5.7644

##
## Call:
## lm(formula = GPA ~ HSGPA + HU + White + SATV + SS, data = FirstYearGPA)
##
## Coefficients:
## (Intercept) HSGPA HU White SATV
SS
## 0.5684876 0.4739983 0.0167447 0.2060408 0.0007481
0.0077474

We see that forward selection (using Cp) takes us to a different model than the four-
variable model provided on p. 163 of the textbook. The primary reason for this is that R is
using Cp, rather than P-values to make decisions. It is important to remember that different
criteria may result in different models. As you will see below in the alternative solutions
section, this five-predictor model is the same as the one we obtain using backward
elimination with Cp.

STEPWISE REGRESSION
This automated option combines forward and backward, having the same syntax as
forward without needing a direction specified. The formula(full) option specifies the pool
of predictors. Leaving out a direction gives the default direction = “both.”
full=lm(GPA~.,data=FirstYearGPA)
none=lm(GPA~1,data=FirstYearGPA)
MSE=(summary(full)$sigma^2)
step(none,scope=formula(full),scale=MSE)

## Start: AIC=104.36
## GPA ~ 1
##
## Df Sum of Sq RSS Cp
## + HSGPA 1 9.4329 37.801 42.181
## + HU 1 4.6765 42.557 74.541
## + SATV 1 4.3741 42.859 76.599
## + White 1 3.7501 43.483 80.844
## + SATM 1 1.7840 45.450 94.221
## + FirstGen 1 1.1580 46.076 98.480
## <none> 47.234 104.359
## + CollegeBound 1 0.1876 47.046 105.082
## + Male 1 0.1319 47.102 105.461
## + SS 1 0.0006 47.233 106.354
##
## Step: AIC=42.18
## GPA ~ HSGPA
##
## Df Sum of Sq RSS Cp
## + HU 1 3.3067 34.494 21.683
## + White 1 3.2292 34.571 22.210
## + SATV 1 2.1861 35.615 29.307
## + FirstGen 1 1.6278 36.173 33.105
## + SATM 1 0.7683 37.032 38.953
## + Male 1 0.4138 37.387 41.365
## <none> 37.801 42.181
## + CollegeBound 1 0.0342 37.766 43.948
## + SS 1 0.0008 37.800 44.175
## - HSGPA 1 9.4329 47.234 104.359
##
## Step: AIC=21.68
## GPA ~ HSGPA + HU
##
## Df Sum of Sq RSS Cp
## + White 1 2.5210 31.973 6.531
## + SATV 1 1.8043 32.690 11.407
## + SATM 1 0.8603 33.634 17.829
## + FirstGen 1 0.8002 33.694 18.238
## + Male 1 0.4338 34.060 20.732
## + SS 1 0.3793 34.115 21.102
## <none> 34.494 21.683
## + CollegeBound 1 0.0390 34.455 23.417
## - HU 1 3.3067 37.801 42.181
## - HSGPA 1 8.0631 42.557 74.541
##
## Step: AIC=6.53
## GPA ~ HSGPA + HU + White
##
## Df Sum of Sq RSS Cp
## + SATV 1 0.6806 31.292 3.9005
## + FirstGen 1 0.3095 31.663 6.4256
## <none> 31.973 6.5310
## + SATM 1 0.2824 31.691 6.6099
## + Male 1 0.2792 31.694 6.6315
## + SS 1 0.2753 31.698 6.6582
## + CollegeBound 1 0.0485 31.924 8.2008
## - White 1 2.5210 34.494 21.6829
## - HU 1 2.5985 34.571 22.2100
## - HSGPA 1 7.7700 39.743 57.3949
##
## Step: AIC=3.9
## GPA ~ HSGPA + HU + White + SATV
##
## Df Sum of Sq RSS Cp
## + SS 1 0.2951 30.997 3.8924
## <none> 31.292 3.9005
## + Male 1 0.1670 31.125 4.7642
## + FirstGen 1 0.1560 31.136 4.8391
## + SATM 1 0.0269 31.265 5.7173
## + CollegeBound 1 0.0137 31.279 5.8071
## - SATV 1 0.6806 31.973 6.5310
## - White 1 1.3973 32.690 11.4068
## - HU 1 2.5042 33.797 18.9383
## - HSGPA 1 6.5602 37.853 46.5337
##
## Step: AIC=3.89
## GPA ~ HSGPA + HU + White + SATV + SS
##
## Df Sum of Sq RSS Cp
## <none> 30.997 3.8924
## - SS 1 0.2951 31.292 3.9005
## + Male 1 0.1534 30.844 4.8488
## + FirstGen 1 0.1194 30.878 5.0801
## + SATM 1 0.0541 30.943 5.5242
## + CollegeBound 1 0.0188 30.978 5.7644
## - SATV 1 0.7005 31.698 6.6582
## - White 1 1.3133 32.310 10.8273
## - HU 1 2.7987 33.796 20.9339
## - HSGPA 1 6.4968 37.494 46.0938

In this case, stepwise proceeeds the same a forward since no variables are dropped at any
step.

Alternative solutions to the approaches above.

Another option for improving the display of the regsubsets results is the function below.
#Function to show results for regsubsets
# input should be the result from regsubsets

ShowSubsets=function(regout){
z=summary(regout)
q=as.data.frame(z$outmat)
q$Rsq=round(z$rsq*100,2)
q$adjRsq=round(z$adjr2*100,2)
q$Cp=round(z$cp,2)
return(q)
}

Once the ShowSubsets( ) function has been defined, it can be used on the result of a
regsubset( ). You may need to scroll to see the full width and all models.
ShowSubsets(all)

## HSGPA SATV SATM Male HU SS FirstGen White CollegeBound Rsq

adjRsq
## 1 ( 1 ) * 19.97
19.60
## 1 ( 2 ) * 9.90
9.49
## 2 ( 1 ) * * 26.97
26.30
## 2 ( 2 ) * * 26.81
26.13
## 3 ( 1 ) * * * 32.31
31.36
## 3 ( 2 ) * * * 30.79
29.83
## 4 ( 1 ) * * * * 33.75
32.51
## 4 ( 2 ) * * * * 32.96
31.71
## 5 ( 1 ) * * * * * 34.37
32.83
## 5 ( 2 ) * * * * * 34.10
32.56
## 6 ( 1 ) * * * * * * 34.70
32.85
## 6 ( 2 ) * * * * * * 34.63
32.78
## 7 ( 1 ) * * * * * * * 34.94
32.78
## 7 ( 2 ) * * * * * * * 34.73
32.56
## 8 ( 1 ) * * * * * * * * 34.95
32.47
## 8 ( 2 ) * * * * * * * * 34.95
32.47
## Cp
## 1 ( 1 ) 42.18
## 1 ( 2 ) 74.54
## 2 ( 1 ) 21.68
## 2 ( 2 ) 22.21
## 3 ( 1 ) 6.53
## 3 ( 2 ) 11.41
## 4 ( 1 ) 3.90
## 4 ( 2 ) 6.43
## 5 ( 1 ) 3.89
## 5 ( 2 ) 4.76
## 6 ( 1 ) 4.85
## 6 ( 2 ) 5.08
## 7 ( 1 ) 6.08
## 7 ( 2 ) 6.77
## 8 ( 1 ) 8.04
## 8 ( 2 ) 8.05

BACKWARD ELMINATION
Here’s code to do the full backward elimination in R automatically.
Note: The scale = MSE uses Cp as the criterion for comparing models.
full=lm(GPA~.,data=FirstYearGPA)
MSE=(summary(full)$sigma^2)
step(full,scale=MSE,direction="backward")

## Start: AIC=10
## GPA ~ HSGPA + SATV + SATM + Male + HU + SS + FirstGen + White +
## CollegeBound
##
## Df Sum of Sq RSS Cp
## - SATM 1 0.0053 30.724 8.0363
## - CollegeBound 1 0.0067 30.726 8.0457
## - FirstGen 1 0.1031 30.822 8.7017
## - Male 1 0.1052 30.824 8.7158
## - SS 1 0.2556 30.975 9.7392
## <none> 30.719 10.0000
## - SATV 1 0.3309 31.050 10.2516
## - White 1 1.1545 31.873 15.8545
## - HU 1 2.4409 33.160 24.6066
## - HSGPA 1 6.4345 37.154 51.7779
##
## Step: AIC=8.04
## GPA ~ HSGPA + SATV + Male + HU + SS + FirstGen + White + CollegeBound
##
## Df Sum of Sq RSS Cp
## - CollegeBound 1 0.0065 30.731 6.0808
## - FirstGen 1 0.1072 30.832 6.7657
## - Male 1 0.1421 30.866 7.0031
## - SS 1 0.2503 30.975 7.7392
## <none> 30.724 8.0363
## - SATV 1 0.4532 31.178 9.1194
## - White 1 1.1778 31.902 14.0498
## - HU 1 2.4567 33.181 22.7506
## - HSGPA 1 6.5762 37.301 50.7779
##
## Step: AIC=6.08
## GPA ~ HSGPA + SATV + Male + HU + SS + FirstGen + White
##
## Df Sum of Sq RSS Cp
## - FirstGen 1 0.1129 30.844 4.8488
## - Male 1 0.1469 30.878 5.0801
## - SS 1 0.2470 30.978 5.7616
## <none> 30.731 6.0808
## - SATV 1 0.4677 31.199 7.2626
## - White 1 1.1713 31.902 12.0499
## - HU 1 2.4506 33.181 20.7534
## - HSGPA 1 6.7560 37.487 50.0456
##
## Step: AIC=4.85
## GPA ~ HSGPA + SATV + Male + HU + SS + White
##
## Df Sum of Sq RSS Cp
## - Male 1 0.1534 30.997 3.8924
## - SS 1 0.2815 31.125 4.7642
## <none> 30.844 4.8488
## - SATV 1 0.5898 31.434 6.8613
## - White 1 1.2934 32.137 11.6483
## - HU 1 2.8154 33.659 22.0034
## - HSGPA 1 6.6441 37.488 48.0526
##
## Step: AIC=3.89
## GPA ~ HSGPA + SATV + HU + SS + White
##
## Df Sum of Sq RSS Cp
## <none> 30.997 3.8924
## - SS 1 0.2951 31.292 3.9005
## - SATV 1 0.7005 31.698 6.6582
## - White 1 1.3133 32.310 10.8273
## - HU 1 2.7987 33.796 20.9339
## - HSGPA 1 6.4968 37.494 46.0938

##
## Call:
## lm(formula = GPA ~ HSGPA + SATV + HU + SS + White, data = FirstYearGPA)
##
## Coefficients:
## (Intercept) HSGPA SATV HU SS
White
## 0.5684876 0.4739983 0.0007481 0.0167447 0.0077474
0.2060408

Notice that using Cp as the criterion gives a slightly different final model with five
predictors, including the SS term that was the last one eliminated when we used P-values
to make the decision to drop terms. The Cp when including SS (3.8924) is just barely less
than when it is dropped (3.9005). This example illustrates where there are several criteria
for helping you decide on the “best model.”

STEPWISE REGRESSION
If you want to suppress seeing all the individual steps, add trace = 0, to just show the final
model, but note that there is often valuable information in the individual steps.
step(none,scope=formula(full),scale=MSE,trace=0)

Analysis of A Data Set
No ratings yet
Analysis of A Data Set
9 pages
Homework 1 Tarea 1
100% (1)
Homework 1 Tarea 1
11 pages
Question Bank For Certification Programme of Returning Officers
No ratings yet
Question Bank For Certification Programme of Returning Officers
77 pages
Right To Travel Brief
67% (6)
Right To Travel Brief
62 pages
Service Manual B4600 PDF
71% (7)
Service Manual B4600 PDF
171 pages
File Show-11
No ratings yet
File Show-11
5 pages
Untitled: Alma Rohmah Fusur 07/06/2022
No ratings yet
Untitled: Alma Rohmah Fusur 07/06/2022
30 pages
Experiments Rlab Upto Cat - 1: Lab - 1 Introduction To R - Lab
No ratings yet
Experiments Rlab Upto Cat - 1: Lab - 1 Introduction To R - Lab
31 pages
VL2024250502474 Ast02
No ratings yet
VL2024250502474 Ast02
10 pages
Lecture2 Slides
No ratings yet
Lecture2 Slides
8 pages
Lec 13
No ratings yet
Lec 13
46 pages
Lecture2 - Slides 2
No ratings yet
Lecture2 - Slides 2
9 pages
Problem 4.1 A)
No ratings yet
Problem 4.1 A)
11 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
Bit Assignment
No ratings yet
Bit Assignment
15 pages
Data Science Practical No 09
No ratings yet
Data Science Practical No 09
8 pages
Introduction To STATA: Introduction To STATA About STATA Basic Operations Regression Analysis Panel Data Analysis
No ratings yet
Introduction To STATA: Introduction To STATA About STATA Basic Operations Regression Analysis Panel Data Analysis
27 pages
212011497-4SE5-Kautsar Hilmi Izzuddin Pertemuan 5
No ratings yet
212011497-4SE5-Kautsar Hilmi Izzuddin Pertemuan 5
13 pages
DARecord
No ratings yet
DARecord
21 pages
Statistic and R Programming Lab Exercise
No ratings yet
Statistic and R Programming Lab Exercise
24 pages
All Exercises R
No ratings yet
All Exercises R
21 pages
Sta108hw4 1
No ratings yet
Sta108hw4 1
5 pages
Debarghya Das (Ba-1), 18021141033
No ratings yet
Debarghya Das (Ba-1), 18021141033
12 pages
R Code Snippets
No ratings yet
R Code Snippets
10 pages
Assignment 1
No ratings yet
Assignment 1
16 pages
R Programing Bhagu
No ratings yet
R Programing Bhagu
40 pages
Practical 10
No ratings yet
Practical 10
22 pages
Data Science Using R
No ratings yet
Data Science Using R
11 pages
Shaurya Sharma PS LAB WORK
No ratings yet
Shaurya Sharma PS LAB WORK
70 pages
Statistics 2 For Chemical Engineering: Department of Mathematics and Computer Science
No ratings yet
Statistics 2 For Chemical Engineering: Department of Mathematics and Computer Science
37 pages
Lab Checkup Notes 1
No ratings yet
Lab Checkup Notes 1
3 pages
Babd Mid-Term
No ratings yet
Babd Mid-Term
16 pages
STATA Basics Regression and Panal Data
100% (1)
STATA Basics Regression and Panal Data
26 pages
Babd End-Term
No ratings yet
Babd End-Term
43 pages
Week 2-A.Guess The Distribution
No ratings yet
Week 2-A.Guess The Distribution
10 pages
Experiment 1: 19BME0057 H.Srrikhar
No ratings yet
Experiment 1: 19BME0057 H.Srrikhar
22 pages
Maths Lab
No ratings yet
Maths Lab
17 pages
Sae P5 Kautsar
No ratings yet
Sae P5 Kautsar
13 pages
MBA SectionD MBA20235 PranayGupta Assignment R
No ratings yet
MBA SectionD MBA20235 PranayGupta Assignment R
16 pages
Dal Programs With Output
No ratings yet
Dal Programs With Output
11 pages
Lab 4 V 1
No ratings yet
Lab 4 V 1
5 pages
Kautsar Hilmi Izzuddin - Tugas SAE P5
No ratings yet
Kautsar Hilmi Izzuddin - Tugas SAE P5
13 pages
R Lab
No ratings yet
R Lab
8 pages
Unit 1 Assignment SKELETON R spr18
No ratings yet
Unit 1 Assignment SKELETON R spr18
23 pages
Netcourse 101: Answers To Exercises in Lesson 3
No ratings yet
Netcourse 101: Answers To Exercises in Lesson 3
7 pages
HW1 Solutions
No ratings yet
HW1 Solutions
10 pages
Multinomial Logit Models With R: Library (Mlogit)
No ratings yet
Multinomial Logit Models With R: Library (Mlogit)
8 pages
Applied Statistics MAT1011
No ratings yet
Applied Statistics MAT1011
22 pages
Homework4: Jiawei Li Sahil Bhagat Shahrzad Baraeinezhad Input Data
No ratings yet
Homework4: Jiawei Li Sahil Bhagat Shahrzad Baraeinezhad Input Data
13 pages
Department of Statistics: COURSE STATS 330/762
No ratings yet
Department of Statistics: COURSE STATS 330/762
8 pages
DSL All Practical Codes - by HK - Official
No ratings yet
DSL All Practical Codes - by HK - Official
46 pages
Statistics (25B-1-2023)
No ratings yet
Statistics (25B-1-2023)
7 pages
R Intro 2011
No ratings yet
R Intro 2011
115 pages
Assignment 04 Problem
No ratings yet
Assignment 04 Problem
7 pages
MATH221 W2 Lab Excel
No ratings yet
MATH221 W2 Lab Excel
7 pages
Best Subset Reg 2
No ratings yet
Best Subset Reg 2
9 pages
Kanak Gupta 1116 SEC Assignment
No ratings yet
Kanak Gupta 1116 SEC Assignment
3 pages
R Commands
No ratings yet
R Commands
5 pages
150+ C Pattern Programs
From Everand
150+ C Pattern Programs
Hernando Abella
No ratings yet
Craps Wagering Strategies Using Actual Las Vegas Roll Data
From Everand
Craps Wagering Strategies Using Actual Las Vegas Roll Data
Eric Cybulski
No ratings yet
IBM System 360 RPG Debugging Template and Keypunch Card
From Everand
IBM System 360 RPG Debugging Template and Keypunch Card
Archive Classics
No ratings yet
Sudoku A Game of Mathematicians 640 Puzzles Normal to Medium Difficulty
From Everand
Sudoku A Game of Mathematicians 640 Puzzles Normal to Medium Difficulty
Kelly Johnson
No ratings yet
Sudoku A Game of Mathematicians 400 Puzzles Normal and Hard Difficulty
From Everand
Sudoku A Game of Mathematicians 400 Puzzles Normal and Hard Difficulty
Kelly Johnson
No ratings yet
3 I Specification BoQ Modular Furniture
No ratings yet
3 I Specification BoQ Modular Furniture
13 pages
672448fa583fcf7e75908848 43302953161
No ratings yet
672448fa583fcf7e75908848 43302953161
2 pages
Transportation Calculations
No ratings yet
Transportation Calculations
11 pages
2024 Amherstburg Calendar - Web
No ratings yet
2024 Amherstburg Calendar - Web
36 pages
Intellect OCR To SAP FB60 Integration Proposal
No ratings yet
Intellect OCR To SAP FB60 Integration Proposal
2 pages
Study Plan
No ratings yet
Study Plan
1 page
Full Literature Review Sample
No ratings yet
Full Literature Review Sample
8 pages
MR Haroon Khan Employment Offer Letter (Princess Cruise Ship Company)
No ratings yet
MR Haroon Khan Employment Offer Letter (Princess Cruise Ship Company)
3 pages
SW 4048 120 Spec Sheet
No ratings yet
SW 4048 120 Spec Sheet
2 pages
2024 July Rationale Crisil
No ratings yet
2024 July Rationale Crisil
7 pages
M01 Lesson 01
No ratings yet
M01 Lesson 01
27 pages
Final Training Design
No ratings yet
Final Training Design
4 pages
Colorimeter Calibration
No ratings yet
Colorimeter Calibration
3 pages
R&S ESW User Manual en 01
No ratings yet
R&S ESW User Manual en 01
828 pages
Air Filter Grades PDF
No ratings yet
Air Filter Grades PDF
2 pages
Outline Field Development & Project Management (5th Apr 22) Rev.2
No ratings yet
Outline Field Development & Project Management (5th Apr 22) Rev.2
67 pages
Apprinova Neossance Hemisqualane Latest
No ratings yet
Apprinova Neossance Hemisqualane Latest
4 pages
Cultural Metaphors
No ratings yet
Cultural Metaphors
13 pages
Faculty - COURSE - ALLOCATION - First - Semester - 2023-2024 - and 2024 - 2025 Academic - Session - Doc UPDATED
No ratings yet
Faculty - COURSE - ALLOCATION - First - Semester - 2023-2024 - and 2024 - 2025 Academic - Session - Doc UPDATED
3 pages
Alternating Current: Avg. & Rms Values
No ratings yet
Alternating Current: Avg. & Rms Values
41 pages
Reducing Patient Falls Through Purposeful Hourly Rounding
No ratings yet
Reducing Patient Falls Through Purposeful Hourly Rounding
78 pages
Bio-Based Insulator
No ratings yet
Bio-Based Insulator
15 pages
INSIGNIA Book Sample
No ratings yet
INSIGNIA Book Sample
38 pages
PMFIAS Prelims Magnum 2025 06 Science and Technology
No ratings yet
PMFIAS Prelims Magnum 2025 06 Science and Technology
210 pages
Gaming Industry - Group 1 - MM
No ratings yet
Gaming Industry - Group 1 - MM
20 pages
EN Checklist ISO Aanvulling Ontwerp 7 - 3 260303
No ratings yet
EN Checklist ISO Aanvulling Ontwerp 7 - 3 260303
3 pages
30 List of Documents Required For Different Categories of Agricultural Loan Schemes-030823261212
No ratings yet
30 List of Documents Required For Different Categories of Agricultural Loan Schemes-030823261212
4 pages

STAT2 2e R Markdown Files Sec4.2

Uploaded by

STAT2 2e R Markdown Files Sec4.2

Uploaded by

Topic 4.

2 Techniques for Choosing Predictors

Load needed packages.

EXAMPLE 4.2 First-year GPA

## 'data.frame': 219 obs. of 10 variables:

FIGURE 4.4 Scatterplot matrix for first-year GPA data

## Subset selection object

Here’s an option to create a plot showing the variables in each model

## model p rsq rss adjr2 cp bic stderr

Perhaps we don’t need Male or SS in this model?

Mallow’s Cp is given in the summaryHH( ) and SHowSubsets( ) output (see alternative

## [1] 42.180513 74.541330 21.682884 22.210037 6.530981 11.406836 3.900456

## [1] 7.000000 4.848791

BACKWARD ELIMINATION (with the GPA data)

CollegeBound should be dropped next, so use the update function again.

Now, drop FirstGen.

Next, drop Male.

Next, drop SS.

FORWARD SELECTION (using the GPA data)

Alternative solutions to the approaches above.

## HSGPA SATV SATM Male HU SS FirstGen White CollegeBound Rsq

You might also like