HW6 Solution
HW6 Solution
Yao Song
12/9/2020
Q 9.2
# import dataset
dat1 <- read.csv("data-table-B21.csv", header = T, sep = ",") # 13 obs. of 6 var.
dat1 <- dat1[, -1]
a From the matrix of correlations between the regressors, would you suspect that multicollinearity is present?
corr <- cor(dat1[, 2:5])
corr
0
c Find the eigenvalues of X X.
By using eigen function in R to do eigen decomposition, we get eigenvalues as the following.
eigendom <- eigen(corr)
value <- eigendom$values
value
1
## [1] 2.235704035 1.576066070 0.186606149 0.001623746
0
d Find the condition number of X X.
k <- max(value)/min(value)
0
The condition number of X X is
λmax
κ= ,
λmin
0 0
where λmax is the largest eigenvalue of X X and λmin is the smallest eigenvalue of X X. The condition
0
number of X X is 1376.8806. Because κ is larger than 1000, severe multicollinearity is indicated.
Q 9.7
# import dataset
dat2 <- read.csv("data-table-B3.csv", header = T, sep = ",") # 32 obs. of 12 var.
dat2 <- dat2[-which(is.na(dat2$x3)), ]
## x1 x2 x3 x4 x5 x6
## x1 1.0000000 0.9408473 0.9891628 -0.34697246 -0.6720903 0.64279836
## x2 0.9408473 1.0000000 0.9643592 -0.28989951 -0.5509642 0.76141897
## x3 0.9891628 0.9643592 1.0000000 -0.32599915 -0.6728661 0.65312630
## x4 -0.3469725 -0.2898995 -0.3259992 1.00000000 0.4137808 0.03748643
## x5 -0.6720903 -0.5509642 -0.6728661 0.41378081 1.0000000 -0.21952829
## x6 0.6427984 0.7614190 0.6531263 0.03748643 -0.2195283 1.00000000
## x7 -0.7719151 -0.6259445 -0.7461800 0.55823570 0.8717662 -0.27563863
## x8 0.8623681 0.8027387 0.8641224 -0.30415026 -0.5613315 0.42206800
## x9 0.7974811 0.7105117 0.7881284 -0.37817358 -0.4534470 0.30038618
## x10 0.9515520 0.8878810 0.9434871 -0.35845879 -0.5798617 0.52036693
## x11 0.8244446 0.7086735 0.8012765 -0.44054570 -0.7546650 0.39548928
## x7 x8 x9 x10 x11
## x1 -0.7719151 0.8623681 0.7974811 0.9515520 0.8244446
## x2 -0.6259445 0.8027387 0.7105117 0.8878810 0.7086735
## x3 -0.7461800 0.8641224 0.7881284 0.9434871 0.8012765
## x4 0.5582357 -0.3041503 -0.3781736 -0.3584588 -0.4405457
## x5 0.8717662 -0.5613315 -0.4534470 -0.5798617 -0.7546650
## x6 -0.2756386 0.4220680 0.3003862 0.5203669 0.3954893
## x7 1.0000000 -0.6552065 -0.6551300 -0.7058126 -0.8506963
## x8 -0.6552065 1.0000000 0.8831512 0.9554541 0.6824919
## x9 -0.6551300 0.8831512 1.0000000 0.8994711 0.6326677
## x10 -0.7058126 0.9554541 0.8994711 1.0000000 0.7530353
## x11 -0.8506963 0.6824919 0.6326677 0.7530353 1.0000000
From the correlation matrix of regressors, we find that many off-diagonal elements in the correlation matrix
are close to 1, which indicate that there might be several near-linear dependencies in the gasoline mileage
data.
2
0
b Calculate the variance inflation factors and the condition number of X X. Is there any evidence of
multicollinearity?
The variance inflation factors are shown in the following.
diag(solve(corr))
## x1 x2 x3 x4 x5 x6 x7
## 119.487804 42.800811 149.234409 2.060036 7.729187 5.324730 11.761341
## x8 x9 x10 x11
## 20.917632 9.397108 85.744344 5.145052
Since VIFs for x1 , x2 , x3 , x7 , x8 and x10 are greater than 10, it implies a strong evidence of multicollinearity.
0
The condition number of X X is shown below.
eigendom <- eigen(corr)
value <- eigendom$values
k <- max(value)/min(value)
Because condition number κ is 2025.2393, which is larger than 1000. That is, severe multicollinearity is
indicated.
Q 10.5
3
plot(sumbest$adjr2, main = "AdjR2 vs. p", xlab = "p")
abline(h = max(sumbest$adjr2))
AdjR2 vs. p
0.775
sumbest$adjr2
0.765
0.755
1 2 3 4 5 6 7 8
p
2
From the plot of RAdj,p versus p, we can conclude that the model involves x5 , x8 and x10 may be an
2
appropriate model with the largest RAdj,p = 0.781.
plot(sumbest$cp, main = "Cp vs. p", xlab = "p")
abline(h = min(sumbest$cp))
4
Cp vs. p
6
5
sumbest$cp
4
3
2
1
0
1 2 3 4 5 6 7 8
p
The plot of Cp versus p also indicates that it may be appropriate to choose the model involves x5 , x8 and x10
because it has the smallest Cp = −0.502.
fit3 <- lm(y ~ x5 + x8 + x10, data = dat2)
summary(fit3)
##
## Call:
## lm(formula = y ~ x5 + x8 + x10, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6101 -1.9868 -0.6613 2.0369 5.8811
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.590404 11.771925 0.390 0.6998
## x5 2.597240 1.264562 2.054 0.0502 .
## x8 0.217814 0.087817 2.480 0.0199 *
## x10 -0.009485 0.001994 -4.757 6.38e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.934 on 26 degrees of freedom
## Multiple R-squared: 0.8035, Adjusted R-squared: 0.7808
## F-statistic: 35.44 on 3 and 26 DF, p-value: 2.462e-09
Thus, the appropriate model relating y to x5 , x8 and x10 is
5
b Use stepwise regression to specify a subset regression model. Does this lead to the same model found in
part a?
fit31 <- lm(y ~ ., data = dat2)
select <- step(fit31, direction="both", trace = 0)
summary(select)
##
## Call:
## lm(formula = y ~ x5 + x8 + x10, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6101 -1.9868 -0.6613 2.0369 5.8811
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.590404 11.771925 0.390 0.6998
## x5 2.597240 1.264562 2.054 0.0502 .
## x8 0.217814 0.087817 2.480 0.0199 *
## x10 -0.009485 0.001994 -4.757 6.38e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.934 on 26 degrees of freedom
## Multiple R-squared: 0.8035, Adjusted R-squared: 0.7808
## F-statistic: 35.44 on 3 and 26 DF, p-value: 2.462e-09
The model from stepwise variable selection is the same as all-possible-regressions approach
Therefore, it is a strong evidence that the model relating y to x5 , x8 and x10 is an appropriate model for the
gasoline mileage performance data.
Q 10.8
Use the all-possible-regressions method to select a subset regression model for the Belle Ayr liquefaction data
in Table B.5. Evaluate the subset models using the Cp criterion. Justify your choice of final model using the
standard checks for model adequacy.
# import dataset
dat4 <- read.csv("data-table-B5.csv", header = T, sep = ",") # 27 obs. of 8 var.
6
## x5 FALSE FALSE
## x6 FALSE FALSE
## x7 FALSE FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: exhaustive
## x1 x2 x3 x4 x5 x6 x7
## 1 ( 1 ) " " " " " " " " " " "*" " "
## 2 ( 1 ) " " " " " " " " " " "*" "*"
## 3 ( 1 ) " " " " " " "*" " " "*" "*"
## 4 ( 1 ) " " " " "*" "*" " " "*" "*"
## 5 ( 1 ) " " "*" "*" "*" " " "*" "*"
## 6 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
plot(sumbest$adjr2, main = "AdjR2 vs. p", xlab = "p")
abline(h = max(sumbest$adjr2))
AdjR2 vs. p
0.62 0.63 0.64 0.65 0.66 0.67
sumbest$adjr2
1 2 3 4 5 6 7
p
2
From the plot of RAdj,p versus p, we can conclude that the model involves x6 and x7 may be an appropriate
2
model with the largest RAdj,p = 0.675.
plot(sumbest$cp, main = "Cp vs. p", xlab = "p")
abline(h = min(sumbest$cp))
7
Cp vs. p
8
6
sumbest$cp
4
2
0
1 2 3 4 5 6 7
p
The plot of Cp versus p also indicates that it may be appropriate to choose the model involves x6 and x7
because it has the smallest Cp = −0.021.
fit4<- lm(y ~ x6 + x7, data = dat4)
summary(fit4)
##
## Call:
## lm(formula = y ~ x6 + x7, data = dat4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.2035 -4.3713 0.2513 4.9339 21.9682
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.526460 3.610055 0.700 0.4908
## x6 0.018522 0.002747 6.742 5.66e-07 ***
## x7 2.185753 0.972696 2.247 0.0341 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.924 on 24 degrees of freedom
## Multiple R-squared: 0.6996, Adjusted R-squared: 0.6746
## F-statistic: 27.95 on 2 and 24 DF, p-value: 5.391e-07
The linear model relating y to x6 and x7 is
8
Since F statistics is 27.95 and p-value is 0, which is less than significance level (0.05), we would reject
H0 : β6 = β7 = 0 and conclude there is a linear relationship between y and any of the regressors x6 , x7 .
R2 in the model is 0.6996, indicating that 69.96% of the total variability in y is explained by this model.
Adjusted R2 is 0.6746.
fit4_std <- rstandard(fit4)
qqnorm(fit4_std, ylab="Standardized Residuals", xlab="Normal Scores")
qqline(fit4_std, col = 2)
1
0
−1
−2
−2 −1 0 1 2
Normal Scores
From the QQ plot, although there are some deviations from normality at the tails, the pattern almost fitted
the normality assumption.
plot(fit4$fitted.values, fit4$residuals, ylab = 'Residuals', xlab = 'Fitted Values', main = 'Residuals v
abline(0, 0)
9
Residuals vs Fitted
20
10
Residuals
0
−10
−20
10 20 30 40 50
Fitted Values
There is no significant pattern in the residuals versus fits plot. This suggests that the model does meet the
linearity assumption. Therefore, these results show that the model with two regressors x6 , x7 is adequate.
10