0% found this document useful (0 votes)
43 views10 pages

HW6 Solution

1. The document contains solutions to homework problems involving regression analysis and detecting multicollinearity. 2. For one dataset, variance inflation factors above 10 and a condition number above 1000 indicate severe multicollinearity. The best regression model relates the outcome variable y to regressors x5, x8, and x10. 3. For a second dataset, correlations close to 1 between some regressors suggest multicollinearity. Variance inflation factors above 10 and a condition number above 1000 also indicate severe multicollinearity. Stepwise regression identifies the same best model as the all-possible regressions approach, with y related to x5, x8, and x10.

Uploaded by

rita901112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views10 pages

HW6 Solution

1. The document contains solutions to homework problems involving regression analysis and detecting multicollinearity. 2. For one dataset, variance inflation factors above 10 and a condition number above 1000 indicate severe multicollinearity. The best regression model relates the outcome variable y to regressors x5, x8, and x10. 3. For a second dataset, correlations close to 1 between some regressors suggest multicollinearity. Variance inflation factors above 10 and a condition number above 1000 also indicate severe multicollinearity. Stepwise regression identifies the same best model as the all-possible regressions approach, with y related to x5, x8, and x10.

Uploaded by

rita901112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

HW6_solution

Yao Song

12/9/2020

Q 9.2

# import dataset
dat1 <- read.csv("data-table-B21.csv", header = T, sep = ",") # 13 obs. of 6 var.
dat1 <- dat1[, -1]

a From the matrix of correlations between the regressors, would you suspect that multicollinearity is present?
corr <- cor(dat1[, 2:5])
corr

## x_1 x_2 x_3 x_4


## x_1 1.0000000 0.2285795 -0.8241338 -0.2454451
## x_2 0.2285795 1.0000000 -0.1392424 -0.9729550
## x_3 -0.8241338 -0.1392424 1.0000000 0.0295370
## x_4 -0.2454451 -0.9729550 0.0295370 1.0000000
From the correlations matrix between the regressors x1 , x2 , x3 , x4 , we find that this matrix reveals high
correlations in the pairs (x2 , x4 ) and (x1 , x3 ). Thus, the correlation matrix may indicate that there are
near-linear dependencies in the Hald cement data, where the problem of multicollinearity is said to exist.

b Calculate the variance inflation factors.

V IFj = Cjj = (1 − Rj2 )−1 ,


0 0
where C = X X. So the diagonal elements of inverse of X X are VIFs. Variance inflation factors (VIFs) for
each regression coefficient is calculated below.
diag(solve(corr))

## x_1 x_2 x_3 x_4


## 38.49621 254.42317 46.86839 282.51286
Since all 4 VIFs for regressors are greater than 10, it is an indication that the associated regression coefficients
are poorly estimated because of multicollinearity.

0
c Find the eigenvalues of X X.
By using eigen function in R to do eigen decomposition, we get eigenvalues as the following.
eigendom <- eigen(corr)
value <- eigendom$values
value

1
## [1] 2.235704035 1.576066070 0.186606149 0.001623746

0
d Find the condition number of X X.
k <- max(value)/min(value)
0
The condition number of X X is
λmax
κ= ,
λmin
0 0
where λmax is the largest eigenvalue of X X and λmin is the smallest eigenvalue of X X. The condition
0
number of X X is 1376.8806. Because κ is larger than 1000, severe multicollinearity is indicated.

Q 9.7

# import dataset
dat2 <- read.csv("data-table-B3.csv", header = T, sep = ",") # 32 obs. of 12 var.
dat2 <- dat2[-which(is.na(dat2$x3)), ]

a Does the correlation matrix give any indication of multicollinearity?


corr <- cor(dat2[, 2:12])
corr

## x1 x2 x3 x4 x5 x6
## x1 1.0000000 0.9408473 0.9891628 -0.34697246 -0.6720903 0.64279836
## x2 0.9408473 1.0000000 0.9643592 -0.28989951 -0.5509642 0.76141897
## x3 0.9891628 0.9643592 1.0000000 -0.32599915 -0.6728661 0.65312630
## x4 -0.3469725 -0.2898995 -0.3259992 1.00000000 0.4137808 0.03748643
## x5 -0.6720903 -0.5509642 -0.6728661 0.41378081 1.0000000 -0.21952829
## x6 0.6427984 0.7614190 0.6531263 0.03748643 -0.2195283 1.00000000
## x7 -0.7719151 -0.6259445 -0.7461800 0.55823570 0.8717662 -0.27563863
## x8 0.8623681 0.8027387 0.8641224 -0.30415026 -0.5613315 0.42206800
## x9 0.7974811 0.7105117 0.7881284 -0.37817358 -0.4534470 0.30038618
## x10 0.9515520 0.8878810 0.9434871 -0.35845879 -0.5798617 0.52036693
## x11 0.8244446 0.7086735 0.8012765 -0.44054570 -0.7546650 0.39548928
## x7 x8 x9 x10 x11
## x1 -0.7719151 0.8623681 0.7974811 0.9515520 0.8244446
## x2 -0.6259445 0.8027387 0.7105117 0.8878810 0.7086735
## x3 -0.7461800 0.8641224 0.7881284 0.9434871 0.8012765
## x4 0.5582357 -0.3041503 -0.3781736 -0.3584588 -0.4405457
## x5 0.8717662 -0.5613315 -0.4534470 -0.5798617 -0.7546650
## x6 -0.2756386 0.4220680 0.3003862 0.5203669 0.3954893
## x7 1.0000000 -0.6552065 -0.6551300 -0.7058126 -0.8506963
## x8 -0.6552065 1.0000000 0.8831512 0.9554541 0.6824919
## x9 -0.6551300 0.8831512 1.0000000 0.8994711 0.6326677
## x10 -0.7058126 0.9554541 0.8994711 1.0000000 0.7530353
## x11 -0.8506963 0.6824919 0.6326677 0.7530353 1.0000000
From the correlation matrix of regressors, we find that many off-diagonal elements in the correlation matrix
are close to 1, which indicate that there might be several near-linear dependencies in the gasoline mileage
data.

2
0
b Calculate the variance inflation factors and the condition number of X X. Is there any evidence of
multicollinearity?
The variance inflation factors are shown in the following.
diag(solve(corr))

## x1 x2 x3 x4 x5 x6 x7
## 119.487804 42.800811 149.234409 2.060036 7.729187 5.324730 11.761341
## x8 x9 x10 x11
## 20.917632 9.397108 85.744344 5.145052
Since VIFs for x1 , x2 , x3 , x7 , x8 and x10 are greater than 10, it implies a strong evidence of multicollinearity.
0
The condition number of X X is shown below.
eigendom <- eigen(corr)
value <- eigendom$values
k <- max(value)/min(value)

Because condition number κ is 2025.2393, which is larger than 1000. That is, severe multicollinearity is
indicated.

Q 10.5

a Use the all-possible-regressions approach to find an appropriate regression model.


best <- regsubsets(y ~ ., data = dat2)
sumbest <- summary(best)
sumbest

## Subset selection object


## Call: regsubsets.formula(y ~ ., data = dat2)
## 11 Variables (and intercept)
## Forced in Forced out
## x1 FALSE FALSE
## x2 FALSE FALSE
## x3 FALSE FALSE
## x4 FALSE FALSE
## x5 FALSE FALSE
## x6 FALSE FALSE
## x7 FALSE FALSE
## x8 FALSE FALSE
## x9 FALSE FALSE
## x10 FALSE FALSE
## x11 FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11
## 1 ( 1 ) "*" " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " "*" " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " "*" " " " " "*" " " "*" " "
## 4 ( 1 ) " " " " " " " " "*" " " " " "*" "*" "*" " "
## 5 ( 1 ) " " " " " " " " "*" " " "*" "*" "*" "*" " "
## 6 ( 1 ) " " " " " " "*" "*" " " "*" "*" "*" "*" " "
## 7 ( 1 ) "*" " " "*" " " "*" " " "*" "*" "*" "*" " "
## 8 ( 1 ) "*" "*" "*" " " "*" " " "*" "*" "*" "*" " "

3
plot(sumbest$adjr2, main = "AdjR2 vs. p", xlab = "p")
abline(h = max(sumbest$adjr2))

AdjR2 vs. p
0.775
sumbest$adjr2

0.765
0.755

1 2 3 4 5 6 7 8

p
2
From the plot of RAdj,p versus p, we can conclude that the model involves x5 , x8 and x10 may be an
2
appropriate model with the largest RAdj,p = 0.781.
plot(sumbest$cp, main = "Cp vs. p", xlab = "p")
abline(h = min(sumbest$cp))

4
Cp vs. p

6
5
sumbest$cp

4
3
2
1
0

1 2 3 4 5 6 7 8

p
The plot of Cp versus p also indicates that it may be appropriate to choose the model involves x5 , x8 and x10
because it has the smallest Cp = −0.502.
fit3 <- lm(y ~ x5 + x8 + x10, data = dat2)
summary(fit3)

##
## Call:
## lm(formula = y ~ x5 + x8 + x10, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6101 -1.9868 -0.6613 2.0369 5.8811
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.590404 11.771925 0.390 0.6998
## x5 2.597240 1.264562 2.054 0.0502 .
## x8 0.217814 0.087817 2.480 0.0199 *
## x10 -0.009485 0.001994 -4.757 6.38e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.934 on 26 degrees of freedom
## Multiple R-squared: 0.8035, Adjusted R-squared: 0.7808
## F-statistic: 35.44 on 3 and 26 DF, p-value: 2.462e-09
Thus, the appropriate model relating y to x5 , x8 and x10 is

ŷ = 4.5904 + 2.5972x5 + 0.2178x8 − 0.0095x10 .

5
b Use stepwise regression to specify a subset regression model. Does this lead to the same model found in
part a?
fit31 <- lm(y ~ ., data = dat2)
select <- step(fit31, direction="both", trace = 0)
summary(select)

##
## Call:
## lm(formula = y ~ x5 + x8 + x10, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6101 -1.9868 -0.6613 2.0369 5.8811
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.590404 11.771925 0.390 0.6998
## x5 2.597240 1.264562 2.054 0.0502 .
## x8 0.217814 0.087817 2.480 0.0199 *
## x10 -0.009485 0.001994 -4.757 6.38e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.934 on 26 degrees of freedom
## Multiple R-squared: 0.8035, Adjusted R-squared: 0.7808
## F-statistic: 35.44 on 3 and 26 DF, p-value: 2.462e-09
The model from stepwise variable selection is the same as all-possible-regressions approach

ŷ = 4.5904 + 2.5972x5 + 0.2178x8 − 0.0095x10 .

Therefore, it is a strong evidence that the model relating y to x5 , x8 and x10 is an appropriate model for the
gasoline mileage performance data.

Q 10.8

Use the all-possible-regressions method to select a subset regression model for the Belle Ayr liquefaction data
in Table B.5. Evaluate the subset models using the Cp criterion. Justify your choice of final model using the
standard checks for model adequacy.
# import dataset
dat4 <- read.csv("data-table-B5.csv", header = T, sep = ",") # 27 obs. of 8 var.

best <- regsubsets(y ~ ., data = dat4)


sumbest <- summary(best)
sumbest

## Subset selection object


## Call: regsubsets.formula(y ~ ., data = dat4)
## 7 Variables (and intercept)
## Forced in Forced out
## x1 FALSE FALSE
## x2 FALSE FALSE
## x3 FALSE FALSE
## x4 FALSE FALSE

6
## x5 FALSE FALSE
## x6 FALSE FALSE
## x7 FALSE FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: exhaustive
## x1 x2 x3 x4 x5 x6 x7
## 1 ( 1 ) " " " " " " " " " " "*" " "
## 2 ( 1 ) " " " " " " " " " " "*" "*"
## 3 ( 1 ) " " " " " " "*" " " "*" "*"
## 4 ( 1 ) " " " " "*" "*" " " "*" "*"
## 5 ( 1 ) " " "*" "*" "*" " " "*" "*"
## 6 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
plot(sumbest$adjr2, main = "AdjR2 vs. p", xlab = "p")
abline(h = max(sumbest$adjr2))

AdjR2 vs. p
0.62 0.63 0.64 0.65 0.66 0.67
sumbest$adjr2

1 2 3 4 5 6 7

p
2
From the plot of RAdj,p versus p, we can conclude that the model involves x6 and x7 may be an appropriate
2
model with the largest RAdj,p = 0.675.
plot(sumbest$cp, main = "Cp vs. p", xlab = "p")
abline(h = min(sumbest$cp))

7
Cp vs. p

8
6
sumbest$cp

4
2
0

1 2 3 4 5 6 7

p
The plot of Cp versus p also indicates that it may be appropriate to choose the model involves x6 and x7
because it has the smallest Cp = −0.021.
fit4<- lm(y ~ x6 + x7, data = dat4)
summary(fit4)

##
## Call:
## lm(formula = y ~ x6 + x7, data = dat4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.2035 -4.3713 0.2513 4.9339 21.9682
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.526460 3.610055 0.700 0.4908
## x6 0.018522 0.002747 6.742 5.66e-07 ***
## x7 2.185753 0.972696 2.247 0.0341 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.924 on 24 degrees of freedom
## Multiple R-squared: 0.6996, Adjusted R-squared: 0.6746
## F-statistic: 27.95 on 2 and 24 DF, p-value: 5.391e-07
The linear model relating y to x6 and x7 is

ŷ = 2.5265 + 0.0185x6 + 2.1858x7 .

8
Since F statistics is 27.95 and p-value is 0, which is less than significance level (0.05), we would reject
H0 : β6 = β7 = 0 and conclude there is a linear relationship between y and any of the regressors x6 , x7 .
R2 in the model is 0.6996, indicating that 69.96% of the total variability in y is explained by this model.
Adjusted R2 is 0.6746.
fit4_std <- rstandard(fit4)
qqnorm(fit4_std, ylab="Standardized Residuals", xlab="Normal Scores")
qqline(fit4_std, col = 2)

Normal Q−Q Plot


2
Standardized Residuals

1
0
−1
−2

−2 −1 0 1 2

Normal Scores
From the QQ plot, although there are some deviations from normality at the tails, the pattern almost fitted
the normality assumption.
plot(fit4$fitted.values, fit4$residuals, ylab = 'Residuals', xlab = 'Fitted Values', main = 'Residuals v
abline(0, 0)

9
Residuals vs Fitted
20
10
Residuals

0
−10
−20

10 20 30 40 50

Fitted Values
There is no significant pattern in the residuals versus fits plot. This suggests that the model does meet the
linearity assumption. Therefore, these results show that the model with two regressors x6 , x7 is adequate.

10

You might also like