H-311 Linear Regression Analysis With R
H-311 Linear Regression Analysis With R
H-311 Linear Regression Analysis With R
Md. Mostakim
Session: 2018-19
Department Of Statistics
University Of Dhaka
Published Date:
Acknowledgements:
I would like to acknowledge our course teacher Dr. Zillur Rahman Shabuz Sir for teaching
us regression analysis so beautifully. This is just a tiny reflection of Sir's guidance.
N.B. You may share this pdf book as much as you like but don’t use it for any unethical purpose.
For any kind of feedback please contact to [email protected]. Your feedback will be very inspiring for me.
Table of Contents
Problem 01: Simple Linear Regression Model ................................................................................. 4
Problem 02: Simple Linear Regression Model ................................................................................. 9
Problem 03: Multiple Regression Model ......................................................................................... 10
Problem 04: Multiple Regression Model ......................................................................................... 16
Problem 05: Multiple Regression Model ......................................................................................... 19
Problem 06: Polynomial Model ........................................................................................................... 21
Problem 07: Dummy Variable with Interaction ........................................................................... 23
Problem 08: Dummy Variable ............................................................................................................. 25
Problem 09: Dummy Variable ............................................................................................................. 28
Problem 10: Dummy Variable with Interaction Effect .............................................................. 30
Problem 11: Multicollinearity.............................................................................................................. 34
Problem 12: Model Selection ............................................................................................................... 36
Problem 13: Model Selection ............................................................................................................... 55
Problem 14: Model Selection ............................................................................................................... 56
Problem 15: Model Selection ............................................................................................................... 56
Problem 16: Model Adequacy Checking .......................................................................................... 57
Problem 17: Model Adequacy Checking .......................................................................................... 64
Problem 18: In course Question 2022 .............................................................................................. 64
Problem 19: In course Question 2019 .............................................................................................. 65
Problem 20: Final Question 2019 ...................................................................................................... 69
Problem 21: Final Question 2018 ...................................................................................................... 70
References: .................................................................................................................................................. 71
Problem 01: Simple Linear Regression Model
Use the following data to answer the questions:
a. Write down the fitted model and interpret the regression coefficients.
b. Construct the analysis of variance table and test for significance of regression. Also,
calculate and interpret R-square.
c. Find the fitted values and check whether the sum of the observed values equals the
sum of the fitted values.
d. Find the residuals and check whether the sum of residuals equals zero.
e. Check whether the sum of the residuals weighted by the corresponding value of the
regressor always equals zero. Also, check whether the sum of the residuals weighted
by the corresponding fitted value always equals zero.
f. Construct an equal tailed 90% confidence interval for the parameter 𝞫 and make
conclusion about the hypotheses.
a.
Interpretation:
The intercept is 80. It can be interpreted as the predicted values of y for a zero x. This
means that, for x equal zero, we can expect a value of y=80, on average. The regression beta
coefficient for the variable y, also known as the slope, is 4. This means that, for a unit
change in x, we can expect an increase of 4 units in y, on average.
b.
> summary(RegOut)
> ##
## Call:
## lm(formula = y ~ x, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.00 -3.25 -1.00 3.75 6.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.0000 3.0753 26.01 5.12e-09 ***
## x 4.0000 0.3868 10.34 6.61e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.61 on 8 degrees of freedom
## Multiple R-squared: 0.9304, Adjusted R-squared: 0.9217
## F-statistic: 106.9 on 1 and 8 DF, p-value: 6.609e-06
> anova(RegOut)
> ## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 2272 2272.00 106.92 6.609e-06 ***
## Residuals 8 170 21.25
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The value of R-square=0.9304, which means that 93% of the variation in the data can be
explained by the model.
c.
> fitted<-predict(RegOut)
fitted
> ## 1 2 3 4 5 6 7 8 9 10
## 84 92 96 96 104 112 120 120 124 132
> RegOut$fitted.values
> ## 1 2 3 4 5 6 7 8 9 10
## 84 92 96 96 104 112 120 120 124 132
> sum(data$y)
> ## [1] 1080
> sum(fitted)
> ## [1] 1080
Therefore, the sum of the observed values equals the sum of the fitted values.
d.
> resi<-resid(RegOut)
RegOut$residuals
> ## 1 2 3 4 5 6 7 8 9 10
## -4 5 -4 6 -1 -1 -1 3 -7 4
> sum(resi)
> ## [1] 1.221245e-15
e.
> sum(data$x*resi)
> ## [1] 7.105427e-15
> sum(resi*fitted)
> ## [1] -2.842171e-14
Therefore, the sum of the residuals weighted by the corresponding value of the regressor
always equals zero and the sum of the residuals weighted by the corresponding fitted value
always equals zero.
f.
Since, the CI of beta is [3.3,4.7] which doesn’t include the value zero, we can say that beta is
highly significant.
g.
> new<-matrix(c(12))
colnames(new)<-'x'
new<-data.frame(new)
predict(RegOut, newdata =new, interval = 'confidence')
> ## fit lwr upr
## 1 128 122.4148 133.5852
h.
a. Fit a simple linear regression model relating games won y to yards gained rushing
by opponents x8 .
b. Construct the analysis of variance table and test for significance of regression.
c. Find a 95% CI on the slope.
d. Find a 95% CI on the mean number of games won if opponents’ yards rushing is
limited to 2000 yards.
e. Suppose we would like to use the model to predict the number of games a team will
win if it can limit opponents ’ yards rushing to 1800 yards. Find a point estimate of
the number of games won when x8 = 1800. Find a 90% prediction interval on the
number of games won.
Answer:
The fitted value is 9.14 and a 90% prediction interval on the number of games won if
opponents’ yards rushing is limited to 1800 yards is (4.935,13.351).
We wish to predict BMI on the basis of AGE (in years), HEIGHT (in inches), WEIGHT (in
Ibs), and WAIST (in cms). (Population regression model: BMI =𝞫o + 𝞫1*AGE + 𝞫2*HEIGHT
+ 𝞫3*WEIGHT + 𝞫4*WAIST + 𝛜)
i. Write down the fitted model and interpret the regression coefficients.
ii. Construct the analysis of variance table and test for significance of regression. Also,
interpret R-square.
iii. Also construct an equal tailed 90% confidence interval for the parameter 𝞫1 and
make conclusion about the hypotheses Ho:𝞫1=0 vs H1:𝞫1≠0
v. Construct a 95% Confidence Interval for the mean response when AGE=20,
HEIGHT=66, WEIGHT=156, and WAIST=89. Also, interpret the result.
vi. Also, construct a 95% prediction interval for the information given in (v) and
interpret the result.
vii. Use forward selection procedure to find the best subset of regressors from the
regressors used in (i). [Write R codes for each steps and write down the final
model.]
Answer:
i.
Interpretation:
The intercept has no practical interpretation since age, height, weight, and waist can’t be
zero.
We found that, the age coefficient is not significant in the multiple regression model. This
means that, for a fixed amount of height, weight, and waist, changes in the age will not
significantly affect BMI.
The height coefficient suggests that for every 1 unit increase in height, holding all other
predictors constant, we can expect a decrease of BMI by 0.77 unit, on average.
The weight coefficient suggests that for every 1 unit increase in weight, holding all other
predictors constant, we can expect an increase of BMI by 0.14 unit, on average.
We found that, the waist coefficient is not significant in the multiple regression model. This
means that, for a fixed amount of height, weight, and age, changes in the waist will not
significantly affect BMI.
ii.
> anova(model)
> ## Analysis of Variance Table
##
## Response: BMI
## Df Sum Sq Mean Sq F value Pr(>F)
## AGE 1 8.64 8.64 93.2988 2.084e-11 ***
## HEIGHT 1 8.30 8.30 89.5822 3.505e-11 ***
## WEIGHT 1 438.67 438.67 4736.7937 < 2.2e-16 ***
## WAIST 1 0.18 0.18 1.9244 0.1741
## Residuals 35 3.24 0.09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The R-square is 0.99 which means that about 99% of the variation can be explained by our
model.
iii.
Since the CI for AGE(𝞫1) is [-0.009444081, 0.00577342] which includes the value 0, we can
conclude that we don’t have enough evidence to reject null hypothesis. Therefore 𝞫1 is
insignificant at 10% level of significance.
iv.
To perform hypothesis testing for Ho:𝞫1=𝞫2=0, we have to perform F-test. Test statistics
will be,
𝑆𝑆𝑅 (β1 , β2 |β3 , β4 )
𝐹0 =
𝑀𝑆𝑅𝑒𝑠
> model1<-lm(BMI~WEIGHT+WAIST,data=data)#Removing 𝞫1,𝞫2 to form a new model
anova(model1)
> ## Analysis of Variance Table
##
## Response: BMI
## Df Sum Sq Mean Sq F value Pr(>F)
## WEIGHT 1 293.528 293.528 90.524 1.75e-11 ***
## WAIST 1 45.527 45.527 14.041 0.0006092 ***
## Residuals 37 119.974 3.243
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> ss1<-8.64+8.30+438.67+0.18
ss1
> ## [1] 455.79
> ss2<-293.528+45.527
ss2
> ## [1] 339.055
> ssr<-ss1-ss2
F<-(ssr/2)/0.09 #F-statistic
F
> ## [1] 648.5278
> pf(F,2,35,lower.tail = FALSE)#P-value
> ## [1] 2.198112e-28
Since P-value=0 we may reject the null hypothesis and may conclude that 𝞫1 and 𝞫2 is not
equal to zero.
v.
> new.data<-matrix(c(20,66,156,89),1,4)
colnames(new.data)<-c('AGE','HEIGHT','WEIGHT','WAIST')
new.data1<-data.frame(new.data)
predict(model,newdata = new.data1,interval = "confidence")
> ## fit lwr upr
## 1 25.40121 25.20286 25.59955
Interpretation:
Therefore, 95% CI for the mean response is (25.2,25.6) when AGE=20, HEIGHT=66,
WEIGHT=156, and WAIST=89 which means that the estimated mean of the BMI will be in
between 25.2 to 25.6 about 95% of time when AGE=20, HEIGHT=66, WEIGHT=156, and
WAIST=89.
vi.
Interpretation:
The 95% prediction interval is (24.75,26.05) which means that about 95% of time the
value of BMI will be between 24.75 and 26.05 when AGE=20, HEIGHT=66, WEIGHT=156,
and WAIST=89.
vii.
(a) Fit a multiple regression model relating CO2 product (y) to total solvent (x6) and
hydrogen consumption (x7).
(b) Test for significance of regression. Calculate R-square and adjusted R-square.
(d) Refit the model using only X6 as the regressor. Test for significance of regression
and calculate R-square.
(e) Construct a 95% CI on 𝞫6 using the model you fit in part (d). Compare the length of
this CI to the length of the CI in part (c).
(f) Is it possible to construct an un-equal tailed 95% confidence interval (say 1% in the
lower tail and 4% in the upper tail or any other combination) that has length shorter
than the equal tailed confidence interval [Yes/No]?
Answer:
a.
b.
R-square=0.6996
Adjusted R-square=0.6746
Interpretation: About 67.5% of variation in y can be explained by our model (or by x6 and
x7 variable).
c.
> confint(model)
> ## 2.5 % 97.5 %
## (Intercept) -4.92432697 9.97724714
## x6 0.01285196 0.02419204
## x7 0.17820756 4.19329833
Interpretation: Since both the CI of x6 and x7 doesn’t contain the value 0, we can say that
x6 and x7 are statistically significant at 5% level of significance.
d.
Therefore, the CI for part (c) is more lengthy than part (d).
f.
a. Fit a multiple linear regression model relating the number of games won to the
team’s passing yardage(x2), the percentage of rushing plays(x7), and the opponents’
yards rushing(x8).
b. Construct the analysis - of - variance table and test for significance of regression.
c. Calculate t statistics for testing the hypotheses H0 : β2 = 0, H0 : β7 = 0, and H0 : β8 =
0. What conclusions can you draw about the roles the variables x2, x7, and x8 play in
the model?
e. Using the partial F test, determine the contribution of x7 to the model. How is this
partial F statistic related to the t test for β7 calculated in part c above?
f. Show numerically that the square of the simple correlation coefficient between the
observed values yi and the fitted values 𝑦̂ equals R square.
g. Find a 95% CI on β7 .
h. Find a 95% CI on the mean number of games won by a team when x2 = 2300, x7 =
56.0, and x8 = 2100.
Answer:
b. Regression is significant.
g. ̂ ± 2.064(.08823) =(.012,
A 95% confidence interval on the slope parameter β7 is 𝛽7
.376)
A 95% confidence interval on the mean number of games won by a team when X2 =
2300, X7 = 56.0 and X8 = 2100 is
Problem 06: Polynomial Model
Consider a solid fuel rocket propellant loses weight after it is produced. The following data
are available: (Data download link: polynomial.txt)
a. Fit a second - order polynomial that expresses weight loss as a function of the
number of months since production.
c. Test the hypothesis H0 : β2 = 0. Comment on the need for the quadratic term in this
model.
From the above plot we can see that we have to fit a polynomial model for the above data.
a.
> model<-lm(y~x+I(x^2),data=data)
summary(model)
> ##
## Call:
## lm(formula = y ~ x + I(x^2), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.005364 -0.002727 0.001045 0.002409 0.003273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.633000 0.004196 389.2 < 2e-16 ***
## x -1.232182 0.007010 -175.8 5.09e-14 ***
## I(x^2) 1.494545 0.002484 601.6 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.003568 on 7 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.859e+06 on 2 and 7 DF, p-value: < 2.2e-16
b.
c.
Since for β2 p-value is equals to zero, therefore we may reject the null hypothesis and
conclude that β2 is significant. As we can see that the plot of y and x is somewhat quadratic
we can say that we need the quadratic term in our model.
(a) Test whether this peacetime recession significantly influence the relationship
between PS and PDI. [Divide the sample data into two periods, 1970 to 1981 and
1982 to 1995, the pre- and post-1982 recession periods.]
(b) Test for the significance of interaction effect.
Answer:
a.
> data<-matrix(c(1970,61.0,727.1,0,1971,68.6,790.2,0,1972,63.6,855.3,0,1973,89.6,965.0,0,1974,97.6,105
4.2,0,1975,104.4,1159.2,0,1976,96.4,1273.0,0,1977,92.5,1401.4,0,1978,112.6,1580.1,0,1979,130.1,1769.
5,0,1980,161.8,1973.3,0,1981,199.1,2200.2,0,1982,205.5,2347.3,1,1983,167.0,2522.4,1,1984,235.7,2810.
0,1,1985,206.2,3002.0,1,1986,196.5,3187.6,1,1987,168.4,3363.1,1,1988,189.1,3640.8,1,1989,187.8,3894.
5,1,1990,208.7,4166.8,1,1991,246.4,4343.7,1,1992,272.6,4613.7,1,1993,214.4,4790.2,1,1994,189.4,5021.
7,1,1995,249.3,5320.8,1),26,4,byrow=T)
colnames(data)<-c('YEAR','PS','PDI','DV')
data<-data.frame(data)
> RegOut<-lm(PS~PDI+DV+PDI*DV,data=data)
summary(RegOut)
> ##
## Call:
## lm(formula = PS ~ PDI + DV + PDI * DV, data = data)
## Residuals:
## Min 1Q Median 3Q Max
## -38.729 -14.777 -1.398 11.689 50.535
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.01612 20.16483 0.050 0.960266
## PDI 0.08033 0.01450 5.541 1.44e-05 ***
## DV 152.47855 33.08237 4.609 0.000136 ***
## PDI:DV -0.06547 0.01598 -4.096 0.000477 ***
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 23.15 on 22 degrees of freedom
## Multiple R-squared: 0.8819, Adjusted R-squared: 0.8658
## F-statistic: 54.78 on 3 and 22 DF, p-value: 2.268e-10
Since, p-value=2.268e-10 we can say that peacetime recession significantly influence the
relationship between PS and PDI.
b.
Both the differential intercept (beta2) and slope coefficients (beta3) are individually
statistically significant, suggesting that the savings-income relationship between the two
time periods has changed. Since beta3 is statistically significant we can say that there is a
presence of interaction effect.
From the above plots we can see that by gender, income level is changing. So, we can say
that there is an effect of gender in income level.
Consider the data on average salary (in dollars) of public school teachers in 50 states and
the District of Columbia for the year 1985. These 51 areas are classified into three
geographical regions: (1) Northeast and North Central (21 states in all), (2) South (17
states in all), and (3) West (13 states in all).
Fit a regression model using D1 and D2. Interpret the mean salary of teacher form different
regions.
Answer:
> library(foreign)
data<-read.spss("D:/Mostakim/Data set/salary.sav",to.data.frame = TRUE)
> ## re-encoding from CP1252
> head(data,3)
> ## salary spending D1 D2 D3
## 1 19583 3346 1 0 0
## 2 20263 3114 1 0 0
## 3 20325 3554 1 0 0
> model<-lm(salary~D1+D2,data = data)
summary(model)
> ##
## Call:
## lm(formula = salary ~ D1 + D2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6329.1 -2592.1 -370.6 2143.4 15321.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26159 1128 23.180 <2e-16 ***
## D1 -1734 1436 -1.208 0.2330
## D2 -3265 1499 -2.178 0.0344 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4069 on 48 degrees of freedom
## Multiple R-squared: 0.09008, Adjusted R-squared: 0.05217
## F-statistic: 2.376 on 2 and 48 DF, p-value: 0.1038
a) Suppose that we expect the regression lines relating tool life to lathe speed to differ
in both intercept and slope (interaction exists). Now, fit a multiple regression model
(model with interaction effect). Write down the model.
b) Check if two regression models are identical (effect of tool type) also test for
significance of interaction effect and comment on it.
Answer:
a.
b.
We have to test the hypothesis, Ho:β2=β3= 0 (i.e. the two regression lines are identical)
We know that,
> #Finding SSR(𝜷1,𝜷2,𝜷3|𝜷o) and SSR(𝜷1|𝜷o):
summary(model)
> ##
## Call:
## lm(formula = Y ~ X + D + X * D, data = data)
## Residuals:
## Min 1Q Median 3Q Max
## -5.1750 -1.4999 0.4849 1.7830 4.8652
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.774760 4.633472 7.073 2.63e-06 ***
## X -0.020970 0.006074 -3.452 0.00328 **
## D 23.970593 6.768973 3.541 0.00272 **
## X:D -0.011944 0.008842 -1.351 0.19553
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 2.968 on 16 degrees of freedom
## Multiple R-squared: 0.9105, Adjusted R-squared: 0.8937
## F-statistic: 54.25 on 3 and 16 DF, p-value: 1.319e-08
> anova(model)
> ## Analysis of Variance Table
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## X 1 293.01 293.01 33.2545 2.889e-05 ***
## D 1 1125.03 1125.03 127.6847 4.891e-09 ***
## X:D 1 16.08 16.08 1.8248 0.1955
## Residuals 16 140.98 8.81
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> model1<-lm(Y~X,data = data)
summary(model1)
> ##
## Call:
## lm(formula = Y ~ X, data = data)
## Residuals:
## Min 1Q Median 3Q Max
## -12.9730 -7.2995 -0.9276 7.2326 12.7766
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.61672 9.60323 4.542 0.000253 ***
## X -0.02545 0.01255 -2.028 0.057600 .
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 8.44 on 18 degrees of freedom
## Multiple R-squared: 0.186, Adjusted R-squared: 0.1408
## F-statistic: 4.114 on 1 and 18 DF, p-value: 0.0576
> anova(model1)
> ## Analysis of Variance Table
##
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## X 1 293.01 293.005 4.1137 0.0576 .
## Residuals 18 1282.08 71.227
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> ssr1<-293.01+1125.03+16.08 #calculating SSR for model, SSR(𝜷1,𝜷2,𝜷3|𝜷o)
ssr2<-293.005 #calculating SSR for model1, SSR(𝜷1|𝜷o)
ssr<-ssr1-ssr2 #calculating SSR for SSR(𝜷2,𝜷3|𝜷1,𝜷o)
MSres<-8.81#calculating MSres for model
c(ssr1,ssr2,ssr,MSres)
> ## [1] 1434.120 293.005 1141.115 8.810
Since for this statistic P-value =2.135303e-08, we conclude that the two regression lines
are not identical that is we can not explain the relationship between hours and rpm with a
single regression model.
Testing for the presence of interaction effects (two regression lines have a common slope
but possibly different intercepts)(Ho:𝞫3=0):
Since since for this statistic P = 0.20 we conclude that the slopes of the two straight lines
are the same (i.e. there is no interaction effect). This can also be determined by using the t
statistics for β3 from 1st summary table where p-value of X:D is 0.1955=0.20.
Answer: a.
> data(table.b3)
cor(table.b3[,c(-1,-4)]) #There is some missing observation in x3
> ## x1 x2 x4 x5 x6 x7
## x1 1.0000000 0.9452080 -0.33015370 -0.6315968 0.65906008 -0.7814778
## x2 0.9452080 1.0000000 -0.29205832 -0.5170425 0.77190992 -0.6431558
## x4 -0.3301537 -0.2920583 1.00000000 0.3737462 -0.04933889 0.4938104
## x5 -0.6315968 -0.5170425 0.37374620 1.0000000 -0.20535194 0.8428620
## x6 0.6590601 0.7719099 -0.04933889 -0.2053519 1.00000000 -0.3005751
## x7 -0.7814778 -0.6431558 0.49381043 0.8428620 -0.30057509 1.0000000
## x8 0.8551981 0.7973892 -0.25810785 -0.5481227 0.42518809 -0.6630802
## x9 0.8013975 0.7176056 -0.31876434 -0.4343576 0.31567268 -0.6682373
## x10 0.9456621 0.8834004 -0.27721850 -0.5424247 0.52064243 -0.7178265
## x11 0.8354239 0.7266835 -0.36836123 -0.7032485 0.41733783 -0.8549981
## x8 x9 x10 x11
## x1 0.8551981 0.8013975 0.9456621 0.8354239
## x2 0.7973892 0.7176056 0.8834004 0.7266835
## x4 -0.2581079 -0.3187643 -0.2772185 -0.3683612
## x5 -0.5481227 -0.4343576 -0.5424247 -0.7032485
## x6 0.4251881 0.3156727 0.5206424 0.4173378
## x7 -0.6630802 -0.6682373 -0.7178265 -0.8549981
## x8 1.0000000 0.8849771 0.9475859 0.6863079
## x9 0.8849771 1.0000000 0.9015431 0.6507213
## x10 0.9475859 0.9015431 1.0000000 0.7722283
## x11 0.6863079 0.6507213 0.7722283 1.0000000
Some variables are highly correlated with each other which indicated that there is a
potential problem with multicollinearity.
b.
> library(car)
> ## Loading required package: carData
> model<-lm(y~.-x4,data=table.b3)
vif(model)
> ## x1 x2 x3 x5 x6 x7 x8
## 119.355446 36.054178 139.027410 7.726459 4.821291 10.062542 19.719346
## x9 x10 x11
## 9.378732 81.037016 5.141237
There are many variance inflation factors which are much higher than 10, which indicates
there is evidence of multicolinearity.
(i) Find all possible regression and its plots hence find the best model.
(ii) Find the best possible subset of regressors using forward step wise procedure with
R codes for each steps. Also with single R-codes to show steps. Write down the final
model.
(iii) Find the best possible subset of regressors using forward selection procedure with
R codes for each steps. Also with single R-codes to show each steps. Write down the
final model.
(iv) Find the best possible subset of regressors using backward elimination procedure
with R codes for each steps. Also with single R-codes to show each steps. Write
down the final model.
Answer:
Since the variance inflation factors are 38.49, 254.42, 46.86, and 282.51 which is much
higher than 10, these reflect quite serious problems with multicolinearity. Here, we will use
model respecification to lessen the impact of multicolinearity.
i.
ii.
> RegOut1<-lm(y~x4,data=data)
summary(RegOut1)
> ##
## Call:
## lm(formula = y ~ x4, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.589 -8.228 1.495 4.726 17.524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 117.5679 5.2622 22.342 1.62e-10 ***
## x4 -0.7382 0.1546 -4.775 0.000576 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.964 on 11 degrees of freedom
## Multiple R-squared: 0.6745, Adjusted R-squared: 0.645
## F-statistic: 22.8 on 1 and 11 DF, p-value: 0.0005762
In the above model x4 significantly affects y. Now checking partial correlation for two
variables given a third variable(x4):
> pcor.test(data$y,data$x1,data$x4)
> ## estimate p.value statistic n gp Method
## 1 0.9567731 1.105281e-06 10.40307 13 1 pearson
> pcor.test(data$y,data$x2,data$x4)
> ## estimate p.value statistic n gp Method
## 1 0.1302149 0.6866842 0.4153118 13 1 pearson
> pcor.test(data$y,data$x3,data$x4)
> ## estimate p.value statistic n gp Method
## 1 -0.8950818 8.375467e-05 -6.347801 13 1 pearson
Since the partial correlation between y and x1 is highest and p-value is lowest after
controlling the effect of third variable x4, we add x1 to our model.
> RegOut2<-lm(y~x4+x1,data=data)
summary(RegOut2)
> ##
## Call:
## lm(formula = y ~ x4 + x1, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0234 -1.4737 0.1371 1.7305 3.7701
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103.09738 2.12398 48.54 3.32e-13 ***
## x4 -0.61395 0.04864 -12.62 1.81e-07 ***
## x1 1.43996 0.13842 10.40 1.11e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.734 on 10 degrees of freedom
## Multiple R-squared: 0.9725, Adjusted R-squared: 0.967
## F-statistic: 176.6 on 2 and 10 DF, p-value: 1.581e-08
In the above model both x1 and x4 significantly affects y. Therefore we will check partial
correlation for two variables given a set of variables(x1, x4):
> pcor.test(data$y,data$x2,data[,c(6,3)])
> ## estimate p.value statistic n gp Method
## 1 0.5986053 0.05168735 2.241844 13 2 pearson
> pcor.test(data$y,data$x3,data[,c(6,3)])
> ## estimate p.value statistic n gp Method
## 1 -0.5657105 0.06969226 -2.058117 13 2 pearson
Since the partial correlation between y and x2 is highest and p-value is lowest given the set
of variables x1and x4, we add x2 to our model.
> RegOut3<-lm(y~x4+x1+x2,data=data)
summary(RegOut3)
> ##
## Call:
## lm(formula = y ~ x4 + x1 + x2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0919 -1.8016 0.2562 1.2818 3.8982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.6483 14.1424 5.066 0.000675 ***
## x4 -0.2365 0.1733 -1.365 0.205395
## x1 1.4519 0.1170 12.410 5.78e-07 ***
## x2 0.4161 0.1856 2.242 0.051687 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.309 on 9 degrees of freedom
## Multiple R-squared: 0.9823, Adjusted R-squared: 0.9764
## F-statistic: 166.8 on 3 and 9 DF, p-value: 3.323e-08
In the above model x4 is not significant, so we can remove x4 from our model. At this point
the only remaining candidate regressor is x3.
> pcor.test(data$y,data$x3,data[,c(3,4)])
> ## estimate p.value statistic n gp Method
## 1 0.4112643 0.2088895 1.353561 13 2 pearson
Partial correlation between y and x3 is weak(0.4) and p-value is large(.21) which is not
significant. So, the forward stepwise selection procedure terminates.
Machine Mechanism:
At first she fits model with single regressor (x1,x2,x3,x4), then she finds which model has
the lowest p-value. In our case (y~x4) has the lowest p-value so she selects x4 in the model.
Secondly she fits model with two regressors (x4+x1,x4+x2,x4+x3), (y~x4+x1) has the
lowest p-value so she selects x1 in the model. Thirdly she fits model with regressors
(x4+x1+x2,x4+x1+x3), (y~x4+x1+x2) has the lowest p-value so she add x2 in the model but
then she finds out that x4 is not significant any more, so she removes x4 from the model.
Therefore final model is (y~x1+x2). The relevant codes are:
> summary(lm(y~x1,data=data))
summary(lm(y~x2,data=data))
summary(lm(y~x3,data=data))
summary(lm(y~x4,data=data))
summary(lm(y~x4+x1,data=data))
summary(lm(y~x4+x2,data=data))
summary(lm(y~x4+x3,data=data))
summary(lm(y~x4+x1+x2,data=data))
summary(lm(y~x4+x1+x3,data=data))
iii.
> cor(data)
> ## X y x1 x2 x3 x4
## X 1.0000000 0.4850346 0.1382303 0.5514217 0.2238313 -0.6328061
## y 0.4850346 1.0000000 0.7307175 0.8162526 -0.5346707 -0.8213050
## x1 0.1382303 0.7307175 1.0000000 0.2285795 -0.8241338 -0.2454451
## x2 0.5514217 0.8162526 0.2285795 1.0000000 -0.1392424 -0.9729550
## x3 0.2238313 -0.5346707 -0.8241338 -0.1392424 1.0000000 0.0295370
## x4 -0.6328061 -0.8213050 -0.2454451 -0.9729550 0.0295370 1.0000000
> RegOut1<-lm(y~x4,data=data)
summary(RegOut1)
> ##
## Call:
## lm(formula = y ~ x4, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.589 -8.228 1.495 4.726 17.524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 117.5679 5.2622 22.342 1.62e-10 ***
## x4 -0.7382 0.1546 -4.775 0.000576 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.964 on 11 degrees of freedom
## Multiple R-squared: 0.6745, Adjusted R-squared: 0.645
## F-statistic: 22.8 on 1 and 11 DF, p-value: 0.0005762
Since correlation between y and x4 is highest and p-value is lowest we select x4.
Now checking partial correlation for two variables given a third variable(x4):
> pcor.test(data$y,data$x1,data$x4)
> ## estimate p.value statistic n gp Method
## 1 0.9567731 1.105281e-06 10.40307 13 1 pearson
> pcor.test(data$y,data$x2,data$x4)
> ## estimate p.value statistic n gp Method
## 1 0.1302149 0.6866842 0.4153118 13 1 pearson
> pcor.test(data$y,data$x3,data$x4)
> ## estimate p.value statistic n gp Method
## 1 -0.8950818 8.375467e-05 -6.347801 13 1 pearson
Since correlation between y and x1 is highest and p-value is lowest given the variable x4,
we add x1 to our model.
> RegOut2<-lm(y~x4+x1,data=data)
summary(RegOut2)
> ##
## Call:
## lm(formula = y ~ x4 + x1, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0234 -1.4737 0.1371 1.7305 3.7701
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103.09738 2.12398 48.54 3.32e-13 ***
## x4 -0.61395 0.04864 -12.62 1.81e-07 ***
## x1 1.43996 0.13842 10.40 1.11e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.734 on 10 degrees of freedom
## Multiple R-squared: 0.9725, Adjusted R-squared: 0.967
## F-statistic: 176.6 on 2 and 10 DF, p-value: 1.581e-08
Checking partial correlation for two variables given a set of variables(x1, x4):
> pcor.test(data$y,data$x2,data[,c(6,3)])
> ## estimate p.value statistic n gp Method
## 1 0.5986053 0.05168735 2.241844 13 2 pearson
> pcor.test(data$y,data$x3,data[,c(6,3)])
> ## estimate p.value statistic n gp Method
## 1 -0.5657105 0.06969226 -2.058117 13 2 pearson
Since correlation between y and x2 is highest and p-value is lowest given the variable
x4&x1, we add x2 to our model.
> RegOut3<-lm(y~x4+x1+x2,data=data)
Checking partial correlation for two variables given a set of variables(x1, x4,x2):
> pcor.test(data$y,data$x3,data[,c(6,3,4)])
> ## estimate p.value statistic n gp Method
## 1 0.04768649 0.8959227 0.1350314 13 3 pearson
Partial correlation between y and x3 is very weak(0.05) and p-value is much larger(.89)
which is not significant.
> model<-lm(y~x1+x2+x4,data=data)
model
> ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.6483 14.1424 5.066 0.000675 ***
## x1 1.4519 0.1170 12.410 5.78e-07 ***
## x2 0.4161 0.1856 2.242 0.051687 .
## x4 -0.2365 0.1733 -1.365 0.205395
iv.
We can see from the above summary table that x3 has the highest p-value(0.89) among all
other regressors. So, we can remove x3 from the model.
> RegOut2<-lm(y~x1+x2+x4,data=data) #Fitting the model with x1,x2,x4.
summary(RegOut2)
> ##
## Call:
## lm(formula = y ~ x1 + x2 + x4, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0919 -1.8016 0.2562 1.2818 3.8982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.6483 14.1424 5.066 0.000675 ***
## x1 1.4519 0.1170 12.410 5.78e-07 ***
## x2 0.4161 0.1856 2.242 0.051687 .
## x4 -0.2365 0.1733 -1.365 0.205395
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.309 on 9 degrees of freedom
## Multiple R-squared: 0.9823, Adjusted R-squared: 0.9764
## F-statistic: 166.8 on 3 and 9 DF, p-value: 3.323e-08
In second model we can see that x4 has the highest p-value(0.21) among all other
regressors, therefore, x4 is removed from the model.
> RegOut3<-lm(y~x1+x2,data=data) #Fitting the model with x1 and x2.
summary(RegOut3)
> ##
## Call:
## lm(formula = y ~ x1 + x2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.893 -1.574 -1.302 1.363 4.048
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.57735 2.28617 23.00 5.46e-10 ***
## x1 1.46831 0.12130 12.11 2.69e-07 ***
## x2 0.66225 0.04585 14.44 5.03e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.406 on 10 degrees of freedom
## Multiple R-squared: 0.9787, Adjusted R-squared: 0.9744
## F-statistic: 229.5 on 2 and 10 DF, p-value: 4.407e-09
Answer:
Problem 14: Model Selection
Consider the solar thermal energy test data in table.b2 in “MPV” package. Here the
variables denote,
d. Apply all possible regressions to the data. Evaluate R square, Cp , and MSRes for
each model. Which subset model do you recommend?
e. Compare and contrast the models produced by the variable selection strategies in
parts a – d.
b. Use stepwise regression to specify a subset regression model. Does this lead to the
same model found in part a?
Answer:
Problem 16: Model Adequacy Checking
Consider the simple regression model fit to the National Football League team performance
data in Problem 2.
a. Construct a normal probability plot of the residuals. Does there seem to be any
problem with the normality assumption?
b. Construct and interpret a plot of the residuals versus the predicted response.
c. Plot the residuals versus the team passing yardage, x2 . Does this plot indicate that
the model will be improved by adding x2 to the model?
Answer:
> data("table.b1")
data<-table.b1
model<-lm(y~x8,data = data)
summary(model)
> ##
## Call:
## lm(formula = y ~ x8, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.804 -1.591 -0.647 2.032 4.580
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.788251 2.696233 8.081 1.46e-08 ***
## x8 -0.007025 0.001260 -5.577 7.38e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.393 on 26 degrees of freedom
## Multiple R-squared: 0.5447, Adjusted R-squared: 0.5272
## F-statistic: 31.1 on 1 and 26 DF, p-value: 7.381e-06
a.
> library(olsrr)
ols_plot_resid_qq(model)
> ols_plot_resid_fit(model)
> ols_plot_resid_regressor(model,"x2")
a. Construct a normal probability plot of the residuals. Does there seem to be any
problem with the normality assumption?
b. Construct and interpret a plot of the residuals versus the predicted response.
c. Construct plots of the residuals versus each of the regressor variables. Do these
plots imply that the regressor is correctly specified?
d. Construct the partial regression plots for this model. Compare the plots with the
plots of residuals versus regressors from part c above. Discuss the type of
information provided by these plots.
e. Compute the studentized residuals and the R - student residuals for this model.
What information is conveyed by these scaled residuals?
Answer:
Try yourself!!
f. Calculate the residuals and show that the sum of the residuals equal zero.
g. Find a 95% confidence interval on the mean quality when X1 = 1, X2 = 6.8, X3 = 5, X4
= 6, X 5 = 5.2. Also the 95% prediction interval for that observation.
(a) Fit a multiple regression model relating gasoline mileage Y (miles per gallon) to
engine displacement X1 and the number of carburetor barrels X6. Interpret the
regression coefficients.
(b) Make comment about the goodness of fit of the above model.
(d) Find a 95% confidence interval for 𝜷1. What does this confidence interval mean?
What conclusion can you draw for testing Ho:𝜷1 = 0 based on the confidence
interval?
(e) Find a 95% confidence interval on the mean gasoline mileage when X1 = 275 in3
and X6 = 2 barrels. Also the 95% prediction interval for a new observation when X1
= 275 in3 and X6 = 2 barrels.
(f) Fit a linear regression model relating gasoline mileage Y to engine displacement X1
and the type of transmission X11. Does the type of transmission significantly affect
the mileage performance?
(g) Modify the model developed in part (f) to include an interaction between engine
displacement and the type of transmission. What conclusion can you draw about the
effect of the type of transmission on gasoline mileage?
Answer:
> library(MPV) #Reading Library
> data<-table.b3
head(data,3)
> ## y x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11
## 1 18.9 350 165 260 8.00 2.56 4 3 200.3 69.9 3910 1
## 2 17.0 350 170 275 8.50 2.56 4 3 199.6 72.9 3860 1
## 3 20.0 250 105 185 8.25 2.73 1 3 196.7 72.2 3510 1
> model<-lm(y~x1+x6,data=data) #(a)
summary(model) #(b,c)
> ##
## Call:
## lm(formula = y ~ x1 + x6, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.0623 -1.6687 -0.3628 1.6221 6.2305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.884551 1.535408 21.417 < 2e-16 ***
## x1 -0.053148 0.006137 -8.660 1.55e-09 ***
## x6 0.959223 0.670277 1.431 0.163
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.013 on 29 degrees of freedom
## Multiple R-squared: 0.7873, Adjusted R-squared: 0.7726
## F-statistic: 53.67 on 2 and 29 DF, p-value: 1.79e-10
> confint(model,level=.95) #(d)
> ## 2.5 % 97.5 %
## (Intercept) 29.74428901 36.02481266
## x1 -0.06569892 -0.04059641
## x6 -0.41164739 2.33009349
> new.data<-matrix(c(275,2),1,2) #(e)
colnames(new.data)<-c('x1','x6')
new.data<-data.frame(new.data)
predict(model,newdata=new.data,interval='confidence')
> ## fit lwr upr
## 1 20.18739 18.87221 21.50257
The 95% confidence interval associated with a value of engine displacement, x1=275 and
carburetor barrels,x6=2 is (18.87221, 21.50257). This means that, according to our model,
a value of engine displacement x1=275 and carburetor barrels x6=2 has, on average, a
gasoline mileage Y ranging between 18.87221 and 21.50257
> predict(model,newdata=new.data,interval='prediction')
> ## fit lwr upr
## 1 20.18739 13.8867 26.48808
The 95% prediction intervals associated with a value of engine displacement, x1=275 and
carburetor barrels,x6=2 is (13.8867, 26.48808). This means that, according to our model,
95% of the cars with a value of engine displacement,x1=275 and carburetor barrels,x6=2
have a gasoline mileage Y ranging between 13.8867 and 26.48808
(g) Since p-value of the F-statistics is 1.064e-11, which is highly significant. This means
that, at least, one of the predictor variables is significantly related to the outcome
variable. Both the differential intercept (beta2) and slope coefficients (beta3) are
individually statistically significant, suggesting that there is effect of the type of
transmission on gasoline mileage.
Problem 20: Final Question 2019
Use the ‘mosaicData’ package to run the data ‘SaratogaHouses’ then answer the following
questions:
Consider the house price data in Saratoga Country, New York, USA in 2006. The data set has
1728 observations and 16 variables.
a. Fit a multiple regression model relating selling price to size of lot (square feet), age
of house (years), value of land (1000s of US dollars), living area (square feet),
number of bedrooms, number of fireplaces, number of bathrooms and number of
rooms.
b. Construct the analysis of variance table and test for significance of regression.
c. Find a 95% confidence interval for 𝜷3. What conclusion can you draw about Ho: 𝜷3
= 0 from the above confidence interval?
d. Use forward selection and backward elimination procedure to find the best subset
of regressors from the regressors used in part (a).
Answer:
Try yourself!!
Problem 21: Final Question 2018
(c) At the 5% significance level, does it appear that any of the predictor variables can be
removed from the full model as unnecessary?
(d) Obtain and interpret 95% confidence intervals for the slopes, fir, of the population
regression line that relates net revenues and number of branches to profit margin.
(e) Are there any multicolinearity problems (i.e., are net revenues and number of
branches co linear [estimating similar relationships/quantities])?
(f) Obtain a point estimate for the mean profit margin with 3.5 net revenues and 6500
branches. ”
(g) Test the alternative hypothesis that the mean profit margin with 3.5 net revenues
and 6500 branches is greater than 0.70. Test at the 5% significance level.
(h) Determine a 95% confidence interval for the mean profit margin with 3.5 net
revenues and 6500 branches.
(i) Find the predicted profit margin for my bank with 3.5 net revenues and 6500
branches.
(j) Determine a 95% prediction interval for the profit margin for the bank with 3.5 net
revenues and 6500 branches.
Answer:
Try yourself!!
References:
• Theory and Practical Lecture of Dr. Zillur Rahman Shabuz Sir (Problem 01, 03, 07,
08, 09, 10, 12).
• Montgomery, D.C. and Peck, E., (2012), An introduction to Regression analysis, 5th
edi. Wiley, N.Y. (Problem 02, 04-06, 11, 12-17).
For the above-mentioned problems kindly see the relevant chapters in the lectures and
books.
Good Luck