H-311 Linear Regression Analysis With R

Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

LINEAR REGRESSION ANALYSIS WITH R

Md. Mostakim

Session: 2018-19

Department Of Statistics

University Of Dhaka

Published Date:

1st Published: June, 2022

Acknowledgements:

I would like to acknowledge our course teacher Dr. Zillur Rahman Shabuz Sir for teaching
us regression analysis so beautifully. This is just a tiny reflection of Sir's guidance.

N.B. You may share this pdf book as much as you like but don’t use it for any unethical purpose.

For any kind of feedback please contact to [email protected]. Your feedback will be very inspiring for me.
Table of Contents
Problem 01: Simple Linear Regression Model ................................................................................. 4
Problem 02: Simple Linear Regression Model ................................................................................. 9
Problem 03: Multiple Regression Model ......................................................................................... 10
Problem 04: Multiple Regression Model ......................................................................................... 16
Problem 05: Multiple Regression Model ......................................................................................... 19
Problem 06: Polynomial Model ........................................................................................................... 21
Problem 07: Dummy Variable with Interaction ........................................................................... 23
Problem 08: Dummy Variable ............................................................................................................. 25
Problem 09: Dummy Variable ............................................................................................................. 28
Problem 10: Dummy Variable with Interaction Effect .............................................................. 30
Problem 11: Multicollinearity.............................................................................................................. 34
Problem 12: Model Selection ............................................................................................................... 36
Problem 13: Model Selection ............................................................................................................... 55
Problem 14: Model Selection ............................................................................................................... 56
Problem 15: Model Selection ............................................................................................................... 56
Problem 16: Model Adequacy Checking .......................................................................................... 57
Problem 17: Model Adequacy Checking .......................................................................................... 64
Problem 18: In course Question 2022 .............................................................................................. 64
Problem 19: In course Question 2019 .............................................................................................. 65
Problem 20: Final Question 2019 ...................................................................................................... 69
Problem 21: Final Question 2018 ...................................................................................................... 70
References: .................................................................................................................................................. 71
Problem 01: Simple Linear Regression Model
Use the following data to answer the questions:

a. Write down the fitted model and interpret the regression coefficients.

b. Construct the analysis of variance table and test for significance of regression. Also,
calculate and interpret R-square.
c. Find the fitted values and check whether the sum of the observed values equals the
sum of the fitted values.

d. Find the residuals and check whether the sum of residuals equals zero.

e. Check whether the sum of the residuals weighted by the corresponding value of the
regressor always equals zero. Also, check whether the sum of the residuals weighted
by the corresponding fitted value always equals zero.
f. Construct an equal tailed 90% confidence interval for the parameter 𝞫 and make
conclusion about the hypotheses.

g. Find confidence interval for x=12.

h. Find prediction interval for x=12.


Answer:

a.

> data<-matrix(c(1,3,4,4,6,8,10,10,11,13,80,97,92,102,103,111,119,123,117,136),10,2) #Reading the data


as matrix
colnames(data)<-c('x','y')#Setting the column names
data<-data.frame(data) #Reading the data as data frame
RegOut<-lm(y~x,data=data) #Creating model
RegOut #View model
> ##
## Call:
## lm(formula = y ~ x, data = data)
##
## Coefficients:
## (Intercept) x
## 80 4

So, our estimated model is: 𝑦̂=80+4x

Interpretation:

The intercept is 80. It can be interpreted as the predicted values of y for a zero x. This
means that, for x equal zero, we can expect a value of y=80, on average. The regression beta
coefficient for the variable y, also known as the slope, is 4. This means that, for a unit
change in x, we can expect an increase of 4 units in y, on average.
b.

> summary(RegOut)
> ##
## Call:
## lm(formula = y ~ x, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.00 -3.25 -1.00 3.75 6.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.0000 3.0753 26.01 5.12e-09 ***
## x 4.0000 0.3868 10.34 6.61e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.61 on 8 degrees of freedom
## Multiple R-squared: 0.9304, Adjusted R-squared: 0.9217
## F-statistic: 106.9 on 1 and 8 DF, p-value: 6.609e-06
> anova(RegOut)
> ## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 2272 2272.00 106.92 6.609e-06 ***
## Residuals 8 170 21.25
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since, the p-value=6.609e-06, we can say that our model is significant.

The value of R-square=0.9304, which means that 93% of the variation in the data can be
explained by the model.
c.

Following two lines will give fitted values

> fitted<-predict(RegOut)
fitted
> ## 1 2 3 4 5 6 7 8 9 10
## 84 92 96 96 104 112 120 120 124 132
> RegOut$fitted.values
> ## 1 2 3 4 5 6 7 8 9 10
## 84 92 96 96 104 112 120 120 124 132
> sum(data$y)
> ## [1] 1080
> sum(fitted)
> ## [1] 1080

Therefore, the sum of the observed values equals the sum of the fitted values.

d.

Following two lines will give residuals

> resi<-resid(RegOut)

RegOut$residuals
> ## 1 2 3 4 5 6 7 8 9 10
## -4 5 -4 6 -1 -1 -1 3 -7 4
> sum(resi)
> ## [1] 1.221245e-15

Therefore, the sum of residuals equals zero.

e.

> sum(data$x*resi)
> ## [1] 7.105427e-15
> sum(resi*fitted)
> ## [1] -2.842171e-14
Therefore, the sum of the residuals weighted by the corresponding value of the regressor
always equals zero and the sum of the residuals weighted by the corresponding fitted value
always equals zero.

f.

> confint(RegOut, level=0.90) # 90% confidence interval


> ## 5% 95 %
## (Intercept) 74.281248 85.718752
## x 3.280646 4.719354

Since, the CI of beta is [3.3,4.7] which doesn’t include the value zero, we can say that beta is
highly significant.

g.

Confidence interval for mean response:

> new<-matrix(c(12))
colnames(new)<-'x'
new<-data.frame(new)
predict(RegOut, newdata =new, interval = 'confidence')
> ## fit lwr upr
## 1 128 122.4148 133.5852

h.

Confidence interval for prediction interval:

> # Prediction interval:


predict(RegOut, newdata =new, interval = 'prediction')
> ## fit lwr upr
## 1 128 115.9919 140.0081
Problem 02: Simple Linear Regression Model
The data in table.b1 in “MPV” package gives data concerning the performance of the 26
National Football League teams in 1976. It is suspected that the number of yards gained
rushing by opponents ( x8 ) has an effect on the number of games won by a team ( y ).

a. Fit a simple linear regression model relating games won y to yards gained rushing
by opponents x8 .
b. Construct the analysis of variance table and test for significance of regression.
c. Find a 95% CI on the slope.

c. What percent of the total variability in y is explained by this model?

d. Find a 95% CI on the mean number of games won if opponents’ yards rushing is
limited to 2000 yards.
e. Suppose we would like to use the model to predict the number of games a team will
win if it can limit opponents ’ yards rushing to 1800 yards. Find a point estimate of
the number of games won when x8 = 1800. Find a 90% prediction interval on the
number of games won.

Answer:
The fitted value is 9.14 and a 90% prediction interval on the number of games won if
opponents’ yards rushing is limited to 1800 yards is (4.935,13.351).

Problem 03: Multiple Regression Model


Suppose you have a data set named reg_data.csv (data download link: reg_data.csv data)
that is stored on the D drive.

We wish to predict BMI on the basis of AGE (in years), HEIGHT (in inches), WEIGHT (in
Ibs), and WAIST (in cms). (Population regression model: BMI =𝞫o + 𝞫1*AGE + 𝞫2*HEIGHT
+ 𝞫3*WEIGHT + 𝞫4*WAIST + 𝛜)

i. Write down the fitted model and interpret the regression coefficients.

ii. Construct the analysis of variance table and test for significance of regression. Also,
interpret R-square.

iii. Also construct an equal tailed 90% confidence interval for the parameter 𝞫1 and
make conclusion about the hypotheses Ho:𝞫1=0 vs H1:𝞫1≠0

iv. Perform a hypothesis testing for Ho:𝞫1=𝞫2=0

v. Construct a 95% Confidence Interval for the mean response when AGE=20,
HEIGHT=66, WEIGHT=156, and WAIST=89. Also, interpret the result.

vi. Also, construct a 95% prediction interval for the information given in (v) and
interpret the result.

vii. Use forward selection procedure to find the best subset of regressors from the
regressors used in (i). [Write R codes for each steps and write down the final
model.]
Answer:

i.

> data<-read.csv("D:/Mostakim/Data set/reg_data.csv")


model<-lm(BMI~AGE+HEIGHT+WEIGHT+WAIST,data = data)
summary(model)
> ##
## Call:
## lm(formula = BMI ~ AGE + HEIGHT + WEIGHT + WAIST, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.92556 -0.12445 0.00732 0.10002 0.92135
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.071958 1.808471 28.793 <2e-16 ***
## AGE -0.001835 0.004503 -0.408 0.686
## HEIGHT -0.771534 0.024730 -31.198 <2e-16 ***
## WEIGHT 0.143698 0.006008 23.917 <2e-16 ***
## WAIST 0.021014 0.015148 1.387 0.174
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3043 on 35 degrees of freedom
## Multiple R-squared: 0.9929, Adjusted R-squared: 0.9921
## F-statistic: 1230 on 4 and 35 DF, p-value: < 2.2e-16
̂ =52.07-0.0018AGE-0.77HEIGHT+0.14WEIGHT+0.02WAIST
The fitted model is 𝐵𝑀𝐼

Interpretation:

The intercept has no practical interpretation since age, height, weight, and waist can’t be
zero.
We found that, the age coefficient is not significant in the multiple regression model. This
means that, for a fixed amount of height, weight, and waist, changes in the age will not
significantly affect BMI.

The height coefficient suggests that for every 1 unit increase in height, holding all other
predictors constant, we can expect a decrease of BMI by 0.77 unit, on average.

The weight coefficient suggests that for every 1 unit increase in weight, holding all other
predictors constant, we can expect an increase of BMI by 0.14 unit, on average.

We found that, the waist coefficient is not significant in the multiple regression model. This
means that, for a fixed amount of height, weight, and age, changes in the waist will not
significantly affect BMI.

ii.

> anova(model)
> ## Analysis of Variance Table
##
## Response: BMI
## Df Sum Sq Mean Sq F value Pr(>F)
## AGE 1 8.64 8.64 93.2988 2.084e-11 ***
## HEIGHT 1 8.30 8.30 89.5822 3.505e-11 ***
## WEIGHT 1 438.67 438.67 4736.7937 < 2.2e-16 ***
## WAIST 1 0.18 0.18 1.9244 0.1741
## Residuals 35 3.24 0.09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The R-square is 0.99 which means that about 99% of the variation can be explained by our
model.
iii.

> confint(model,level = .9)


> ## 5% 95 %
## (Intercept) 49.016414944 55.12750050
## AGE -0.009444081 0.00577342
## HEIGHT -0.813317628 -0.72975096
## WEIGHT 0.133547048 0.15384981
## WAIST -0.004579828 0.04660826

Since the CI for AGE(𝞫1) is [-0.009444081, 0.00577342] which includes the value 0, we can
conclude that we don’t have enough evidence to reject null hypothesis. Therefore 𝞫1 is
insignificant at 10% level of significance.

iv.

To perform hypothesis testing for Ho:𝞫1=𝞫2=0, we have to perform F-test. Test statistics
will be,
𝑆𝑆𝑅 (β1 , β2 |β3 , β4 )
𝐹0 =
𝑀𝑆𝑅𝑒𝑠
> model1<-lm(BMI~WEIGHT+WAIST,data=data)#Removing 𝞫1,𝞫2 to form a new model
anova(model1)
> ## Analysis of Variance Table
##
## Response: BMI
## Df Sum Sq Mean Sq F value Pr(>F)
## WEIGHT 1 293.528 293.528 90.524 1.75e-11 ***
## WAIST 1 45.527 45.527 14.041 0.0006092 ***
## Residuals 37 119.974 3.243
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> ss1<-8.64+8.30+438.67+0.18
ss1
> ## [1] 455.79
> ss2<-293.528+45.527
ss2
> ## [1] 339.055
> ssr<-ss1-ss2
F<-(ssr/2)/0.09 #F-statistic
F
> ## [1] 648.5278
> pf(F,2,35,lower.tail = FALSE)#P-value
> ## [1] 2.198112e-28

Since P-value=0 we may reject the null hypothesis and may conclude that 𝞫1 and 𝞫2 is not
equal to zero.
v.
> new.data<-matrix(c(20,66,156,89),1,4)
colnames(new.data)<-c('AGE','HEIGHT','WEIGHT','WAIST')
new.data1<-data.frame(new.data)
predict(model,newdata = new.data1,interval = "confidence")
> ## fit lwr upr
## 1 25.40121 25.20286 25.59955
Interpretation:

Therefore, 95% CI for the mean response is (25.2,25.6) when AGE=20, HEIGHT=66,
WEIGHT=156, and WAIST=89 which means that the estimated mean of the BMI will be in
between 25.2 to 25.6 about 95% of time when AGE=20, HEIGHT=66, WEIGHT=156, and
WAIST=89.

vi.

> predict(model, newdata = new.data1, interval = "prediction")


> ## fit lwr upr
## 1 25.40121 24.75235 26.05007

Interpretation:

The 95% prediction interval is (24.75,26.05) which means that about 95% of time the
value of BMI will be between 24.75 and 26.05 when AGE=20, HEIGHT=66, WEIGHT=156,
and WAIST=89.

vii.

First learn problem 12: model selection then try yourself!


Problem 04: Multiple Regression Model
The data in table.b5 in “MPV” package represents the performance of a chemical process as
a function of sever controllable process variables. Here the variables denote:

(a) Fit a multiple regression model relating CO2 product (y) to total solvent (x6) and
hydrogen consumption (x7).

(b) Test for significance of regression. Calculate R-square and adjusted R-square.

(c) Construct 95% CIs on 𝞫6 and 𝞫7 .

(d) Refit the model using only X6 as the regressor. Test for significance of regression
and calculate R-square.
(e) Construct a 95% CI on 𝞫6 using the model you fit in part (d). Compare the length of
this CI to the length of the CI in part (c).
(f) Is it possible to construct an un-equal tailed 95% confidence interval (say 1% in the
lower tail and 4% in the upper tail or any other combination) that has length shorter
than the equal tailed confidence interval [Yes/No]?
Answer:

a.

> #Reading the dataset


library("MPV") #You first might need to install "MPV" package
> data<-table.b5
#Fitting a regression model.
model<-lm(y~x6+x7,data = data)
summary(model)
> ##
## Call:
## lm(formula = y ~ x6 + x7, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.2035 -4.3713 0.2513 4.9339 21.9682
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.526460 3.610055 0.700 0.4908
## x6 0.018522 0.002747 6.742 5.66e-07 ***
## x7 2.185753 0.972696 2.247 0.0341 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.924 on 24 degrees of freedom
## Multiple R-squared: 0.6996, Adjusted R-squared: 0.6746
## F-statistic: 27.95 on 2 and 24 DF, p-value: 5.391e-07

The model is 𝑦̂=2.526+.018x6+2.185x7

b.

Since p-value=5.391e-07 we can say that the model is significant.

R-square=0.6996

Adjusted R-square=0.6746
Interpretation: About 67.5% of variation in y can be explained by our model (or by x6 and
x7 variable).

c.

> confint(model)
> ## 2.5 % 97.5 %
## (Intercept) -4.92432697 9.97724714
## x6 0.01285196 0.02419204
## x7 0.17820756 4.19329833

Interpretation: Since both the CI of x6 and x7 doesn’t contain the value 0, we can say that
x6 and x7 are statistically significant at 5% level of significance.

d.

> model1<-lm(y~x6,data = data)


summary(model1)
> ##
## Call:
## lm(formula = y ~ x6, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.081 -5.829 -0.839 5.522 26.882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.144181 3.483064 1.764 0.0899 .
## x6 0.019395 0.002932 6.616 6.24e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.7 on 25 degrees of freedom
## Multiple R-squared: 0.6365, Adjusted R-squared: 0.6219
## F-statistic: 43.77 on 1 and 25 DF, p-value: 6.238e-07

Interpretation and calculation: Try yourself!!


e.

> confint(model1) #Constructing a 95% on 𝞫6 using the model in part (d)


> ## 2.5 % 97.5 %
## (Intercept) -1.02932458 13.31768586
## x6 0.01335688 0.02543261
> 0.01335688-0.02543261 #length of B6 for part (d)
> ## [1] -0.01207573
> 0.01382178-0.02322222 #length of B6 for part (c)
> ## [1] -0.00940044

Therefore, the CI for part (c) is more lengthy than part (d).

f.

No. (Why! Think yourself!! )

Problem 05: Multiple Regression Model


The data located in table.b1 in “MPV” package represents the National Football League
data.

a. Fit a multiple linear regression model relating the number of games won to the
team’s passing yardage(x2), the percentage of rushing plays(x7), and the opponents’
yards rushing(x8).

b. Construct the analysis - of - variance table and test for significance of regression.
c. Calculate t statistics for testing the hypotheses H0 : β2 = 0, H0 : β7 = 0, and H0 : β8 =
0. What conclusions can you draw about the roles the variables x2, x7, and x8 play in
the model?

d. Calculate R square and Adjusted R-square for this model.

e. Using the partial F test, determine the contribution of x7 to the model. How is this
partial F statistic related to the t test for β7 calculated in part c above?
f. Show numerically that the square of the simple correlation coefficient between the
observed values yi and the fitted values 𝑦̂ equals R square.

g. Find a 95% CI on β7 .
h. Find a 95% CI on the mean number of games won by a team when x2 = 2300, x7 =
56.0, and x8 = 2100.
Answer:

a. 𝑦̂= -1.8 + .0036x2 + .194x7 - .0048x8

b. Regression is significant.

c. All three are significant.

d. Rsquare = 78.6% and Adjusted R square= 76.0%

e. Fo = (257.094 - 243.03)/2.911 = 4.84 which is significant at a = 0.05. The test


statistic here is the square of the t-statistic in part c.

f. Correlation coefficient between Yi and 𝑦̂ is .887. So (.887)^2 = .786 which is


Rsquare.

g. ̂ ± 2.064(.08823) =(.012,
A 95% confidence interval on the slope parameter β7 is 𝛽7
.376)

A 95% confidence interval on the mean number of games won by a team when X2 =
2300, X7 = 56.0 and X8 = 2100 is
Problem 06: Polynomial Model
Consider a solid fuel rocket propellant loses weight after it is produced. The following data
are available: (Data download link: polynomial.txt)

a. Fit a second - order polynomial that expresses weight loss as a function of the
number of months since production.

b. Test for significance of regression.

c. Test the hypothesis H0 : β2 = 0. Comment on the need for the quadratic term in this
model.

d. Are there any potential hazards in extrapolating with this model?


Answer:

> data<-read.table("D:/Mostakim/Data set/polynomial_model.txt",header=TRUE) #Reading the data set


plot(data$x,data$y)
lines(data$x,data$y)

From the above plot we can see that we have to fit a polynomial model for the above data.
a.

Fitting a polynomial model:

> model<-lm(y~x+I(x^2),data=data)
summary(model)
> ##
## Call:
## lm(formula = y ~ x + I(x^2), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.005364 -0.002727 0.001045 0.002409 0.003273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.633000 0.004196 389.2 < 2e-16 ***
## x -1.232182 0.007010 -175.8 5.09e-14 ***
## I(x^2) 1.494545 0.002484 601.6 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.003568 on 7 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.859e+06 on 2 and 7 DF, p-value: < 2.2e-16

Therefore, the model is 𝑦̂=1.63-1.23x+1.49x^2

b.

Since F = 1.859e+06 with p = 0.000 which is significant.

c.

Since for β2 p-value is equals to zero, therefore we may reject the null hypothesis and
conclude that β2 is significant. As we can see that the plot of y and x is somewhat quadratic
we can say that we need the quadratic term in our model.

d. Since it is a quadratic model, there can be potential hazards in extrapolating.


Problem 07: Dummy Variable with Interaction
The R script reg_dummy.R stored on the desktop has the information about personal
savings (PS) and personal disposable income (PD1) both measured in billions of dollars, in
the United States for the period 1970 to 1995. It is well known that in 1982 the United
States suffered its worst peacetime recession. The unemployment rate that year reached
9.7 percent, the highest since 1948.

(a) Test whether this peacetime recession significantly influence the relationship
between PS and PDI. [Divide the sample data into two periods, 1970 to 1981 and
1982 to 1995, the pre- and post-1982 recession periods.]
(b) Test for the significance of interaction effect.

Answer:
a.

> data<-matrix(c(1970,61.0,727.1,0,1971,68.6,790.2,0,1972,63.6,855.3,0,1973,89.6,965.0,0,1974,97.6,105
4.2,0,1975,104.4,1159.2,0,1976,96.4,1273.0,0,1977,92.5,1401.4,0,1978,112.6,1580.1,0,1979,130.1,1769.
5,0,1980,161.8,1973.3,0,1981,199.1,2200.2,0,1982,205.5,2347.3,1,1983,167.0,2522.4,1,1984,235.7,2810.
0,1,1985,206.2,3002.0,1,1986,196.5,3187.6,1,1987,168.4,3363.1,1,1988,189.1,3640.8,1,1989,187.8,3894.
5,1,1990,208.7,4166.8,1,1991,246.4,4343.7,1,1992,272.6,4613.7,1,1993,214.4,4790.2,1,1994,189.4,5021.
7,1,1995,249.3,5320.8,1),26,4,byrow=T)
colnames(data)<-c('YEAR','PS','PDI','DV')
data<-data.frame(data)
> RegOut<-lm(PS~PDI+DV+PDI*DV,data=data)
summary(RegOut)
> ##
## Call:
## lm(formula = PS ~ PDI + DV + PDI * DV, data = data)
## Residuals:
## Min 1Q Median 3Q Max
## -38.729 -14.777 -1.398 11.689 50.535
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.01612 20.16483 0.050 0.960266
## PDI 0.08033 0.01450 5.541 1.44e-05 ***
## DV 152.47855 33.08237 4.609 0.000136 ***
## PDI:DV -0.06547 0.01598 -4.096 0.000477 ***
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 23.15 on 22 degrees of freedom
## Multiple R-squared: 0.8819, Adjusted R-squared: 0.8658
## F-statistic: 54.78 on 3 and 22 DF, p-value: 2.268e-10

Since, p-value=2.268e-10 we can say that peacetime recession significantly influence the
relationship between PS and PDI.

b.

Both the differential intercept (beta2) and slope coefficients (beta3) are individually
statistically significant, suggesting that the savings-income relationship between the two
time periods has changed. Since beta3 is statistically significant we can say that there is a
presence of interaction effect.

Problem 08: Dummy Variable


Suppose we have the following data-set and we would like to use gender and age to predict
income:
Answer:

> data<-matrix(c(45000, 23, 0,48000, 25, 1,54000, 24, 0,


57000, 29, 1,65000, 38, 1,69000, 36, 1,78000, 40, 0,83000, 59, 1,98000, 56, 0,104000, 64, 0,107000, 53, 0
),11,3, byrow=T)
colnames(data)<-c("Income","Age","Gender")
par(mfrow=c(1,2))
plot(data[,3], data[,1])
#with(DF, plot(x, y, col=z))
plot(data[,2], data[,1],
pch = 19,
col = factor(data[,3]))

From the above plots we can see that by gender, income level is changing. So, we can say
that there is an effect of gender in income level.

Now fitting model:


> data<-data.frame(data)
data
> ## Income Age Gender
## 1 45000 23 0
## 2 48000 25 1
## 3 54000 24 0
## 4 57000 29 1
## 5 65000 38 1
## 6 69000 36 1
## 7 78000 40 0
## 8 83000 59 1
## 9 98000 56 0
## 10 104000 64 0
## 11 107000 53 0
> model<-lm(Income~Age+Gender,data = data)
summary(model)
> ##
## Call:
## lm(formula = Income ~ Age + Gender, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9906.6 -2879.8 -35.1 2542.5 13242.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23810.9 7489.2 3.179 0.0130 *
## Age 1319.7 158.2 8.340 3.23e-05 ***
## Gender -8769.5 4563.9 -1.922 0.0909 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7376 on 8 degrees of freedom
## Multiple R-squared: 0.9124, Adjusted R-squared: 0.8906
## F-statistic: 41.69 on 2 and 8 DF, p-value: 5.877e-05

Therefore the model is: 𝑖𝑛𝑐𝑜𝑚𝑒


̂ =23810.0+1319.7Age-8769.5Gender.
Problem 09: Dummy Variable
Read the “salary.sav” (data download link: salary.sav data) data from D drive to Answer the
following questions.

Consider the data on average salary (in dollars) of public school teachers in 50 states and
the District of Columbia for the year 1985. These 51 areas are classified into three
geographical regions: (1) Northeast and North Central (21 states in all), (2) South (17
states in all), and (3) West (13 states in all).

Fit a regression model using D1 and D2. Interpret the mean salary of teacher form different
regions.

Answer:
> library(foreign)
data<-read.spss("D:/Mostakim/Data set/salary.sav",to.data.frame = TRUE)
> ## re-encoding from CP1252
> head(data,3)
> ## salary spending D1 D2 D3
## 1 19583 3346 1 0 0
## 2 20263 3114 1 0 0
## 3 20325 3554 1 0 0
> model<-lm(salary~D1+D2,data = data)
summary(model)
> ##
## Call:
## lm(formula = salary ~ D1 + D2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6329.1 -2592.1 -370.6 2143.4 15321.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26159 1128 23.180 <2e-16 ***
## D1 -1734 1436 -1.208 0.2330
## D2 -3265 1499 -2.178 0.0344 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4069 on 48 degrees of freedom
## Multiple R-squared: 0.09008, Adjusted R-squared: 0.05217
## F-statistic: 2.376 on 2 and 48 DF, p-value: 0.1038

The fitted model is 𝑦̂ = 26159 − 1734𝐷1 − 3265𝐷2 Interpretation: As these regression


results show, the mean salary of teachers in the West is about $26159, that of teachers in
the Northeast and North Central is lower by about $1734, and that of teachers in the South
is lower by about $3265. The actual mean salaries in the last two regions are about $24424
and $22894.
Problem 10: Dummy Variable with Interaction Effect
Read the given data(Tool life Data) to Answer the following questions:

Here, Y=Hours, X=RPM, D=Tool type

a) Suppose that we expect the regression lines relating tool life to lathe speed to differ
in both intercept and slope (interaction exists). Now, fit a multiple regression model
(model with interaction effect). Write down the model.

b) Check if two regression models are identical (effect of tool type) also test for
significance of interaction effect and comment on it.
Answer:
a.

> #Reading data


y<-matrix(c(18.73, 14.52 , 17.43, 14.54, 13.44, 24.39, 13.34, 22.71, 12.68, 19.32, 30.16, 27.09, 25.40, 26.0
5, 33.49, 35.62, 26.07, 36.78, 34.95, 43.67),20,1)
x1<-matrix(c(610,950, 720, 840, 980, 530, 680, 540, 890, 730, 670, 770, 880, 1000, 760, 590, 910, 65
0, 810, 500),20,1)
x2<-matrix(c(rep(0,10),rep(1,10)),20,1)
data1<-cbind(y,x1,x2)
data<-data.frame(data1)
colnames(data)<-c('Y','X','D')
model<-lm(Y~X+D+X*D,data = data)
model
> ##
## Call:
## lm(formula = Y ~ X + D + X * D, data = data)
##
## Coefficients:
## (Intercept) X D X:D
## 32.77476 -0.02097 23.97059 -0.01194

Therefore the model is, 𝑌̂=32.775-0.021X+23.971D-0.012X.D

b.

Checking if two regression models are identical (effect of tool type):

We have to test the hypothesis, Ho:β2=β3= 0 (i.e. the two regression lines are identical)

We use the statistic:

We know that,
> #Finding SSR(𝜷1,𝜷2,𝜷3|𝜷o) and SSR(𝜷1|𝜷o):
summary(model)
> ##
## Call:
## lm(formula = Y ~ X + D + X * D, data = data)
## Residuals:
## Min 1Q Median 3Q Max
## -5.1750 -1.4999 0.4849 1.7830 4.8652
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.774760 4.633472 7.073 2.63e-06 ***
## X -0.020970 0.006074 -3.452 0.00328 **
## D 23.970593 6.768973 3.541 0.00272 **
## X:D -0.011944 0.008842 -1.351 0.19553
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 2.968 on 16 degrees of freedom
## Multiple R-squared: 0.9105, Adjusted R-squared: 0.8937
## F-statistic: 54.25 on 3 and 16 DF, p-value: 1.319e-08
> anova(model)
> ## Analysis of Variance Table
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## X 1 293.01 293.01 33.2545 2.889e-05 ***
## D 1 1125.03 1125.03 127.6847 4.891e-09 ***
## X:D 1 16.08 16.08 1.8248 0.1955
## Residuals 16 140.98 8.81
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> model1<-lm(Y~X,data = data)
summary(model1)
> ##
## Call:
## lm(formula = Y ~ X, data = data)
## Residuals:
## Min 1Q Median 3Q Max
## -12.9730 -7.2995 -0.9276 7.2326 12.7766
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.61672 9.60323 4.542 0.000253 ***
## X -0.02545 0.01255 -2.028 0.057600 .
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 8.44 on 18 degrees of freedom
## Multiple R-squared: 0.186, Adjusted R-squared: 0.1408
## F-statistic: 4.114 on 1 and 18 DF, p-value: 0.0576
> anova(model1)
> ## Analysis of Variance Table
##
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## X 1 293.01 293.005 4.1137 0.0576 .
## Residuals 18 1282.08 71.227
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> ssr1<-293.01+1125.03+16.08 #calculating SSR for model, SSR(𝜷1,𝜷2,𝜷3|𝜷o)
ssr2<-293.005 #calculating SSR for model1, SSR(𝜷1|𝜷o)
ssr<-ssr1-ssr2 #calculating SSR for SSR(𝜷2,𝜷3|𝜷1,𝜷o)
MSres<-8.81#calculating MSres for model
c(ssr1,ssr2,ssr,MSres)
> ## [1] 1434.120 293.005 1141.115 8.810

> #Calculating test statistic:


F<-(ssr/2)/MSres
F
> ## [1] 64.76249

> pf(64.76249,2,16,lower.tail = FALSE)#Calculating P-value for F-statistic to test Ho:β2=β3= 0


> ## [1] 2.135303e-08

Since for this statistic P-value =2.135303e-08, we conclude that the two regression lines
are not identical that is we can not explain the relationship between hours and rpm with a
single regression model.
Testing for the presence of interaction effects (two regression lines have a common slope
but possibly different intercepts)(Ho:𝞫3=0):

We use the statistic,

> pf(1.8225,2,16,lower.tail = FALSE) ##Calculating p-value for t-statistic to test Ho:𝞫3=0


> ## [1] 0.193617

Since since for this statistic P = 0.20 we conclude that the slopes of the two straight lines
are the same (i.e. there is no interaction effect). This can also be determined by using the t
statistics for β3 from 1st summary table where p-value of X:D is 0.1955=0.20.

Problem 11: Multicollinearity


Consider the gasoline mileage data in table.b3 in “MPV” package. The variables indicate,
a. Does the correlation matrix give any indication of multicollinearity?
b. Calculate the variance inflation factors and the condition number of X′X . Is there
any evidence of multicollinearity?

Answer: a.

> data(table.b3)
cor(table.b3[,c(-1,-4)]) #There is some missing observation in x3
> ## x1 x2 x4 x5 x6 x7
## x1 1.0000000 0.9452080 -0.33015370 -0.6315968 0.65906008 -0.7814778
## x2 0.9452080 1.0000000 -0.29205832 -0.5170425 0.77190992 -0.6431558
## x4 -0.3301537 -0.2920583 1.00000000 0.3737462 -0.04933889 0.4938104
## x5 -0.6315968 -0.5170425 0.37374620 1.0000000 -0.20535194 0.8428620
## x6 0.6590601 0.7719099 -0.04933889 -0.2053519 1.00000000 -0.3005751
## x7 -0.7814778 -0.6431558 0.49381043 0.8428620 -0.30057509 1.0000000
## x8 0.8551981 0.7973892 -0.25810785 -0.5481227 0.42518809 -0.6630802
## x9 0.8013975 0.7176056 -0.31876434 -0.4343576 0.31567268 -0.6682373
## x10 0.9456621 0.8834004 -0.27721850 -0.5424247 0.52064243 -0.7178265
## x11 0.8354239 0.7266835 -0.36836123 -0.7032485 0.41733783 -0.8549981
## x8 x9 x10 x11
## x1 0.8551981 0.8013975 0.9456621 0.8354239
## x2 0.7973892 0.7176056 0.8834004 0.7266835
## x4 -0.2581079 -0.3187643 -0.2772185 -0.3683612
## x5 -0.5481227 -0.4343576 -0.5424247 -0.7032485
## x6 0.4251881 0.3156727 0.5206424 0.4173378
## x7 -0.6630802 -0.6682373 -0.7178265 -0.8549981
## x8 1.0000000 0.8849771 0.9475859 0.6863079
## x9 0.8849771 1.0000000 0.9015431 0.6507213
## x10 0.9475859 0.9015431 1.0000000 0.7722283
## x11 0.6863079 0.6507213 0.7722283 1.0000000

Some variables are highly correlated with each other which indicated that there is a
potential problem with multicollinearity.
b.

> library(car)
> ## Loading required package: carData
> model<-lm(y~.-x4,data=table.b3)
vif(model)
> ## x1 x2 x3 x5 x6 x7 x8
## 119.355446 36.054178 139.027410 7.726459 4.821291 10.062542 19.719346
## x9 x10 x11
## 9.378732 81.037016 5.141237

There are many variance inflation factors which are much higher than 10, which indicates
there is evidence of multicolinearity.

Problem 12: Model Selection


Hald (1952) presents data concerning the heat evolved in calories per gram of cement ( y )
as a function of the amount of each of four ingredients in the mix: tricalcium aluminate ( x1
), tricalcium silicate ( x2 ), tetracalcium alumino ferrite ( x3 ), and dicalcium silicate ( x4 ).
The data can be found in MPV package named cement.

(i) Find all possible regression and its plots hence find the best model.

(ii) Find the best possible subset of regressors using forward step wise procedure with
R codes for each steps. Also with single R-codes to show steps. Write down the final
model.

(iii) Find the best possible subset of regressors using forward selection procedure with
R codes for each steps. Also with single R-codes to show each steps. Write down the
final model.

(iv) Find the best possible subset of regressors using backward elimination procedure
with R codes for each steps. Also with single R-codes to show each steps. Write
down the final model.
Answer:

> #Loading necessary packages:


library(MPV) #For the 'cement' data set.
library(car) #For variance Inflation Factor.
library(ppcor) #For partial correlation.
library(olsrr) #For all possible regressions.
> data("cement") #Loading the required data set.
data<-cement #Attaching the cement data into data for convinience.
head(data,3)
> ## X y x1 x2 x3 x4
## 1 1 78.5 7 26 6 60
## 2 2 74.3 1 29 15 52
## 3 3 104.3 11 56 8 20
> RegOut<-lm(y~x1+x2+x3+x4,data=data) #Making model using all regressors.
### Checking colinearity by Variance Inflation Factor:
vif(RegOut)
> ## x1 x2 x3 x4
## 38.49621 254.42317 46.86839 282.51286
> da<-data[,3:6]
solve(cor(da))
> ## x1 x2 x3 x4
## x1 38.49621 94.11969 41.88410 99.7858
## x2 94.11969 254.42317 105.09139 267.5394
## x3 41.88410 105.09139 46.86839 111.1451
## x4 99.78580 267.53942 111.14509 282.5129

Since the variance inflation factors are 38.49, 254.42, 46.86, and 282.51 which is much
higher than 10, these reflect quite serious problems with multicolinearity. Here, we will use
model respecification to lessen the impact of multicolinearity.
i.

All possible regression model:

> RegOut<-lm(y~x1+x2+x3+x4,data=data) #Making model.


### All possible regression and its plots:
k<-ols_step_all_possible(RegOut)
k
> ## Index N Predictors R-Square Adj. R-Square Mallow's Cp
## 4 11 x4 0.6745420 0.6449549 138.730833
## 2 21 x2 0.6662683 0.6359290 142.486407
## 1 31 x1 0.5339480 0.4915797 202.548769
## 3 41 x3 0.2858727 0.2209521 315.154284
## 5 52 x1 x2 0.9786784 0.9744140 2.678242
## 7 62 x1 x4 0.9724710 0.9669653 5.495851
## 10 72 x3 x4 0.9352896 0.9223476 22.373112
## 8 82 x2 x3 0.8470254 0.8164305 62.437716
## 9 92 x2 x4 0.6800604 0.6160725 138.225920
## 6 10 2 x1 x3 0.5481667 0.4578001 198.094653
## 12 11 3 x1 x2 x4 0.9823355 0.9764473 3.018233
## 11 12 3 x1 x2 x3 0.9822847 0.9763796 3.041280
## 13 13 3 x1 x3 x4 0.9812811 0.9750415 3.496824
## 14 14 3 x2 x3 x4 0.9728200 0.9637599 7.337474
## 15 15 4 x1 x2 x3 x4 0.9823756 0.9735634 5.000000
> plot(k) #Plots
From the above graphs and table we can see that model 5 has the highest R-square,
adjusted R-square and lowest Mallow’s Cp. Therefore we can conclude that model 5 may be
the best possible model for the above data set.
> model<-lm(y~x1+x2,data=data) #Best model.
summary(model)
> ##
## Call:
## lm(formula = y ~ x1 + x2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.893 -1.574 -1.302 1.363 4.048
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.57735 2.28617 23.00 5.46e-10 ***
## x1 1.46831 0.12130 12.11 2.69e-07 ***
## x2 0.66225 0.04585 14.44 5.03e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.406 on 10 degrees of freedom
## Multiple R-squared: 0.9787, Adjusted R-squared: 0.9744
## F-statistic: 229.5 on 2 and 10 DF, p-value: 4.407e-09

Therefore the best model is: 𝑦̂=52.58-1.47x1+0.66x2

ii.

Forward stepwise procedure with R-codes for each step:

> # Correlation coefficient for the data


cor(data)
> ## X y x1 x2 x3 x4
## X 1.0000000 0.4850346 0.1382303 0.5514217 0.2238313 -0.6328061
## y 0.4850346 1.0000000 0.7307175 0.8162526 -0.5346707 -0.8213050
## x1 0.1382303 0.7307175 1.0000000 0.2285795 -0.8241338 -0.2454451
## x2 0.5514217 0.8162526 0.2285795 1.0000000 -0.1392424 -0.9729550
## x3 0.2238313 -0.5346707 -0.8241338 -0.1392424 1.0000000 0.0295370
## x4 -0.6328061 -0.8213050 -0.2454451 -0.9729550 0.0295370 1.0000000
From the above table we can see that the correlation coefficient between y and x4 is the
highest (-0.82) among all other regressors. So, we add x4 to our model.

> RegOut1<-lm(y~x4,data=data)
summary(RegOut1)
> ##
## Call:
## lm(formula = y ~ x4, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.589 -8.228 1.495 4.726 17.524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 117.5679 5.2622 22.342 1.62e-10 ***
## x4 -0.7382 0.1546 -4.775 0.000576 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.964 on 11 degrees of freedom
## Multiple R-squared: 0.6745, Adjusted R-squared: 0.645
## F-statistic: 22.8 on 1 and 11 DF, p-value: 0.0005762

In the above model x4 significantly affects y. Now checking partial correlation for two
variables given a third variable(x4):

> pcor.test(data$y,data$x1,data$x4)
> ## estimate p.value statistic n gp Method
## 1 0.9567731 1.105281e-06 10.40307 13 1 pearson
> pcor.test(data$y,data$x2,data$x4)
> ## estimate p.value statistic n gp Method
## 1 0.1302149 0.6866842 0.4153118 13 1 pearson
> pcor.test(data$y,data$x3,data$x4)
> ## estimate p.value statistic n gp Method
## 1 -0.8950818 8.375467e-05 -6.347801 13 1 pearson
Since the partial correlation between y and x1 is highest and p-value is lowest after
controlling the effect of third variable x4, we add x1 to our model.

> RegOut2<-lm(y~x4+x1,data=data)
summary(RegOut2)
> ##
## Call:
## lm(formula = y ~ x4 + x1, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0234 -1.4737 0.1371 1.7305 3.7701
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103.09738 2.12398 48.54 3.32e-13 ***
## x4 -0.61395 0.04864 -12.62 1.81e-07 ***
## x1 1.43996 0.13842 10.40 1.11e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.734 on 10 degrees of freedom
## Multiple R-squared: 0.9725, Adjusted R-squared: 0.967
## F-statistic: 176.6 on 2 and 10 DF, p-value: 1.581e-08

In the above model both x1 and x4 significantly affects y. Therefore we will check partial
correlation for two variables given a set of variables(x1, x4):

> pcor.test(data$y,data$x2,data[,c(6,3)])
> ## estimate p.value statistic n gp Method
## 1 0.5986053 0.05168735 2.241844 13 2 pearson
> pcor.test(data$y,data$x3,data[,c(6,3)])
> ## estimate p.value statistic n gp Method
## 1 -0.5657105 0.06969226 -2.058117 13 2 pearson

Since the partial correlation between y and x2 is highest and p-value is lowest given the set
of variables x1and x4, we add x2 to our model.
> RegOut3<-lm(y~x4+x1+x2,data=data)
summary(RegOut3)
> ##
## Call:
## lm(formula = y ~ x4 + x1 + x2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0919 -1.8016 0.2562 1.2818 3.8982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.6483 14.1424 5.066 0.000675 ***
## x4 -0.2365 0.1733 -1.365 0.205395
## x1 1.4519 0.1170 12.410 5.78e-07 ***
## x2 0.4161 0.1856 2.242 0.051687 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.309 on 9 degrees of freedom
## Multiple R-squared: 0.9823, Adjusted R-squared: 0.9764
## F-statistic: 166.8 on 3 and 9 DF, p-value: 3.323e-08

In the above model x4 is not significant, so we can remove x4 from our model. At this point
the only remaining candidate regressor is x3.

Checking partial correlation between y and x3 given a set of variables(x1,x2):

> pcor.test(data$y,data$x3,data[,c(3,4)])
> ## estimate p.value statistic n gp Method
## 1 0.4112643 0.2088895 1.353561 13 2 pearson

Partial correlation between y and x3 is weak(0.4) and p-value is large(.21) which is not
significant. So, the forward stepwise selection procedure terminates.

Therefore our final model will be:


> model<-lm(y~x1+x2,data = data) #Final model.

So, by forward stepwise procedure our final model is: 𝑦̂=52.58-1.47x1+0.66x2

Forward stepwise procedure with single R-codes:

> k<-ols_step_both_p(RegOut,pent = 0.1,prem = 0.05)


plot(k)
> ols_step_both_p(RegOut, pent = 0.1,prem = 0.05, progress = TRUE)
> ## Stepwise Selection Method
## Candidate Terms:
## 1. x1
## 2. x2
## 3. x3
## 4. x4
## We are selecting variables based on p value...
## Variables Entered/Removed:
## - x4 added
## - x1 added
## - x2 added
## - x4 added
## No more variables to be added/removed.
## Final Model Output
## Model Summary
## R 0.989 RMSE 2.406
## R-Squared 0.979 Coef. Var 2.522
## Adj. R-Squared 0.974 MSE 5.790
## Pred R-Squared 0.965 MAE 1.909
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
## ANOVA
## ---------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ---------------------------------------------------------------------
## Regression 2657.859 2 1328.929 229.504 0.0000
## Residual 57.904 10 5.790
## Total 2715.763 12
## Parameter Estimates
## model Beta Std. Error Std. Beta t Sig lower upper
## (Intercept) 52.577 2.286 22.998 0.000 47.483 57.671
## x1 1.468 0.121 0.574 12.105 0.000 1.198 1.739
## x2 0.662 0.046 0.685 14.442 0.000 0.560 0.764
>
## Stepwise Selection Summary
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## 1 x4 addition 0.675 0.645 138.7310 97.7440 8.9639
## 2 x1 addition 0.972 0.967 5.4960 67.6341 2.7343
## 3 x2 addition 0.982 0.976 3.0180 63.8663 2.3087
## 4 x4 removal 0.979 0.974 2.6780 64.3124 2.4063
## -------------------------------------------------------------------------------------
Therefore, by stepwise procedure our final model is: 𝑦̂=52.58-1.47x1+0.66x2

Machine Mechanism:

At first she fits model with single regressor (x1,x2,x3,x4), then she finds which model has
the lowest p-value. In our case (y~x4) has the lowest p-value so she selects x4 in the model.
Secondly she fits model with two regressors (x4+x1,x4+x2,x4+x3), (y~x4+x1) has the
lowest p-value so she selects x1 in the model. Thirdly she fits model with regressors
(x4+x1+x2,x4+x1+x3), (y~x4+x1+x2) has the lowest p-value so she add x2 in the model but
then she finds out that x4 is not significant any more, so she removes x4 from the model.
Therefore final model is (y~x1+x2). The relevant codes are:

> summary(lm(y~x1,data=data))
summary(lm(y~x2,data=data))
summary(lm(y~x3,data=data))
summary(lm(y~x4,data=data))
summary(lm(y~x4+x1,data=data))
summary(lm(y~x4+x2,data=data))
summary(lm(y~x4+x3,data=data))
summary(lm(y~x4+x1+x2,data=data))
summary(lm(y~x4+x1+x3,data=data))
iii.

forward selection procedure with R codes for each steps:

> cor(data)
> ## X y x1 x2 x3 x4
## X 1.0000000 0.4850346 0.1382303 0.5514217 0.2238313 -0.6328061
## y 0.4850346 1.0000000 0.7307175 0.8162526 -0.5346707 -0.8213050
## x1 0.1382303 0.7307175 1.0000000 0.2285795 -0.8241338 -0.2454451
## x2 0.5514217 0.8162526 0.2285795 1.0000000 -0.1392424 -0.9729550
## x3 0.2238313 -0.5346707 -0.8241338 -0.1392424 1.0000000 0.0295370
## x4 -0.6328061 -0.8213050 -0.2454451 -0.9729550 0.0295370 1.0000000
> RegOut1<-lm(y~x4,data=data)
summary(RegOut1)
> ##
## Call:
## lm(formula = y ~ x4, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.589 -8.228 1.495 4.726 17.524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 117.5679 5.2622 22.342 1.62e-10 ***
## x4 -0.7382 0.1546 -4.775 0.000576 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.964 on 11 degrees of freedom
## Multiple R-squared: 0.6745, Adjusted R-squared: 0.645
## F-statistic: 22.8 on 1 and 11 DF, p-value: 0.0005762

Since correlation between y and x4 is highest and p-value is lowest we select x4.

Now checking partial correlation for two variables given a third variable(x4):
> pcor.test(data$y,data$x1,data$x4)
> ## estimate p.value statistic n gp Method
## 1 0.9567731 1.105281e-06 10.40307 13 1 pearson
> pcor.test(data$y,data$x2,data$x4)
> ## estimate p.value statistic n gp Method
## 1 0.1302149 0.6866842 0.4153118 13 1 pearson
> pcor.test(data$y,data$x3,data$x4)
> ## estimate p.value statistic n gp Method
## 1 -0.8950818 8.375467e-05 -6.347801 13 1 pearson

Since correlation between y and x1 is highest and p-value is lowest given the variable x4,
we add x1 to our model.

> RegOut2<-lm(y~x4+x1,data=data)
summary(RegOut2)
> ##
## Call:
## lm(formula = y ~ x4 + x1, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0234 -1.4737 0.1371 1.7305 3.7701
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103.09738 2.12398 48.54 3.32e-13 ***
## x4 -0.61395 0.04864 -12.62 1.81e-07 ***
## x1 1.43996 0.13842 10.40 1.11e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.734 on 10 degrees of freedom
## Multiple R-squared: 0.9725, Adjusted R-squared: 0.967
## F-statistic: 176.6 on 2 and 10 DF, p-value: 1.581e-08
Checking partial correlation for two variables given a set of variables(x1, x4):

> pcor.test(data$y,data$x2,data[,c(6,3)])
> ## estimate p.value statistic n gp Method
## 1 0.5986053 0.05168735 2.241844 13 2 pearson
> pcor.test(data$y,data$x3,data[,c(6,3)])
> ## estimate p.value statistic n gp Method
## 1 -0.5657105 0.06969226 -2.058117 13 2 pearson

Since correlation between y and x2 is highest and p-value is lowest given the variable
x4&x1, we add x2 to our model.

> RegOut3<-lm(y~x4+x1+x2,data=data)

At this point only remaining candiadate regressor is x3.

Checking partial correlation for two variables given a set of variables(x1, x4,x2):

> pcor.test(data$y,data$x3,data[,c(6,3,4)])
> ## estimate p.value statistic n gp Method
## 1 0.04768649 0.8959227 0.1350314 13 3 pearson

Partial correlation between y and x3 is very weak(0.05) and p-value is much larger(.89)
which is not significant.

So, the forward selection procedure terminates.

Final model will be:

> model<-lm(y~x1+x2+x4,data=data)
model
> ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.6483 14.1424 5.066 0.000675 ***
## x1 1.4519 0.1170 12.410 5.78e-07 ***
## x2 0.4161 0.1856 2.242 0.051687 .
## x4 -0.2365 0.1733 -1.365 0.205395

Therefore by forward selection procedure our final model is: 𝑦̂=71.65+1.45x1+0.42x2-


0.24x4
Forward selection procedure with single R-codes:

> ols_step_forward_p(RegOut,progress = TRUE)


> ## Forward Selection Method
## Candidate Terms:
## 1. x1
## 2. x2
## 3. x3
## 4. x4
## We are selecting variables based on p value...
## Variables Entered:
## - x4
## - x1
## - x2
## No more variables to be added.
## Final Model Output
## Model Summary
## R 0.991 RMSE 2.309
## R-Squared 0.982 Coef. Var 2.419
## Adj. R-Squared 0.976 MSE 5.330
## Pred R-Squared 0.969 MAE 1.606
## ANOVA
## Sum of
## Squares DF Mean Square F Sig.
## Regression 2667.790 3 889.263 166.832 0.0000
## Residual 47.973 9 5.330
## Total 2715.763 12
## Parameter Estimates
## model Beta Std. Error Std. Beta t Sig lower upper
## (Intercept) 71.648 14.142 5.066 0.001 39.656 103.641
## x4 -0.237 0.173 -0.263 -1.365 0.205 -0.629 0.155
## x1 1.452 0.117 0.568 12.410 0.000 1.187 1.717
## x2 0.416 0.186 0.430 2.242 0.052 -0.004 0.836
> ## Selection Summary
## Variable Adj.
## Step Entered R-Square R-Square C(p) AIC RMSE
## 1 x4 0.6745 0.6450 138.7308 97.7440 8.9639
## 2 x1 0.9725 0.9670 5.4959 67.6341 2.7343
## 3 x2 0.9823 0.9764 3.0182 63.8663 2.3087
Therefore by forward selection procedure our final model is: 𝑦̂=71.65+1.45x1+0.42x2-
0.24x4

iv.

Backward elimination procedure with R-codes for each steps:

> RegOut1<-lm(y~x1+x2+x3+x4,data=data) #Fitting the model with all regressors.


summary(RegOut1)
> ##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1750 -1.6709 0.2508 1.3783 3.9254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.4054 70.0710 0.891 0.3991
## x1 1.5511 0.7448 2.083 0.0708 .
## x2 0.5102 0.7238 0.705 0.5009
## x3 0.1019 0.7547 0.135 0.8959
## x4 -0.1441 0.7091 -0.203 0.8441
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.446 on 8 degrees of freedom
## Multiple R-squared: 0.9824, Adjusted R-squared: 0.9736
## F-statistic: 111.5 on 4 and 8 DF, p-value: 4.756e-07

We can see from the above summary table that x3 has the highest p-value(0.89) among all
other regressors. So, we can remove x3 from the model.
> RegOut2<-lm(y~x1+x2+x4,data=data) #Fitting the model with x1,x2,x4.
summary(RegOut2)
> ##
## Call:
## lm(formula = y ~ x1 + x2 + x4, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0919 -1.8016 0.2562 1.2818 3.8982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.6483 14.1424 5.066 0.000675 ***
## x1 1.4519 0.1170 12.410 5.78e-07 ***
## x2 0.4161 0.1856 2.242 0.051687 .
## x4 -0.2365 0.1733 -1.365 0.205395
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.309 on 9 degrees of freedom
## Multiple R-squared: 0.9823, Adjusted R-squared: 0.9764
## F-statistic: 166.8 on 3 and 9 DF, p-value: 3.323e-08

In second model we can see that x4 has the highest p-value(0.21) among all other
regressors, therefore, x4 is removed from the model.
> RegOut3<-lm(y~x1+x2,data=data) #Fitting the model with x1 and x2.
summary(RegOut3)
> ##
## Call:
## lm(formula = y ~ x1 + x2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.893 -1.574 -1.302 1.363 4.048
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.57735 2.28617 23.00 5.46e-10 ***
## x1 1.46831 0.12130 12.11 2.69e-07 ***
## x2 0.66225 0.04585 14.44 5.03e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.406 on 10 degrees of freedom
## Multiple R-squared: 0.9787, Adjusted R-squared: 0.9744
## F-statistic: 229.5 on 2 and 10 DF, p-value: 4.407e-09

As x1 and x2 is significant, no further regressors can be removed from the model.


Therefore backward elimination process terminates.

Hence, by backward elimination procedure the final model is: 𝑦̂=52.58-1.47x1+0.66x2

Backward elimination procedure with single R-codes:


> ols_step_backward_p(RegOut,pent = 0.1,prem = 0.05,progress=TRUE)
> ## Backward Elimination Method
## Candidate Terms:
## 1 . x1
## 2 . x2
## 3 . x3
## 4 . x4
## We are eliminating variables based on p value...
## Variables Removed:
## - x3
## - x4
## No more variables satisfy the condition of p value = 0.05
## Final Model Output
## Model Summary
## R 0.989 RMSE 2.406
## R-Squared 0.979 Coef. Var 2.522
## Adj. R-Squared 0.974 MSE 5.790
## Pred R-Squared 0.965 MAE 1.909
## ANOVA
## Sum of
## Squares DF Mean Square F Sig.
## Regression 2657.859 2 1328.929 229.504 0.0000
## Residual 57.904 10 5.790
## Total 2715.763 12
## Parameter Estimates
## ----------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------
## (Intercept) 52.577 2.286 22.998 0.000 47.483 57.671
## x1 1.468 0.121 0.574 12.105 0.000 1.198 1.739
## x2 0.662 0.046 0.685 14.442 0.000 0.560 0.764

> ## Elimination Summary


## -----------------------------------------------------------------------
## Variable Adj.
## Step Removed R-Square R-Square C(p) AIC RMSE
## -----------------------------------------------------------------------
## 1 x3 0.9823 0.9764 3.0182 63.8663 2.3087
## 2 x4 0.9787 0.9744 2.6782 64.3124 2.4063
## -----------------------------------------------------------------------
Hence, by backward elimination procedure the final model is: 𝑦̂=52.58-1.47x1+0.66x2

Problem 13: Model Selection


Consider the National Football League data in table.b1 in “MPV” package. Here the
variables denote,

a. Use the forward selection algorithm to select a subset regression model.

b. Use the backward elimination algorithm to select a subset regression model.

c. Use stepwise regression to select a subset regression model.


d. Comment on the final model chosen by these three procedures.

Answer:
Problem 14: Model Selection
Consider the solar thermal energy test data in table.b2 in “MPV” package. Here the
variables denote,

a. Use forward selection to specify a subset regression model.

b. Use backward elimination to specify a subset regression model.


c. Use stepwise regression to specify a subset regression model.

d. Apply all possible regressions to the data. Evaluate R square, Cp , and MSRes for
each model. Which subset model do you recommend?

e. Compare and contrast the models produced by the variable selection strategies in
parts a – d.

Problem 15: Model Selection


Consider the gasoline mileage performance data in table.b3 in “MPV” package .

a. Use the all possible regressions approach to fi nd an appropriate regression model.

b. Use stepwise regression to specify a subset regression model. Does this lead to the
same model found in part a?

Answer:
Problem 16: Model Adequacy Checking
Consider the simple regression model fit to the National Football League team performance
data in Problem 2.

a. Construct a normal probability plot of the residuals. Does there seem to be any
problem with the normality assumption?

b. Construct and interpret a plot of the residuals versus the predicted response.

c. Plot the residuals versus the team passing yardage, x2 . Does this plot indicate that
the model will be improved by adding x2 to the model?

Answer:

> data("table.b1")
data<-table.b1
model<-lm(y~x8,data = data)
summary(model)
> ##
## Call:
## lm(formula = y ~ x8, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.804 -1.591 -0.647 2.032 4.580
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.788251 2.696233 8.081 1.46e-08 ***
## x8 -0.007025 0.001260 -5.577 7.38e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.393 on 26 degrees of freedom
## Multiple R-squared: 0.5447, Adjusted R-squared: 0.5272
## F-statistic: 31.1 on 1 and 26 DF, p-value: 7.381e-06

Therefore the model is 𝑦̂=21.788-0.007x8


> par(mfrow=c(2,2))
plot(model)

a.

> library(olsrr)
ols_plot_resid_qq(model)

There does not seem to be a problem with the normality assumption.


b.

> ols_plot_resid_fit(model)

The model seems adequate.


c.

> ols_plot_resid_regressor(model,"x2")

It appears that the model may be improved by adding x2.


> model1<-lm(y~x8+x2,data = data)
summary(model1)
> ##
## Call:
## lm(formula = y ~ x8 + x2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4280 -1.3744 -0.0177 1.0010 4.1240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.7126750 2.6175266 5.621 7.55e-06 ***
## x8 -0.0068083 0.0009658 -7.049 2.18e-07 ***
## x2 0.0031111 0.0007074 4.398 0.000178 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.832 on 25 degrees of freedom
## Multiple R-squared: 0.7433, Adjusted R-squared: 0.7227
## F-statistic: 36.19 on 2 and 25 DF, p-value: 4.152e-08

Let’s see the condition of our new model:

> ols_plot_diagnostics(model1,print_plot = FALSE)


> ## $plot_1
Problem 17: Model Adequacy Checking
Consider the multiple regression model fi t to the National Football League team
performance data in Problem 5.

a. Construct a normal probability plot of the residuals. Does there seem to be any
problem with the normality assumption?

b. Construct and interpret a plot of the residuals versus the predicted response.

c. Construct plots of the residuals versus each of the regressor variables. Do these
plots imply that the regressor is correctly specified?

d. Construct the partial regression plots for this model. Compare the plots with the
plots of residuals versus regressors from part c above. Discuss the type of
information provided by these plots.

e. Compute the studentized residuals and the R - student residuals for this model.
What information is conveyed by these scaled residuals?

Answer:

Try yourself!!

Problem 18: In course Question 2022


The quality of Pinot Noir wine is thought to be related to the properties of clarity, aroma,
body, flavor, and oakiness. Data for 38 wines (winequatity.csv) are stored on the desktop.
(Data set link: winequality.csv)

a. Write the fitted model relating wine quality to these regressors.

b. Make comment about the goodness of fit of the above model.

c. Test for significance of regression. What conclusions can you draw?


d. Construct an equal tailed 90% confidence interval for the parameter B1 (clarity) and
make conclusion about the hypotheses Ho: B1 = 0 vs Ho: B1 # 0.
e. Test for significance of Ho: B3 = B4 = 0.

f. Calculate the residuals and show that the sum of the residuals equal zero.
g. Find a 95% confidence interval on the mean quality when X1 = 1, X2 = 6.8, X3 = 5, X4
= 6, X 5 = 5.2. Also the 95% prediction interval for that observation.

Problem 19: In course Question 2019


Use the ‘MPV’ package to run the data ‘table.b3’ then answer the following questions:

(a) Fit a multiple regression model relating gasoline mileage Y (miles per gallon) to
engine displacement X1 and the number of carburetor barrels X6. Interpret the
regression coefficients.

(b) Make comment about the goodness of fit of the above model.

(c) Test the significance of regression.

(d) Find a 95% confidence interval for 𝜷1. What does this confidence interval mean?
What conclusion can you draw for testing Ho:𝜷1 = 0 based on the confidence
interval?

(e) Find a 95% confidence interval on the mean gasoline mileage when X1 = 275 in3
and X6 = 2 barrels. Also the 95% prediction interval for a new observation when X1
= 275 in3 and X6 = 2 barrels.

(f) Fit a linear regression model relating gasoline mileage Y to engine displacement X1
and the type of transmission X11. Does the type of transmission significantly affect
the mileage performance?

(g) Modify the model developed in part (f) to include an interaction between engine
displacement and the type of transmission. What conclusion can you draw about the
effect of the type of transmission on gasoline mileage?
Answer:
> library(MPV) #Reading Library
> data<-table.b3
head(data,3)
> ## y x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11
## 1 18.9 350 165 260 8.00 2.56 4 3 200.3 69.9 3910 1
## 2 17.0 350 170 275 8.50 2.56 4 3 199.6 72.9 3860 1
## 3 20.0 250 105 185 8.25 2.73 1 3 196.7 72.2 3510 1
> model<-lm(y~x1+x6,data=data) #(a)
summary(model) #(b,c)
> ##
## Call:
## lm(formula = y ~ x1 + x6, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.0623 -1.6687 -0.3628 1.6221 6.2305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.884551 1.535408 21.417 < 2e-16 ***
## x1 -0.053148 0.006137 -8.660 1.55e-09 ***
## x6 0.959223 0.670277 1.431 0.163
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.013 on 29 degrees of freedom
## Multiple R-squared: 0.7873, Adjusted R-squared: 0.7726
## F-statistic: 53.67 on 2 and 29 DF, p-value: 1.79e-10
> confint(model,level=.95) #(d)
> ## 2.5 % 97.5 %
## (Intercept) 29.74428901 36.02481266
## x1 -0.06569892 -0.04059641
## x6 -0.41164739 2.33009349
> new.data<-matrix(c(275,2),1,2) #(e)
colnames(new.data)<-c('x1','x6')
new.data<-data.frame(new.data)
predict(model,newdata=new.data,interval='confidence')
> ## fit lwr upr
## 1 20.18739 18.87221 21.50257
The 95% confidence interval associated with a value of engine displacement, x1=275 and
carburetor barrels,x6=2 is (18.87221, 21.50257). This means that, according to our model,
a value of engine displacement x1=275 and carburetor barrels x6=2 has, on average, a
gasoline mileage Y ranging between 18.87221 and 21.50257

> predict(model,newdata=new.data,interval='prediction')
> ## fit lwr upr
## 1 20.18739 13.8867 26.48808

The 95% prediction intervals associated with a value of engine displacement, x1=275 and
carburetor barrels,x6=2 is (13.8867, 26.48808). This means that, according to our model,
95% of the cars with a value of engine displacement,x1=275 and carburetor barrels,x6=2
have a gasoline mileage Y ranging between 13.8867 and 26.48808

> model1<-lm(y~x1+x11,data=data) #(f)


summary(model1)
> ##
## Call:
## lm(formula = y ~ x1 + x11, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9153 -1.8882 0.1106 1.7706 6.7829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.618408 1.539505 21.837 < 2e-16 ***
## x1 -0.045736 0.008682 -5.268 1.2e-05 ***
## x11 -0.498689 2.228198 -0.224 0.824
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.115 on 29 degrees of freedom
## Multiple R-squared: 0.7727, Adjusted R-squared: 0.757
## F-statistic: 49.28 on 2 and 29 DF, p-value: 4.696e-10
(f) Since p-value of the F-statistics is 4.696e-10, which is highly significant. This means
that, at least, one of the predictor variables is significantly related to the outcome
variable.
> model2<-lm(y~x1+x11+x1*x11,data=data) #(g)
summary(model2)
> ##
## Call:
## lm(formula = y ~ x1 + x11 + x1 * x11, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2712 -1.2660 0.1342 1.5181 4.6599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.91963 2.73493 15.693 2.10e-15 ***
## x1 -0.11677 0.01984 -5.886 2.49e-06 ***
## x11 -13.46371 3.84413 -3.502 0.001567 **
## x1:x11 0.08165 0.02127 3.839 0.000647 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.566 on 28 degrees of freedom
## Multiple R-squared: 0.851, Adjusted R-squared: 0.8351
## F-statistic: 53.33 on 3 and 28 DF, p-value: 1.064e-11

(g) Since p-value of the F-statistics is 1.064e-11, which is highly significant. This means
that, at least, one of the predictor variables is significantly related to the outcome
variable. Both the differential intercept (beta2) and slope coefficients (beta3) are
individually statistically significant, suggesting that there is effect of the type of
transmission on gasoline mileage.
Problem 20: Final Question 2019
Use the ‘mosaicData’ package to run the data ‘SaratogaHouses’ then answer the following
questions:

Consider the house price data in Saratoga Country, New York, USA in 2006. The data set has
1728 observations and 16 variables.

a. Fit a multiple regression model relating selling price to size of lot (square feet), age
of house (years), value of land (1000s of US dollars), living area (square feet),
number of bedrooms, number of fireplaces, number of bathrooms and number of
rooms.

b. Construct the analysis of variance table and test for significance of regression.

c. Find a 95% confidence interval for 𝜷3. What conclusion can you draw about Ho: 𝜷3
= 0 from the above confidence interval?

d. Use forward selection and backward elimination procedure to find the best subset
of regressors from the regressors used in part (a).

Answer:

Try yourself!!
Problem 21: Final Question 2018

(a) Determine the multiple regression equation for the data.

(b) Compute and interpret the coefficient of multiple determination, R-square.

(c) At the 5% significance level, does it appear that any of the predictor variables can be
removed from the full model as unnecessary?
(d) Obtain and interpret 95% confidence intervals for the slopes, fir, of the population
regression line that relates net revenues and number of branches to profit margin.
(e) Are there any multicolinearity problems (i.e., are net revenues and number of
branches co linear [estimating similar relationships/quantities])?
(f) Obtain a point estimate for the mean profit margin with 3.5 net revenues and 6500
branches. ”

(g) Test the alternative hypothesis that the mean profit margin with 3.5 net revenues
and 6500 branches is greater than 0.70. Test at the 5% significance level.

(h) Determine a 95% confidence interval for the mean profit margin with 3.5 net
revenues and 6500 branches.

(i) Find the predicted profit margin for my bank with 3.5 net revenues and 6500
branches.
(j) Determine a 95% prediction interval for the profit margin for the bank with 3.5 net
revenues and 6500 branches.
Answer:

Try yourself!!
References:
• Theory and Practical Lecture of Dr. Zillur Rahman Shabuz Sir (Problem 01, 03, 07,
08, 09, 10, 12).

• Montgomery, D.C. and Peck, E., (2012), An introduction to Regression analysis, 5th
edi. Wiley, N.Y. (Problem 02, 04-06, 11, 12-17).

For the above-mentioned problems kindly see the relevant chapters in the lectures and
books.

Good Luck

You might also like