0% found this document useful (0 votes)
47 views7 pages

HW2 Solution

1) A multiple linear regression model was fit relating house selling price to 9 regressors. None of the regressors were found to be statistically significant based on t-tests. Multicollinearity was identified as a potential problem with the model. 2) A second multiple linear regression model was fit relating CO2 production to total solvent and hydrogen consumption. Both regressors were found to be statistically significant based on t-tests. The models explained approximately 67-69% of the variability in CO2 production. 3) Refitting the model with only total solvent showed similar results to the full model, indicating total solvent alone can adequately explain CO2 production. The length of the confidence interval for the slope was nearly

Uploaded by

rita901112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views7 pages

HW2 Solution

1) A multiple linear regression model was fit relating house selling price to 9 regressors. None of the regressors were found to be statistically significant based on t-tests. Multicollinearity was identified as a potential problem with the model. 2) A second multiple linear regression model was fit relating CO2 production to total solvent and hydrogen consumption. Both regressors were found to be statistically significant based on t-tests. The models explained approximately 67-69% of the variability in CO2 production. 3) Refitting the model with only total solvent showed similar results to the full model, indicating total solvent alone can adequately explain CO2 production. The length of the confidence interval for the slope was nearly

Uploaded by

rita901112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

HW2 Solution

Yao Song

10/12/2020

Q 3.7

# import dataset
dat1 <- read.csv("data-table-B4.csv", header = T, sep = ",") # 24 obs. of 10 var.

(a) Fit a multiple regression model relating selling price to all nine regressors.
fit1 <- lm(y ~ ., data = dat1)
s1 <- summary(fit1)
fit1

##
## Call:
## lm(formula = y ~ ., data = dat1)
##
## Coefficients:
## (Intercept) x1 x2 x3 x4 x5
## 14.92765 1.92472 7.00053 0.14918 2.72281 2.00668
## x6 x7 x8 x9
## -0.41012 -1.40324 -0.03715 1.55945
The linear regression model between sale price of the house/1000 and all 9 regressors is

ŷ = 14.92765+1.92472x1 +7.00053x2 +0.14918x3 +2.72281x4 +2.00668x5 −0.41012x6 −1.40324x7 −0.03715x8 +1.55945x9 .

The coefficients for x1 , x2 , x3 , x4 , x5 , x9 are greater than 0, which indicates there may be positive relationships
among them, while x6 , x7 , x8 have negative coefficients, indicating negative relationships among them. For
example, when the number of baths x2 increases 1 unit, selling price would increase 7 units. And when
number of bedrooms x7 increases 1 unit, selling price may decrease 1.4 units.

(b) Test for significance of regression. What conclusions can you draw?
s1

##
## Call:
## lm(formula = y ~ ., data = dat1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.720 -1.956 -0.045 1.627 4.253
##
## Coefficients:

1
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.92765 5.91285 2.525 0.0243 *
## x1 1.92472 1.02990 1.869 0.0827 .
## x2 7.00053 4.30037 1.628 0.1258
## x3 0.14918 0.49039 0.304 0.7654
## x4 2.72281 4.35955 0.625 0.5423
## x5 2.00668 1.37351 1.461 0.1661
## x6 -0.41012 2.37854 -0.172 0.8656
## x7 -1.40324 3.39554 -0.413 0.6857
## x8 -0.03715 0.06672 -0.557 0.5865
## x9 1.55945 1.93750 0.805 0.4343
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.949 on 14 degrees of freedom
## Multiple R-squared: 0.8531, Adjusted R-squared: 0.7587
## F-statistic: 9.037 on 9 and 14 DF, p-value: 0.000185
Since F statistics is 9.037 and p-value is 2 × 10−4 , which is less than significance level (0.05), we would reject
H0 : β1 = . . . = β9 = 0 and conclude there is a linear relationship between selling price y and any of the
regressors x1 , x2 , . . . , x9 .

(c) Use t tests to assess the contribution of each regressor to the model. Discuss your findings.
From the results in part (b), we can see that the p-values for t statistics of β1 , β2 , . . . , β9 are all greater than
the significance level (0.05). That is, none of the regressors are significant in the linear model. Therefore, we
may face with multicolinearity problem in this linear regression model.

(d) What is the contribution of lot size (x3 ) and living space (x4 ) to the model given that all of the other
regressors are included?
fit11 <- lm(y ~ x1 + x2 + x5 + x6 + x7 + x8 + x9, data = dat1)
s11 <- summary(fit11)
anov1 <- anova(fit1)
anov11 <- anova(fit11)
F_34 <- round((sum(anov1$`Sum Sq`[1:9])-sum(anov11$`Sum Sq`[1:7]))/2/anov1$`Mean Sq`[10], 4)
p_34 <- round(1-pf(F_34, 2, 14), 4)

The partial F test measures the contribution of the regressors in x3 and x4 given that the other regressors in
the model, where F statistics value is 0.3225 and correpsonding p-value is 0.7296. The p-value is greater than
significance level (0.05), which indicates that there is no contribution of lot size and living space given that
all the other regressors are in the model.

(e) Is multicollinearity a potential problem in this model?


vif(fit1)

## x1 x2 x3 x4 x5 x6 x7 x8
## 7.021036 2.835413 2.454907 3.836477 1.823605 11.710101 9.722663 2.320887
## x9
## 1.942494
From part (b), (c), (d), we find insignificant coefficients of all regressors, while the regression is significant
based on F statistics. And VIF for x6 is larger than 10. This situation indicates multicollinearity in this
model.

2
Q 3.8

# import dataset
dat2 <- read.csv("data-table-B5.csv", header = T, sep = ",") # 27 obs. of 8 var.

(a) Fit a multiple regression model relating CO2 product (y) to total solvent (x6 ) and hydrogen consumption
(x7 ).
fit2 <- lm(y ~ x6 + x7, data = dat2)
s2 <- summary(fit2)
fit2

##
## Call:
## lm(formula = y ~ x6 + x7, data = dat2)
##
## Coefficients:
## (Intercept) x6 x7
## 2.52646 0.01852 2.18575
The linear regression model between CO2 product and total solvent (x6 ) and hydrogen consumption (x7 ) is

ŷ = 2.52646 + 0.01852x6 + 2.18575x7 .

The slope for x6 , x7 are greater than 0, which indicates there may be positive trends between them. When the
total solvent increases 1 unit, CO2 would increase 0.01852 units. When the hydrogen consumption increases
1 unit, CO2 would increase 2.18575 units.

(b) Test for significance of regression. Calculate R2 and RAdj


2
.
s2

##
## Call:
## lm(formula = y ~ x6 + x7, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.2035 -4.3713 0.2513 4.9339 21.9682
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.526460 3.610055 0.700 0.4908
## x6 0.018522 0.002747 6.742 5.66e-07 ***
## x7 2.185753 0.972696 2.247 0.0341 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.924 on 24 degrees of freedom
## Multiple R-squared: 0.6996, Adjusted R-squared: 0.6746
## F-statistic: 27.95 on 2 and 24 DF, p-value: 5.391e-07
Since F statistics is 27.95 and p-value is 0, which is less than significance level (0.05), we would reject
H0 : β6 = β7 = 0 and conclude there is a linear relationship between CO2 product y and any of the regressors
x6 , x7 .

3
R2 in this model is 0.6996, indicating that 0.6996 of the total variability in CO2 product is explained by this
model. Adjusted R2 is 0.6746.

(c) Using t tests determine the contribution of x6 and x7 to the model.


From result in part (b), we can see that the p-values of t statistics for β6 (0) and β7 (0.0341) are both less
than significance level (0.05), which indicates that there are contributions of these 2 variables to the model.

(d) Construct 95% CIs on β6 and β7 .


CI <- round(confint(fit2, level = 0.95), 4)
CI

## 2.5 % 97.5 %
## (Intercept) -4.9243 9.9772
## x6 0.0129 0.0242
## x7 0.1782 4.1933
The 95% CI on β6 is [0.0129, 0.0242]. In other words, 95% of such intervals will include the true value of the
slope. The 95% CI on β7 is [0.1782, 4.1933]. In other words, 95% of such intervals will include the true value
of the slope.

(e) Refit the model using only x6 as the regressor. Test for significance of regression and calculate R2 and
2
RAdj . Discuss your findings. Based on these statistics, are you satisfied with this model?
fit21 <- lm(y ~ x6, data = dat2)
s21 <- summary(fit21)
s21

##
## Call:
## lm(formula = y ~ x6, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.081 -5.829 -0.839 5.522 26.882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.144181 3.483064 1.764 0.0899 .
## x6 0.019395 0.002932 6.616 6.24e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.7 on 25 degrees of freedom
## Multiple R-squared: 0.6365, Adjusted R-squared: 0.6219
## F-statistic: 43.77 on 1 and 25 DF, p-value: 6.238e-07
The linear regression model between CO2 product and total solvent (x6 ) is

ŷ = 6.14418 + 0.01939x6 .

Since F statistics is 43.77 and p-value is 0, which is less than significance level (0.05), we would reject
H0 : β6 = 0 and conclude there is a linear relationship between CO2 product y and total solvent (x6 ).

4
R2 in this model is 0.6365, indicating that 0.6365 of the total variability in CO2 product is explained by this
model. Adjusted R2 is 0.6219.
Comparing the adjusted R2 in these 2 models 0.6746 (full model) and 0.6219 (reduced model), the difference
between these are small, indicating that we can explain CO2 product good only with x6 . At the same time,
the test for significance of reduced model (only with x6 ) is also significant. Therefore, the model only with x6
is also good.

(f) Construct a 95% CI on β6 using the model you fit in part (e). Compare the length of this CI to the
length of the CI in part (d). Does this tell you anything important about the contribution of x7 to the model?
CI <- round(confint(fit21, level = 0.95), 4)
CI

## 2.5 % 97.5 %
## (Intercept) -1.0293 13.3177
## x6 0.0134 0.0254
The 95% CI on β6 is [0.0134, 0.0254] (length is 0.012), while the 95% CI in part (d) (length is 0.0113). The
length of this confidence interval is almost exactly the same as the one from the model including x7 . So x7
may not provide much contribution to the model.

(g) Compare the values of M SRes obtained for the two models you have fit (parts (a) and (e)). How did
the M SRes change when you removed x7 from the model? Does this tell you anything important about the
contribution of x7 to the model?
anov2 <- anova(fit2)
anov21 <- anova(fit21)
anov2

## Analysis of Variance Table


##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x6 1 5008.9 5008.9 50.8557 2.267e-07 ***
## x7 1 497.3 497.3 5.0495 0.0341 *
## Residuals 24 2363.8 98.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anov21

## Analysis of Variance Table


##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x6 1 5008.9 5008.9 43.766 6.238e-07 ***
## Residuals 25 2861.2 114.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
M SRes is 98.5 when x6 and x7 are in the model, while M SRes is 114.4 when there is only x6 in the model.
We can see M SRes is lower when x6 and x7 are both in the model. So we can conclude that x7 does have
some contribution in the linear model, but not that much.

5
Q 3.9

# import dataset
dat3 <- read.csv("data-table-B6.csv", header = T, sep = ",") # 28 obs. of 5 var.

(a) Fit a multiple regression model relating concentration of N bOCl3 (y) to concentration of COCl2 , (x1 )
and mole fraction (x4 ).
fit3 <- lm(y ~ x1 + x4, data = dat3)
s3 <- summary(fit3)
fit3

##
## Call:
## lm(formula = y ~ x1 + x4, data = dat3)
##
## Coefficients:
## (Intercept) x1 x4
## 0.004833 -0.344984 -0.000143
The linear regression model between concentration of N bOCl3 and concentration of COCl2 , mole fraction is

ŷ = 0.004833 − 0.344984x1 − 0.000143x4 .

The slope for x1 and x4 are less than 0, which indicates there may be negative trends between them. When
the concentration of COCl2 increases 1 g-mol/l, concentration of N bOCl3 would decrease 0.344984 g-mol/l.
When the mole fraction CO2 increases 1 unit, concentration of N bOCl3 would decrease 0.000143 g-mol/l.

(b) Test for significance of regression.


s3

##
## Call:
## lm(formula = y ~ x1 + x4, data = dat3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0009015 -0.0003526 -0.0001538 0.0003847 0.0010874
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0048333 0.0008142 5.936 3.39e-06 ***
## x1 -0.3449837 0.0673963 -5.119 2.74e-05 ***
## x4 -0.0001430 0.0078151 -0.018 0.986
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0005804 on 25 degrees of freedom
## Multiple R-squared: 0.6636, Adjusted R-squared: 0.6367
## F-statistic: 24.66 on 2 and 25 DF, p-value: 1.218e-06
Since F statistics is 24.66 and p-value is 0, which is less than significance level (0.05), we would reject
H0 : β1 = β4 = 0 and conclude there is a linear relationship between concentration of N bOCl3 and any of the
regressors x1 , x4 .

6
(c) Calculate R2 and RAdj
2
for this model.
R2 in this model is 0.6636, indicating that 0.6636 of the total variability in concentration of N bOCl3 is
explained by this model. Adjusted R2 is 0.6367.

(d) Using t tests, determine the contribution of x1 and x4 to the model. Are both regressors x1 and x4
necessary?
From result in part (b), we can see that the p-value of t statistics for β1 (0) is less than significance level (0.05)
and p-value of t statistics for β4 (0.986) is greater than significance level (0.05). That is, x1 is significant and
x4 is not, indicating that x4 may not be necessary in the model.

(e) Is multicollinearity a potential concern in this model?


vif(fit3)

## x1 x4
## 1.891525 1.891525
The variance inflation factors for both x1 and x4 are less than 10, which indicates that multicollinearity is
not a potential concern in this model.

You might also like