Graded Homework 1 Solutions
Graded Homework 1 Solutions
Solution Manual
Answer 1. (B) The sum of residuals will always be equal to zero if you include
intercept term in your model and you are using OLS to estimate the
coefficients.
When an intercept is included in multiple linear regression for a data set with
p points, the line is of the form:
And to find all the betas, we differentiate all the above equation against all of
betas individually including β0. We obtain ‘p+1’ equations to solve for all the
‘p+1’ betas. Differentiating against β0, we get:
The above equation is responsible for ensuring that the sum of residuals is 0.
Therefore, C and D are incorrect options.
Thus, B is the correct option. One can think of the intercept as the catchall
term, which provides for shifting the line of regression closer to or away from
the X-axis (say the response terms are of the order 1000s, and thus the β0
value helps move the line of regression the said order, providing for finer
tuned remaining betas).
Answer 2. (B) These lines will always intersect at the mean (Xmean, Ymean)
The above equation is obtained using the least square estimator method for
estimating the values a and b while minimizing the residuals:
2
The final line is y=a+bx and we find a and b using equations obtained by
placing
∂ D(a , b) ∂ D (a , b) CoV ( x , y )
= =0 which gives us a= y−b x and b=
∂a ∂b Var ( x )
Call:
lm(formula = Salary ~ Experience, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-73.00 -12.82 -1.18 13.32 60.85
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.4679 2.5673 11.48 <2e-16 ***
Experience 3.0959 0.1113 27.81 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
lm(formula = Experience ~ Salary, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-11.430 -4.867 -1.787 3.489 20.805
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.475177 0.841711 -1.753 0.0807 .
Salary 0.233146 0.008385 27.806 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The graph produced is as follows (which meets at the mean of the Xs and Ys)
> abline(29.4679, 3.0959)
> abline(1.475177/0.233146, 1/0.233146)
Kindly note that while plotting the two line on the same axis, the line from X
~ Y needs to be transformed to X-Y co-ordinate as shown above in the
second line of code. Another point to note is that the difference in the slope
for these lines is large as randomness increases in the data. If you consider
data that is perfectly describable by a linear equation, then one would obtain
the same line in both the cases (i.e. Y ~ X and X ~ Y).
For example, considering Data Points {(1,3), (2,5), (3,7), (4,9), (5,11)}, the
lines obtained would be y = 2x + 1 and x = 0.5y + 0.5, which are basically
the same overlapping lines.
Call:
lm(formula = dist ~ speed, data = cars)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
lm(formula = new.dist ~ new.speed)
Residuals:
Min 1Q Median 3Q Max
-8.8603 -2.9033 -0.6925 2.8086 13.1678
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.3581 2.0600 -2.601 0.0123 *
new.speed 2.6812 0.2833 9.464 1.49e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> # predict stopping distance with 7/5m/s speed and confidence interval
> new.dat <- data.frame(new.speed=7.5)
>
> predict(ols_reg2, testdata, interval="predict")
fit lwr upr
1 14.7526 5.227095 24.2781
>
> predict(ols_reg2, testdata, interval="predict", level = 0.9)
fit lwr upr
1 14.7526 6.806648 22.69855
Question 7. Now suppose your model is ready and you are asked to
make a prediction for distance a car will need to stop in meters if it
is moving at a speed of 7.5 m/s. You are required to report the
predicted distance and also the lower and upper bound for the 90%
confidence interval. What will you report?
This does not match any of the given options. Take a note of the argument
‘interval’. While calculating a range for prediction, we may come across two
types intervals – ‘confidence’ and ‘predict’ and it is important to understand
the difference between the two.
Confidence intervals tell you about how well you have determined the mean.
The key point is that the confidence interval tells you about the likely
location of the true population parameter.
Prediction intervals tell you where you can expect to see the next data point
sampled. The key point is that the prediction interval tells you about the
distribution of values, not the uncertainty in determining the population
mean.
Prediction intervals must account for both the uncertainty in knowing the
value of the population mean, plus data scatter. So, a prediction interval is
always wider than a confidence interval.
One task might be to forecast the weight of some specific person. And we
don't know the living circumstances of that individual. Here the prediction
interval must be used. But in a practical environment, the task is to predict
the average weight of all the people having the same values of all three
explanatory variables at hand. Here we use the confidence interval. Likewise,
in the problem, we wish to check, given that the vehicle is moving at 7.5 m/s,
what would be on average, the likely value of the stopping distance.
We observe that the two are centered around the same point, but prediction
interval is much wider than the confidence interval.
Question 8. Your boss is not happy with the intercept term and asks
you to try dropping it. Let’s drop the intercept as your boss asked.
Which of the following finding will you report back to boss
concerning the percentage of variation explained by the model?
But when dropping the intercept term, R uses a modified version which is as
follows:
R2 becomes higher without intercept, not because the model is better, but
because the definition of R2 used is another one! R2 is an expression of a
comparison of the estimated model with some standard model, expressed as
reduction in sum of squares compared to sum of squares with the standard
model. In the model with intercept, the comparison sum of squares is around
the mean. Without intercept, it is around zero! The latter is usually much
higher, so it easier to get a large reduction in sum of squares. However,
setting y=0 at x=0 introduces bias in the data and thus the co-efficient may
not be truly significant
Answer 9. (C) 2
We have three categorical variables, only two variables are needed because
one factor level is taken as base line.
Answer 10. (D) At the same age level, people living in urban areas consume
more calories than people living in rural areas.
The confidence level is 95%, and the p-value for Rural_True (0.012) is less
than 0.05. Therefore, Rural_True has a significant effect on the response
variable. A is incorrect. B and C fail to consider ageing effect; thus B and C
are incorrect. People in rural area consume 40 calories less than those in
urban areas, while holding age constant. Hence Choice D is the correct
answer
Answer 11. (A) Interaction effects exist between food and condiment since p-
value is much smaller than the significance level/alpha = 5%. We should
include the interaction term in the regression model to explain the variability
in the data.
Since the p value of the interaction effect is close to zero, and it is less than
5% (error rate), we will reject the null hypothesis that interaction effect does
not exist between food and condiment. Thus, A is correct answer. If food and
condiment are independent of each other, the interaction effect should not
be significant, making D is incorrect. Interaction effects are relevant when
conducting regression analysis, E is incorrect.
Question 13. When plotting the expected value for food (one line for
chocolate sauce and one line for mustard), which plot are you most
likely to see?
Answer 13. (E) Since the interaction effect is significant between food and
condiment, we should expect to see the two lines cross each other.
Therefore, A and C cannot be correct. In terms of choice B, from the statistic
result table in the previous question, hot dog with mustard sauce should be
more pleasant than hot dog with chocolate sauce. Choice B is also not
correct.
In terms of choice D, if the mustard line is flat, it means that the food
condiment has no significant influence on the satisfaction. However, the
statistic table result shows that the food condiment does have significant
influence. Therefore, E is the most likely plot.
> library(lmtest)
> library(MASS)
>
> mydata = Boston
> OLSModel = lm(medv ~ ., mydata)
>
> summary(OLSModel)
Call:
lm(formula = medv ~ ., data = mydata)
Residuals:
Min 1Q Median 3Q Max
-15.595 -2.730 -0.518 1.777 26.199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
crim -1.080e-01 3.286e-02 -3.287 0.001087 **
zn 4.642e-02 1.373e-02 3.382 0.000778 ***
indus 2.056e-02 6.150e-02 0.334 0.738288
chas 2.687e+00 8.616e-01 3.118 0.001925 **
nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
age 6.922e-04 1.321e-02 0.052 0.958229
dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
tax -1.233e-02 3.760e-03 -3.280 0.001112 **
ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
black 9.312e-03 2.686e-03 3.467 0.000573 ***
lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> plot(OLSModel)
>
> shapiro.test(residuals(OLSModel))
data: residuals(OLSModel)
W = 0.90138, p-value < 2.2e-16
> dwtest(OLSModel)
Durbin-Watson test
data: OLSModel
DW = 1.0784, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0
Call:
lm(formula = sqrt(medv) ~ ., data = mydata)
Residuals:
Min 1Q Median 3Q Max
-1.35292 -0.26646 -0.04765 0.20693 2.20133
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.6171186 0.4830511 13.699 < 2e-16 ***
crim -0.0166326 0.0031107 -5.347 1.37e-07 ***
zn 0.0037856 0.0012993 2.913 0.003737 **
indus 0.0035719 0.0058207 0.614 0.539728
chas 0.2546251 0.0815500 3.122 0.001900 **
nox -1.8316237 0.3615452 -5.066 5.75e-07 ***
rm 0.3045221 0.0395573 7.698 7.62e-14 ***
age 0.0001287 0.0012503 0.103 0.918076
dis -0.1345203 0.0188787 -7.125 3.71e-12 ***
rad 0.0322156 0.0062798 5.130 4.18e-07 ***
tax -0.0013597 0.0003559 -3.820 0.000151 ***
ptratio -0.0941484 0.0123830 -7.603 1.48e-13 ***
black 0.0009775 0.0002542 3.845 0.000136 ***
lstat -0.0603341 0.0048003 -12.569 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> plot(sqrt_OLSModel)
For the majority part (between fitted values 5 to 35, where more than 95% of
the data points are observed), the residuals are not aggressively diverging
out of or converging into the fitted line (which is usually seen as a cone
formation that suggests that the variation of residuals changes with fitted
values). Thus, we infer that Homoskedasticity is observed for the most part
of it.
However, there is clearly a pattern in the residuals. We observe that towards
the either ends (values less than 5 and greater than 35) of the fitted plot (the
red line), there are no residuals on the negative side of the axis. This is
indicative of the fact that a non-linear data set is being fitted to a linear
model.
Question 15. Let’s check if our residuals are normal by doing visual
inspection of a diagnostic plot. (Normal Q-Q plot). What do you
observe?
Answer 15. (C) There seems to be non-normality with distribution being right-
skewed.
As can be seen in the plot, towards the right end of Theoretical Quantiles,
the Standardized residuals of the data tend to shift towards the left, and in
terms of statistics, it is called being Skewed towards the right.
One can think of a QQ plot as plotting 2 curves (bell shaped curves) on the
two axis, Theoretical ones on the X axis and Standardized ones on the Y axis.
Question 16. Let’s run a formal test to confirm if there is indeed a
non-normality. This test is called Shapiro-Wilk normality test and
the run command for the same is as follows:
> shapiro.test(residuals(your lm object))
The null hypothesis is that the residuals are normal. Now does the
result from this test match your results from Question 15?
Answer 16. (B) Yes, we get a low p-value in Shapiro-Wilk test which means
the residuals are not normally distributed and visual inspection in Question
15 also led to the conclusion that there is a non-normal distribution of
residuals.
The Shapiro Wilk test produces a value of 0.9 with a p-value < 0.05. The null-
hypothesis of this test is that the population is normally distributed. Thus, on
the one hand, if the p value is less than the chosen alpha level (typically
95%, and hence we test against 0.05), then the null hypothesis is rejected
and there is evidence that the data tested are not normally distributed.
Therefore, for this case, the population of residuals are not normally
distributed, which can be seen in the residual graphs above.
Question 17. Let’s check for any autocorrelation in the data. Durbin-
Watson statistic is used for that. The function “dwtest” in the
package “lmtest” can be used for this. dwtest takes your linear
model as input. The NULL hypothesis for this test is that the errors
are uncorrelated. Let’s use that. Type the following code to get
ready Install.packages(‘lmtest’) require(lmtest) What does the test
tell you?
Answer 17. (B) The small p-value indicates that there might be
autocorrelation.
The Durbin-Watson test produced a value of 1.08 with p-value < 0.05. The
Durbin-Watson test tests the null hypothesis that linear regression residuals
are uncorrelated, against the alternative hypothesis that autocorrelation
exists. The small p value is thus indicative of the fact there indeed is a
correlation amongst the residuals and are not independently distributed.
Answer 18. (D) The non-linear pattern earlier visible in residuals in Question
14 becomes less prominent and the residuals are still not normally
distributed.
We can simply check the residuals vs fitted line plot using the plot() function
which shows as below:
The plot is similar in nature with the one in Question 14, but with slightly
more values dropping below the x-axis. The QQ plot further shows that the
residuals are not normally distributed and thus we should expected uneven
distribution of residuals around the fitted line. Thus the non-linear trend is
still expected. But to truly appreciate the effect of the square-root
transformation, we should compare the residuals on the same range for the
Y-axis.
As can be seen, the residuals have dropped significantly closer to 0, and with
this, we can say that the non-linear pattern is less prominent.
> summary(linlin)
Call:
lm(formula = Salary ~ Experience, data = edsal)
Residuals:
Min 1Q Median 3Q Max
-73.00 -12.82 -1.18 13.32 60.85
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.4679 2.5673 11.48 <2e-16 ***
Experience 3.0959 0.1113 27.81 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(linlog)
Call:
lm(formula = Salary ~ LogExperience, data = edsal)
Residuals:
Min 1Q Median 3Q Max
-61.700 -21.895 -5.022 16.730 84.879
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.991 4.768 -0.418 0.677
LogExperience 34.985 1.704 20.529 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(loglin)
Call:
lm(formula = LogSalary ~ Experience, data = edsal)
Residuals:
Min 1Q Median 3Q Max
-1.51651 -0.17318 0.02534 0.19444 0.53280
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.640177 0.029106 125.07 <2e-16 ***
Experience 0.037087 0.001262 29.38 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(loglog)
Call:
lm(formula = LogSalary ~ LogExperience, data = edsal)
Residuals:
Min 1Q Median 3Q Max
-0.99692 -0.19914 -0.00272 0.20315 0.72587
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.15767 0.04584 68.88 <2e-16 ***
LogExperience 0.45949 0.01638 28.04 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Question 19. Which of the 4 fitted models has the highest R-square
value?
100∗dY
The term in the equation above represents the percentage change in
Y
Y. However both dX and dY are differentials (infinitesimally small) and thus
for our purpose need to be changed to ∆ X and ∆ Y .
For unit change, ∆ X=1. We thus consider Y at X=0 and X=1 to compute ∆ Y .
b 0+ b1 ( 1) b 0+ b 1( 0)
100∗∆ Y 100∗e −e b1
= b 0+b 1 ( 0 )
=100∗( e −1)
Y e
Y =b 0+b 1∗log ( X ) →
dY
dX ( )
=b 1
1
X ( )
→ dY =b 1
dX
X
→ dY =
b1
100 [( ) ]
dX
X
∗100
The term ( dXX )∗100 in the equation above represents the percentage change
in X. Thus for 1% change in Experience, Y changes by a factor of
b 1 34.985
= =0.01∗34.985
100 100