0% found this document useful (0 votes)

19 views19 pages

Graded Homework 1 Solutions

The document provides solutions to a graded homework assignment on Ordinary Least Squares (OLS) regression, covering key concepts such as the sum of residuals, regression lines, correlation, and model fitting. It discusses the importance of including an intercept term in regression models, the relationship between dependent and independent variables, and how to interpret regression outputs in R. Additionally, it addresses the implications of changing units for variables and the effects of dropping the intercept term on the percentage of variation explained by the model.

Uploaded by

Mahmudul Hasan Porag

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views19 pages

Graded Homework 1 Solutions

Uploaded by

Mahmudul Hasan Porag

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Graded Homework 1

Solution Manual

Question 1. Choose the correct statement regarding the sum of

residuals calculated using Ordinary Least Squares (OLS).

Answer 1. (B) The sum of residuals will always be equal to zero if you include
intercept term in your model and you are using OLS to estimate the
coefficients.

When an intercept is included in multiple linear regression for a data set with
p points, the line is of the form:

In OLS (Ordinary Least Square) Regression, the objective is to minimize the

SSE (Sum of Squared Errors) to 0, which is obtained as follows:

And to find all the betas, we differentiate all the above equation against all of
betas individually including β0. We obtain ‘p+1’ equations to solve for all the
‘p+1’ betas. Differentiating against β0, we get:

The above equation is responsible for ensuring that the sum of residuals is 0.
Therefore, C and D are incorrect options.

We now must decide between A and B. One form of regression under

consideration is “Regression passing through Origin” (which is ideally not
considered to be True OLS) which is a special case. Absence of intercept still
means that the SSE is minimized, but it does not guarantee that SSE = 0. It
is only when the intercept is included that the residuals sum up to 0.

Thus, B is the correct option. One can think of the intercept as the catchall
term, which provides for shifting the line of regression closer to or away from
the X-axis (say the response terms are of the order 1000s, and thus the β0
value helps move the line of regression the said order, providing for finer
tuned remaining betas).

We understand that some ambiguity was observed regarding the

forms of regression. Here, the question was designed keeping in
mind the concept of the importance of the intercept term. While the
true answer should be (B), we shall consider both (A) and (B) as the
right choice.

Question 2. Consider a simple linear regression model. Y ~ X. A fit

for this model will be a line in the X-Y plane. Now suppose we fit a
model for the opposite relationship: X ~ Y (that is, X dependent, and
Y independent). This new model yields a new regression line in the
X-Y plane. Which of the following statements are true about these
two lines?

Answer 2. (B) These lines will always intersect at the mean (Xmean, Ymean)

If we have that y is a dependent variable on x, (and for their respective

means y and x ), then the least squares regression Y ~ X line is given as:
CoV ( x , y )
( y− y)= ( x−x)
Var ( x )
Similarly, the line for X ~ Y is given as:
CoV ( x , y )
(x−x )= ( y− y)
Var ( y )
Clearly, these two lines meet at the means y and x .

The above equation is obtained using the least square estimator method for
estimating the values a and b while minimizing the residuals:
2

The final line is y=a+bx and we find a and b using equations obtained by
placing
∂ D(a , b) ∂ D (a , b) CoV ( x , y )
= =0 which gives us a= y−b x and b=
∂a ∂b Var ( x )

We can also write a small piece of code to check the same:

> mydata = read.csv(‘EDSAL.csv')

> YregX <- lm(Salary~Experience, mydata)
> summary(YregX)

Call:
lm(formula = Salary ~ Experience, data = mydata)

Residuals:
Min 1Q Median 3Q Max
-73.00 -12.82 -1.18 13.32 60.85
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.4679 2.5673 11.48 <2e-16 ***
Experience 3.0959 0.1113 27.81 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 24.05 on 298 degrees of freedom

Multiple R-squared: 0.7218, Adjusted R-squared: 0.7209
F-statistic: 773.2 on 1 and 298 DF, p-value: < 2.2e-16

> XregY <- lm(Experience~Salary, mydata)

> summary(XregY)

Call:
lm(formula = Experience ~ Salary, data = mydata)

Residuals:
Min 1Q Median 3Q Max
-11.430 -4.867 -1.787 3.489 20.805

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.475177 0.841711 -1.753 0.0807 .
Salary 0.233146 0.008385 27.806 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.601 on 298 degrees of freedom

Multiple R-squared: 0.7218, Adjusted R-squared: 0.7209
F-statistic: 773.2 on 1 and 298 DF, p-value: < 2.2e-16

> plot(mydata$Experience, mydata$Salary)

> abline(29.4679, 3.0959)
> abline(1.475177/0.233146, 1/0.233146)
>
> mean(mydata$Experience)
[1] 19.39333
> mean(mydata$Salary)
[1] 89.50846

The graph produced is as follows (which meets at the mean of the Xs and Ys)
> abline(29.4679, 3.0959)
> abline(1.475177/0.233146, 1/0.233146)

Kindly note that while plotting the two line on the same axis, the line from X
~ Y needs to be transformed to X-Y co-ordinate as shown above in the
second line of code. Another point to note is that the difference in the slope
for these lines is large as randomness increases in the data. If you consider
data that is perfectly describable by a linear equation, then one would obtain
the same line in both the cases (i.e. Y ~ X and X ~ Y).

For example, considering Data Points {(1,3), (2,5), (3,7), (4,9), (5,11)}, the
lines obtained would be y = 2x + 1 and x = 0.5y + 0.5, which are basically
the same overlapping lines.

Code for Questions 3 to 7

> # loading data

> library(Ecdat)
> load(cars)
>
> # correlation
> cor(cars$speed, cars$dist)
[1] 0.8068949
>
> # simple lin reg model and summary
> lm <- lm(dist ~ speed, data=cars)
>
> summary(lm)

Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.38 on 48 degrees of freedom

Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12

> # changing units to meters per second and meters

> new.speed <- cars$speed*0.44704
> new.dist <- cars$dist*0.3048
>
> # new lin reg model after updating units
> lm.new <- lm(new.dist~new.speed)
>
> summary(lm.new)

Call:
lm(formula = new.dist ~ new.speed)

Residuals:
Min 1Q Median 3Q Max
-8.8603 -2.9033 -0.6925 2.8086 13.1678

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.3581 2.0600 -2.601 0.0123 *
new.speed 2.6812 0.2833 9.464 1.49e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.688 on 48 degrees of freedom

Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12

> # predict stopping distance with 7/5m/s speed and confidence interval
> new.dat <- data.frame(new.speed=7.5)
>
> predict(ols_reg2, testdata, interval="predict")
fit lwr upr
1 14.7526 5.227095 24.2781
>
> predict(ols_reg2, testdata, interval="predict", level = 0.9)
fit lwr upr
1 14.7526 6.806648 22.69855

Question 3. Let’s try to find out if there is a correlation between the

distance needed to stop and the speed at which the car is moving.
What correlation value do you find when doing this in R?

Answer 3. (C) 0.81

As can be seen from the code above, the cor() function in R returns a value
of 0.8068 as the correlation between Speed and Distance
Question 4. Would you say that distance to stop and speed of the
car are?

Answer 4. (C) Well Correlated

The 2 covariates are well-correlated with a correlation value which is close to
1 in the acceptable range [-1, 1].

Question 5. Now, let’s fit a linear model with distance needed to

stop as the response and speed as the predictor. What is the
percent variation explained by speed, intercept, and coefficient of
speed?

Answer 5. (A) 0.65, -17.58 and 3.93

Percent variation explained by speed is the R-squared value = 0.65
Intercept of speed (from the regression summary table) = -17.58
Coefficient of speed (from the regression summary table) = 3.93

Question 6. Now suppose we need to change the units of distance

needed to stop from feet to meters and speed from mph to meters
per second because we need the results to be standard units. What
would be the results for percent variation explained by speed,
intercept, and coefficient of speed?

Answer 6. (B) 0.65, -5.36, and 2.68

First, change the dataset into proper units. Convert speed from miles per
hour to meters per second (multiply by 0.44704) and convert feet into
meters again by multiplying by a conversion factor (0.3048). Then, reuse the
same steps as in Qn4 to get regression summary and look for the same
variable outputs.
Percent variation explained by speed is the R-squared value = 0.65
Intercept of speed (from the regression summary table) = -5.36
Coefficient of speed (from the regression summary table) = 2.68

Question 7. Now suppose your model is ready and you are asked to
make a prediction for distance a car will need to stop in meters if it
is moving at a speed of 7.5 m/s. You are required to report the
predicted distance and also the lower and upper bound for the 90%
confidence interval. What will you report?

Answer 7. (D) None of the Above

To the calculate the above given metrics, we use the R function predict() as
follows:
> predict (lm.new, newdata = new.dat, interval ='confidence', level=0.90)
fit lwr upr
[1] 14.7508 13.60107 15.90053

This does not match any of the given options. Take a note of the argument
‘interval’. While calculating a range for prediction, we may come across two
types intervals – ‘confidence’ and ‘predict’ and it is important to understand
the difference between the two.

> predict(ols_reg2, testdata, interval=’predict’, level = 0.90)

fit lwr upr
[1] 14.7526 6.806648 22.69855

Confidence intervals tell you about how well you have determined the mean.
The key point is that the confidence interval tells you about the likely
location of the true population parameter.
Prediction intervals tell you where you can expect to see the next data point
sampled. The key point is that the prediction interval tells you about the
distribution of values, not the uncertainty in determining the population
mean.

Prediction intervals must account for both the uncertainty in knowing the
value of the population mean, plus data scatter. So, a prediction interval is
always wider than a confidence interval.

Let us consider another example [Thanks to Serhii Kushchenko on stats stack

exchange for this explanation]. Suppose you try to predict the people's
weight from their height, gender(Male, Female) and Calories Consumed (Less
than 1000, 1000 to 2000, Above 2000). Given the large population of the
earth (in billions), one would expect lots of people with the same height,
gender and calories consumed having different weights, which may vary for
different reasons. But what we may talk about with some certainty is that on
average, we would obtain a somewhat constant value when sampled
multiple times.

One task might be to forecast the weight of some specific person. And we
don't know the living circumstances of that individual. Here the prediction
interval must be used. But in a practical environment, the task is to predict
the average weight of all the people having the same values of all three
explanatory variables at hand. Here we use the confidence interval. Likewise,
in the problem, we wish to check, given that the vehicle is moving at 7.5 m/s,
what would be on average, the likely value of the stopping distance.
We observe that the two are centered around the same point, but prediction
interval is much wider than the confidence interval.

NOTE: We understand that this concept is highly statistical in

nature, and thus, would award points to both (A) and (D).

Question 8. Your boss is not happy with the intercept term and asks
you to try dropping it. Let’s drop the intercept as your boss asked.
Which of the following finding will you report back to boss
concerning the percentage of variation explained by the model?

Answer 8. (B) In the R output, the percentage of variation explained

increases, but this does not mean a better fit as percentage of variation
explained is artificially inflated in R output when you remove the intercept
term

The regular way calculating R2 in R is as follows:

But when dropping the intercept term, R uses a modified version which is as
follows:

R2 becomes higher without intercept, not because the model is better, but
because the definition of R2 used is another one! R2 is an expression of a
comparison of the estimated model with some standard model, expressed as
reduction in sum of squares compared to sum of squares with the standard
model. In the model with intercept, the comparison sum of squares is around
the mean. Without intercept, it is around zero! The latter is usually much
higher, so it easier to get a large reduction in sum of squares. However,
setting y=0 at x=0 introduces bias in the data and thus the co-efficient may
not be truly significant

Question 9. We regress calorie consumption of individual based on

their region, the level of urban development of the place they live in
and age. To transform region variable into categorical variables for
regression, how many dummy variables do we need to insert into
the regression?

Answer 9. (C) 2
We have three categorical variables, only two variables are needed because
one factor level is taken as base line.

Question 10. We regress calorie consumption of individual based on

their region, the level of urban development of the place they live in
and age. Which statement about Level of Urban Development is
TRUE (confidence Interval = 95%)?

Answer 10. (D) At the same age level, people living in urban areas consume
more calories than people living in rural areas.

The confidence level is 95%, and the p-value for Rural_True (0.012) is less
than 0.05. Therefore, Rural_True has a significant effect on the response
variable. A is incorrect. B and C fail to consider ageing effect; thus B and C
are incorrect. People in rural area consume 40 calories less than those in
urban areas, while holding age constant. Hence Choice D is the correct
answer

Question 11. Suppose we regress satisfaction against food type,

condiment type and interaction between food type and condiment
type:
Satisfaction = b0 + b1* Food + b2* Condiment + b3*
Food*Condiment
What can we say about the interaction effects? (confidence interval
= 95%)

Answer 11. (A) Interaction effects exist between food and condiment since p-
value is much smaller than the significance level/alpha = 5%. We should
include the interaction term in the regression model to explain the variability
in the data.

Since the p value of the interaction effect is close to zero, and it is less than
5% (error rate), we will reject the null hypothesis that interaction effect does
not exist between food and condiment. Thus, A is correct answer. If food and
condiment are independent of each other, the interaction effect should not
be significant, making D is incorrect. Interaction effects are relevant when
conducting regression analysis, E is incorrect.

Question 12. Suppose we regress satisfaction against food type,

condiment type and interaction between food type and condiment
type:
Satisfaction = b0 + b1* Food + b2* Condiment + b3*
Food*Condiment
What is the expected value of satisfaction if food is hot dog and
condiment is chocolate sauce? (The coefficient is 1 when the food is
hot dog and condiment is chocolate sauce.)

Answer 12. (E) 65.317

77.320 + 0.141 + 1.863 - 14.007 = 65.317
The total satisfaction is obtained by adding Intercept + Effects of Food (when
Hot dog = 1) + Effects of Condiment (when Chocolate Sauce = 1) +
Interaction Effects of Hot dog and Chocolate Sauce.

NOTE: This question originally had a typo (64.317 instead of

65.317). We updated the quiz, but it didn’t reflect for all students –
so asked them to pick the closest answer.

Question 13. When plotting the expected value for food (one line for
chocolate sauce and one line for mustard), which plot are you most
likely to see?

Answer 13. (E) Since the interaction effect is significant between food and
condiment, we should expect to see the two lines cross each other.
Therefore, A and C cannot be correct. In terms of choice B, from the statistic
result table in the previous question, hot dog with mustard sauce should be
more pleasant than hot dog with chocolate sauce. Choice B is also not
correct.

In terms of choice D, if the mustard line is flat, it means that the food
condiment has no significant influence on the satisfaction. However, the
statistic table result shows that the food condiment does have significant
influence. Therefore, E is the most likely plot.

NOTE: While (E) was

supposed to be the logically
correct answer, purely
based on the visual aspect
(as was intended while
building the question), we
understand that based on
the numerical values, plot D
is closer to the correct plot. Thus, both plot (D) and (E) will be
considered the right answers. The plot based on numerical values:

Code for Question 14 to 18

> library(lmtest)
> library(MASS)
>
> mydata = Boston
> OLSModel = lm(medv ~ ., mydata)
>
> summary(OLSModel)

Call:
lm(formula = medv ~ ., data = mydata)

Residuals:
Min 1Q Median 3Q Max
-15.595 -2.730 -0.518 1.777 26.199

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
crim -1.080e-01 3.286e-02 -3.287 0.001087 **
zn 4.642e-02 1.373e-02 3.382 0.000778 ***
indus 2.056e-02 6.150e-02 0.334 0.738288
chas 2.687e+00 8.616e-01 3.118 0.001925 **
nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
age 6.922e-04 1.321e-02 0.052 0.958229
dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
tax -1.233e-02 3.760e-03 -3.280 0.001112 **
ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
black 9.312e-03 2.686e-03 3.467 0.000573 ***
lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.745 on 492 degrees of freedom

Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16

> plot(OLSModel)
>
> shapiro.test(residuals(OLSModel))

Shapiro-Wilk normality test

data: residuals(OLSModel)
W = 0.90138, p-value < 2.2e-16

> dwtest(OLSModel)

Durbin-Watson test
data: OLSModel
DW = 1.0784, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0

> sqrt_OLSModel = lm(sqrt(medv) ~ ., mydata)

> summary(sqrt_OLSModel)

Call:
lm(formula = sqrt(medv) ~ ., data = mydata)

Residuals:
Min 1Q Median 3Q Max
-1.35292 -0.26646 -0.04765 0.20693 2.20133

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.6171186 0.4830511 13.699 < 2e-16 ***
crim -0.0166326 0.0031107 -5.347 1.37e-07 ***
zn 0.0037856 0.0012993 2.913 0.003737 **
indus 0.0035719 0.0058207 0.614 0.539728
chas 0.2546251 0.0815500 3.122 0.001900 **
nox -1.8316237 0.3615452 -5.066 5.75e-07 ***
rm 0.3045221 0.0395573 7.698 7.62e-14 ***
age 0.0001287 0.0012503 0.103 0.918076
dis -0.1345203 0.0188787 -7.125 3.71e-12 ***
rad 0.0322156 0.0062798 5.130 4.18e-07 ***
tax -0.0013597 0.0003559 -3.820 0.000151 ***
ptratio -0.0941484 0.0123830 -7.603 1.48e-13 ***
black 0.0009775 0.0002542 3.845 0.000136 ***
lstat -0.0603341 0.0048003 -12.569 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4492 on 492 degrees of freedom

Multiple R-squared: 0.7758, Adjusted R-squared: 0.7698
F-statistic: 130.9 on 13 and 492 DF, p-value: < 2.2e-16

> plot(sqrt_OLSModel)

Question 14. Let’s check for non-constant variance using visual

inspection of a diagnostic plot (Residuals vs Fitted). What do you
observe?

Answer 14. (A) There seems to be homoskedasticity but there is clearly a

non-linear pattern.

For the majority part (between fitted values 5 to 35, where more than 95% of
the data points are observed), the residuals are not aggressively diverging
out of or converging into the fitted line (which is usually seen as a cone
formation that suggests that the variation of residuals changes with fitted
values). Thus, we infer that Homoskedasticity is observed for the most part
of it.
However, there is clearly a pattern in the residuals. We observe that towards
the either ends (values less than 5 and greater than 35) of the fitted plot (the
red line), there are no residuals on the negative side of the axis. This is
indicative of the fact that a non-linear data set is being fitted to a linear
model.

NOTE: We understand that a lot of students faced problem judging

Homoskedasticity or Heteroskedasticity for this particular problem.
Usually, visual interpretation is a supportive factor to statistical
tests, and we this problem was particularly designed for the very
reason. In light of this, both (A) and (C) will be regarded as correct.

Question 15. Let’s check if our residuals are normal by doing visual
inspection of a diagnostic plot. (Normal Q-Q plot). What do you
observe?

Answer 15. (C) There seems to be non-normality with distribution being right-
skewed.

As can be seen in the plot, towards the right end of Theoretical Quantiles,
the Standardized residuals of the data tend to shift towards the left, and in
terms of statistics, it is called being Skewed towards the right.

One can think of a QQ plot as plotting 2 curves (bell shaped curves) on the
two axis, Theoretical ones on the X axis and Standardized ones on the Y axis.
Question 16. Let’s run a formal test to confirm if there is indeed a
non-normality. This test is called Shapiro-Wilk normality test and
the run command for the same is as follows:
> shapiro.test(residuals(your lm object))
The null hypothesis is that the residuals are normal. Now does the
result from this test match your results from Question 15?

Answer 16. (B) Yes, we get a low p-value in Shapiro-Wilk test which means
the residuals are not normally distributed and visual inspection in Question
15 also led to the conclusion that there is a non-normal distribution of
residuals.
The Shapiro Wilk test produces a value of 0.9 with a p-value < 0.05. The null-
hypothesis of this test is that the population is normally distributed. Thus, on
the one hand, if the p value is less than the chosen alpha level (typically
95%, and hence we test against 0.05), then the null hypothesis is rejected
and there is evidence that the data tested are not normally distributed.
Therefore, for this case, the population of residuals are not normally
distributed, which can be seen in the residual graphs above.

Question 17. Let’s check for any autocorrelation in the data. Durbin-
Watson statistic is used for that. The function “dwtest” in the
package “lmtest” can be used for this. dwtest takes your linear
model as input. The NULL hypothesis for this test is that the errors
are uncorrelated. Let’s use that. Type the following code to get
ready Install.packages(‘lmtest’) require(lmtest) What does the test
tell you?
Answer 17. (B) The small p-value indicates that there might be
autocorrelation.

The Durbin-Watson test produced a value of 1.08 with p-value < 0.05. The
Durbin-Watson test tests the null hypothesis that linear regression residuals
are uncorrelated, against the alternative hypothesis that autocorrelation
exists. The small p value is thus indicative of the fact there indeed is a
correlation amongst the residuals and are not independently distributed.

Question 18. Let’s try to correct for some violations in assumptions.

What’s the impact of applying a square-root transformation on the
response variable? Please note that independent variables are all
the variables except age and indus.

Answer 18. (D) The non-linear pattern earlier visible in residuals in Question
14 becomes less prominent and the residuals are still not normally
distributed.

This is reinforced by the fact that there seems to be a non-linear relation in

the response and predictor variables. Here, we need to check two aspects of
the residuals :- how are the residuals distributed and what is the effect on
the residuals when the square-root transformation is applied.

We can simply check the residuals vs fitted line plot using the plot() function
which shows as below:

The plot is similar in nature with the one in Question 14, but with slightly
more values dropping below the x-axis. The QQ plot further shows that the
residuals are not normally distributed and thus we should expected uneven
distribution of residuals around the fitted line. Thus the non-linear trend is
still expected. But to truly appreciate the effect of the square-root
transformation, we should compare the residuals on the same range for the
Y-axis.

The following piece of code should help visualize:

> mydata = Boston
> OLSModel = lm(medv ~ ., mydata)
> sqrt_OLSModel = lm(sqrt(medv) ~ ., mydata)
> plot(fitted.values(OLSModel), residuals(OLSModel), ylim = c(-15,20))
> plot(fitted.values(sqrt_OLSModel), residuals(sqrt_OLSModel), ylim = c(-15,20))

As can be seen, the residuals have dropped significantly closer to 0, and with
this, we can say that the non-linear pattern is less prominent.

Code for Questions 19 to 21

> edsal = read.csv('C:\\Users\\anmol\\Desktop\\TA Work\\EDSAL.csv')
> edsal$LogExperience = log(edsal$Experience)
> edsal$LogSalary = log(edsal$Salary)
>
> linlin = lm(Salary ~ Experience, edsal)
> linlog = lm(Salary ~ LogExperience, edsal)
> loglin = lm(LogSalary ~ Experience, edsal)
> loglog = lm(LogSalary ~ LogExperience, edsal)

> summary(linlin)

Call:
lm(formula = Salary ~ Experience, data = edsal)

Residuals:
Min 1Q Median 3Q Max
-73.00 -12.82 -1.18 13.32 60.85

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.4679 2.5673 11.48 <2e-16 ***
Experience 3.0959 0.1113 27.81 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 24.05 on 298 degrees of freedom

Multiple R-squared: 0.7218, Adjusted R-squared: 0.7209
F-statistic: 773.2 on 1 and 298 DF, p-value: < 2.2e-16

> summary(linlog)

Call:
lm(formula = Salary ~ LogExperience, data = edsal)

Residuals:
Min 1Q Median 3Q Max
-61.700 -21.895 -5.022 16.730 84.879

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.991 4.768 -0.418 0.677
LogExperience 34.985 1.704 20.529 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 29.35 on 298 degrees of freedom

Multiple R-squared: 0.5858, Adjusted R-squared: 0.5844
F-statistic: 421.5 on 1 and 298 DF, p-value: < 2.2e-16

> summary(loglin)

Call:
lm(formula = LogSalary ~ Experience, data = edsal)

Residuals:
Min 1Q Median 3Q Max
-1.51651 -0.17318 0.02534 0.19444 0.53280

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.640177 0.029106 125.07 <2e-16 ***
Experience 0.037087 0.001262 29.38 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2727 on 298 degrees of freedom

Multiple R-squared: 0.7434, Adjusted R-squared: 0.7425
F-statistic: 863.2 on 1 and 298 DF, p-value: < 2.2e-16

> summary(loglog)

Call:
lm(formula = LogSalary ~ LogExperience, data = edsal)

Residuals:
Min 1Q Median 3Q Max
-0.99692 -0.19914 -0.00272 0.20315 0.72587

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.15767 0.04584 68.88 <2e-16 ***
LogExperience 0.45949 0.01638 28.04 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2822 on 298 degrees of freedom

Multiple R-squared: 0.7252, Adjusted R-squared: 0.7243
F-statistic: 786.5 on 1 and 298 DF, p-value: < 2.2e-16

Question 19. Which of the 4 fitted models has the highest R-square
value?

Answer 19. (C) Log-Lin

The Log-Linear model has the highest R-Square value of 0.7434 and hence is
the correct answer.

Question 20. Which is the interpretation of the slope coefficient for

the Log-Lin model?

Answer 20. (B) Increasing Experience by 1 unit leads to ((e^0.037087)-1)

*100% increase in Salary

We are considering the following model:

log ( Y ) =b 0+b 1( X) → log ( Salary ) =b 0+b 1( Experience)
Increasing Experience by one unit will increase log(Salary) by 0.37087 units.

However, the model can also be written as

dY dY
Y =eb 0+b 1 ( X ) → =b 1 ( Y ) → =b 1 ( dX )
dX Y
dY 100∗dY
→ =b 1 ( dX ) → =100∗b 1 ( dX )
Y Y

100∗dY
The term in the equation above represents the percentage change in
Y
Y. However both dX and dY are differentials (infinitesimally small) and thus
for our purpose need to be changed to ∆ X and ∆ Y .
For unit change, ∆ X=1. We thus consider Y at X=0 and X=1 to compute ∆ Y .
b 0+ b1 ( 1) b 0+ b 1( 0)
100∗∆ Y 100∗e −e b1
= b 0+b 1 ( 0 )
=100∗( e −1)
Y e

Question 21. Which is the interpretation of the slope coefficient for

the Lin-Log model?

Answer 21. (C) Increasing Experience by 1% leads to 0.01*34.985 units

increase in Salary
We are considering the following model:
Y =b 0+b 1∗log( X) → Salary=b 0+b 1∗log(Experience)
Increasing Log (Experience) by one unit will increase Salary by 34.985 units.

However, the model can also be written as

Y =b 0+b 1∗log ( X ) →
dY
dX ( )
=b 1
1
X ( )
→ dY =b 1
dX
X
→ dY =
b1
100 [( ) ]
dX
X
∗100

The term ( dXX )∗100 in the equation above represents the percentage change
in X. Thus for 1% change in Experience, Y changes by a factor of
b 1 34.985
= =0.01∗34.985
100 100

Week 2
No ratings yet
Week 2
66 pages
R-Programming - Unit 5
No ratings yet
R-Programming - Unit 5
43 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
51 pages
Linear Regression Analysis - 6
No ratings yet
Linear Regression Analysis - 6
29 pages
Lecture 12 - Adv. Correlation and Multiple Regression
No ratings yet
Lecture 12 - Adv. Correlation and Multiple Regression
32 pages
06 Least Squar Regression
No ratings yet
06 Least Squar Regression
25 pages
Lab 10 Forest Regression
No ratings yet
Lab 10 Forest Regression
5 pages
Simple Lin Regress Inference
No ratings yet
Simple Lin Regress Inference
51 pages
Simple Regression
100% (1)
Simple Regression
50 pages
Business Statistics II
100% (2)
Business Statistics II
100 pages
292322356
No ratings yet
292322356
69 pages
9 Regression (Statistics IEM 2-2)
No ratings yet
9 Regression (Statistics IEM 2-2)
32 pages
2024-Lecture 11
No ratings yet
2024-Lecture 11
37 pages
Residual Analysis For Simple Linear Regression: X B B y N e N e
No ratings yet
Residual Analysis For Simple Linear Regression: X B B y N e N e
15 pages
15.simple Linear Regression-530
No ratings yet
15.simple Linear Regression-530
54 pages
20BCE1205 Lab3
No ratings yet
20BCE1205 Lab3
9 pages
Simple Linear Regression Sample
No ratings yet
Simple Linear Regression Sample
55 pages
Regression & Correlation
No ratings yet
Regression & Correlation
44 pages
ISLP - Website 135 200
No ratings yet
ISLP - Website 135 200
66 pages
Regression Basics: Predicting A DV With A Single IV
No ratings yet
Regression Basics: Predicting A DV With A Single IV
20 pages
Linear Regression
No ratings yet
Linear Regression
13 pages
Homework4 1
No ratings yet
Homework4 1
10 pages
Linear Regression
No ratings yet
Linear Regression
21 pages
ISLP - Website-135-200 (1) - 1-60
No ratings yet
ISLP - Website-135-200 (1) - 1-60
60 pages
Sheet5 Sol
No ratings yet
Sheet5 Sol
13 pages
9 Regression (Statistics IEM 2-2)
No ratings yet
9 Regression (Statistics IEM 2-2)
32 pages
6th Lecture Note 108335647 230518 203102
No ratings yet
6th Lecture Note 108335647 230518 203102
35 pages
Section 2
No ratings yet
Section 2
22 pages
TCMG - MEEG 573 - SP - 20 - Lecture - 7
No ratings yet
TCMG - MEEG 573 - SP - 20 - Lecture - 7
69 pages
6013B0519Y T2 Homework Solutions 20240504
No ratings yet
6013B0519Y T2 Homework Solutions 20240504
6 pages
Chapter 17
No ratings yet
Chapter 17
31 pages
Lecture 4
No ratings yet
Lecture 4
22 pages
H-311 Linear Regression Analysis With R
100% (1)
H-311 Linear Regression Analysis With R
71 pages
Exercice V
No ratings yet
Exercice V
5 pages
Simple Linear Regression 1. Review of Least Squares Procedure 2. Inference For Least Squares Lines
No ratings yet
Simple Linear Regression 1. Review of Least Squares Procedure 2. Inference For Least Squares Lines
51 pages
The University of Auckland: Second Semester, 2004 Campus: City
No ratings yet
The University of Auckland: Second Semester, 2004 Campus: City
23 pages
MATH3714 Jan 2024
No ratings yet
MATH3714 Jan 2024
9 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
11 pages
Chapter4 Regression
No ratings yet
Chapter4 Regression
15 pages
Mindanao State University General Santos City: Simple Linear Regression
No ratings yet
Mindanao State University General Santos City: Simple Linear Regression
12 pages
The Bucharest University of Economic Studies Bucharest Business School Romanian - French INDE MBA Program
No ratings yet
The Bucharest University of Economic Studies Bucharest Business School Romanian - French INDE MBA Program
67 pages
Linear Regression
100% (2)
Linear Regression
228 pages
Chapter 3 - Classical Simple Linear Regression
No ratings yet
Chapter 3 - Classical Simple Linear Regression
52 pages
Sec2 Regression PDF
No ratings yet
Sec2 Regression PDF
183 pages
UDAU M6 Correlation & Regression
No ratings yet
UDAU M6 Correlation & Regression
26 pages
Experiment No.8 - Fit Simple Linear Regression Models Using Built-In Functions.
No ratings yet
Experiment No.8 - Fit Simple Linear Regression Models Using Built-In Functions.
8 pages
Output Input Linear Correlation Coefficient Regression Analysis
No ratings yet
Output Input Linear Correlation Coefficient Regression Analysis
6 pages
Model Solution - Econ f241 Mid
No ratings yet
Model Solution - Econ f241 Mid
3 pages
Slides Prepared by John S. Loucks St. Edward's University
100% (1)
Slides Prepared by John S. Loucks St. Edward's University
44 pages
Lab-5-1-Regression and Multiple Regression
100% (2)
Lab-5-1-Regression and Multiple Regression
8 pages
Evans Analytics2e PPT 08
No ratings yet
Evans Analytics2e PPT 08
65 pages
CH 2
No ratings yet
CH 2
31 pages
Topic Simple Linear Regression
No ratings yet
Topic Simple Linear Regression
38 pages
RegrCorr PDF
No ratings yet
RegrCorr PDF
20 pages
Shanghai Jiaotong University Shanghai Advanced Institution of Finance
No ratings yet
Shanghai Jiaotong University Shanghai Advanced Institution of Finance
3 pages
Isye4031 Regression and Forecasting Practice Problems 2 Fall 2014
No ratings yet
Isye4031 Regression and Forecasting Practice Problems 2 Fall 2014
5 pages
Using R For Linear Regression
No ratings yet
Using R For Linear Regression
9 pages
HW 03 Sol
No ratings yet
HW 03 Sol
9 pages
Statistics Reading Comprehension 1
100% (1)
Statistics Reading Comprehension 1
2 pages
Data Analysis
100% (1)
Data Analysis
34 pages
Mat 152 Course Syllabus - Fa2019
No ratings yet
Mat 152 Course Syllabus - Fa2019
4 pages
Worksheet 6 2 Key
100% (1)
Worksheet 6 2 Key
2 pages
Bayesian Methods For Social and Political Measurement: Simon Jackman Stanford University Jackman@stanford - Edu
No ratings yet
Bayesian Methods For Social and Political Measurement: Simon Jackman Stanford University Jackman@stanford - Edu
37 pages
MAE202 FINALterm 2nd Sem AY 22-23-Zafra-Jonald-Grace
No ratings yet
MAE202 FINALterm 2nd Sem AY 22-23-Zafra-Jonald-Grace
13 pages
Econometrics Cheat Sheet - ?
No ratings yet
Econometrics Cheat Sheet - ?
10 pages
University of Southeastern Philippines: Business Statistics
No ratings yet
University of Southeastern Philippines: Business Statistics
10 pages
ECO 220 Syllabus - FallWinter - 2019 - 2020
No ratings yet
ECO 220 Syllabus - FallWinter - 2019 - 2020
8 pages
Statistical Analysis Using SAS
No ratings yet
Statistical Analysis Using SAS
47 pages
Package Boot': August 29, 2013
No ratings yet
Package Boot': August 29, 2013
117 pages
Basic Econometrics 2021
No ratings yet
Basic Econometrics 2021
3 pages
Homework 8
No ratings yet
Homework 8
6 pages
Newbold, P., Carlson, W.L. and Thorne, B. Statistics For Business and Economics
No ratings yet
Newbold, P., Carlson, W.L. and Thorne, B. Statistics For Business and Economics
11 pages
ML - Practical List
No ratings yet
ML - Practical List
3 pages
HOMEWORK - 13 - Question 19.1 Answer
No ratings yet
HOMEWORK - 13 - Question 19.1 Answer
4 pages
Homework 4
No ratings yet
Homework 4
3 pages
Section A: Answer Any Four Questions (Each Carrying 5 Marks)
No ratings yet
Section A: Answer Any Four Questions (Each Carrying 5 Marks)
2 pages
ch08 Ans
No ratings yet
ch08 Ans
20 pages
Distribution With 15: Brief Notation For The
No ratings yet
Distribution With 15: Brief Notation For The
2 pages
Homework1 2
No ratings yet
Homework1 2
2 pages
Ch4 2945310
No ratings yet
Ch4 2945310
2 pages
Econometrics - Review Questions
No ratings yet
Econometrics - Review Questions
4 pages
046 - Shipra Agarwal
No ratings yet
046 - Shipra Agarwal
9 pages
Lec 29-34
No ratings yet
Lec 29-34
27 pages
Math Grade 70 92 80 74 65 83 English Grade 74 84 63 87 78 90
No ratings yet
Math Grade 70 92 80 74 65 83 English Grade 74 84 63 87 78 90
2 pages
Econ G2 Final
No ratings yet
Econ G2 Final
10 pages
Designing Robust Parameters For Injection-Compression Molding Light-Guided Plates Based On Desirability Function and Regression Model
No ratings yet
Designing Robust Parameters For Injection-Compression Molding Light-Guided Plates Based On Desirability Function and Regression Model
15 pages
Statistics One: Repeated Measures ANOVA
No ratings yet
Statistics One: Repeated Measures ANOVA
23 pages
Pom Assignment 2 Muntaha Khan 63430
No ratings yet
Pom Assignment 2 Muntaha Khan 63430
4 pages
CB2203 2023-24 Sem B Assignment 2
No ratings yet
CB2203 2023-24 Sem B Assignment 2
3 pages
What Is A Split Plot ANOVA
No ratings yet
What Is A Split Plot ANOVA
2 pages
JKK
No ratings yet
JKK
1 page
IPE 12th Batch Spring 19, 12C
No ratings yet
IPE 12th Batch Spring 19, 12C
3 pages
IPE 5th Batch Fall 2015, 09C
No ratings yet
IPE 5th Batch Fall 2015, 09C
2 pages
IPE 3rd Batch Fall 2014, 08C
No ratings yet
IPE 3rd Batch Fall 2014, 08C
2 pages
IPE 1st Batch Fall 2013, 08C
No ratings yet
IPE 1st Batch Fall 2013, 08C
2 pages
IPE 4th Batch Spring 2015, 09C
No ratings yet
IPE 4th Batch Spring 2015, 09C
2 pages
IPE 2nd Batch Spring 2014, 08C
No ratings yet
IPE 2nd Batch Spring 2014, 08C
2 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
From Everand
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
Gérard Blanchet
3/5 (1)
Numerical Analysis II Essentials
From Everand
Numerical Analysis II Essentials
The Editors of REA
No ratings yet

Graded Homework 1 Solutions

Uploaded by

Graded Homework 1 Solutions

Uploaded by

Graded Homework 1

Question 1. Choose the correct statement regarding the sum of

In OLS (Ordinary Least Square) Regression, the objective is to minimize the

We now must decide between A and B. One form of regression under

We understand that some ambiguity was observed regarding the

Question 2. Consider a simple linear regression model. Y ~ X. A fit

If we have that y is a dependent variable on x, (and for their respective

We can also write a small piece of code to check the same:

> mydata = read.csv(‘EDSAL.csv')

Residual standard error: 24.05 on 298 degrees of freedom

> XregY <- lm(Experience~Salary, mydata)

Residual standard error: 6.601 on 298 degrees of freedom

> plot(mydata$Experience, mydata$Salary)

Code for Questions 3 to 7

> # loading data

Residual standard error: 15.38 on 48 degrees of freedom

> # changing units to meters per second and meters

Residual standard error: 4.688 on 48 degrees of freedom

Question 3. Let’s try to find out if there is a correlation between the

Answer 3. (C) 0.81

Answer 4. (C) Well Correlated

Question 5. Now, let’s fit a linear model with distance needed to

Answer 5. (A) 0.65, -17.58 and 3.93

Question 6. Now suppose we need to change the units of distance

Answer 6. (B) 0.65, -5.36, and 2.68

Answer 7. (D) None of the Above

> predict(ols_reg2, testdata, interval=’predict’, level = 0.90)

Let us consider another example [Thanks to Serhii Kushchenko on stats stack

NOTE: We understand that this concept is highly statistical in

Answer 8. (B) In the R output, the percentage of variation explained

The regular way calculating R2 in R is as follows:

Question 9. We regress calorie consumption of individual based on

Question 10. We regress calorie consumption of individual based on

Question 11. Suppose we regress satisfaction against food type,

Question 12. Suppose we regress satisfaction against food type,

Answer 12. (E) 65.317

NOTE: This question originally had a typo (64.317 instead of

NOTE: While (E) was

Code for Question 14 to 18

Residual standard error: 4.745 on 492 degrees of freedom

Shapiro-Wilk normality test

> sqrt_OLSModel = lm(sqrt(medv) ~ ., mydata)

Residual standard error: 0.4492 on 492 degrees of freedom

Question 14. Let’s check for non-constant variance using visual

Answer 14. (A) There seems to be homoskedasticity but there is clearly a

NOTE: We understand that a lot of students faced problem judging

Question 18. Let’s try to correct for some violations in assumptions.

This is reinforced by the fact that there seems to be a non-linear relation in

The following piece of code should help visualize:

Code for Questions 19 to 21

Residual standard error: 24.05 on 298 degrees of freedom

Residual standard error: 29.35 on 298 degrees of freedom

Residual standard error: 0.2727 on 298 degrees of freedom

Residual standard error: 0.2822 on 298 degrees of freedom

Answer 19. (C) Log-Lin

Question 20. Which is the interpretation of the slope coefficient for

Answer 20. (B) Increasing Experience by 1 unit leads to ((e^0.037087)-1)

We are considering the following model:

However, the model can also be written as

Question 21. Which is the interpretation of the slope coefficient for

Answer 21. (C) Increasing Experience by 1% leads to 0.01*34.985 units

However, the model can also be written as

You might also like