BANA 7052: Applied Linear Regression
Final exam (60 points)
Instructions: The final is open book/notes. This is NOT a group assignment and absolutely
NO working together. Failure to do so will result in a zero for the final. There is a bonus
question at the end worth five points.
Note: At the top of your submission, please make sure to include the following lines (no
cover page needed):
Student name: <your full name>
Course name: BANA 7052
Assignment: Final exam
Part I: multiple choice
Please clearly circle (or highlight) your multiple choice answers
Question 1: In a multiple linear regression model, which of the following are considered as
random?
a. The continuous response or dependent variable Y
b. The expected value of Y
c. The unknown parameter β
d. The unknown error variance σ 2
Question 2: When developing a linear regression model, adding additional predictors to
the model will
a. Always increase R2
b. Always increase adjusted R2
c. Sometimes decrease R2
d. Always decrease adjusted R2
Question 3: What is the best way to identify multicollinearity?
a. Including interaction terms in the linear regression model
b. Variance inflation factors (VIFs)
c. Residual diagnostic plots
d. Adjusted R2
Question 4: The definitions for AIC and BIC (or SBC) are:
A I C=− 2 ln ( L ) +2 p
B I C=−2 ln ( L ) +ln ( n ) p
where L is the log likelihood, p is the number of parameters, n is the number of
observations (i.e., the sample size). Which one of the following statements is correct?
a. AIC penalizes complex models more than BIC
b. One can only use AIC as a model selection criterion in stepwise regression
c. For the same model, if the AIC is less than the BIC then it is an indication that you
have found a good model
d. When BIC is used as the model selection criterion, usually fewer independent
variables are included in the model compared with the model selected using AIC
Question 5: In simple linear regression, a confidence interval for the mean response is
narrower for values of X that are:
a. Closer to the mean of the the X ’s
b. Further from the mean of the X ’s
Question 6: When building a linear regression model of the form Y = β0 + β 1 X 1 + β 2 X 2 + ϵ for
a data set, the presence of multicollinearity in the data may result in _________ standard
errors for the coefficient estimates than if the data came from an orthogonal design (i.e.,
when X 1 and X 2 are uncorrelated).
a. Smaller
b. Larger
Question 7: What does the following residual versus fitted value plot suggest about a
model between a single predictor X and Y ?
a. Heteroscedasticity of the error term
b. A nonlinear between X and Y
c. Non-normality of the error term
d. Satisfactory residuals
Question 8: When a linear regression model is being developed, adding additional
variables to the model will
a. always decrease model SSE
b. always decrease model AIC
c. always increase model adjusted R2
d. always decrease model MSE
Question 9: The ordinary residuals refer to
a. Ý − Y^
b. Y − Y^
c. Ý − Ý
d. Y^ −Y
Question 10: For a fitted simple linear regression model, which one of the following
properties is NOT true?
a. The fitted regression line passes through the point ( X́ , Ý )
n
b. The residuals sum to zero: ∑ ei=0
i=1
n n
c. ∑ Y i=∑ Y^ i
i=1 i=1
d. The residuals, e i,. are always independent
Question 11: The diagonal elements of hat matrix, also referred to as the hat values or
leverage values, measures the influence of observation i on the regression line when
removing observation i .
a. True
b. False
Question 12: In a regression study, a 95% confidence interval for β 1 was given as: ( −5 , 2 ) .
What does this confidence interval mean?
a. The interval ( −5 , 2 ) contains the true β 1 with 95% probability
b. 95% of all possible β 1’s are in( −5 , 2 )
c. 95% of the interval ( −5 , 2 ) contains the true β 1
d. If we were to repeat the experiment many times (i.e., repeatedly take a new sample
of size n and compute the same confidence interval), roughly 95% of the generated
intervals would contain the true β 1
Question 13: In a regression study, a 95% confidence interval for β 1 was given as: ( −5 , 2 ) .
Which of following is correct?
a. The 90% interval will be wider than the 95% interval
b. The 99% interval will be wider than the 95% interval
Question 14: In simple linear regression, the confidence interval for the mean response is
always narrowest around X́
a. True
b. False
Question 15: Point A in the far right is likely to be
a. High influential point of the regression line
b. High leverage point
Question 16: Suppose you have built a multiple linear regression model according to
Y = β0 + β 1 X 1 + β 2 X 2 + β 3 X 3 +ϵ . If you were to test H 0 : β 1=β 3=0 versus H 1 : β1 ≠ 0 or β 3 ≠ 0 you
would use
a. t -test
b. F -test based on an ANOVA
c. Partial F -test
Question 17: In simple linear regression, assume we used a t -test for H 0 : β 1=0 vs H 1 : β 1 ≠ 0 ,
and we rejected the null hypothesis at the specified α level. We would conclude that:
a. The linear regression model is useful compared with Y
b. There is no linear relationship between X and Y
c. The relationship between X and Y is quadratic
d. There is no relationship between X and Y
Question 18: Which of the following cases lead to multicollinearity?
a. There are indicator variables being used as predictors
b. A predictor can be expressed as a linear combination of other predictors
c. The variances across all predictors are not the same
d. The predictors are not normally distributed and are positively skewed
Question 19: What does the last line of output “F-statistic….p-value: …” indicate in the
following multiple regression output?
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0562 -1.4636 -0.4281 1.2854 5.8269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.82854 2.75747 14.807 1.76e-14 ***
## cyl -1.29332 0.65588 -1.972 0.058947 .
## disp 0.01160 0.01173 0.989 0.331386
## hp -0.02054 0.01215 -1.691 0.102379
## wt -3.85390 1.01547 -3.795 0.000759 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.513 on 27 degrees of freedom
## Multiple R-squared: 0.8486, Adjusted R-squared: 0.8262
## F-statistic: 37.84 on 4 and 27 DF, p-value: 1.061e-10
a. All of the coefficients are significantly different from 0 at the α =0.05 level
b. None of the independent variables explains any of the variation in Y
c. At least one of the independent variables explains some of the variation in Y
d. The model explained 1.061e-10 of the variability in Y
Question 20: If one wishes to incorporate seasonal dummy variables for monthly data into
a regression model, how many dummy variables should be in the model?
a. 12
b. 11
c. 10
d. 1
Part II: short answer response
For the multiple linear regression model Y i=β 0 + β 1 X 1 i+ β 2 X 2 i +…+ β p X p −1 , i+ ϵ i
Question 21: (2 pts.) Recall that the variance inflation factor for X j is defined as
2 −1
V I F j=( 1− R j ) . Explain in one sentence what R j is. Explain in one sentence why it is
2
better to use VIFs than pairwise correlations when trying to detect the presence of
multicollinearity in a regression data set.
Question 22: (2 pts.) For variable selection criteria based on residuals, among P R E S S ,
S S E, R2, R2a d j, and M S E , which are “good” criteria for variable selection? Explain your
reason in two sentences.
Question 23: (4 pts.) Please state one possible violation of standard model assumptions
for each of the residual plots below (one sentence for each plot):
Question 24: (4 pts.) Suppose you want to build a linear regression for response variable
weight ( Y ) using covariate height ( X 1 ) and gender ( X 2 =0 for female and X 2 =1 for male).
a. Suppose you want to allow the slope (of height) to be the same for different gender
but different intercept, how would you build the linear regression model? Please
specify the model in one line.
b. Suppose you want to allow both slope (of height) and intercept to be different for
different gender, how would you build the linear regression model? Please specify
the model in one line.
Question 25: (2 pts.) An engineer has stated: “Reduction of the number of candidate
explanatory variables should always be done using the objective forward stepwise
regression procedure.” Discuss.
Question 26: (4 pts.) A junior investment analyst used a polynomial regression model of
relatively high order in a research seminar on municipal bonds and obtained an R2 of 0.991
in the regression of net interest yield of bond ( Y ) on industrial diversity index of
municipality ( X ) for seven bond issues. A classmate, unimpressed, said: “You overfitted.
Your curve follows the random effects in the data.”
a. Comment on the classmate’s criticism.
2
b. Might Ra d j be more appropriate than R2 as a descriptive measure here?
Question 27: (4 pts.) A student who used a regression model that included indicator
variables was upset when receiving only the following output on the multiple regression
printout: XTRANSPOSE X SINGULAR. What does this mean and what is a likely source of the
error?
Question 28: (4 pts.) In a regression study of factors affecting learning time for a certain
task (measured in minutes), gender of learner was included as a predictor variable ( X 2 )
that was coded X 2 =1 if male and X 2 =0 if female. It was found that the estimated coefficient
of X 2 was ^β 2=22.3 with a standard error of 3.8. An observer questioned whether the coding
scheme for gender is fair because it results in a positive coefficient, leading to longer
learning times for males than females. Comment.
Question 29: (2 pts.) A student stated: “Adding predictor variables to a regression model
can never reduce R2, so we should include all available predictor variables in the model.”
Comment.
Question 30: (2 pts.) The members of a health spa pay annual membership dues of $300
plus a charge of $2 for each visit to the spa. Let Y denote the dollar cost for the year for a
member and X the number of visits by the member during the year. Express the relation
between X and Y mathematically. Is it a functional relation or a statistical relation?
Question 31: (2 pts.) Evaluate the following statement: “For the least squares method to be
fully valid, it is required that the distribution of Y be normal.”
Question 32: (2 pts.) A member of a student team playing an interactive marketing game
received the following computer output when studying the relation between advertising
expenditures ( X ) and sales ( Y ) for one of the team’s products:
• Estimated regression equation: Y =350.70− 0.18 X
• Two-sided p-value for estimated slope: 0.91
The student stated: “The message I get here is that the more we spend on advertising this
product, the fewer units we sell!” Comment.
Question 33: (2 pts.) What is a residual? Why are residuals important in regression
analysis?
Question 34: (4 pts.) The following regression model is being considered in a water
resources study:
Y i=β 0 + β 1 X 1 i+ β 2 X 2 i + β 3 X 1 i X 2 i + β 4 √ X 3 i+ ϵ i
State the reduced models for testing whether or not:
a. β 3=β 4 =0
b. β 3=0
c. β 1=β 2=5
d. β 4 =7
Bonus: (5 pts.) An analyst wanted to fit the regression model
Y i=β 0 + β 1 X 1 i+ β 2 X 2 i +…+ β p X p −1 , i+ ϵ i by the method of least squares when it is known that
β 2=4 . How can the analyst obtain the desired fit using standard statistical software (e.g., R,
Python, or SAS)? No need to run any code, just describe in general how you could
accomplish this.