Practice Final Part 1
Practice Final Part 1
We split the data into a training dataset (442 movies) and a testing dataset (100 movies) and
consider three different linear models. These three models use variables selected by LASSO
for three different values of λ.
1
Here is Model 1:
## MODEL INFO:
## Observations: 442
## Dependent Variable: BoxOffice
## Type: OLS linear regression
##
## MODEL FIT:
## F(1,440) = 682.97, p = 0.00
## R2 = 0.61
## Adj. R2 = 0.61
##
## Standard errors: OLS
## --------------------------------------------------
## Est. S.E. t val. p
## ----------------- -------- ------- -------- ------
## (Intercept) -42.29 16.45 -2.57 0.01
## ProdBudget 5.57 0.21 26.13 0.00
## --------------------------------------------------
For Model 1, the average prediction error on training data is 23.1. The standard deviation
of these prediction errors is 18.7. The average prediction error on testing data is 24.2. The
standard deviation of these prediction errors is 19.2. You can assume that a linear model
appears to be appropriate.
2
Here is Model 2:
## MODEL INFO:
## Observations: 442
## Dependent Variable: BoxOffice
## Type: OLS linear regression
##
## MODEL FIT:
## F(3,438) = 307.83, p = 0.00
## R2 = 0.68
## Adj. R2 = 0.68
##
## Standard errors: OLS
## -------------------------------------------------------
## Est. S.E. t val. p
## ---------------------- -------- ------- -------- ------
## (Intercept) -56.50 17.46 -3.24 0.00
## Sequel 79.30 36.63 2.16 0.03
## MarBudget 9.19 0.46 19.96 0.00
## Sequel:MarBudget 3.66 0.85 4.30 0.00
## -------------------------------------------------------
For Model 2, the average prediction error on training data is 17.7. The standard deviation
of these prediction errors is 16.9. The average prediction error on testing data is 18.5. The
standard deviation of these prediction errors is 17.8. You can assume that a linear model
appears to be appropriate.
3
Here is Model 3:
## MODEL INFO:
## Observations: 442
## Dependent Variable: BoxOffice
## Type: OLS linear regression
##
## MODEL FIT:
## F(6,435) = 160.83, p = 0.00
## R2 = 0.69
## Adj. R2 = 0.68
##
## Standard errors: OLS
## -------------------------------------------------------
## Est. S.E. t val. p
## ---------------------- -------- ------- -------- ------
## (Intercept) -99.00 53.10 -1.86 0.06
## ProdBudget -3.35 0.88 -3.80 0.00
## Sequel 101.73 36.71 2.77 0.01
## MarBudget 14.59 1.49 9.80 0.00
## CriticRating 0.45 0.72 0.63 0.53
## AudienceRating 0.24 1.00 0.24 0.81
## Sequel:MarBudget 5.59 0.98 5.72 0.00
## -------------------------------------------------------
For Model 3, the average prediction error on training data is 13.1. The standard deviation
of these prediction errors is 11.0. The average prediction error on testing data is 22.9. The
standard deviation of these prediction errors is 19.2. You can assume that a linear model
appears to be appropriate.
4
Questions
1. What percent of variability in BoxOffice can be explained by ProdBudget?
a) 5.6%
b) 61%
c) 69%
d) Not enough information.
2. Suppose a movie that is a sequel (Sequel = 1) has a marketing budget of $20 million.
Using Model 2, what would you predict for box office revenue?
a) $22.8 million
b) $127.3 million
c) $206.6 million
d) $279.8 million
5
5. Recall that the three models all come from LASSO regressions with different values of
λ. Which of the three models comes the highest value of λ?
a) Model 1
b) Model 2
c) Model 3
d) Not enough information.
6. Suppose we use forward selection instead of LASSO, and suppose it suggests the same
model as Model 2. Which of the following statements is NOT true?
a) The first variable selected might be ProdBudget, which was also the first variable
selected by LASSO.
b) Adding any other variable to Model 2 would increase AIC
c) Adding any other variable to Model 2 would increase R2 .
8. If you were a studio executive, does the regression output in Model 1 mean that
increasing the production budget will cause an increase in profits? Explain why or why
not in at most 3 sentences.
6
9. If you could only use one of these models to make predictions about BoxOffice, which
model would you use? Explain your answer in at most 2 sentences.
10. How confident are you that the model you chose in the previous answer will make better
predictions than Model 1 (or Model 2, if you chose Model 1)? The best answers will
use a hypothesis test to compare the two models (or identify a plausible hypothesis test
but describe why its results may not tell the whole story).
11. Is there any other information you would want to have to answer these questions? If so,
feel free to make an assumption in order to answer the question, and explain what you
assumed here. If not, leave this blank.
7
Solutions
1:b. This is the definition of R2 applied to Model 1.
3:a. The relevant coefficient is the one on the interaction term between Sequel and
MarBudget. This p-value is below 0.05, which means it is statistically significant at the 5%
level.
4:a. Yes, definitely. Looking at Models 1 and 3, we see that the coefficient on ProdBudget
changes signs from positive to negative. This is a strong sign that it is collinear with other
predictor variables, though it is not clear which of the variables it is collinear with.
5:a. The largest value of λ is the largest penalty for new variables, so we would select
fewer variables in that case. The correct answer is then Model 1, which has the fewest variables
6:a. This cannot be true. Once forward selection picks a variable, it remains selected. So if it
arrives at a model with 3 other variables (and not ProdBudget), it could not have selected
ProdBudget first. This is a big difference between LASSO and forward selection.
7: If the production budget were to increase by $1 million, we would predict box office
revenue to increase by $5.57 million. This coefficient is statistically significant at the 5%
level because the p-value is below 0.05.
8: No! Correlation does not imply causation. For example, it could be that movies that are
expected to do well based on the quality of the scripts are given larger budgets.
9: I would suggest using Model 2, as it has the lowest error on test data. It seems that this
model finds the sweet spot of testing error. It uses enough variables to reduce training and
testing error, but not so much that it overfits. The avoidance of overfitting can be seen by
noting that the training and testing errors are quite similar.
10: We want to assess whether the true average prediction errors will be different for Models 1
and 2. In other words, we want to compare the test data prediction errors using a two-sample
hypothesis test for means. Our null hypothesis is that the prediction errors are equal.
For Model 1, the test error has an average of x1 = 24.2 with a standard deviation of
s1 = 19.2. For Model 2, the test error has an average of x2 = 18.5 with a standard deviation
8
of s2 = 17.8. The sample sizes are n1 = n2 = 100
q
Using our formula for the appropriate test statistic, we get: (24.2−18.5)/ 19.22 /100 + 17.82 /100 =
2.177. Because this is slightly over 2, we can reject the null hypothesis that the prediction
errors are equal. There is a less than 5% chance that we would see this data if the two
models had equal prediction errors.
If you provide a compelling reason why a different model might be better, based on some
contextual knowledge or other rationale, that can be entirely acceptable, but you should
identify this as the relevant hypothesis test to conduct.