0% found this document useful (0 votes)
6 views9 pages

Practice Final Part 1

The document outlines a practice final exam for a course on Data and Decisions, focusing on analyzing a dataset of 542 Action/Adventure movies to predict box office success based on various features. It presents three linear regression models with different predictor variables and their respective performance metrics, along with a series of questions and solutions related to the models' outputs and implications. The exam assesses understanding of concepts such as model selection, statistical significance, and the relationship between production budgets and box office revenue.

Uploaded by

oliviadtush
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views9 pages

Practice Final Part 1

The document outlines a practice final exam for a course on Data and Decisions, focusing on analyzing a dataset of 542 Action/Adventure movies to predict box office success based on various features. It presents three linear regression models with different predictor variables and their respective performance metrics, along with a series of questions and solutions related to the models' outputs and implications. The exam assesses understanding of concepts such as model selection, statistical significance, and the relationship between production budgets and box office revenue.

Uploaded by

oliviadtush
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Practice Final Exam: Part 1

MGMTFT 402 - Data and Decisions

Data and Analysis


Think back to the first class, where we discussed which movies make the largest return
on investment. Suppose you want to better understand what features lead to box office
success vs. box office failure. We are going to focus on a subset of movies in the genre
“Action/Adventure” and with a total budget greater than $10 million.
Specifically, we have a dataset of 542 movies, with the following features as predictor variables:
Predictors / features:
• ProdBudget: Production budget (in millions of $)
• Year: Year movie was made
• MarBudget: Marketing budget (in millions of $)
• Runtime: Length of movie (in minutes)
• Sequel: Dummy variable indicating whether movie is a sequel (1) or not (0)
• CriticRating: Average critic rating at the end of the opening weekend (out of 100)
• AudienceRating: Average audience rating at the end of the opening weekend (out of
100)
We will also consider an interaction term between Sequel and MarBudget.
Our goal is to make good predictions for the following variable:
Target / label:
• BoxOffice: Total box office revenue (in millions of $)

We split the data into a training dataset (442 movies) and a testing dataset (100 movies) and
consider three different linear models. These three models use variables selected by LASSO
for three different values of λ.

1
Here is Model 1:
## MODEL INFO:
## Observations: 442
## Dependent Variable: BoxOffice
## Type: OLS linear regression
##
## MODEL FIT:
## F(1,440) = 682.97, p = 0.00
## R2 = 0.61
## Adj. R2 = 0.61
##
## Standard errors: OLS
## --------------------------------------------------
## Est. S.E. t val. p
## ----------------- -------- ------- -------- ------
## (Intercept) -42.29 16.45 -2.57 0.01
## ProdBudget 5.57 0.21 26.13 0.00
## --------------------------------------------------
For Model 1, the average prediction error on training data is 23.1. The standard deviation
of these prediction errors is 18.7. The average prediction error on testing data is 24.2. The
standard deviation of these prediction errors is 19.2. You can assume that a linear model
appears to be appropriate.

2
Here is Model 2:
## MODEL INFO:
## Observations: 442
## Dependent Variable: BoxOffice
## Type: OLS linear regression
##
## MODEL FIT:
## F(3,438) = 307.83, p = 0.00
## R2 = 0.68
## Adj. R2 = 0.68
##
## Standard errors: OLS
## -------------------------------------------------------
## Est. S.E. t val. p
## ---------------------- -------- ------- -------- ------
## (Intercept) -56.50 17.46 -3.24 0.00
## Sequel 79.30 36.63 2.16 0.03
## MarBudget 9.19 0.46 19.96 0.00
## Sequel:MarBudget 3.66 0.85 4.30 0.00
## -------------------------------------------------------
For Model 2, the average prediction error on training data is 17.7. The standard deviation
of these prediction errors is 16.9. The average prediction error on testing data is 18.5. The
standard deviation of these prediction errors is 17.8. You can assume that a linear model
appears to be appropriate.

3
Here is Model 3:
## MODEL INFO:
## Observations: 442
## Dependent Variable: BoxOffice
## Type: OLS linear regression
##
## MODEL FIT:
## F(6,435) = 160.83, p = 0.00
## R2 = 0.69
## Adj. R2 = 0.68
##
## Standard errors: OLS
## -------------------------------------------------------
## Est. S.E. t val. p
## ---------------------- -------- ------- -------- ------
## (Intercept) -99.00 53.10 -1.86 0.06
## ProdBudget -3.35 0.88 -3.80 0.00
## Sequel 101.73 36.71 2.77 0.01
## MarBudget 14.59 1.49 9.80 0.00
## CriticRating 0.45 0.72 0.63 0.53
## AudienceRating 0.24 1.00 0.24 0.81
## Sequel:MarBudget 5.59 0.98 5.72 0.00
## -------------------------------------------------------
For Model 3, the average prediction error on training data is 13.1. The standard deviation
of these prediction errors is 11.0. The average prediction error on testing data is 22.9. The
standard deviation of these prediction errors is 19.2. You can assume that a linear model
appears to be appropriate.

4
Questions
1. What percent of variability in BoxOffice can be explained by ProdBudget?
a) 5.6%
b) 61%
c) 69%
d) Not enough information.

2. Suppose a movie that is a sequel (Sequel = 1) has a marketing budget of $20 million.
Using Model 2, what would you predict for box office revenue?
a) $22.8 million
b) $127.3 million
c) $206.6 million
d) $279.8 million

3. Based on Model 3 and using a significance level of α = 0.05, is there a statistically


significant difference in the impact of marketing budget on films that are sequels
compared to films that are NOT sequels?
a) Yes, because the relevant p-value is below 0.05
b) Yes, because the relevant p-value is NOT below 0.05
c) No, because the relevant p-value is below 0.05
d) No, because the relevant p-value is NOT below 0.05

4. Does ProdBudget appear to be collinear with other predictor variables?


a) Yes
b) No
c) Not enough information.

5
5. Recall that the three models all come from LASSO regressions with different values of
λ. Which of the three models comes the highest value of λ?
a) Model 1
b) Model 2
c) Model 3
d) Not enough information.

6. Suppose we use forward selection instead of LASSO, and suppose it suggests the same
model as Model 2. Which of the following statements is NOT true?
a) The first variable selected might be ProdBudget, which was also the first variable
selected by LASSO.
b) Adding any other variable to Model 2 would increase AIC
c) Adding any other variable to Model 2 would increase R2 .

7. Consider Model 1. What is your interpretation of the coefficient on ProdBudget? Is


this coefficient statistically significant at the α = 0.05 level?

8. If you were a studio executive, does the regression output in Model 1 mean that
increasing the production budget will cause an increase in profits? Explain why or why
not in at most 3 sentences.

6
9. If you could only use one of these models to make predictions about BoxOffice, which
model would you use? Explain your answer in at most 2 sentences.

10. How confident are you that the model you chose in the previous answer will make better
predictions than Model 1 (or Model 2, if you chose Model 1)? The best answers will
use a hypothesis test to compare the two models (or identify a plausible hypothesis test
but describe why its results may not tell the whole story).

11. Is there any other information you would want to have to answer these questions? If so,
feel free to make an assumption in order to answer the question, and explain what you
assumed here. If not, leave this blank.

7
Solutions
1:b. This is the definition of R2 applied to Model 1.

2:d. This involves plugging in values into Model 2. y = −56.5+9.19∗20+79.3∗1+3.66∗20∗1 =


279.8.

3:a. The relevant coefficient is the one on the interaction term between Sequel and
MarBudget. This p-value is below 0.05, which means it is statistically significant at the 5%
level.

4:a. Yes, definitely. Looking at Models 1 and 3, we see that the coefficient on ProdBudget
changes signs from positive to negative. This is a strong sign that it is collinear with other
predictor variables, though it is not clear which of the variables it is collinear with.

5:a. The largest value of λ is the largest penalty for new variables, so we would select
fewer variables in that case. The correct answer is then Model 1, which has the fewest variables

6:a. This cannot be true. Once forward selection picks a variable, it remains selected. So if it
arrives at a model with 3 other variables (and not ProdBudget), it could not have selected
ProdBudget first. This is a big difference between LASSO and forward selection.

7: If the production budget were to increase by $1 million, we would predict box office
revenue to increase by $5.57 million. This coefficient is statistically significant at the 5%
level because the p-value is below 0.05.

8: No! Correlation does not imply causation. For example, it could be that movies that are
expected to do well based on the quality of the scripts are given larger budgets.

9: I would suggest using Model 2, as it has the lowest error on test data. It seems that this
model finds the sweet spot of testing error. It uses enough variables to reduce training and
testing error, but not so much that it overfits. The avoidance of overfitting can be seen by
noting that the training and testing errors are quite similar.

10: We want to assess whether the true average prediction errors will be different for Models 1
and 2. In other words, we want to compare the test data prediction errors using a two-sample
hypothesis test for means. Our null hypothesis is that the prediction errors are equal.

For Model 1, the test error has an average of x1 = 24.2 with a standard deviation of
s1 = 19.2. For Model 2, the test error has an average of x2 = 18.5 with a standard deviation

8
of s2 = 17.8. The sample sizes are n1 = n2 = 100

q
Using our formula for the appropriate test statistic, we get: (24.2−18.5)/ 19.22 /100 + 17.82 /100 =
2.177. Because this is slightly over 2, we can reject the null hypothesis that the prediction
errors are equal. There is a less than 5% chance that we would see this data if the two
models had equal prediction errors.

If you provide a compelling reason why a different model might be better, based on some
contextual knowledge or other rationale, that can be entirely acceptable, but you should
identify this as the relevant hypothesis test to conduct.

You might also like