0% found this document useful (0 votes)
18 views2 pages

Account Based Analytics Final Spring 2025

The document outlines the final exam for an Advanced Business Analytics course, detailing the structure, points, and time allotted. It includes various questions related to predicting doctor visits and movie screens using statistical models, as well as the use of time-varying covariates in the Cox model. Students are instructed to submit their work on Gradescope and avoid external resources while providing detailed steps and explanations in their Jupyter files.

Uploaded by

kenna.harde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views2 pages

Account Based Analytics Final Spring 2025

The document outlines the final exam for an Advanced Business Analytics course, detailing the structure, points, and time allotted. It includes various questions related to predicting doctor visits and movie screens using statistical models, as well as the use of time-varying covariates in the Cox model. Students are instructed to submit their work on Gradescope and avoid external resources while providing detailed steps and explanations in their Jupyter files.

Uploaded by

kenna.harde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Final Exam

Advanced Business Analytics

Points:100 Time: 90 minutes

You should be able to access the exam on Gradescope and submit it there.

Read the questions carefully and provide details of the steps in your Jupyter file. If you get stuck,
make sure you provide the details of where and why you got stuck. Most of the time, with some
changes, you should be able to make it work. All problems have clear solutions.

Do not use the Internet, email, messaging apps, or LLMs for answers. Everything was covered in our
lectures.

1. You want to predict the number of doctor visits by a sample of 3874 users. The data has the
following attributes which can help with prediction.

Docvis : number of visits to doctor


Hospvis: number of days in hospital
Edlevel: educational level (categorical: 1-4)
Age: age: 25-64
Outwork: out of work=1; 0=working
Female: female=1; 0=male
Married: married=1; 0=not married
Kids: have children=1; no children=0
Hhninc: household yearly income
Educ: years of formal education (7-18)
Self: self-employed=1; not self employed=0
edlevel1 : (1/0) not high school graduate
edlevel2: (1/0) high school graduate
edlevel3: (1/0) university/college
edlevel4: (1/0) graduate school

i. Plot the histogram of doctor visits. What do you observe in the plot? [5]
ii. Why would you want to fit a Zero-Inflated Poisson (ZIP) model? Explain clearly what a ZIP
model is and how the model overcomes excessive zeros. [5]
iii. One challenge is finding a variable that can be used for classification (inflation). Can you
explain why this is a challenge? [5]
iv. You decide to model doctor visits as a function of all covariates except Outwork. You believe
that people out of work (less likely to have insurance) are unlikely to go to the doctor.
Therefore, you use that as a classifier.
Split your data into train (80%) and test (20%). Write the Zero inflation model and estimate
it. Show the results. [10]
v. In your results, is the classifier significant? Explain the magnitude of that variable. Explain
the magnitude of the estimate on educ (years of formal education) [5]
vi. You want to make predictions for the “mean” value of doc visits for the test sample. Make
the prediction. [5]

vii. The mean value already controls for inflated zeros. How will you convert the mean values to
the discrete number of visits? Provide the prediction and plot the histogram comparing
discrete predictions with the actual number of visits. [10]

2. You have the data on movies released in the last few years. The data outlines the attributes (also
described below) which are self-explanatory. Three attributes are numerical, and the rest are
categorical. The numerical attributes are - number of screens, budget and revenues. When a movie
is released in a theater, a certain number of screens are allocated to the movie.

You want to predict the number of screens for a movie. Convert the categorical data into dummies
and take the log of revenues and budget.

You believe that attributes (Release period, Remake, Franchise, Genre, New Actor, New director,
and log of budget) can predict the number of screens.

The first 1600 rows are used for training, and the rest for test. When you use dmatrices, correctly
specify the rows for training and test set. Notice that you do not have values for the number of
screens and revenues in the data after 1600 rows. You will be making predictions for those.

i. What model will you run? Explain the rationale. Estimate the model and show your
results. [10]
ii. Explain the role of budget. [2]
iii. Predict the number of screens for the test data (all the rows after 1600 - where you do
not have screen number value). Show the predictions for the first 10 observations in the
test set. [5]

You believe that the number of screens and other covariates (Release period, Remake, Franchise,
Genre, New Actor, New director) predict log revenues. To get the parameters, you estimate a linear
regression model (using GLM) in the training data.

iv. What is the estimated magnitude of the number of screens? [10]

With the predicted number of screens from (iii), you can predict revenues in your test data.

v. Predict the log revenues in the test set. Show the predictions for the first 10
observations in the test set. [13]

3. Why do you need time-varying covariates for the Cox model? How does the Cox model maintain
proportionality assumptions when using time-varying covariates? [5]

Suppose you are studying an event for 20 periods. Unfortunately, one of your analysis's covariates
changed value at times 4, 7, and 12. Show the format of the data used by lifelines in Python to
estimate the impact of the covariate on the hazard. [10]

You might also like