Account Based Analytics Final Spring 2025
Account Based Analytics Final Spring 2025
You should be able to access the exam on Gradescope and submit it there.
Read the questions carefully and provide details of the steps in your Jupyter file. If you get stuck,
make sure you provide the details of where and why you got stuck. Most of the time, with some
changes, you should be able to make it work. All problems have clear solutions.
Do not use the Internet, email, messaging apps, or LLMs for answers. Everything was covered in our
lectures.
1. You want to predict the number of doctor visits by a sample of 3874 users. The data has the
following attributes which can help with prediction.
i. Plot the histogram of doctor visits. What do you observe in the plot? [5]
ii. Why would you want to fit a Zero-Inflated Poisson (ZIP) model? Explain clearly what a ZIP
model is and how the model overcomes excessive zeros. [5]
iii. One challenge is finding a variable that can be used for classification (inflation). Can you
explain why this is a challenge? [5]
iv. You decide to model doctor visits as a function of all covariates except Outwork. You believe
that people out of work (less likely to have insurance) are unlikely to go to the doctor.
Therefore, you use that as a classifier.
Split your data into train (80%) and test (20%). Write the Zero inflation model and estimate
it. Show the results. [10]
v. In your results, is the classifier significant? Explain the magnitude of that variable. Explain
the magnitude of the estimate on educ (years of formal education) [5]
vi. You want to make predictions for the “mean” value of doc visits for the test sample. Make
the prediction. [5]
vii. The mean value already controls for inflated zeros. How will you convert the mean values to
the discrete number of visits? Provide the prediction and plot the histogram comparing
discrete predictions with the actual number of visits. [10]
2. You have the data on movies released in the last few years. The data outlines the attributes (also
described below) which are self-explanatory. Three attributes are numerical, and the rest are
categorical. The numerical attributes are - number of screens, budget and revenues. When a movie
is released in a theater, a certain number of screens are allocated to the movie.
You want to predict the number of screens for a movie. Convert the categorical data into dummies
and take the log of revenues and budget.
You believe that attributes (Release period, Remake, Franchise, Genre, New Actor, New director,
and log of budget) can predict the number of screens.
The first 1600 rows are used for training, and the rest for test. When you use dmatrices, correctly
specify the rows for training and test set. Notice that you do not have values for the number of
screens and revenues in the data after 1600 rows. You will be making predictions for those.
i. What model will you run? Explain the rationale. Estimate the model and show your
results. [10]
ii. Explain the role of budget. [2]
iii. Predict the number of screens for the test data (all the rows after 1600 - where you do
not have screen number value). Show the predictions for the first 10 observations in the
test set. [5]
You believe that the number of screens and other covariates (Release period, Remake, Franchise,
Genre, New Actor, New director) predict log revenues. To get the parameters, you estimate a linear
regression model (using GLM) in the training data.
With the predicted number of screens from (iii), you can predict revenues in your test data.
v. Predict the log revenues in the test set. Show the predictions for the first 10
observations in the test set. [13]
3. Why do you need time-varying covariates for the Cox model? How does the Cox model maintain
proportionality assumptions when using time-varying covariates? [5]
Suppose you are studying an event for 20 periods. Unfortunately, one of your analysis's covariates
changed value at times 4, 7, and 12. Show the format of the data used by lifelines in Python to
estimate the impact of the covariate on the hazard. [10]