0% found this document useful (0 votes)
30 views12 pages

IS5740 W05 Tutorial Note (Regression)

The document describes using linear regression to predict movie box office collections for Netflix. Various regression models are created and compared using a movie dataset containing 506 movies and 18 variables. Models include a full regression, reduced regression removing correlated variables, and backward and forward stepwise regressions. Model performance is evaluated on training and validation data using RMSE and other fit statistics.

Uploaded by

aryaynl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views12 pages

IS5740 W05 Tutorial Note (Regression)

The document describes using linear regression to predict movie box office collections for Netflix. Various regression models are created and compared using a movie dataset containing 506 movies and 18 variables. Models include a full regression, reduced regression removing correlated variables, and backward and forward stepwise regressions. Model performance is evaluated on training and validation data using RMSE and other fit statistics.

Uploaded by

aryaynl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

IS5740 Mgt.

Support & BI Systems

W05 Linear Regression


Business Problem: As an OTT platform, Netflix intends to introduce promotional programs
for new movies in their early stages. In pursuit of more impactful promotions, Netflix aims
to forecast the box office collections for the initial three months following a movie's
release.
I. Data Source: Movie.xlsx
 Suppose you're working at Netflix, where you've been tasked with predicting the box
office collection for the three months following a movie's release. The insights gained
from this prediction will aid your department in deciding on marketing expenses for
promotional campaigns.
 Your department possesses historical data comprising information on 506 movies,
including 18 variables.
Name Role Level Description

Marketing expense Input Interval Expense for promotions

Production expense Input Interval Expense for movie production


Multiplex coverage Input Interval Percentage of multiplexes showing a movie

Budget Input Interval Total budget for Production, Meeting, and Casting Fees

Movie_length Input Interval The length of a movie (minutes)

Lead_Actor_Rating Input Interval A lead actor’s rating

Lead_Actress_rating Input Interval A lead actress’s rating


Director_rating Input Interval A director’s rating

Producer_rating Input Interval A producer’s rating


Critic_rating Input Interval Critics score
Trailer_views Input Interval Number of movie trailer views

_3D_available Input Binary Whether 3D was used

Twitter_hashtags Input Interval Number of Twitter hashtags

Genre Input Nominal Genre (i.e., Action, Comedy, Drama, and Thriller)

Avg_age_actors Input Interval Average age of all actors


Time_taken input Interval The number of days after its release

Num_multiplex Input Interval Number of multiplexes showing a movie

Boxoffice Target Interval Number of tickets sold in Box-Offices

II. Predict the box-office collection by a regression model


1. Create your SAS E-Miner project
2. Download Movie.xlsx from the Canvas.

1
IS5740 Mgt. Support & BI Systems

3. Create your diagram called “W05_Movie”


4. Select the “File Import” node in the Sample tab and add the node to your diagram
5. Rename the “File Import” node “Movie” by right-click on your mouse.

6. Click the “Movie” file import node and click the button of the “Import File” option. Find
your movie.xlsx and import it.

7. Click the right button of your mouse and click the “Edit Variable”.

2
IS5740 Mgt. Support & BI Systems

8. Edit the variables of the movie dataset as below.

9. Click on the 'Movie' data node and navigate to the property window. Change the
'Summarize' option to 'Yes'.

10. Let’s conduct data visualization using the GraphExplore node. Don’t forget to set the Size
option to “Max”.

3
IS5740 Mgt. Support & BI Systems

a. Check the histogram of the target variable.

b. Please check the scatter plot matrix of all input variables, and then check the matrix.

11. Select the Data Partition node icon in the Sample tab. Drag the node into the Diagram
Workspace. Connect the Movie data node to the Data Partition node. Click the “Data
Partition” node and put 50% for training and 50% for validation, 0% for test data in the

4
IS5740 Mgt. Support & BI Systems

property window. Make sure that the total, train, and validation datasets are all
balanced.

12. Check the result of data partition.

13. We could do logistic regression using Regression node (the same node as linear
regression). Select the Model tab. Drag a Regression tool into the diagram workspace.
Connect the Data Partition node to the Regression node.
a. Rename the regression node as “Full Regression”.
b. Select the Regression node and examine the Property panel. By default, the
regression type is logistic, so we have to change it to “Linear Regression”. Rename
it to “Full Regression”

5
IS5740 Mgt. Support & BI Systems

14. Run the Full Regression node.


15. First, go to the property window. Click the small button in the right side of the
“Exported Data”.

16. In the pop-up window, select the “VALIDATE’, and click the “Browse..” button.

17. You can see the columns of ‘Box Office’, and ‘Predicted Box Office’, and ‘Residual Box
Office’

6
IS5740 Mgt. Support & BI Systems

18. Click the ‘Full Regression’ node, and click the right button of your mouse. Select the
‘Results’.

- Score Rankings Matrix — The data were sorted by a target variable in ascending. Y-
axis shows a target variable, and X-axis shows the percentage of used observation.

- Effects Plot — displays a bar graph of the absolute values of the coefficients in the
final model. The bars are color coded to indicate the algebraic signs of the coefficients.

a. Maximize the output window. Check the r-square and model significance. Also, you need
to check which variables have significant effects on a target variable.

7
IS5740 Mgt. Support & BI Systems

b. Restore the Output window to its original size by double-clicking its title bar. Maximize the
Fit Statistics window.

If estimate predictions are the focus, model fit can be assessed by RMSE. There appears to be
some discrepancy between the values of these two statistics in the train and validation data.

III. Model Selection


1. Reduced variables
Look at the scatter matrix. There are variables which are highly correlated: Ratings
(Director_rating, Lead_Actor_rating, Lead_Actress_rating, and Producer_rating).
a. Select the Regression node and examine the Property panel. By default, the
regression type is logistic, so we have to change it to “Linear Regression”.
Rename it to “Reduced Regression”

8
IS5740 Mgt. Support & BI Systems

b. By clicking the right button of your mouse, open “Edit Variables”. Set the below
variables’ uses to “No”. Check the results!
Lead_Actor_rating, Lead_Actress_rating, and Producer_rating

c. Run the Reduced regression node, and check the results.

9
IS5740 Mgt. Support & BI Systems

2. Backward
a. Add the “Regression” node in the Model tab. Rename it to “Backward Regression”.
b. Select Selection Model  Backward on the Regression node Properties panel.
c. Connect the “Forward Regression” node to the “Data Partition” node.
d. Run and check the results.

3. Forward
Add the “Regression” node in the Model tab. Rename it to “Forward Regression”. Repeat
the above steps.

IV. Model Comparisons


1. Add the “Model Comparison” node from the Assess tab, and make the connections all
regression nodes and the model comparison node.

10
IS5740 Mgt. Support & BI Systems

2. Run the model comparison node, and check the results

3. In the Output window, see the “Fit Statistics Model Selection based on Valid: Average Squared
Error”.

11
IS5740 Mgt. Support & BI Systems

4. Please check the RMSE of all models.


1) Training dataset

2) Validation dataset

12

You might also like