0% found this document useful (0 votes)
10 views

Module 7 Homework Prompt - JMP

The document outlines a homework assignment for MNGT 379 - Business Analytics, focusing on predictive data mining through three problems involving Salmons Stores, BlueOrRed, and CreditScore. Each problem requires data manipulation, model building, and interpretation using JMP software, including logistic regression, K nearest neighbors, and classification trees. Students are instructed to save their work and submit it along with JMP files upon completion.

Uploaded by

ajmalfarid077
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Module 7 Homework Prompt - JMP

The document outlines a homework assignment for MNGT 379 - Business Analytics, focusing on predictive data mining through three problems involving Salmons Stores, BlueOrRed, and CreditScore. Each problem requires data manipulation, model building, and interpretation using JMP software, including logistic regression, K nearest neighbors, and classification trees. Students are instructed to save their work and submit it along with JMP files upon completion.

Uploaded by

ajmalfarid077
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

MNGT 379 - Business Analytics

Module 7 Predictive Data Mining Homework

IMPORTANT: As you complete these problems, save your completed JMP files for each Problem so
that you may submit them with your Solution Word document.

Problem 1 – Salmons Stores

Salmons Stores operates a national chain of women’s apparel stores. Five thousand copies of an
expensive four-color sales catalog have been printed, and each catalog includes a coupon that provides a
$50 discount on purchases of $200 or more. Salmons would like to send the catalogs only to customers
who have the highest probability of using the coupon. The file “Module 7 SalmonsStores.xlsx” contains
data from an earlier promotional campaign. For each of 1,000 Salmons customers, three variables are
tracked: last year’s total spending at Salmons, whether they have a Salmons store credit card, and
whether they used the promotional coupon they were sent. Use the data in “Module 7 SalmonsStores
Data.xlsx” to complete the steps below.

Follow the instructions below to partition the data into Training, Validation, and Test Sets, and perform
a Logistic Regression on the data.

Step 1: Let’s start by reviewing our data to ensure it’s configured


with the correct data types; after the data is imported, right-click the column header for Customer and
choose the first menu item, Column Info, then change the Modeling Type to Nominal. Repeat these
steps to change Card and Coupon to Nominal as well.

Step 2: Now, we need to partition the data into Training, Validation, and Test sets. Under Analyze >
Predictive Modeling, select . In this dialog box, you don’t need to select
anything; just click OK. In the next window, change the Training Set value to 0.40, the Validation Set
value to 0.40, and the Test Set value to 0.2. In the Options, change New Column Name to “SetName”
without the quotes, and change the Random Seed to 1. Make sure that your dialog box matches the
screenshot to the right before you continue. Once you are satisfied, click Go.

Step 3: Now we can build the Logistic Regression model. Under the Analyze tab on the menu bar, select
Fit Model. Select the Coupon Column as the Y variable either by clicking and dragging or by
highlighting Coupon and then clicking . In the upper-right of this dialog box, the Personality
should automatically switch to Nominal Logistic; if it does not, go back to Step 1 and re-check your data
types. Next, add SetName to the Validation box and add Spending and Card to the Construct Model
Effects. Again, make sure that your dialog box matches the screenshot below before you continue.
Once you are satisfied, click Run.
Step 4: First, minimize all the report subsections except for Parameter Estimates. Then, click the red
triangle next to Nominal Logistic Fit for Coupon, and choose Lift Curve, then open the red triangle
again and choose Confusion Matrix (both are near the middle). Take screenshots of the Parameter
Estimates table, the three Lift Curves, and the Confusion Matrix, and paste them into the document
under the appropriate headers below.

Parameter Estimates

Lift Curve on Training Data

Lift Curve on Validation Data

Lift Curve on Test Data

Step 5: Now we can begin working to understand the report. Interpret the output by completing the
sentence, “The smallest classification error on the validation set results from the model… ” in the space
below, rounding parameter values to four decimal places:

Step 6: Recall that a value of 1 indicates that the decile is equally likely to correctly predict observations
(customers in this case) compared to choosing randomly, while a value of 1.35 indicates that the decile
is 35% more likely to predict customers correctly. Now, with this in mind, consider the Lift Curves we
added in Step 4; at what decile should we expect our model to be around twice as good at predicting
which customers use a Coupon? Enter your answer in the space below.

Step 7: Again consider the Lift Curves, and compare the Lift Curve on Training Data to the other two
Lift Curves. Does this suggest that the Regression Equation you defined in Step 5 has good predictive
power, or is there evidence of model overfitting? Justify your answer in the space below.

Problem 2 – BlueOrRed

Suppose that campaign organizers for both the Republican and Democratic parties are interested in
identifying individual undecided voters who would consider voting for their party in an upcoming
election. The file “Module 7 BlueOrRed Data.xlsx” contains data on a sample of voters with tracked
variables, including whether or not they are undecided regarding their candidate preference, age,
whether they own a home, gender, marital status, household size, income, years of education, and
whether they attend church.

Follow the instructions below to partition the data into Training, Validation, and Test Sets, and perform
K Nearest Neighbors on the data.

Step 1: As we did on Problem 1, we’ll need to change the


data types for many of our variables. Use the right-click
menu > Column Info to change Undecided, HomeOwner,
Female, Married, and Church to Nominal (you can hold
control or command to select multiple columns at once), and
Education to Ordinal.

Step 2: Again like Problem 1, partition the data into


validation and test sets using the Make Validation Column
tool found under Analyze > Predictive Modeling (if you
have a column selected from Step 1, click into the data and
push Escape). In the Specify Rates area, set your partition
percentages as 0.50, 0.30, and 0.20 respectively, then under
Options set the New Column Name to SetName and the
Random Seed to 5. Compare your dialog box to the
example on the right before continuing. Click Go.

Step 3: Open the K Nearest Neighbors tool found under


Analyze > Predictive Modeling (not the one we used in Module 4 under Clustering). Use SetName as
the “Validation” Variable; Undecided as the “Y, Response” Variable; and Age, HomeOwner, Female,
HouseholdSize, Income, Education, and Church (i.e. all the remaining variables except for Married) to
the list of “X, Factor” Variables. Finally, set the value of Set Random Seed to 10. Click OK.

Step 4: Take screenshots of the Model Selection Chart, Training Table, Validation Table, and Test Table
and include them under the appropriate headings below.

Model Selection Chart

Training Table

Validation Table

Test Table

Step 5: Consider the four screenshots you just took; these report the misclassification percentage for
each value of k, meaning for each number of neighbors used in the k-Nearest Neighbors Classification
procedure. Based on these screenshots, which value for k has the smallest misclassification rate, and
consequently, what is the optimal number of neighbors to use in our analysis? Justify your response in
the space below.

Step 6: Consider the Misclassification Rates reported on each Training, Validation, and Test tables; does
the error rate reported on the Training table seem to be optimistic (better than the true performance of
the model), conservative (worse than the true performance of the model), or somewhere in the middle?
Justify your response in the space below.

Problem 3 – CreditScore

A consumer advocacy agency, Equitable Ernest, is interested in providing a service in which an


individual can estimate their own credit score (a continuous measure used by banks, insurance
companies, and other businesses when granting loans, quoting premiums, and issuing credit). The file
“Module 7 CreditScore Data.xlsx” contains data on an individual’s credit score and other variables.
Follow the instructions below to partition the data into Training, Validation, and Test Sets, and create a
Classification Tree for the data.

Step 1: Let’s once again start by checking the data types of


our variables. Most of them are fine as Continuous, but we
need to convert HomeOwner to Nominal. Do that, and
move on to Step 2.

Step 2: Once more partition the data into Training,


Validation, and Test sets with a 40%/40%/20% split, name
the New Column SetName, and use seed 3. Confirm the
settings with the screenshot to the right before continuing.

Step 3: Create a Decision Tree using the Partition tool


found under Analyze > Predictive Modeling. Select
SetName as the “Validation” Variable, CreditScore as the
“Y, Response” Variable, and all the other variables as the
“X, Factor” Variables.

Step 4: Click OK to create the initial Partition, then click


Go to create the rest of the decision tree. WARNING: The Go button will not disappear after you click it;
do not click it again or you will create additional partitions, and you will need to start again. Then,
comparing the RASE (it stands for Median of Root Average Squared Error, but you can ignore the
details, it’s just a measure of error) values from each Set in the table; do these suggest that the model has
been overfit? Justify your response in the space below.

Step 5: Now, click the red triangle menu and choose Display Options > Show Tree. Don’t be alarmed,
but it won’t fit on your screen very well. Take a screenshot of the top node labeled “All Rows” and paste
it into the space below (again, just the one box that says All Rows at the top). IMPORTANT: If you
click Split, Prune, or Go, you’ll have to Redo the Analysis.

Step 6: There’s a lot to dive into here, but we’ll keep it simple and accept the Decision Tree that JMP
has given us. This might look very intimidating, but this Decision Tree is just like the one we went over
in the Classification Trees lecture video – it just looks a little different! Let me provide a quick
recap/description of how the tree is formatted in JMP Pro:

Each node is giving us the criteria for entry into the node (the bold label text), along with some
key metrics: the number of people in our Training Sample that meet the criteria to sort into this
node (which means they meet all the criteria from the nodes higher in the Tree as well), the
Mean credit score of those qualifying individuals (which doubles as the predicted value at that
node), and the Standard Deviation of credit scores within those qualifying individuals.

So, with that in mind, use the Decision Tree to predict the credit score of an individual who has had 5
credit bureau inquiries, has used 10% of her available credit, has $14,500 of total available credit, has no
collection reports or missed payments, is a homeowner, has an average credit age of 6.5 years (i.e.
CreditAge=6.5), and has worked continuously for the past 5 years (i.e. TimeOnJob=5). Enter your
estimate for the credit score, i.e. the Mean of the final node reached by the individual described above,
into the space below, rounding your answer to two decimal places.

Hint/Reminder: the process for this was described in the Classification Trees video provided on D2L; if
you’re stuck, review that video, and if you’re still stuck, send me (or the Tutoring Office) a quick email
so we can find a time to meet. This is a lot easier to explain “live”.

Final Submission Instructions

Once you have completed all three problems above, submit your completed version of this file to the
Assignment on D2L along with your completed JMP files. As always, let me know if you have any
questions!

You might also like