Data Mining Case
Data Mining Case
Problem 2: k-Means Cluster Analysis with the Football Bowl Subdivision (FBS)
1. Open the FBS file used in Problem 1 and copy the data to a new workbook. Delete the
cluster column from the hierarchical clustering in Problem 1.
2. Apply k-Means clustering with k=10 using football stadium capacity, latitude, longitude,
endowment, and enrollment as variables. Specify 50 iterations and 10 random starts and
normalize the data.
3. Analyze the resultant clusters. What is the smallest cluster (the one with the fewest
observations)? Cluster 1 is the smallest
4. What is the least dense (aka most diverse) cluster, as measured by the largest average
distance in the cluster? What makes the least dense cluster so diverse? Cluster 1, because it
has the highest Endowment out of all of the universities.
5. What problems do you see with the plan of defining the school membership of the 10
conferences directly with these 10 clusters? The problem with trying to put the schools directly
with the clusters are that they are not based on how close the lat/long are to eachother which
poses a problem with these clusters.
Problem 3: Both Types of Cluster Analysis with the Football Bowl Subdivision (FBS)
The NCAA has a preference for conferences consisting of similar schools with respect to their
endowment, enrollment, and football stadium size, but these conferences must be in the same geographic
region to reduce traveling costs. Take the following steps to address this desire.
1. Apply k-means clustering again (in a new worksheet) using latitude and longitude as
variables with k=3. Be sure to normalize and specific 50 iterations and 10 random starts. Then
create one distinct data set (one spreadsheet) for each of the three regional clusters (east, west,
and south).
2. For the west cluster, apply hierarchical clustering with Ward’s method and use
normalized data to form two sub-clusters using football stadium capacity, endowment, and
enrollment as variables. Use a PivotTable on the data in HC_Clusters to report the
characteristics of each cluster.
3. Do the same for the east cluster, using three sub-clusters.
4. Do the same for the south cluster, using four sub-clusters.
5. What problems do you see with this plan? How could this approach be tweaked to solve
the problem? The plan needs more specific components, and although endowment, athletic
revenue, and enrollment characterize the different conferences, perhaps adding more regions
would improve the plan. There could be separate regions such as SouthEast, SouthWest,
NorthEast, etc. to divide the conferences. Right now there needs to be more regions to improve
transportation between the schools.
Problem 4: Market Basket Analysis on Cookie Monster, Inc. (Problem 8 in our Textbook)
Cookie Monster Inc. is a company that specializes in the development of software that tracks Web
browsing history of individuals.
1. Open the CookieMonster file and review the binary matrix format. The entry in row and
column indicates whether the column website was visited by the row user. Using a minimum
support of 800 transactions and a minimum confidence of 50%, use XLMiner to generate a list
of association rules.
2. Review the top 14 rules. What information does this analysis provide Cookie Monster
regarding the online behavior of individuals? Be sure to address the lift ratios (and the meaning
of the lift ratios) in common terms that a business user would immediately understand.
Problem 7: Logistic Regression to Predict the Organic Customers using SAS Visual Statistics
Access Visual Analytics from the Teradata University Network (TUN) site. Open the
ORGANICS_VISTAT data set and use this data to create a logistic regression equation to predict which
customers will buy organic foods.
1. Create a boxplot that shows affluence grade and age by organics purchase indicator,
just as you did for one of the mini-cases on the midterm exam.
2. Click the Logistic Regression tool to begin using Visual Statistics.
3. Construct the model using Organics Purchase as the output variable and age, gender,
and recent purchase variables as the input variables.
4. Remove any variables that are not statistically significant. What is the resulting logistic
regression calculation? Age and Affluence Grade were the most significant measures from the
data provided, so those were the only continuous effects. Using the input variables age, gender,
and recent purchases resulted in an R-square value of .1481, which shows a weak relationship
between the variables and Organics Purchased. This value means that 14.81% of the variability
of the organic purchases are explained by the input variables. According to the fit summary, the
recent purchases did not have high variable importance, so after omitting them from the input
the R-square value increased to .2302. After changing the input characteristics to include the
various measures and categories, the best output was using the Age and Affluence Grade
measures with the Gender, Loyalty Card Class, Residential Neighborhood, and Television
Region categories. Gender had the largest impact on the data, causing the R-square value to
jump from .2302 to .2335. This included 1,498,264 outputs and 190,684 unused data points.
Customer ID made up a large section of this with over 5,000 data points, but this did not
influence the graphs. Grouping the geographic and demographic variables will increase the
5. What is the overall r square for the model? .2335
6. Use the assessment plots to determine the effectiveness of the model.
a. Look at the Lift Chart, which measures the model’s effectiveness. A lift chart is a
graphical representation of the advantage (or lift) of using a predictive model to improve on the
target response vs. not using a model. It is a measure of the effectiveness of a predictive model
calculated as the ratio between the results obtained with and without the predictive model.
When the lift in the lower percentiles of the chart is higher, the model is better.
The chart shows two lines: one that represents the model you built; and one that represents the
best, achievable model, or a perfect classifier. When the Model line is closer to the Best line,
especially in the lower percentiles, the model is better.
Restated, lift is the ratio of the percentage of captured events within each percentile bin to the
average percentage of responses for the model. Cumulative lift is calculated by using all the
data up to and including the current percentile bin. In this example, what is cumulative lift at the
20th percentile? Is this value low? If so, a low value indicates additional variables or interaction
effects should be considered to improve the model. This lift value means that if the supermarket
chain sent coupons to the top 20 percent of customers selected by this model, it could expect to
see 1.2056 times more customers purchasing organic products than if the same number of
customers were randomly selected. The cumulative lift of the model is close to the best line
cumulative lift for the 80 percentile of 1.25, meaning that this model is an effective predictor for
the upper percentiles of customer spend.
What does the ROC chart suggest? The curvature of the chart has a steeper initial slope, then
levels off, suggesting that the predictive accuracy of the model is good, not excellent. The
maximum separation (or KS Statistic) is .4716, and is located at about the .2 specificity. The
model becomes a better predictor for error after the .4 sensitivity region, where it reaches a
sensitivity higher than .82.
c. Assess the Misclassification Chart, which displays how many observations were
correctly and incorrectly classified for each value of the response variable. In this case, the
misclassification plot displays how many observations were correctly and incorrectly classified
as bought (positive) or did not buy (negative) organic products. How many customers were
classified as false positives? Should this model be refined more in light of this?
Using the variable inputs resulting in the R-square value of .2335, there were 63,080 customers
who were incorrectly classified as not bought, or false positive. 214,396 customers were
identified as a false negative, meaning they did not buy anything. This model should be refined
in light of this, because there is a higher percentage of customers who were identified as false
negatives than true positives, and grocers should be more concerned with what parameters
customers who buy products are classified as.
7. Click on the Parameter Estimates tab in the summary table. Click on the z Value
column two times to sort descending. Which variables have a high influence on predicting
whether a customer will buy organic food? The variables that have the highest influence include
both measures- Age and Affluence Grade- and Gender. The Z-value for Age was -276.245,
which was the only quantity less than the intercept of -89.5. The Affluence Grade Z-value is
353.92, and the scores for female and male are 226.4 and 95.36. These scores are indicative of
a high influence on the predictor.
Problem 8: Logistic Regression to Predict the PVA Donors using SAS Visual Statistics
Access Visual Analytics from the Teradata University Network (TUN) site. Open the PVA_DATA data
set and use this data to create a logistic regression equation to predict who is likely to donate to the PVA.
1. Create a boxplot that shows last gift amount and age by donor indicator.
2. Click the Logistic Regression tool to begin using Visual Statistics.
3. Construct the logistic regression model using Donation as the response variable. Select
the Advanced link and make sure that the event level is set to Donated.
4. Change pep_star from a measure to a category.
5. One at a time, select from the following continuous effects:
file_avg_gift, file_card_gift, home_value, house_income, last_gift_amount,
lifetime_avg_gift_amt, lifetime_gift_count, lifetime_gift_range, lifetime_max_gift_amt,
lifetime_min_gift_amt, lifetime_prom, months_since_first_gift, months_since_last_gift,
number_prom_12, age, card_prom_12
6. On the Properties tab, select Information missingness and Use variable selection.
7. Remove any variables that are not statistically significant.
Variables that aren’t statistically significant are: file_avg_gift, house_income,
lifetime_min_gift_amt, and home_owner
8. Remove some of the outliers via the Residual plot.
There were some outliers, on the top half of the plot to the far right near the “1” residual value
and the “.7” predicted probability.
9. View the response profile tab on the Summary table.
10. What is the resulting logistic regression calculation?
The logistic regression calculation is listed on pages 8 through 10 on the PDF associated with
this problem.
11. Check the various assessment charts and comment on the usefulness of the model.
According to the assessment charts, this model isn’t very useful. Although the ROC chart shows
that the model is better than no model, it isn’t much better. The r-square of .034 isn’t very
convincing either.