Data Mining Cluster
Data Mining Cluster
latitude alone are not good variables to cluster colleges by and single linkage clustering
yielded a poor result.
Problem 2: k-Means Cluster Analysis with the Football Bowl Subdivision (FBS)
1. Open the FBS file used in Problem 1 and copy the data to a new workbook. Delete the cluster
column from the hierarchical clustering in Problem 1.
2. Apply k-Means clustering with k=10 using football stadium capacity, latitude, longitude,
endowment, and enrollment as variables. Specify 50 iterations and 10 random starts and
normalize the data.
3. Analyze the resultant clusters. What is the smallest cluster (the one with the fewest observations)?
a. The smallest cluster is cluster 5
4. What is the least dense (aka most diverse) cluster, as measured by the largest average distance in
the cluster? What makes the least dense cluster so diverse?
a. Cluster 1 is the least dense
b. It is so diverse because there are multiple observations and they are more spread out than
a highly concentrated cluster group. The density is low because of this distance apart and
the relatively small number of observations to group these 5 universities together.
5. What problems do you see with the plan of defining the school membership of the 10 conferences
directly with these 10 clusters?
a. Cluster 2 only has 3 schools which would be awful for a FBS conference
b. Cluster 5 is also too small with only 1 school in that division
c. Cluster 7 is an outlier with 27 schools in the division
d. Overall the range of the sizes of these clusters span a large distance. It spans form 1 to 27
which makes for a lot of variance.
Problem 3: Both Types of Cluster Analysis with the Football Bowl Subdivision (FBS)
The NCAA has a preference for conferences consisting of similar schools with respect to their
endowment, enrollment, and football stadium size, but these conferences must be in the same geographic
region to reduce traveling costs. Take the following steps to address this desire.
1. Apply k-means clustering again (in a new worksheet) using latitude and longitude as variables
with k=3. Be sure to normalize and specific 50 iterations and 10 random starts. Then create one
distinct data set (one spreadsheet) for each of the three regional clusters (east, west, and south).
2. For the west cluster, apply hierarchical clustering with Wards method and use normalized data to
form two sub-clusters using football stadium capacity, endowment, and enrollment as variables.
Use a PivotTable on the data in HC_Clusters to report the characteristics of each cluster.
Row Labels
1
2
Grand Total
Average of
Enrollment
26589.2381
19945
26287.2272
7
Average of
StadiumCapacit
y
49088.71429
50000
Average of
Endowment
($000)
842519.4762
16502606
49130.13636
1554341.591
Count of
SubCluster
21
1
22
Cluster1has21schoolswhilecluster2onlyhas1school.Cluster1hashighersignificantlyhigher
endowment.
Average of
Stadium
Capacity
63568.4
63347.66667
34350.73077
50217.80702
Average of
Endowment
($000)
1336091.8
5866583.5
193019.3462
1291584.193
Average of
Enrollment
32963.4
21313
24231.80769
27754.21053
Count of
Sub-Cluster
25
6
26
57
Cluster1and3hassimilarnumberofschoolsinthereclusterswhilecluster2ismadeupofonly6
schools.
a. Cluster 1
4. Do the same for the south cluster, using four sub-clusters.
Row
Labels
1
2
3
4
Grand
Total
Count of
SubCluster
17
2
21
8
Average of
StadiumCapacity
39736.11765
85812
66754.7619
66461.125
Average of
Endowment ($000)
113253.5882
3652205.5
547584.5238
1191190.375
Average of Enrollment
25873.17647
22330.5
29637.04762
22726
48
57930.77083
630385.8333
26847.72917
Cluster2onlyhas2schools.Therangeofthenumberofschoolsineachclusterisntbalanced.
5. What problems do you see with this plan? How could this approach be tweaked to solve the
problem?
a. The latitude and longitudes doesnt necessarily pick up the proximity of the schools to
each other. For example Hawaii was in the south division when logically they should be
in the west. It might be necessary to manually alter some of the clusters because of this.
b. Within each region there is still an uneven number of schools within each sub-cluster.
This problem could be improved by adding a North region. Creating more geographical
regions besides East, South, West, and North could expand this solution further. Getting
more data on each school would better help cluster them such as ranking.
Problem 4: Market Basket Analysis on Cookie Monster, Inc. (Problem 8 in our Textbook)
Cookie Monster Inc. is a company that specializes in the development of software that tracks Web
browsing history of individuals.
1. Open the CookieMonster file and review the binary matrix format. The entry in row and column
indicates whether the column website was visited by the row user. Using a minimum support of
800 transactions and a minimum confidence of 50%, use XLMiner to generate a list of
association rules.
2. Review the top 14 rules. What information does this analysis provide Cookie Monster regarding
the online behavior of individuals? Be sure to address the lift ratios (and the meaning of the lift
ratios) in common terms that a business user would immediately understand.
a. The lift ratio is a measure of the usefulness of a rule. Lift ratio is made by the support of
(antecedent and consequent) divided by support of the antecedent. This information
regarding online behavior indicates that there is a correlation between Facebook, Twitter,
and YouTube. The highest lift ratios come from any combination of two of these, which
leads to the third. This also allows us to determine the ones with low lift ratios, which are
less effective of measuring customers click patterns. If you know customers are going to
go to all three of these sites you could save money by only advertising on one or flood
the market by advertising on all three.