Data Mining Case

1. This document describes several data mining problems involving clustering and classification techniques. 2. For hierarchical clustering of FBS schools, Ward's method with 7 clusters seemed best, but k-means with latitude and longitude produced more geographic coherence. 3. For predicting Oscar winners, logistic regression using nominations and prior awards had over 98% accuracy on validation data.

Uploaded by

Roh Mer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

173 views8 pages

Data Mining Case

Uploaded by

Roh Mer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Data Mining Case

A Collection of Data Mining Problems – Due 3/29/16

PART 1: DATA MINING TECHNIQUES TO FIND “PATTERNS” – UNSUPERVISED LEARNING

Problem 1: Hierarchical Cluster Analysis with the Football Bowl Subdivision (FBS)
We started this example in class and will now do some further analysis. The Football Bowl Subdivision
(FBS) of the National Collegiate Athletic Association (NCAA) consists of over 100 schools. Most of
these schools belong to one of several conferences, or collections of schools, that compete with each other
on a regular basis in collegiate sports. Suppose the NCAA has commissioned a study that will propose the
formation of conferences based on the similarities of the constituent schools.
1. Open the FBS file (found in the Chapter 6 textbook files) that contains rows of
information on constituent FBS schools. Apply hierarchical clustering with 10 clusters using
football stadium capacity, latitude, longitude, endowment, and enrollment as variables. Use
Ward’s method as the clustering algorithm. Be sure to normalize the data. Copy the assigned
cluster column to the data sheet.
2. Use a Pivot Table on the data in the HC_Clusters sheet to identify the cluster with the
largest average football stadium capacity. Which cluster and school have the highest? Cluster
9- Michigan Wolverines
3. How would you characterize the universities in this cluster? They are characterized by
Longitude and Latitude around 42/83.
4. What is the smallest cluster (the one with the fewest observations) and what makes it
unique? Cluster 10- The endowment is an outlier of the data set.
5. Examine the dendrogram on the HC_Dendrogram worksheet (as well as the sequence of
clustering stages in the HC_Output sheet). What number of clusters seems to be the most
natural fit based on the distance? 7 Clusters
6. Create another pivot table and count the number of schools per cluster. Analyze the
results. Why aren’t these cluster results appropriate, or (restated) why should we rerun the cluster
analysis using different variables or a different number of clusters? The Clusters do not have an
average number in each cluster. There are some with a lot more values than the others.
7. Apply hierarchical clustering again with 10 clusters using just latitude and longitude as the
variables. Be sure to normalize the data and specify single linkage as the clustering method.
Use a Pivot Table on the data in HC_Clusters. You can also visualize the clusters with a scatter
plot with longitude as the x-variable and latitude as the y-variable. Compare the clusters to the
previous method. Which is the better method? The Ward’s method is a better method.

Problem 2: k-Means Cluster Analysis with the Football Bowl Subdivision (FBS)
1. Open the FBS file used in Problem 1 and copy the data to a new workbook. Delete the
cluster column from the hierarchical clustering in Problem 1.
2. Apply k-Means clustering with k=10 using football stadium capacity, latitude, longitude,
endowment, and enrollment as variables. Specify 50 iterations and 10 random starts and
normalize the data.
3. Analyze the resultant clusters. What is the smallest cluster (the one with the fewest
observations)? Cluster 1 is the smallest
4. What is the least dense (aka most diverse) cluster, as measured by the largest average
distance in the cluster? What makes the least dense cluster so diverse? Cluster 1, because it
has the highest Endowment out of all of the universities.
5. What problems do you see with the plan of defining the school membership of the 10
conferences directly with these 10 clusters? The problem with trying to put the schools directly
with the clusters are that they are not based on how close the lat/long are to eachother which
poses a problem with these clusters.

Problem 3: Both Types of Cluster Analysis with the Football Bowl Subdivision (FBS)
The NCAA has a preference for conferences consisting of similar schools with respect to their
endowment, enrollment, and football stadium size, but these conferences must be in the same geographic
region to reduce traveling costs. Take the following steps to address this desire.
1. Apply k-means clustering again (in a new worksheet) using latitude and longitude as
variables with k=3. Be sure to normalize and specific 50 iterations and 10 random starts. Then
create one distinct data set (one spreadsheet) for each of the three regional clusters (east, west,
and south).
2. For the west cluster, apply hierarchical clustering with Ward’s method and use
normalized data to form two sub-clusters using football stadium capacity, endowment, and
enrollment as variables. Use a PivotTable on the data in HC_Clusters to report the
characteristics of each cluster.
3. Do the same for the east cluster, using three sub-clusters.
4. Do the same for the south cluster, using four sub-clusters.
5. What problems do you see with this plan? How could this approach be tweaked to solve
the problem? The plan needs more specific components, and although endowment, athletic
revenue, and enrollment characterize the different conferences, perhaps adding more regions
would improve the plan. There could be separate regions such as SouthEast, SouthWest,
NorthEast, etc. to divide the conferences. Right now there needs to be more regions to improve
transportation between the schools.

Problem 4: Market Basket Analysis on Cookie Monster, Inc. (Problem 8 in our Textbook)
Cookie Monster Inc. is a company that specializes in the development of software that tracks Web
browsing history of individuals.
1. Open the CookieMonster file and review the binary matrix format. The entry in row and
column indicates whether the column website was visited by the row user. Using a minimum
support of 800 transactions and a minimum confidence of 50%, use XLMiner to generate a list
of association rules.
2. Review the top 14 rules. What information does this analysis provide Cookie Monster
regarding the online behavior of individuals? Be sure to address the lift ratios (and the meaning
of the lift ratios) in common terms that a business user would immediately understand.

PART 2: DATA MINING TECHNIQUES TO “PREDICT” – SUPERVISED LEARNING

Problem 5: k-Nearest Neighbors Data Mining for Finding Undecided Voters for Campaign
Organizers Read the description of this problem (# 10) of our textbook. Complete this problem and
check your answers against the solution provided on our textbook website
1. The overall error rate is equal to 0 percent on the training set because the most similar
observation in the training set is the observation itself, so it is a correct classification. The overall
error rate of the validation set is not equal to 0 percent because an observation from the validation
set may not have the same classification as the most similar observation in the training set, so it is a
misclassification.
2. The value of k that minimizes the overall error rate on the validation data for cutoff
probability value 0.5 is k=5. The overall error rate is lowest on the training data, higher on the
validation data, and highest on the test data.
3. 1.87 is the first decile, and this is the first value that divides the sorted data into 10 equal parts.
4. 0.3
Problem 6: Logistic Regression to Predict the Oscars
Read the description of this problem on p. 316 (#18) of our textbook. A description of logistic regression
is found on page 299. Use the Oscars data to create a logistic regression equation to predict whether a
movie will win the Best Picture Oscar Award using information on the total number of Oscar nominations
that a movie receives and whether the movie won the previous Golden Globe award (1 = movie won; 0 =
movie lost).
1. Partition all of the data using the ChronoPartition variable.
2. Construct the model using winner as the output variable and Oscar Nominations, Golden
Globe Wins, and Comedy as the input variables. What is the resulting logistic regression
calculation? Logistic Regression equation= 3.249+(69.849 x Oscar Nominations) + (10.495 x
Golden Globe Nominations) + (2.848 x Comedy)
3. What is the overall error rate on the validation data? 1.54%
4. Use the model to score the new data (2011). Which movie did the model select as the
most likely to win the 2011 Best Picture Award? The Artist

Problem 7: Logistic Regression to Predict the Organic Customers using SAS Visual Statistics
Access Visual Analytics from the Teradata University Network (TUN) site. Open the
ORGANICS_VISTAT data set and use this data to create a logistic regression equation to predict which
customers will buy organic foods.

1. Create a boxplot that shows affluence grade and age by organics purchase indicator,
just as you did for one of the mini-cases on the midterm exam.
2. Click the Logistic Regression tool to begin using Visual Statistics.
3. Construct the model using Organics Purchase as the output variable and age, gender,
and recent purchase variables as the input variables.
4. Remove any variables that are not statistically significant. What is the resulting logistic
regression calculation? Age and Affluence Grade were the most significant measures from the
data provided, so those were the only continuous effects. Using the input variables age, gender,
and recent purchases resulted in an R-square value of .1481, which shows a weak relationship
between the variables and Organics Purchased. This value means that 14.81% of the variability
of the organic purchases are explained by the input variables. According to the fit summary, the
recent purchases did not have high variable importance, so after omitting them from the input
the R-square value increased to .2302. After changing the input characteristics to include the
various measures and categories, the best output was using the Age and Affluence Grade
measures with the Gender, Loyalty Card Class, Residential Neighborhood, and Television
Region categories. Gender had the largest impact on the data, causing the R-square value to
jump from .2302 to .2335. This included 1,498,264 outputs and 190,684 unused data points.
Customer ID made up a large section of this with over 5,000 data points, but this did not
influence the graphs. Grouping the geographic and demographic variables will increase the
5. What is the overall r square for the model? .2335
6. Use the assessment plots to determine the effectiveness of the model.
a. Look at the Lift Chart, which measures the model’s effectiveness. A lift chart is a
graphical representation of the advantage (or lift) of using a predictive model to improve on the
target response vs. not using a model. It is a measure of the effectiveness of a predictive model
calculated as the ratio between the results obtained with and without the predictive model.
When the lift in the lower percentiles of the chart is higher, the model is better.

The chart shows two lines: one that represents the model you built; and one that represents the
best, achievable model, or a perfect classifier. When the Model line is closer to the Best line,
especially in the lower percentiles, the model is better.

Restated, lift is the ratio of the percentage of captured events within each percentile bin to the
average percentage of responses for the model. Cumulative lift is calculated by using all the
data up to and including the current percentile bin. In this example, what is cumulative lift at the
20th percentile? Is this value low? If so, a low value indicates additional variables or interaction
effects should be considered to improve the model. This lift value means that if the supermarket
chain sent coupons to the top 20 percent of customers selected by this model, it could expect to
see 1.2056 times more customers purchasing organic products than if the same number of
customers were randomly selected. The cumulative lift of the model is close to the best line
cumulative lift for the 80 percentile of 1.25, meaning that this model is an effective predictor for
the upper percentiles of customer spend.

b. Assess the ROC (Receiver Operating Characteristic) Chart, which measures

classification or predictive accuracy of a logistic model. The classification accuracy of a model is
demonstrated by the degree that the ROC curve pushes upward and to the left. This degree can
be quantified by the area under the curve. The area ranges from 50 (for a worthless model) to
100 (for a perfect classifier).
Restated, a ROC chart displays the ability of a model to avoid false positive and false negative
classifications. A false positive classification means that an observation was identified as an
event when it was actually a nonevent (aka a Type I error). A false negative classification
means that an observation was identified as a nonevent when it is actually an event (aka a Type
II error). The specificity of a model is the true negative rate. To derive the false positive rate,
subtract the specificity from 1. The false positive rate, labeled 1-Specificity is the X-axis of the
ROC chart. The sensitivity of a model is the true positive rate (Y-axis) A good ROC chart has a
steep initial slope and levels off quickly.

What does the ROC chart suggest? The curvature of the chart has a steeper initial slope, then
levels off, suggesting that the predictive accuracy of the model is good, not excellent. The
maximum separation (or KS Statistic) is .4716, and is located at about the .2 specificity. The
model becomes a better predictor for error after the .4 sensitivity region, where it reaches a
sensitivity higher than .82.

c. Assess the Misclassification Chart, which displays how many observations were
correctly and incorrectly classified for each value of the response variable. In this case, the
misclassification plot displays how many observations were correctly and incorrectly classified
as bought (positive) or did not buy (negative) organic products. How many customers were
classified as false positives? Should this model be refined more in light of this?
Using the variable inputs resulting in the R-square value of .2335, there were 63,080 customers
who were incorrectly classified as not bought, or false positive. 214,396 customers were
identified as a false negative, meaning they did not buy anything. This model should be refined
in light of this, because there is a higher percentage of customers who were identified as false
negatives than true positives, and grocers should be more concerned with what parameters
customers who buy products are classified as.
7. Click on the Parameter Estimates tab in the summary table. Click on the z Value
column two times to sort descending. Which variables have a high influence on predicting
whether a customer will buy organic food? The variables that have the highest influence include
both measures- Age and Affluence Grade- and Gender. The Z-value for Age was -276.245,
which was the only quantity less than the intercept of -89.5. The Affluence Grade Z-value is
353.92, and the scores for female and male are 226.4 and 95.36. These scores are indicative of
a high influence on the predictor.

Problem 8: Logistic Regression to Predict the PVA Donors using SAS Visual Statistics
Access Visual Analytics from the Teradata University Network (TUN) site. Open the PVA_DATA data
set and use this data to create a logistic regression equation to predict who is likely to donate to the PVA.
1. Create a boxplot that shows last gift amount and age by donor indicator.
2. Click the Logistic Regression tool to begin using Visual Statistics.
3. Construct the logistic regression model using Donation as the response variable. Select
the Advanced link and make sure that the event level is set to Donated.
4. Change pep_star from a measure to a category.
5. One at a time, select from the following continuous effects:
file_avg_gift, file_card_gift, home_value, house_income, last_gift_amount,
lifetime_avg_gift_amt, lifetime_gift_count, lifetime_gift_range, lifetime_max_gift_amt,
lifetime_min_gift_amt, lifetime_prom, months_since_first_gift, months_since_last_gift,
number_prom_12, age, card_prom_12

and select frequency_status_97nk, gender, home_owner, in_house, income_group,

overlay_source, recency_status_96nk, and pep_star as classification effects.

6. On the Properties tab, select Information missingness and Use variable selection.
7. Remove any variables that are not statistically significant.
Variables that aren’t statistically significant are: file_avg_gift, house_income,
lifetime_min_gift_amt, and home_owner
8. Remove some of the outliers via the Residual plot.
There were some outliers, on the top half of the plot to the far right near the “1” residual value
and the “.7” predicted probability.
9. View the response profile tab on the Summary table.
10. What is the resulting logistic regression calculation?
The logistic regression calculation is listed on pages 8 through 10 on the PDF associated with
this problem.
11. Check the various assessment charts and comment on the usefulness of the model.
According to the assessment charts, this model isn’t very useful. Although the ROC chart shows
that the model is better than no model, it isn’t much better. The r-square of .034 isn’t very
convincing either.

BS-200&220&330&350 - Service Manual - V8.0 - EN
100% (2)
BS-200&220&330&350 - Service Manual - V8.0 - EN
137 pages
UGBA 104 Prob Set C
No ratings yet
UGBA 104 Prob Set C
29 pages
Data Mining Cluster
50% (2)
Data Mining Cluster
4 pages
Contextual Learning Matrix (CLM) : Social Skills
No ratings yet
Contextual Learning Matrix (CLM) : Social Skills
31 pages
Essentials of Business Analytics 1st Edition Camm Solutions Manual Download
100% (21)
Essentials of Business Analytics 1st Edition Camm Solutions Manual Download
35 pages
RP ch06
No ratings yet
RP ch06
121 pages
Chi-Square Test
No ratings yet
Chi-Square Test
36 pages
Graphic Design Vocabulary
100% (1)
Graphic Design Vocabulary
10 pages
Activity 2
100% (1)
Activity 2
2 pages
List of Book
No ratings yet
List of Book
50 pages
10 M
No ratings yet
10 M
28 pages
Two Sample Test
No ratings yet
Two Sample Test
57 pages
Case Problem - Quality Associates, INC
No ratings yet
Case Problem - Quality Associates, INC
7 pages
Midterm Project Report
No ratings yet
Midterm Project Report
39 pages
Problem Set D: Attempt History
No ratings yet
Problem Set D: Attempt History
10 pages
Butler With Deliveries
No ratings yet
Butler With Deliveries
19 pages
Part 1: Building Your Own Binary Classification Model: Data - Final Project
No ratings yet
Part 1: Building Your Own Binary Classification Model: Data - Final Project
9 pages
Monsale Cris Elisha J. Activity 6
100% (1)
Monsale Cris Elisha J. Activity 6
7 pages
Task Performance Quantitative
No ratings yet
Task Performance Quantitative
7 pages
Statistics Mcqs - Estimation Part 3: Examrace
No ratings yet
Statistics Mcqs - Estimation Part 3: Examrace
7 pages
SImplified Solutions of BAD601 Model Question Paper
No ratings yet
SImplified Solutions of BAD601 Model Question Paper
32 pages
Industry Analysis (6-Force) Worksheet: Horizontal Relationships
No ratings yet
Industry Analysis (6-Force) Worksheet: Horizontal Relationships
4 pages
Assignment 2
No ratings yet
Assignment 2
6 pages
Ie151-2x - Eng Eco - Ps
No ratings yet
Ie151-2x - Eng Eco - Ps
9 pages
Classification: Decision Tree and Naïve Bayes Model
No ratings yet
Classification: Decision Tree and Naïve Bayes Model
7 pages
05 Assignment 1 Sason IT Service Management
No ratings yet
05 Assignment 1 Sason IT Service Management
4 pages
Institute Management Report
No ratings yet
Institute Management Report
175 pages
01 Activity 1 - Productivity and Project Management
No ratings yet
01 Activity 1 - Productivity and Project Management
2 pages
Optimum Policy Control
No ratings yet
Optimum Policy Control
4 pages
This Study Resource Was
No ratings yet
This Study Resource Was
4 pages
03 Task Performance 1 - ARG - MMW - Relevo
No ratings yet
03 Task Performance 1 - ARG - MMW - Relevo
4 pages
What Is The Difference Between Population and Sample?: 1. Principles of Sampling
No ratings yet
What Is The Difference Between Population and Sample?: 1. Principles of Sampling
3 pages
The Population Frequencies Are Equal To The Expected Frequencies. The Null Hypothesis Is False
No ratings yet
The Population Frequencies Are Equal To The Expected Frequencies. The Null Hypothesis Is False
3 pages
IMDB Top 250 Rev 1
No ratings yet
IMDB Top 250 Rev 1
23 pages
Linear Correlation Coefficient
No ratings yet
Linear Correlation Coefficient
3 pages
04 Task Performance 1 ARG Morales DataComs
No ratings yet
04 Task Performance 1 ARG Morales DataComs
1 page
Problem Set 7: Linear Programming
No ratings yet
Problem Set 7: Linear Programming
1 page
Health Care Needs of A Population Cannot Be Measured or Met Without Knowledge of Its Size and Characteristics
No ratings yet
Health Care Needs of A Population Cannot Be Measured or Met Without Knowledge of Its Size and Characteristics
1 page
State of Nature Decision Good Foreign Competitive Conditionspoor Foreign Competitive Conditions
100% (1)
State of Nature Decision Good Foreign Competitive Conditionspoor Foreign Competitive Conditions
4 pages
Format of Request For Honorarium Payment
No ratings yet
Format of Request For Honorarium Payment
1 page
DS100-1 WS 2.6 Enrico, DM
No ratings yet
DS100-1 WS 2.6 Enrico, DM
8 pages
05 Performance Task 1 - ARG
No ratings yet
05 Performance Task 1 - ARG
2 pages
IE13 Assignment 1 - Chapters 2 and 3
No ratings yet
IE13 Assignment 1 - Chapters 2 and 3
2 pages
Costing and Pricing - 02 - Activity - 1
No ratings yet
Costing and Pricing - 02 - Activity - 1
1 page
Analysis of Variance
No ratings yet
Analysis of Variance
51 pages
Idea Builder: Name: Section: Date: Score
No ratings yet
Idea Builder: Name: Section: Date: Score
1 page
BADM 572 - Stats Homework Answers 6
No ratings yet
BADM 572 - Stats Homework Answers 6
7 pages
Desalgo Preliminary Exam Reviewer 1
No ratings yet
Desalgo Preliminary Exam Reviewer 1
11 pages
NSTP - Worksheet 1
No ratings yet
NSTP - Worksheet 1
2 pages
04 Task Performance 1 Math
No ratings yet
04 Task Performance 1 Math
1 page
03 Review 1 - ARG - ENTREP MIND'
No ratings yet
03 Review 1 - ARG - ENTREP MIND'
1 page
Ganito Kami Noon, Paano Kayo Ngayon: Reflection
No ratings yet
Ganito Kami Noon, Paano Kayo Ngayon: Reflection
2 pages
CBD Aisc 360 10
100% (1)
CBD Aisc 360 10
79 pages
RA 1425 and Other Rizal Laws
No ratings yet
RA 1425 and Other Rizal Laws
17 pages
Aguilar - 08 Activity 1 - ARG@TourandTravel
No ratings yet
Aguilar - 08 Activity 1 - ARG@TourandTravel
1 page
04 - OR2 - Dynamic Programming
No ratings yet
04 - OR2 - Dynamic Programming
14 pages
Quanti Finals Zara 2
No ratings yet
Quanti Finals Zara 2
9 pages
Chap - 5 - Problems With Answers
No ratings yet
Chap - 5 - Problems With Answers
15 pages
Managerial Statistics (PRE-MBA) S04: Group Number: Group Name
No ratings yet
Managerial Statistics (PRE-MBA) S04: Group Number: Group Name
3 pages
Installation Guide For Windows MSB 13
No ratings yet
Installation Guide For Windows MSB 13
4 pages
Z-Test For Mean Asssignment
No ratings yet
Z-Test For Mean Asssignment
2 pages
Business Mathematics Assignment
No ratings yet
Business Mathematics Assignment
5 pages
Lesson 3
No ratings yet
Lesson 3
57 pages
Ie151-2x Or1 Concept
No ratings yet
Ie151-2x Or1 Concept
4 pages
Modal Verbs Bachillerato Teoría y Ejercicios
No ratings yet
Modal Verbs Bachillerato Teoría y Ejercicios
2 pages
Applied Final$$$$$
No ratings yet
Applied Final$$$$$
11 pages
MMW Lesson 3
No ratings yet
MMW Lesson 3
5 pages
43 - Crystal UHD 4K CU7000 UA43CU7000UXLY - Samsung
No ratings yet
43 - Crystal UHD 4K CU7000 UA43CU7000UXLY - Samsung
11 pages
Assignment Method
No ratings yet
Assignment Method
17 pages
Exponential Distribution
No ratings yet
Exponential Distribution
16 pages
Module 7 Assessment: Solution
No ratings yet
Module 7 Assessment: Solution
4 pages
Organic Agriculture Production NC II
No ratings yet
Organic Agriculture Production NC II
19 pages
Confidence Interval Estimationnew
No ratings yet
Confidence Interval Estimationnew
12 pages
Thermal Printers Media Brochure en
No ratings yet
Thermal Printers Media Brochure en
8 pages
0324 Timchan
No ratings yet
0324 Timchan
54 pages
Sample Definition: A Sample Is A Group of Units Selected From A Larger Group (The
No ratings yet
Sample Definition: A Sample Is A Group of Units Selected From A Larger Group (The
7 pages
en-NTF2004-Coverage Analysis
No ratings yet
en-NTF2004-Coverage Analysis
25 pages
S 4hana Sizing 1725532807
No ratings yet
S 4hana Sizing 1725532807
3 pages
Case Problem #5
No ratings yet
Case Problem #5
18 pages
T Berd Mts 4000 v2 User Manual Manuals User Guides en
No ratings yet
T Berd Mts 4000 v2 User Manual Manuals User Guides en
212 pages
Microsoft Excel
No ratings yet
Microsoft Excel
23 pages
Seatwork# 3. Romer T. Bulanos Lesson 4 - Organizational Strategy and Information Systems
0% (1)
Seatwork# 3. Romer T. Bulanos Lesson 4 - Organizational Strategy and Information Systems
3 pages
Muvizu Instructions
100% (1)
Muvizu Instructions
119 pages
Ict 1
No ratings yet
Ict 1
38 pages
Liquidation Form
No ratings yet
Liquidation Form
2 pages
NGEC 5 Midterm Rev.
No ratings yet
NGEC 5 Midterm Rev.
49 pages
Sample Lette of Invitation Honorarium
No ratings yet
Sample Lette of Invitation Honorarium
1 page
Trolley
No ratings yet
Trolley
64 pages
PDF Export
No ratings yet
PDF Export
110 pages
Drawing Realistic Ipad2 - Photoshop Tutorial: Step 1: Ipad Basic Shape
No ratings yet
Drawing Realistic Ipad2 - Photoshop Tutorial: Step 1: Ipad Basic Shape
54 pages
Computer Worksheet Grade VII
No ratings yet
Computer Worksheet Grade VII
2 pages
Document
No ratings yet
Document
2 pages
Device-To-Device (D2D) Communications
No ratings yet
Device-To-Device (D2D) Communications
3 pages
Troubleshooting
No ratings yet
Troubleshooting
1 page
Case Problem #4
No ratings yet
Case Problem #4
5 pages
Zach Leach Portfolio
No ratings yet
Zach Leach Portfolio
11 pages
Documentaion
No ratings yet
Documentaion
19 pages
Altistart 22: Selection Guide
No ratings yet
Altistart 22: Selection Guide
8 pages
Final Exam
No ratings yet
Final Exam
2 pages
Record Locking in Database: Record Locking Mechanism Is The Most
No ratings yet
Record Locking in Database: Record Locking Mechanism Is The Most
1 page
Maps and Sets 02 Class Notes DECODE DSA With C 2-0-659c0f305baa270018a79167
No ratings yet
Maps and Sets 02 Class Notes DECODE DSA With C 2-0-659c0f305baa270018a79167
14 pages
EVO 4 User Guide Manual V3.0
No ratings yet
EVO 4 User Guide Manual V3.0
24 pages
Seatwork 2
No ratings yet
Seatwork 2
1 page
Lesson Creation in Minecraft: Education Edition Planning
No ratings yet
Lesson Creation in Minecraft: Education Edition Planning
3 pages
VHDL Design of A RISC Processor: Control Unit: Functional Description
No ratings yet
VHDL Design of A RISC Processor: Control Unit: Functional Description
3 pages
CV For Sanni Joseph Adeiza Acted
No ratings yet
CV For Sanni Joseph Adeiza Acted
3 pages
S19408.1 - H Tech Servicos (Lyft GDA)
No ratings yet
S19408.1 - H Tech Servicos (Lyft GDA)
2 pages
My First SQL Practice - To Create Table
No ratings yet
My First SQL Practice - To Create Table
4 pages
Thinking Statistically
From Everand
Thinking Statistically
Anthony Banfield
5/5 (1)

Data Mining Case

Uploaded by

Data Mining Case

Uploaded by

Data Mining Case

A Collection of Data Mining Problems – Due 3/29/16

PART 1: DATA MINING TECHNIQUES TO FIND “PATTERNS” – UNSUPERVISED LEARNING

PART 2: DATA MINING TECHNIQUES TO “PREDICT” – SUPERVISED LEARNING

b. Assess the ROC (Receiver Operating Characteristic) Chart, which measures

and select frequency_status_97nk, gender, home_owner, in_house, income_group,

You might also like