Data Mining Notes
Data Mining Notes
Chapters 1 & 2
Classification
Prediction
Association Rules
Data Reduction
Data Exploration
Visualization
Supervised and Unsupervised Learning
Supervised Learning
Goal: Predict a single “target” or “outcome” variable
Categorical
Naïve Bayes can use as-is
In most other algorithms, must create binary dummies
(number of dummies = number of categories – 1)
XLMiner has a utility to convert categorical variables to binary
dummies
Detecting Outliers
Can normalizing the data change which two records are farthest from each
other in terms of Euclidean distance?
Distance between records -Problem 2.9 (Chapter 2)–contd.
2.9 Statistical distance between records can be measured in several ways. Consider Euclidean distance, measured as the
square root of the sum of the squared differences. For the first two records in Table 2.7, it is
Can normalizing the data change which two records are farthest from each
other in terms of Euclidean distance?
The Problem of Overfitting
1400
1200
1000
Revenue
800
600
400
200
0
0 100 200 300 400 500 600 700 800 900 1000
Expenditure
Overfitting (cont.)
Causes:
Too many predictors
A model with too many parameters
Trying many different models