GBUS515 –Business Intelligence and Information Systems
Chapters 1 & 2
Introduction and Overview of the Data Mining Process
Instructor – Dr. Sunita Goel
Adapted from Shmueli, Bruce & Patel, Data Mining for Business Analytics, 3e
© Galit Shmueli and Peter Bruce 2010
Let’s get familiar with Terminology - 1
Predictor: A variable, usually denoted by X, used as an input into
a predictive model. Also called a feature, attribute, input
variable, independent variable, or from a database perspective, a
field.
Response: A variable, usually denoted by Y, which is the variable
being predicted in supervised learning; also called dependent
variable, output variable, target variable, or outcome variable.
Observation:The unit of analysis on which the measurements are
taken (a customer, a transaction, etc.); also
called instance, sample, example, case, record, pattern, or row. In
spreadsheets or database table, each row typically represents a
record; each column, a variable or an attribute.
Terminology (contd.) - 2
Supervised Learning:The process of providing an algorithm
(logistic regression, regression tree, etc.) with records in
which an output variable of interest is known and the
algorithm “learns” how to predict this value with new records
where the output is unknown.
Unsupervised Learning: An analysis in which one attempts to
learn patterns in the data other than predicting an output
value of interest.
Success Class: The class of interest in a binary outcome (e.g.,
purchasers in the outcome purchase/no purchase).
Algorithm: A specific procedure used to implement a
particular data mining technique: classification tree,
discriminant analysis, and the like.
Terminology (contd.) - 3
Model: An algorithm as applied to a dataset, complete with
its settings.
Training Data: The portion of the data used to fit a model.
Validation Data: The portion of the data used to assess how
well the model fits, to adjust models, and to select the best
model from among those that have been tried.
Test Data: The portion of the data used only at the end of the
model building and selection process to assess how well the
final model might perform on new data.
Score: A predicted value or class. Scoring new data means
using a model developed with training data to predict output
values in new data.
Once you have installed the Software successfully – You will see an
additional tab when you open Excel – Analytic Solver
Once you have installed the Software successfully – You will see
an additional second tab - Data Mining
Why so many different methods to
build a model?
Each method has its advantages and disadvantages.
Usefulness of a method can depend on factors such as
size of the dataset
types of patterns that exist in the data
whether the data meets some underlying assumptions of the method
how noisy the data is
particular goal of the analysis
Different methods can lead to different results, and their performance
can vary.
It is therefore customary in data mining to apply several different
methods and select the one that appears most useful for the goal at hand.
Core Ideas in Data Mining
Classification
Prediction
Association Rules
Data Reduction
Data Exploration
Visualization
Supervised and Unsupervised Learning
Supervised Learning
Goal: Predict a single “target” or “outcome” variable
Training data, where target value is known
Score to data where value is not known (use a
model developed with training data to predict output
values in new data)
Methods: Classification and Prediction
Unsupervised Learning
Goal: Segment data into meaningful segments or
clusters; detect patterns
There is no target (outcome) variable to predict or
classify
Methods: Association rules, data reduction &
exploration, visualization
Supervised: Classification
Goal: Predict categorical target (outcome) variable
Examples: Purchase/no purchase, fraud/no fraud,
creditworthy/not creditworthy
Each row is a case (customer, tax return, applicant)
Each column is a variable (age, marital status,
income)
Target variable is often binary (yes/no)
Supervised: Prediction
Goal: Predict numerical target (outcome) variable
Examples: sales, revenue, performance
As in classification:
Each row is a case (customer, tax return,
applicant)
Each column is a variable (age, marital status,
income)
Taken together, classification and prediction
constitute “predictive analytics”
Unsupervised: Association Rules
Goal: Produce rules that define “what goes with what”
Example: “If X was purchased, Y was also purchased”
Rows are transactions
Used in recommender systems – “Our records show
you bought X, you may also like Y”
Also called “affinity analysis”
Unsupervised: Data Reduction
Distillation of complex/large data into
simpler/smaller data
Reducing the number of variables/columns (e.g.,
principal component analysis)
Reducing the number of records/rows (e.g.,
clustering)
Unsupervised: Data Visualization
Graphs and plots of
data
Histograms,
boxplots, bar charts,
scatterplots
Especially useful to
examine
relationships
between pairs of
variables
Data Exploration
Data sets are typically large, complex & messy
Need to review the data to help refine the task
Use techniques of Data Reduction and Visualization
The Process of Data Mining
Steps in Data Mining
1. Define/understand purpose
2. Obtain data (may involve random sampling)
3. Explore, clean, pre-process data
4. Reduce the data; if supervised DM, partition it
5. Specify task (classification, clustering, etc.)
6. Choose the techniques (regression, CART, neural
networks, etc.)
7. Iterative implementation and “tuning”
8. Assess results – compare models
9. Deploy best model
Obtaining Data: Sampling
Data mining typically deals with huge databases
Algorithms and models are typically applied to a
sample from a database, to produce statistically-
valid results
XLMiner, e.g., limits the “training” partition to
10,000 records
Once you develop and select a final model, you use
it to “score” the observations in the larger database
When you are ready to partition data in Analytic Solver – invoke it
under data mining tab and specify
Rare event oversampling
Often the event of interest is rare
Examples: response to mailing, fraud in taxes,
Sampling may yield too few “interesting” cases to
effectively train a model
A popular solution: oversample the rare cases to
obtain a more balanced training set
Later, need to adjust results for the oversampling
Pre-processing Data
Types of Variables
Determine the types of pre-processing needed,
and algorithms used
Main distinction: Categorical vs. numeric
Numeric
Continuous
Integer
Categorical
Ordered, also called ordinal (low, medium, high)
Unordered, also called nominal (male, female)
Variable handling
Numeric
Most algorithms in XLMiner
can handle numeric data
May occasionally need to “bin”
into categories
Categorical
Naïve Bayes can use as-is
In most other algorithms, must create binary dummies
(number of dummies = number of categories – 1)
XLMiner has a utility to convert categorical variables to binary
dummies
Detecting Outliers
An outlier is an observation that is “extreme”, being
distant from the rest of the data (definition of
“distant” is deliberately vague)
Outliers can have disproportionate influence on
models (a problem if it is spurious)
An important step in data pre-processing is
detecting outliers
Once detected, domain knowledge is required to
determine if it is an error, or truly extreme.
Detecting Outliers
In some contexts, finding outliers is the purpose of
the data mining exercise (airport security screening).
This is called “anomaly detection”.
Handling Missing Data
Most algorithms will not process records with
missing values. Default is to drop those records.
Solution 1: Omission
If a small number of records have missing values, can
omit them
If many records are missing values on a small set of
variables, can drop those variables (or use proxies)
If many records have missing values, omission is not
practical
Solution 2: Imputation
Replace missing values with reasonable substitutes
Lets you keep the record and use the rest of its (non-
missing) information
Normalizing (Standardizing) Data
Used in some techniques when variables with the
largest scales would dominate and skew results
Puts all variables on same scale
Normalizing function: Subtract mean and divide by
standard deviation (used in XLMiner)
Alternative function: scale to 0-1 by subtracting
minimum and dividing by the range
Useful when the data contain dummies and numeric
Normalization - Problem 2.8 (Chapter 2)–Class Exercise
2.8 Normalize the data in Table 2.7, showing calculations.
Age Income ($)
25 49,000
56 156,000
65 99,000
32 192,000
41 39,000
49 57,000
Normalization - Problem 2.8 (Chapter 2)–Class Exercise
2.8 Normalize the data in Table 2.7, showing calculations.
Age Income ($) For normalizing age for observation # 1 we
calculate as below. Here age = 25.
25 49,000
56 156,000
65 99,000 25 - mean = 25 - 44.66667 = -1.31325
32 192,000 std 14.97554
41 39,000
49 57,000
To Do 1. Similarly normalize for other observations
of Age variable.
2. And then normalize for all the observations
of Income variable.
Distance between records -Problem 2.9 (Chapter 2)
2.9 Statistical distance between records can be measured in several ways. Consider Euclidean distance, measured as the
square root of the sum of the squared differences. For the first two records in Table 2.7, it is
Age Income ($)
25 49,000
56 156,000
65 99,000
32 192,000
41 39,000
49 57,000
Can normalizing the data change which two records are farthest from each
other in terms of Euclidean distance?
Distance between records -Problem 2.9 (Chapter 2)–contd.
2.9 Statistical distance between records can be measured in several ways. Consider Euclidean distance, measured as the
square root of the sum of the squared differences. For the first two records in Table 2.7, it is
Age Income ($)
25 49,000
56 156,000
Thus distance between records 1 and 2 is
65 99,000
32 192,000
41 39,000
49 57,000
1. Similarly compute distance between other data points.
2. Then compute distance between normalized values.
To Do
3. Indicate which two records are farthest in each scenario.
Can normalizing the data change which two records are farthest from each
other in terms of Euclidean distance?
The Problem of Overfitting
Statistical models can produce highly complex
explanations of relationships between variables
The “fit” may be excellent
When used with new data, models of great
complexity do not do so well.
100% fit – not useful for new data
1600
1400
1200
1000
Revenue
800
600
400
200
0
0 100 200 300 400 500 600 700 800 900 1000
Expenditure
Overfitting (cont.)
Causes:
Too many predictors
A model with too many parameters
Trying many different models
Consequence: Deployed model will not work as well
as expected with completely new data.
Partitioning the Data
Problem: How well will our model
perform with new data?
Solution: Separate data into two parts
Training partition to develop the
model
Validation partition to implement the
model and evaluate its performance
on “new” data
Addresses the issue of overfitting
Test Partition
When a model is developed on training
data, it can overfit the training data
(hence need to assess on validation)
Assessing multiple models on same
validation data can overfit validation data
Some methods use the validation data to
choose a parameter. This too can lead to
overfitting the validation data
Solution: final selected model is applied
to a test partition to give unbiased
estimate of its performance on new data
Example – Linear Regression
West Roxbury Housing Data
Transform Data – create dummies for the
categorical variable “REMODEL”
After Transforming Data – creating dummies
for the categorical variable “REMODEL”
Using Excel and Analytic Solver for
Data Mining
Excel is limited in data capacity
However, the training and validation of DM models
can be handled within the modest limits of Excel
and Analytic Solver
Models can then be used to score larger databases
Analytic Solver has functions for interacting with
various databases (taking samples from a database,
and scoring a database from a developed model)
Summary
Data Mining consists of supervised methods
(Classification & Prediction) and unsupervised
methods (Association Rules, Data Reduction, Data
Exploration & Visualization)
Before algorithms can be applied, data must be
characterized and pre-processed
To evaluate performance and to avoid overfitting,
data partitioning is used
Data mining methods are usually applied to a
sample from a large database, and then the best
model is used to score the entire database
Chapter Exercises
(Updated in Canvas)