0% found this document useful (0 votes)
2 views

Data Mining Notes

The document provides an overview of data mining processes, including key terminology such as predictors, responses, and observations. It discusses the differences between supervised and unsupervised learning, various data mining techniques, and the importance of data pre-processing and partitioning to avoid overfitting. The document also emphasizes the use of tools like Excel and Analytic Solver for data mining tasks.

Uploaded by

drmitola
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data Mining Notes

The document provides an overview of data mining processes, including key terminology such as predictors, responses, and observations. It discusses the differences between supervised and unsupervised learning, various data mining techniques, and the importance of data pre-processing and partitioning to avoid overfitting. The document also emphasizes the use of tools like Excel and Analytic Solver for data mining tasks.

Uploaded by

drmitola
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

GBUS515 –Business Intelligence and Information Systems

Chapters 1 & 2

Introduction and Overview of the Data Mining Process

Instructor – Dr. Sunita Goel


Adapted from Shmueli, Bruce & Patel, Data Mining for Business Analytics, 3e

© Galit Shmueli and Peter Bruce 2010


Let’s get familiar with Terminology - 1
— Predictor: A variable, usually denoted by X, used as an input into
a predictive model. Also called a feature, attribute, input
variable, independent variable, or from a database perspective, a
field.
— Response: A variable, usually denoted by Y, which is the variable
being predicted in supervised learning; also called dependent
variable, output variable, target variable, or outcome variable.
— Observation:The unit of analysis on which the measurements are
taken (a customer, a transaction, etc.); also
called instance, sample, example, case, record, pattern, or row. In
spreadsheets or database table, each row typically represents a
record; each column, a variable or an attribute.
Terminology (contd.) - 2
— Supervised Learning:The process of providing an algorithm
(logistic regression, regression tree, etc.) with records in
which an output variable of interest is known and the
algorithm “learns” how to predict this value with new records
where the output is unknown.
— Unsupervised Learning: An analysis in which one attempts to
learn patterns in the data other than predicting an output
value of interest.
— Success Class: The class of interest in a binary outcome (e.g.,
purchasers in the outcome purchase/no purchase).
— Algorithm: A specific procedure used to implement a
particular data mining technique: classification tree,
discriminant analysis, and the like.
Terminology (contd.) - 3
— Model: An algorithm as applied to a dataset, complete with
its settings.
— Training Data: The portion of the data used to fit a model.
— Validation Data: The portion of the data used to assess how
well the model fits, to adjust models, and to select the best
model from among those that have been tried.
— Test Data: The portion of the data used only at the end of the
model building and selection process to assess how well the
final model might perform on new data.
— Score: A predicted value or class. Scoring new data means
using a model developed with training data to predict output
values in new data.
Once you have installed the Software successfully – You will see an
additional tab when you open Excel – Analytic Solver
Once you have installed the Software successfully – You will see
an additional second tab - Data Mining
Why so many different methods to
build a model?
— Each method has its advantages and disadvantages.
— Usefulness of a method can depend on factors such as
— size of the dataset
— types of patterns that exist in the data
— whether the data meets some underlying assumptions of the method
— how noisy the data is
— particular goal of the analysis
— Different methods can lead to different results, and their performance
can vary.
— It is therefore customary in data mining to apply several different
methods and select the one that appears most useful for the goal at hand.
Core Ideas in Data Mining

— Classification
— Prediction
— Association Rules
— Data Reduction
— Data Exploration
— Visualization
— Supervised and Unsupervised Learning
Supervised Learning
— Goal: Predict a single “target” or “outcome” variable

— Training data, where target value is known

— Score to data where value is not known (use a


model developed with training data to predict output
values in new data)

— Methods: Classification and Prediction


Unsupervised Learning

— Goal: Segment data into meaningful segments or


clusters; detect patterns

— There is no target (outcome) variable to predict or


classify

— Methods: Association rules, data reduction &


exploration, visualization
Supervised: Classification

— Goal: Predict categorical target (outcome) variable


— Examples: Purchase/no purchase, fraud/no fraud,
creditworthy/not creditworthy
— Each row is a case (customer, tax return, applicant)
— Each column is a variable (age, marital status,
income)
— Target variable is often binary (yes/no)
Supervised: Prediction
— Goal: Predict numerical target (outcome) variable
— Examples: sales, revenue, performance
— As in classification:
— Each row is a case (customer, tax return,
applicant)
— Each column is a variable (age, marital status,
income)
— Taken together, classification and prediction
constitute “predictive analytics”
Unsupervised: Association Rules
— Goal: Produce rules that define “what goes with what”
— Example: “If X was purchased, Y was also purchased”
— Rows are transactions
— Used in recommender systems – “Our records show
you bought X, you may also like Y”
— Also called “affinity analysis”
Unsupervised: Data Reduction
— Distillation of complex/large data into
simpler/smaller data
— Reducing the number of variables/columns (e.g.,
principal component analysis)
— Reducing the number of records/rows (e.g.,
clustering)
Unsupervised: Data Visualization
— Graphs and plots of
data
— Histograms,
boxplots, bar charts,
scatterplots
— Especially useful to
examine
relationships
between pairs of
variables
Data Exploration

— Data sets are typically large, complex & messy


— Need to review the data to help refine the task
— Use techniques of Data Reduction and Visualization
The Process of Data Mining
Steps in Data Mining
1. Define/understand purpose
2. Obtain data (may involve random sampling)
3. Explore, clean, pre-process data
4. Reduce the data; if supervised DM, partition it
5. Specify task (classification, clustering, etc.)
6. Choose the techniques (regression, CART, neural
networks, etc.)
7. Iterative implementation and “tuning”
8. Assess results – compare models
9. Deploy best model
Obtaining Data: Sampling

— Data mining typically deals with huge databases


— Algorithms and models are typically applied to a
sample from a database, to produce statistically-
valid results
— XLMiner, e.g., limits the “training” partition to
10,000 records
— Once you develop and select a final model, you use
it to “score” the observations in the larger database
When you are ready to partition data in Analytic Solver – invoke it
under data mining tab and specify
Rare event oversampling

— Often the event of interest is rare


— Examples: response to mailing, fraud in taxes,
— Sampling may yield too few “interesting” cases to
effectively train a model
— A popular solution: oversample the rare cases to
obtain a more balanced training set
— Later, need to adjust results for the oversampling
Pre-processing Data
Types of Variables
— Determine the types of pre-processing needed,
and algorithms used
— Main distinction: Categorical vs. numeric
— Numeric
— Continuous
— Integer
— Categorical
— Ordered, also called ordinal (low, medium, high)
— Unordered, also called nominal (male, female)
Variable handling
— Numeric
— Most algorithms in XLMiner
can handle numeric data
— May occasionally need to “bin”
into categories

— Categorical
— Naïve Bayes can use as-is
— In most other algorithms, must create binary dummies
(number of dummies = number of categories – 1)
— XLMiner has a utility to convert categorical variables to binary
dummies
Detecting Outliers

— An outlier is an observation that is “extreme”, being


distant from the rest of the data (definition of
“distant” is deliberately vague)
— Outliers can have disproportionate influence on
models (a problem if it is spurious)
— An important step in data pre-processing is
detecting outliers
— Once detected, domain knowledge is required to
determine if it is an error, or truly extreme.
Detecting Outliers
— In some contexts, finding outliers is the purpose of
the data mining exercise (airport security screening).
This is called “anomaly detection”.
Handling Missing Data
— Most algorithms will not process records with
missing values. Default is to drop those records.
— Solution 1: Omission
— If a small number of records have missing values, can
omit them
— If many records are missing values on a small set of
variables, can drop those variables (or use proxies)
— If many records have missing values, omission is not
practical
— Solution 2: Imputation
— Replace missing values with reasonable substitutes
— Lets you keep the record and use the rest of its (non-
missing) information
Normalizing (Standardizing) Data
— Used in some techniques when variables with the
largest scales would dominate and skew results
— Puts all variables on same scale
— Normalizing function: Subtract mean and divide by
standard deviation (used in XLMiner)
— Alternative function: scale to 0-1 by subtracting
minimum and dividing by the range
— Useful when the data contain dummies and numeric
Normalization - Problem 2.8 (Chapter 2)–Class Exercise

2.8 Normalize the data in Table 2.7, showing calculations.

Age Income ($)


25 49,000
56 156,000
65 99,000
32 192,000
41 39,000
49 57,000
Normalization - Problem 2.8 (Chapter 2)–Class Exercise

2.8 Normalize the data in Table 2.7, showing calculations.

Age Income ($) For normalizing age for observation # 1 we


calculate as below. Here age = 25.
25 49,000
56 156,000
65 99,000 25 - mean = 25 - 44.66667 = -1.31325
32 192,000 std 14.97554
41 39,000
49 57,000

To Do 1. Similarly normalize for other observations


of Age variable.
2. And then normalize for all the observations
of Income variable.
Distance between records -Problem 2.9 (Chapter 2)
2.9 Statistical distance between records can be measured in several ways. Consider Euclidean distance, measured as the
square root of the sum of the squared differences. For the first two records in Table 2.7, it is

Age Income ($)


25 49,000
56 156,000
65 99,000
32 192,000
41 39,000
49 57,000

Can normalizing the data change which two records are farthest from each
other in terms of Euclidean distance?
Distance between records -Problem 2.9 (Chapter 2)–contd.
2.9 Statistical distance between records can be measured in several ways. Consider Euclidean distance, measured as the
square root of the sum of the squared differences. For the first two records in Table 2.7, it is

Age Income ($)


25 49,000
56 156,000
Thus distance between records 1 and 2 is
65 99,000
32 192,000
41 39,000
49 57,000
1. Similarly compute distance between other data points.
2. Then compute distance between normalized values.
To Do
3. Indicate which two records are farthest in each scenario.

Can normalizing the data change which two records are farthest from each
other in terms of Euclidean distance?
The Problem of Overfitting

— Statistical models can produce highly complex


explanations of relationships between variables
— The “fit” may be excellent
— When used with new data, models of great
complexity do not do so well.
100% fit – not useful for new data
1600

1400

1200

1000
Revenue

800

600

400

200

0
0 100 200 300 400 500 600 700 800 900 1000

Expenditure
Overfitting (cont.)
Causes:
— Too many predictors
— A model with too many parameters
— Trying many different models

Consequence: Deployed model will not work as well


as expected with completely new data.
Partitioning the Data
Problem: How well will our model
perform with new data?

Solution: Separate data into two parts


— Training partition to develop the
model
— Validation partition to implement the
model and evaluate its performance
on “new” data

Addresses the issue of overfitting


Test Partition
— When a model is developed on training
data, it can overfit the training data
(hence need to assess on validation)
— Assessing multiple models on same
validation data can overfit validation data
— Some methods use the validation data to
choose a parameter. This too can lead to
overfitting the validation data
— Solution: final selected model is applied
to a test partition to give unbiased
estimate of its performance on new data
Example – Linear Regression
West Roxbury Housing Data
Transform Data – create dummies for the
categorical variable “REMODEL”
After Transforming Data – creating dummies
for the categorical variable “REMODEL”
Using Excel and Analytic Solver for
Data Mining
— Excel is limited in data capacity
— However, the training and validation of DM models
can be handled within the modest limits of Excel
and Analytic Solver
— Models can then be used to score larger databases
— Analytic Solver has functions for interacting with
various databases (taking samples from a database,
and scoring a database from a developed model)
Summary
— Data Mining consists of supervised methods
(Classification & Prediction) and unsupervised
methods (Association Rules, Data Reduction, Data
Exploration & Visualization)
— Before algorithms can be applied, data must be
characterized and pre-processed
— To evaluate performance and to avoid overfitting,
data partitioning is used
— Data mining methods are usually applied to a
sample from a large database, and then the best
model is used to score the entire database
Chapter Exercises
(Updated in Canvas)

You might also like