0% found this document useful (0 votes)

17 views43 pages

Data Mining Notes

The document provides an overview of data mining processes, including key terminology such as predictors, responses, and observations. It discusses the differences between supervised and unsupervised learning, various data mining techniques, and the importance of data pre-processing and partitioning to avoid overfitting. The document also emphasizes the use of tools like Excel and Analytic Solver for data mining tasks.

Uploaded by

drmitola

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views43 pages

Data Mining Notes

Uploaded by

drmitola

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

GBUS515 –Business Intelligence and Information Systems

Chapters 1 & 2

Introduction and Overview of the Data Mining Process

Instructor – Dr. Sunita Goel

Adapted from Shmueli, Bruce & Patel, Data Mining for Business Analytics, 3e

© Galit Shmueli and Peter Bruce 2010

Let’s get familiar with Terminology - 1
Predictor: A variable, usually denoted by X, used as an input into
a predictive model. Also called a feature, attribute, input
variable, independent variable, or from a database perspective, a
field.
Response: A variable, usually denoted by Y, which is the variable
being predicted in supervised learning; also called dependent
variable, output variable, target variable, or outcome variable.
Observation:The unit of analysis on which the measurements are
taken (a customer, a transaction, etc.); also
called instance, sample, example, case, record, pattern, or row. In
spreadsheets or database table, each row typically represents a
record; each column, a variable or an attribute.
Terminology (contd.) - 2
Supervised Learning:The process of providing an algorithm
(logistic regression, regression tree, etc.) with records in
which an output variable of interest is known and the
algorithm “learns” how to predict this value with new records
where the output is unknown.
Unsupervised Learning: An analysis in which one attempts to
learn patterns in the data other than predicting an output
value of interest.
Success Class: The class of interest in a binary outcome (e.g.,
purchasers in the outcome purchase/no purchase).
Algorithm: A specific procedure used to implement a
particular data mining technique: classification tree,
discriminant analysis, and the like.
Terminology (contd.) - 3
Model: An algorithm as applied to a dataset, complete with
its settings.
Training Data: The portion of the data used to fit a model.
Validation Data: The portion of the data used to assess how
well the model fits, to adjust models, and to select the best
model from among those that have been tried.
Test Data: The portion of the data used only at the end of the
model building and selection process to assess how well the
final model might perform on new data.
Score: A predicted value or class. Scoring new data means
using a model developed with training data to predict output
values in new data.
Once you have installed the Software successfully – You will see an
additional tab when you open Excel – Analytic Solver
Once you have installed the Software successfully – You will see
an additional second tab - Data Mining
Why so many different methods to
build a model?
Each method has its advantages and disadvantages.
Usefulness of a method can depend on factors such as
size of the dataset
types of patterns that exist in the data
whether the data meets some underlying assumptions of the method
how noisy the data is
particular goal of the analysis
Different methods can lead to different results, and their performance
can vary.
It is therefore customary in data mining to apply several different
methods and select the one that appears most useful for the goal at hand.
Core Ideas in Data Mining

Classification
Prediction
Association Rules
Data Reduction
Data Exploration
Visualization
Supervised and Unsupervised Learning
Supervised Learning
Goal: Predict a single “target” or “outcome” variable

Training data, where target value is known

Score to data where value is not known (use a

model developed with training data to predict output
values in new data)

Methods: Classification and Prediction

Unsupervised Learning

Goal: Segment data into meaningful segments or

clusters; detect patterns

There is no target (outcome) variable to predict or

classify

Methods: Association rules, data reduction &

exploration, visualization
Supervised: Classification

Goal: Predict categorical target (outcome) variable

Examples: Purchase/no purchase, fraud/no fraud,
creditworthy/not creditworthy
Each row is a case (customer, tax return, applicant)
Each column is a variable (age, marital status,
income)
Target variable is often binary (yes/no)
Supervised: Prediction
Goal: Predict numerical target (outcome) variable
Examples: sales, revenue, performance
As in classification:
Each row is a case (customer, tax return,
applicant)
Each column is a variable (age, marital status,
income)
Taken together, classification and prediction
constitute “predictive analytics”
Unsupervised: Association Rules
Goal: Produce rules that define “what goes with what”
Example: “If X was purchased, Y was also purchased”
Rows are transactions
Used in recommender systems – “Our records show
you bought X, you may also like Y”
Also called “affinity analysis”
Unsupervised: Data Reduction
Distillation of complex/large data into
simpler/smaller data
Reducing the number of variables/columns (e.g.,
principal component analysis)
Reducing the number of records/rows (e.g.,
clustering)
Unsupervised: Data Visualization
Graphs and plots of
data
Histograms,
boxplots, bar charts,
scatterplots
Especially useful to
examine
relationships
between pairs of
variables
Data Exploration

Data sets are typically large, complex & messy

Need to review the data to help refine the task
Use techniques of Data Reduction and Visualization
The Process of Data Mining
Steps in Data Mining
1. Define/understand purpose
2. Obtain data (may involve random sampling)
3. Explore, clean, pre-process data
4. Reduce the data; if supervised DM, partition it
5. Specify task (classification, clustering, etc.)
6. Choose the techniques (regression, CART, neural
networks, etc.)
7. Iterative implementation and “tuning”
8. Assess results – compare models
9. Deploy best model
Obtaining Data: Sampling

Data mining typically deals with huge databases

Algorithms and models are typically applied to a
sample from a database, to produce statistically-
valid results
XLMiner, e.g., limits the “training” partition to
10,000 records
Once you develop and select a final model, you use
it to “score” the observations in the larger database
When you are ready to partition data in Analytic Solver – invoke it
under data mining tab and specify
Rare event oversampling

Often the event of interest is rare

Examples: response to mailing, fraud in taxes,
Sampling may yield too few “interesting” cases to
effectively train a model
A popular solution: oversample the rare cases to
obtain a more balanced training set
Later, need to adjust results for the oversampling
Pre-processing Data
Types of Variables
Determine the types of pre-processing needed,
and algorithms used
Main distinction: Categorical vs. numeric
Numeric
Continuous
Integer
Categorical
Ordered, also called ordinal (low, medium, high)
Unordered, also called nominal (male, female)
Variable handling
Numeric
Most algorithms in XLMiner
can handle numeric data
May occasionally need to “bin”
into categories

Categorical
Naïve Bayes can use as-is
In most other algorithms, must create binary dummies
(number of dummies = number of categories – 1)
XLMiner has a utility to convert categorical variables to binary
dummies
Detecting Outliers

An outlier is an observation that is “extreme”, being

distant from the rest of the data (definition of
“distant” is deliberately vague)
Outliers can have disproportionate influence on
models (a problem if it is spurious)
An important step in data pre-processing is
detecting outliers
Once detected, domain knowledge is required to
determine if it is an error, or truly extreme.
Detecting Outliers
In some contexts, finding outliers is the purpose of
the data mining exercise (airport security screening).
This is called “anomaly detection”.
Handling Missing Data
Most algorithms will not process records with
missing values. Default is to drop those records.
Solution 1: Omission
If a small number of records have missing values, can
omit them
If many records are missing values on a small set of
variables, can drop those variables (or use proxies)
If many records have missing values, omission is not
practical
Solution 2: Imputation
Replace missing values with reasonable substitutes
Lets you keep the record and use the rest of its (non-
missing) information
Normalizing (Standardizing) Data
Used in some techniques when variables with the
largest scales would dominate and skew results
Puts all variables on same scale
Normalizing function: Subtract mean and divide by
standard deviation (used in XLMiner)
Alternative function: scale to 0-1 by subtracting
minimum and dividing by the range
Useful when the data contain dummies and numeric
Normalization - Problem 2.8 (Chapter 2)–Class Exercise

2.8 Normalize the data in Table 2.7, showing calculations.

Age Income ($)

25 49,000
56 156,000
65 99,000
32 192,000
41 39,000
49 57,000
Normalization - Problem 2.8 (Chapter 2)–Class Exercise

2.8 Normalize the data in Table 2.7, showing calculations.

Age Income ($) For normalizing age for observation # 1 we

calculate as below. Here age = 25.
25 49,000
56 156,000
65 99,000 25 - mean = 25 - 44.66667 = -1.31325
32 192,000 std 14.97554
41 39,000
49 57,000

To Do 1. Similarly normalize for other observations

of Age variable.
2. And then normalize for all the observations
of Income variable.
Distance between records -Problem 2.9 (Chapter 2)
2.9 Statistical distance between records can be measured in several ways. Consider Euclidean distance, measured as the
square root of the sum of the squared differences. For the first two records in Table 2.7, it is

Age Income ($)

25 49,000
56 156,000
65 99,000
32 192,000
41 39,000
49 57,000

Can normalizing the data change which two records are farthest from each
other in terms of Euclidean distance?
Distance between records -Problem 2.9 (Chapter 2)–contd.
2.9 Statistical distance between records can be measured in several ways. Consider Euclidean distance, measured as the
square root of the sum of the squared differences. For the first two records in Table 2.7, it is

Age Income ($)

25 49,000
56 156,000
Thus distance between records 1 and 2 is
65 99,000
32 192,000
41 39,000
49 57,000
1. Similarly compute distance between other data points.
2. Then compute distance between normalized values.
To Do
3. Indicate which two records are farthest in each scenario.

Can normalizing the data change which two records are farthest from each
other in terms of Euclidean distance?
The Problem of Overfitting

Statistical models can produce highly complex

explanations of relationships between variables
The “fit” may be excellent
When used with new data, models of great
complexity do not do so well.
100% fit – not useful for new data
1600

1400

1200

1000
Revenue

800

600

400

200

0
0 100 200 300 400 500 600 700 800 900 1000

Expenditure
Overfitting (cont.)
Causes:
Too many predictors
A model with too many parameters
Trying many different models

Consequence: Deployed model will not work as well

as expected with completely new data.
Partitioning the Data
Problem: How well will our model
perform with new data?

Solution: Separate data into two parts

Training partition to develop the
model
Validation partition to implement the
model and evaluate its performance
on “new” data

Addresses the issue of overfitting

Test Partition
When a model is developed on training
data, it can overfit the training data
(hence need to assess on validation)
Assessing multiple models on same
validation data can overfit validation data
Some methods use the validation data to
choose a parameter. This too can lead to
overfitting the validation data
Solution: final selected model is applied
to a test partition to give unbiased
estimate of its performance on new data
Example – Linear Regression
West Roxbury Housing Data
Transform Data – create dummies for the
categorical variable “REMODEL”
After Transforming Data – creating dummies
for the categorical variable “REMODEL”
Using Excel and Analytic Solver for
Data Mining
Excel is limited in data capacity
However, the training and validation of DM models
can be handled within the modest limits of Excel
and Analytic Solver
Models can then be used to score larger databases
Analytic Solver has functions for interacting with
various databases (taking samples from a database,
and scoring a database from a developed model)
Summary
Data Mining consists of supervised methods
(Classification & Prediction) and unsupervised
methods (Association Rules, Data Reduction, Data
Exploration & Visualization)
Before algorithms can be applied, data must be
characterized and pre-processed
To evaluate performance and to avoid overfitting,
data partitioning is used
Data mining methods are usually applied to a
sample from a large database, and then the best
model is used to score the entire database
Chapter Exercises
(Updated in Canvas)

Becoming A Cyberfriend PDF
No ratings yet
Becoming A Cyberfriend PDF
25 pages
ToR - CONNECTA TRA INFR BiH DD 03
No ratings yet
ToR - CONNECTA TRA INFR BiH DD 03
80 pages
Common Analytics Interview Questions
No ratings yet
Common Analytics Interview Questions
4 pages
Installation and Maintenance Instructions: Anderson Greenwood Series 93T Pilot Operated Safety Relief Valves
No ratings yet
Installation and Maintenance Instructions: Anderson Greenwood Series 93T Pilot Operated Safety Relief Valves
8 pages
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
No ratings yet
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
38 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
Data Mining For Business Intelligence: Shmueli, Patel & Bruce
No ratings yet
Data Mining For Business Intelligence: Shmueli, Patel & Bruce
37 pages
Data Mining Process: Dr. Gaurav Dixit
No ratings yet
Data Mining Process: Dr. Gaurav Dixit
18 pages
Chap2 Overview
No ratings yet
Chap2 Overview
17 pages
Data Mining Notes
No ratings yet
Data Mining Notes
14 pages
Chapter 02 Overview
No ratings yet
Chapter 02 Overview
43 pages
Business Understanding This Step Involves Understanding The Problem That Needs To Be Solved and Defining The Objectives of The Data Mining Project
No ratings yet
Business Understanding This Step Involves Understanding The Problem That Needs To Be Solved and Defining The Objectives of The Data Mining Project
5 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
Unit2 Notes
No ratings yet
Unit2 Notes
8 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
62 pages
Lecture 2
No ratings yet
Lecture 2
18 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Lec 2
No ratings yet
Lec 2
19 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
Data Mining vs. Statistics: Pavel Brusilovsky
No ratings yet
Data Mining vs. Statistics: Pavel Brusilovsky
22 pages
Introduction To Data Mining For Business Analytics
No ratings yet
Introduction To Data Mining For Business Analytics
51 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Data Science S3mca
No ratings yet
Data Science S3mca
55 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
3 pages
What Is Business Analytics?: Predictive Analytics Descriptive Analytics Prescriptive Analytics
No ratings yet
What Is Business Analytics?: Predictive Analytics Descriptive Analytics Prescriptive Analytics
35 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
Data Mining - An Overview
No ratings yet
Data Mining - An Overview
40 pages
Classification
No ratings yet
Classification
20 pages
Lecture 1428550844
No ratings yet
Lecture 1428550844
87 pages
Basic Concept of Classification (Data Mining)
No ratings yet
Basic Concept of Classification (Data Mining)
11 pages
03 Data Science Process - Fall 23-24
No ratings yet
03 Data Science Process - Fall 23-24
38 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages
Understanding Data Mining
No ratings yet
Understanding Data Mining
21 pages
Data Mining Unit-1 Complete
No ratings yet
Data Mining Unit-1 Complete
45 pages
Statistics For Data Science
100% (1)
Statistics For Data Science
39 pages
02 Data Preprocessing
No ratings yet
02 Data Preprocessing
62 pages
Data Transformation in Data Mining
No ratings yet
Data Transformation in Data Mining
6 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
Data Mining
No ratings yet
Data Mining
44 pages
Intro 2
No ratings yet
Intro 2
3 pages
PredictiveAnalysis U1 U2
No ratings yet
PredictiveAnalysis U1 U2
7 pages
DWM Notes Class by Proff
No ratings yet
DWM Notes Class by Proff
88 pages
Unit 1 DMW
No ratings yet
Unit 1 DMW
41 pages
BUSINESS INTELLIGENCE NOTES Unit 4
No ratings yet
BUSINESS INTELLIGENCE NOTES Unit 4
10 pages
Unit-1 PPT
No ratings yet
Unit-1 PPT
21 pages
Unit 3
No ratings yet
Unit 3
34 pages
Learning Progress Review Week 10
No ratings yet
Learning Progress Review Week 10
35 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
38 pages
DMBI Simplified
No ratings yet
DMBI Simplified
28 pages
03 Data Science Process - Spring-24-25
No ratings yet
03 Data Science Process - Spring-24-25
48 pages
Data Mining
No ratings yet
Data Mining
49 pages
Data Mining Concepts
100% (3)
Data Mining Concepts
122 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Dr. Gaurav Dixit: Department of Management Studies
No ratings yet
Dr. Gaurav Dixit: Department of Management Studies
26 pages
Unit 3
No ratings yet
Unit 3
18 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Final Report For Sales Dataset Project
No ratings yet
Final Report For Sales Dataset Project
25 pages
DADM S2 Data Preprocessing-Data Cleaning and Transformation
No ratings yet
DADM S2 Data Preprocessing-Data Cleaning and Transformation
12 pages
Unit 1 (DS)
No ratings yet
Unit 1 (DS)
15 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Text Mining Notes
No ratings yet
Text Mining Notes
24 pages
GBUS630 Application Project Spring 2023
No ratings yet
GBUS630 Application Project Spring 2023
2 pages
GBUS 630 Final Project
No ratings yet
GBUS 630 Final Project
67 pages
610 Project
No ratings yet
610 Project
2 pages
Lecture 8 - ADC Interfacing
No ratings yet
Lecture 8 - ADC Interfacing
33 pages
Resume Writing (Group # 4)
No ratings yet
Resume Writing (Group # 4)
16 pages
Jay Omar CV For FAO
No ratings yet
Jay Omar CV For FAO
2 pages
Useful ADB and Fastboot Commands and How To Use Them
No ratings yet
Useful ADB and Fastboot Commands and How To Use Them
7 pages
Computer Networks UNIT-5
No ratings yet
Computer Networks UNIT-5
20 pages
Curriculum Vitae
No ratings yet
Curriculum Vitae
4 pages
CHE 333 D
No ratings yet
CHE 333 D
20 pages
The United Nations Standard Minimum Rules For The Treatment of Prisoners
No ratings yet
The United Nations Standard Minimum Rules For The Treatment of Prisoners
20 pages
Tank Asm
No ratings yet
Tank Asm
1 page
Log
No ratings yet
Log
3 pages
Chapter 1water Quantity
No ratings yet
Chapter 1water Quantity
38 pages
LG 49UF85 шасси LD53H LD53J
No ratings yet
LG 49UF85 шасси LD53H LD53J
116 pages
Deficient Knowledge NCP
No ratings yet
Deficient Knowledge NCP
2 pages
Tutorial 4
No ratings yet
Tutorial 4
3 pages
Me101 HW 1
No ratings yet
Me101 HW 1
2 pages
PDF 5
No ratings yet
PDF 5
1 page
Modelling Bridging Using Tekla Structures: Table of Contents
No ratings yet
Modelling Bridging Using Tekla Structures: Table of Contents
14 pages
Reduction of Mangrove Carbon Stock Ecosystems Due To Illegal Logging Using A Combination of Unmanned Aerial Vehicle Imagery and Field Surveys
No ratings yet
Reduction of Mangrove Carbon Stock Ecosystems Due To Illegal Logging Using A Combination of Unmanned Aerial Vehicle Imagery and Field Surveys
18 pages
Ntile Function
No ratings yet
Ntile Function
9 pages
Hazel Moore - Google Search
No ratings yet
Hazel Moore - Google Search
1 page
Bear Hug V2 1
No ratings yet
Bear Hug V2 1
4 pages
Livingston White Dissertation
100% (2)
Livingston White Dissertation
5 pages
Zillow v. Compass (Fed.)
No ratings yet
Zillow v. Compass (Fed.)
29 pages
6.1.1 The Multiplication Rule: A B C D 1 2 3
No ratings yet
6.1.1 The Multiplication Rule: A B C D 1 2 3
1 page
EasyJet 2022 ARA Sustainability 221215
No ratings yet
EasyJet 2022 ARA Sustainability 221215
20 pages
About Me: Career Objective
No ratings yet
About Me: Career Objective
1 page
Dhananjay Internship
No ratings yet
Dhananjay Internship
16 pages

Data Mining Notes

Uploaded by

Data Mining Notes

Uploaded by

GBUS515 –Business Intelligence and Information Systems

Introduction and Overview of the Data Mining Process

Instructor – Dr. Sunita Goel

© Galit Shmueli and Peter Bruce 2010

 Training data, where target value is known

 Score to data where value is not known (use a

 Methods: Classification and Prediction

 Goal: Segment data into meaningful segments or

 There is no target (outcome) variable to predict or

 Methods: Association rules, data reduction &

 Goal: Predict categorical target (outcome) variable

 Data sets are typically large, complex & messy

 Data mining typically deals with huge databases

 Often the event of interest is rare

 An outlier is an observation that is “extreme”, being

2.8 Normalize the data in Table 2.7, showing calculations.

Age Income ($)

2.8 Normalize the data in Table 2.7, showing calculations.

Age Income ($) For normalizing age for observation # 1 we

To Do 1. Similarly normalize for other observations

Age Income ($)

Age Income ($)

 Statistical models can produce highly complex

Consequence: Deployed model will not work as well

Solution: Separate data into two parts

Addresses the issue of overfitting

You might also like

Training data, where target value is known

Score to data where value is not known (use a

Methods: Classification and Prediction

Goal: Segment data into meaningful segments or

There is no target (outcome) variable to predict or

Methods: Association rules, data reduction &

Goal: Predict categorical target (outcome) variable

Data sets are typically large, complex & messy

Data mining typically deals with huge databases

Often the event of interest is rare

An outlier is an observation that is “extreme”, being

Statistical models can produce highly complex