0% found this document useful (0 votes)

32 views

Data Mining Process: Dr. Gaurav Dixit

The document outlines the typical phases in a data mining process: 1) discovery, 2) data preparation, 3) data exploration and conditioning, 4) model planning, 5) model building, 6) results interpretation, and 7) model deployment. It also discusses key concepts in data mining including supervised vs unsupervised learning, sampling techniques, handling outliers and missing values, and issues like overfitting.

Uploaded by

Parmarth Khanna

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

Data Mining Process: Dr. Gaurav Dixit

Uploaded by

Parmarth Khanna

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

DATA MINING PROCESS

LECTURE 02
DR. GAURAV DIXIT
DEPARTMENT OF MANAGEMENT STUDIES

1
DATA MINING PROCESS

• Phases in a typical Data Mining effort:

1. Discovery
Frame business problem
Identify analytics component
Formulate initial hypotheses

2. Data Preparation
Obtain dataset form internal and external sources
Data consistency checks in terms of definitions of fields, units of measurement, time
periods etc.,
Sample

2
DATA MINING PROCESS

• Phases in a typical Data Mining effort:

3. Data Exploration and Conditioning
Missing data handling, Range reasonability, Outliers,
Graphical or Visual Analysis
Transformation, Creation of new variables, and Normalization
Partitioning into Training, Validation, and Test datasets

3
DATA MINING PROCESS

• Phases in a typical Data Mining effort:

4. Model Planning
Determine data mining task such as prediction, classification etc.
Select appropriate data mining methods and techniques such as regression, neural
networks, clustering etc.
5. Model Building
Building different candidate models using selected techniques and their variants using
training data
Refine and select the final model using validation data
Evaluate the final model on test data

4
DATA MINING PROCESS

• Phases in a typical Data Mining effort:

6. Results Interpretation
Model evaluation using key performance metrics
7. Model Deployment
Pilot project to integrate and run the model on operational systems

• Similar data mining methodologies developed by SAS and

IBM Modeler (SPSS Clementine) are called SEMAA and CRISP-
DM respectively

5
DATA MINING PROCESS

• Data mining techniques can be divided into Supervised

Learning Methods and Unsupervised Learning Methods
• Supervised Learning
– In supervised learning, algorithms are used to learn the function ‘f’
that can map input variables (X) into output variables (Y)
Y = f(X)
– Idea is to approximate ‘f’ such that new data on input variables (X) can
predict the output variables (Y) with minimum possible error (Ɛ)

6
DATA MINING PROCESS

• Supervised Learning problems can be grouped into prediction

and classification problems
• Unsupervised Learning
– In Unsupervised Learning, algorithms are used to learn the underlying
structure or patterns hidden in the data
• Unsupervised Learning problems can be grouped into
clustering and association rule learning problems

7
DATA MINING PROCESS

• Target Population
– Subset of the population under study
– Results are generalized to the target population
• Sample
– Subset of the target population
• Simple Random Sampling
– A sampling method wherein each observation has an equal chance of
being selected

8
DATA MINING PROCESS

• Random Sampling
– A sampling method wherein each observation does not necessarily
have an equal chance of being selected
• Sampling with Replacement
– Sample values are independent
• Sampling without Replacement
– Sample values aren’t independent

9
DATA MINING PROCESS

• Sampling results in less no. of observations than the no. of

total observations in the dataset
• Data Mining algorithms
– Varying limitations on number of observations and variables
• Limitations due to computing power and storage capacity
• Limitations due to statistical software being used
• How many observations to build accurate models?

10
DATA MINING PROCESS

• Rare Event, e.g., low response rate in advertising by

traditional mail or email
– Oversampling of ‘success’ cases
– Arises mainly in classification tasks
– Costs of misclassification
• Asymmetric costs due to more importance of ‘success’ class
– Costs of failing to identify ‘success’ cases are generally more than costs
of detailed review of all cases
– Prediction of ‘success’ cases is likely to come at cost of misclassifying
more ‘failure’ cases as ‘success’ cases than usual

11
DATA MINING PROCESS

• Dummy coding for categorical variables

– Some statistical software cannot use categorical variables expressed in
the label format
– Dummy binary variables (having 0’s and 1’s: 0 indicating ‘absence’ and
1 indicating ‘presence’) for different classes of categorical variables are
created
– For example, if ‘activity status’ of individuals can be put into four
mutually exclusive and jointly exhaustive classes as {student,
unemployed, employed, retired}, only three dummy variables would
be required
12
DATA MINING PROCESS

• Principle of Parsimony
– A model or theory with less no. of assumptions and variables but with
high explanatory power is generally desirable
• More no. of variables also increase the sample size
requirements due to reliability of estimate
• Overfitting
– A model built using a complex function that fits the data perfectly
– Model ends up fitting the noise and explaining the chance variation

13
DATA MINING PROCESS

• Overfitting
– More no. of iterations resulting in excessive learning of the data
– More no. of variables in the model may lead to fitting spurious
relationships
• Sample Size
– Domain Knowledge
– General rule of thumb: 10 × p observations, where p is the no. of
predictors
– For classification tasks: 6 × m × p observations, where m is the no. of
classes in the outcome variable (Delmaster & Hancock, 2001)

14
DATA MINING PROCESS

• Outliers
– A distant data point
– Valid point or erroneous value?
– Further review
• Manual Inspection (Sorting, minimum and maximum values, clustering etc.)
• Domain Knowledge

• Missing Values
– Few records with missing values can be removed
– Imputation

15
DATA MINING PROCESS

• Missing Values
– Drop the variables having missing values
– Replace with proxy variable
• Normalization
– Standardization using z-score
– Min-max normalization

16
Key References

• Data Science and Big Data Analytics: Discovering, Analyzing,

Visualizing and Presenting Data by EMC Education Services
(2015)
• Data Mining for Business Intelligence: Concepts, Techniques,
and Applications in Microsoft Office Excel with XLMiner by
Shmueli, G., Patel, N. R., & Bruce, P. C. (2010)

17
Thanks…

Crisp - DM: Data Mining Process
No ratings yet
Crisp - DM: Data Mining Process
8 pages
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
No ratings yet
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
38 pages
Data Mining Notes
No ratings yet
Data Mining Notes
43 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
Business Understanding This Step Involves Understanding The Problem That Needs To Be Solved and Defining The Objectives of The Data Mining Project
No ratings yet
Business Understanding This Step Involves Understanding The Problem That Needs To Be Solved and Defining The Objectives of The Data Mining Project
5 pages
Intorduction To Data Mining
No ratings yet
Intorduction To Data Mining
26 pages
Chapter Five Data Mining for Healthcare Analytics
No ratings yet
Chapter Five Data Mining for Healthcare Analytics
77 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
PredictiveAnalysis U1 U2
No ratings yet
PredictiveAnalysis U1 U2
7 pages
Lecture 2
No ratings yet
Lecture 2
18 pages
Crisp DM
100% (1)
Crisp DM
30 pages
Lecture 1 & 2- Introduction to Data Mining2
No ratings yet
Lecture 1 & 2- Introduction to Data Mining2
19 pages
PREDICTIVE & PRESCRIPTIVE ANALYTICS
No ratings yet
PREDICTIVE & PRESCRIPTIVE ANALYTICS
19 pages
Unit 1 DMW
No ratings yet
Unit 1 DMW
41 pages
Introduction to Data Mining
No ratings yet
Introduction to Data Mining
27 pages
Presentation 1
No ratings yet
Presentation 1
28 pages
crisp (1)
No ratings yet
crisp (1)
31 pages
data mining
No ratings yet
data mining
44 pages
PAM - Complete
No ratings yet
PAM - Complete
322 pages
Data Mining - An Overview
No ratings yet
Data Mining - An Overview
40 pages
Chapter 1
No ratings yet
Chapter 1
23 pages
Data Mining Mod 1 Notes
No ratings yet
Data Mining Mod 1 Notes
25 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
62 pages
Crisp
No ratings yet
Crisp
28 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
38 pages
Data Mining
No ratings yet
Data Mining
41 pages
DM NOTES
No ratings yet
DM NOTES
91 pages
Unit 3
No ratings yet
Unit 3
34 pages
Data Mining Concepts
100% (3)
Data Mining Concepts
122 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Data Mining
No ratings yet
Data Mining
30 pages
IBA - MODULe 4.3
No ratings yet
IBA - MODULe 4.3
10 pages
Data Mining Process, Techniques, Tools & Examples
No ratings yet
Data Mining Process, Techniques, Tools & Examples
11 pages
Data Mining 101
No ratings yet
Data Mining 101
50 pages
Data Mining Notes
No ratings yet
Data Mining Notes
14 pages
Crisp-Dm
No ratings yet
Crisp-Dm
4 pages
Lec 2
No ratings yet
Lec 2
19 pages
Data
No ratings yet
Data
9 pages
Data Mining
100% (1)
Data Mining
18 pages
Crisp DM
No ratings yet
Crisp DM
33 pages
DATA Mining
No ratings yet
DATA Mining
21 pages
BI_Unit 5
No ratings yet
BI_Unit 5
9 pages
Chap2 Overview
No ratings yet
Chap2 Overview
17 pages
Lecture 7 8 Data Mining
No ratings yet
Lecture 7 8 Data Mining
23 pages
Process: 1. Data Mining (The Analysis Step of The "Knowledge Discovery in Databases" Process, or KDD)
No ratings yet
Process: 1. Data Mining (The Analysis Step of The "Knowledge Discovery in Databases" Process, or KDD)
4 pages
Pattern Recognition Application
No ratings yet
Pattern Recognition Application
43 pages
Data Mining vs. Statistics: Pavel Brusilovsky
No ratings yet
Data Mining vs. Statistics: Pavel Brusilovsky
22 pages
DWM Notes Class by Proff
No ratings yet
DWM Notes Class by Proff
88 pages
Data Mining - KTUweb PDF
No ratings yet
Data Mining - KTUweb PDF
82 pages
DM Sem U-1
No ratings yet
DM Sem U-1
50 pages
Mod03-Lifecycle Dataprocessing
No ratings yet
Mod03-Lifecycle Dataprocessing
72 pages
2 crisp-DM
No ratings yet
2 crisp-DM
28 pages
DMDW Lecture Notes
No ratings yet
DMDW Lecture Notes
24 pages
What Is Business Analytics?: Predictive Analytics Descriptive Analytics Prescriptive Analytics
No ratings yet
What Is Business Analytics?: Predictive Analytics Descriptive Analytics Prescriptive Analytics
35 pages
PPT4 W3 S4 R0 Predictive Analytics I Data Mining Process
No ratings yet
PPT4 W3 S4 R0 Predictive Analytics I Data Mining Process
50 pages
10 Challenging Problems in Data Mining Research
No ratings yet
10 Challenging Problems in Data Mining Research
8 pages
Data Mining For Business Intelligence: Shmueli, Patel & Bruce
No ratings yet
Data Mining For Business Intelligence: Shmueli, Patel & Bruce
37 pages
2 & 3_Business Problems and Science Solution
No ratings yet
2 & 3_Business Problems and Science Solution
26 pages
data science course training in india hyderabad: innomatics research labs
From Everand
data science course training in india hyderabad: innomatics research labs
innomatics research labs
No ratings yet
Hrishav Agarwal
No ratings yet
Hrishav Agarwal
2 pages
How To Break Down A Set Defence
No ratings yet
How To Break Down A Set Defence
27 pages
Regression: UNIT - V Regression Model
100% (1)
Regression: UNIT - V Regression Model
21 pages
MAHESH Project Document
No ratings yet
MAHESH Project Document
63 pages
Deng Et Al. - 2019 - Feature Selection For Text Classification A Review
No ratings yet
Deng Et Al. - 2019 - Feature Selection For Text Classification A Review
20 pages
11 Most Common Machine Learning Algorithms Explained in A Nutshell by Soner Yıldırım Towards Data Science
No ratings yet
11 Most Common Machine Learning Algorithms Explained in A Nutshell by Soner Yıldırım Towards Data Science
16 pages
Building a Tanh Activation Function
No ratings yet
Building a Tanh Activation Function
9 pages
Pattern Recognition
No ratings yet
Pattern Recognition
66 pages
Bagging and Boosting
No ratings yet
Bagging and Boosting
32 pages
Printed Circuit Board Defect Detection Using Mathematical Morphology
No ratings yet
Printed Circuit Board Defect Detection Using Mathematical Morphology
5 pages
Deep Learning: Dr. Sanjeev Sharma
No ratings yet
Deep Learning: Dr. Sanjeev Sharma
61 pages
Intership Final
No ratings yet
Intership Final
23 pages
Machine Learning: K-Nearest Neighbors Algorithm
No ratings yet
Machine Learning: K-Nearest Neighbors Algorithm
15 pages
Machine Learning Unit 4 MCQ
No ratings yet
Machine Learning Unit 4 MCQ
28 pages
A Comparative Study of The Different Classification Algorithms On Football Analytics
No ratings yet
A Comparative Study of The Different Classification Algorithms On Football Analytics
16 pages
Video Based Fall Detection For Seniors With Human Pose Estimation
No ratings yet
Video Based Fall Detection For Seniors With Human Pose Estimation
26 pages
R16 BEST ETHIOPIA Birhanu Belay (Classification Model For Ethiopian Traditional Music Video Using CNN)
No ratings yet
R16 BEST ETHIOPIA Birhanu Belay (Classification Model For Ethiopian Traditional Music Video Using CNN)
70 pages
Module2 ML 22 01 2024 WM
No ratings yet
Module2 ML 22 01 2024 WM
42 pages
Classification Model Evaluation Metrics
No ratings yet
Classification Model Evaluation Metrics
9 pages
SPE 117423 Retrieving Vuggy-Fractured Porosity From Standard Well Log Data
No ratings yet
SPE 117423 Retrieving Vuggy-Fractured Porosity From Standard Well Log Data
7 pages
Supervised Learning in R Classification
No ratings yet
Supervised Learning in R Classification
7 pages
Tutorial on Neural Networks_18MAR2024
No ratings yet
Tutorial on Neural Networks_18MAR2024
33 pages
Revanasiddappa M Biradar: Personal Information
No ratings yet
Revanasiddappa M Biradar: Personal Information
3 pages
UNIT 2 - Notes
No ratings yet
UNIT 2 - Notes
31 pages
CSD311: Artificial Intelligence
No ratings yet
CSD311: Artificial Intelligence
31 pages
Fraud Detection System Micro-Project
No ratings yet
Fraud Detection System Micro-Project
27 pages
Monkeylearn Thesis
No ratings yet
Monkeylearn Thesis
19 pages
Combining Pattern Classifiers Methods and Algorithms PDF
No ratings yet
Combining Pattern Classifiers Methods and Algorithms PDF
2 pages
COMP1801 - Copy 1
No ratings yet
COMP1801 - Copy 1
18 pages
The Use of Texture For Image Classification of Bla
No ratings yet
The Use of Texture For Image Classification of Bla
11 pages

Data Mining Process: Dr. Gaurav Dixit

Uploaded by

Data Mining Process: Dr. Gaurav Dixit

Uploaded by

DATA MINING PROCESS

• Phases in a typical Data Mining effort:

• Phases in a typical Data Mining effort:

• Phases in a typical Data Mining effort:

• Phases in a typical Data Mining effort:

• Similar data mining methodologies developed by SAS and

• Data mining techniques can be divided into Supervised

• Supervised Learning problems can be grouped into prediction

• Sampling results in less no. of observations than the no. of

• Rare Event, e.g., low response rate in advertising by

• Dummy coding for categorical variables

• Data Science and Big Data Analytics: Discovering, Analyzing,

You might also like