Data Mining Process: Dr. Gaurav Dixit
Data Mining Process: Dr. Gaurav Dixit
LECTURE 02
DR. GAURAV DIXIT
DEPARTMENT OF MANAGEMENT STUDIES
1
DATA MINING PROCESS
2. Data Preparation
Obtain dataset form internal and external sources
Data consistency checks in terms of definitions of fields, units of measurement, time
periods etc.,
Sample
2
DATA MINING PROCESS
3
DATA MINING PROCESS
4
DATA MINING PROCESS
5
DATA MINING PROCESS
6
DATA MINING PROCESS
7
DATA MINING PROCESS
• Target Population
– Subset of the population under study
– Results are generalized to the target population
• Sample
– Subset of the target population
• Simple Random Sampling
– A sampling method wherein each observation has an equal chance of
being selected
8
DATA MINING PROCESS
• Random Sampling
– A sampling method wherein each observation does not necessarily
have an equal chance of being selected
• Sampling with Replacement
– Sample values are independent
• Sampling without Replacement
– Sample values aren’t independent
9
DATA MINING PROCESS
10
DATA MINING PROCESS
11
DATA MINING PROCESS
• Principle of Parsimony
– A model or theory with less no. of assumptions and variables but with
high explanatory power is generally desirable
• More no. of variables also increase the sample size
requirements due to reliability of estimate
• Overfitting
– A model built using a complex function that fits the data perfectly
– Model ends up fitting the noise and explaining the chance variation
13
DATA MINING PROCESS
• Overfitting
– More no. of iterations resulting in excessive learning of the data
– More no. of variables in the model may lead to fitting spurious
relationships
• Sample Size
– Domain Knowledge
– General rule of thumb: 10 × p observations, where p is the no. of
predictors
– For classification tasks: 6 × m × p observations, where m is the no. of
classes in the outcome variable (Delmaster & Hancock, 2001)
14
DATA MINING PROCESS
• Outliers
– A distant data point
– Valid point or erroneous value?
– Further review
• Manual Inspection (Sorting, minimum and maximum values, clustering etc.)
• Domain Knowledge
• Missing Values
– Few records with missing values can be removed
– Imputation
15
DATA MINING PROCESS
• Missing Values
– Drop the variables having missing values
– Replace with proxy variable
• Normalization
– Standardization using z-score
– Min-max normalization
16
Key References
17
Thanks…
18