Chapter - 3 Data Pre - Processing
Chapter - 3 Data Pre - Processing
Data Pre-processing
1
Outline
▪ Why to preprocess data?
▪ Mean, median, mode & range
▪ Attribute types
▪ Data preprocessing tasks
• Data cleaning
• Data integration
• Data transformation
• Data reduction
▪ Data mining task primitives
Why to preprocess data?
▪ Real world data are generally “dirty”
• Incomplete: Missing attribute values, lack of certain attributes of interest,
or containing only aggregate data.
o E.g. Occupation=“ ”
• Noisy: Containing errors or outliers.
o E.g. Salary=“abcxy”
• Inconsistent: Containing similarity in codes or names.
o E.g. “Gujarat” & “Gujrat” (Common mistakes like spelling, grammar, articles)
Why data preprocessing is important?
“No quality data, No quality results”
▪ It looks like Garbage In Garbage Out (GIGO).
Find mode.
12, 15, 11, 11, 7, 12, 13
11, 12 Mode (Bimodal)
Mode (Cont..)
▪ Example
Find mode.
12, 12 15, 11, 11, 7, 13, 7
7, 11, 12 Mode (Trimodal)
Find mode.
12, 15, 11, 10, 7, 14, 13
No Mode
Range
▪ The range of a set of data is the difference between the largest
and the smallest number in the set.
▪ Example
✔ Find range for given data 40, 30, 43, 48, 26, 50, 55, 40, 34, 42, 47, 50
55 – 26 = 29 Range
Standard deviation
▪
Standard deviation (Cont..)
▪ The Variance is defined as:
The average of the squared differences from the Mean.
Data Data
Transformation Integration
Data
Reduction
1) Data Cleaning
1. Fill in missing values
1. Ignore the tuple
2. Fill missing value manually
3. Fill in the missing value automatically
4. Use a global constant to fill in the missing value
2. Identify outliers and smooth out noisy data
1. Binning Method
2. Clustering
3. Correct inconsistent data
4. Resolve redundancy caused by data integration
1) Fill missing values Data Cleaning
1. Binning method
2. Clustering
1) Binning method
▪ Data binning or bucketing is a data pre-processing technique used
to reduce the effects of minor observation errors.
▪ The original data values which fall in a given small interval called
as a bin are replaced by a value which represents that interval,
often called the central value.
▪ Steps of Binning method
1. Sort the attribute values and partition them into bins.
2. Then smooth by bin means, bin median or bin boundaries.
Binning method - Example
▪ Given data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
▪ Step: 1
▪ Partition into equal-depth [n=4]:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
▪ Step: 2 (4 + 8 + 9 + 15)/4 = 9
• Smoothing by bin means: (21 + 21 + 24 + 25)/4 = 23
Bin 1: 9, 9, 9, 9 (26 + 28 + 29 + 34)/4 = 29
For Age 16 :
For Age 20 :
For Age 40 :