Unit 4
Unit 4
Data Pre-processing
Outline
Why to preprocess data?
Mean, median, mode & range
Attribute types
Data preprocessing tasks
• Data cleaning
• Data integration
• Data transformation
• Data reduction
Data mining task primitives
Why to preprocess data?
Real world data are generally “dirty”
• Incomplete: Missing attribute values, lack of certain attributes of interest,
or containing only aggregate data.
o E.g. Occupation=“ ”
• Noisy: Containing errors or outliers.
o E.g. Salary=“abcxy”
• Inconsistent: Containing similarity in codes or names.
o E.g. “Gujarat” & “Gujrat” (Common mistakes like spelling, grammar, articles)
Why data preprocessing is important?
“No quality data, No quality results”
It looks like Garbage In Garbage Out (GIGO).
Find mode.
12, 15, 11, 11, 7, 12, 13
11, 12 Mode (Bimodal)
Mode (Cont..)
Example
Find mode.
12, 12 15, 11, 11, 7, 13, 7
7, 11, 12 Mode (Trimodal)
Find mode.
12, 15, 11, 10, 7, 14, 13
No Mode
Range
The range of a set of data is the difference between the largest
and the smallest number in the set.
Example
Find range for given data 40, 30, 43, 48, 26, 50, 55, 40, 34, 42, 47, 50
55 – 26 = 29 Range
Standard deviation
The Standard Deviation is a measure of how spread out any data
are.
Its symbol is σ (the Greek letter sigma).
46 -3.2 10.24
50 0.8 0.64
Total 2600.4
Standard deviation – example (Cont..)
Standard deviation can be thought of measuring how far the data
values lie from the mean, we take the mean and move on
standard deviation in either direction.
The mean for this example is 49.2 and the standard deviation is
17.
Now, 49.2 - 17 = 32.2 and 49.2 + 17 = 66.2
This means that most of the data probably spend between 32.2
and 66.2.
If all data are same then variance & standard deviation is 0 (zero).
Example (Try it)
Calculate Mean, Median, Mode, Range, Variance &
Standard deviation .
13, 18, 13, 14, 13, 16, 14, 21, 13
Mean is 15.
Median is 14.
Mode is 13 & 14 (Bimodal).
Range is 8.
Variance is 289.
Standard deviation is 17.
Attribute Types
An attribute is a property of the object.
It also represents different features of the object.
E.g. Person Name, Age, Qualification etc.
Attribute types can be divided into four categories.
1. Nominal
2. Ordinal
3. Interval
4. Ratio
1) Nominal Attribute Attribute Types
Data Data
Transformation Integration
Data
Reduction
1) Data Cleaning
Real-world data tend to be incomplete, noisy, and inconsistent.
Data cleaning (or data cleansing) routines attempt to fill in
missing values, smooth out noise while identifying outliers, and
correct inconsistencies in the data.
1) Fill missing values Data Cleaning
For Age 16 :
For Age 20 :
For Age 40 :