Lecture 1
Lecture 1
COMPUTER ENGINEERING
Data Cleansing
Reasons for Data Cleansing
2
Resolving Inconsistency
Typical Example
4
Sample Table
Typical Example (Incomplete Data)
5
Sample Table
Typical Example (Data Values Errors)
6
Sample Table
Reasons:
• The Zip code consists of five digits and cannot contain any letters
• Income must be positive number
• Age must be positive number
Typical Example (Outlier Values)
7
Sample Table
Reasons:
• Outliers are data values that deviate from expected values of the rest of the
data set
• The values 10000000 and -40000 look very divergent from the rest of
values
Typical Example (Ambiguity)
8
Sample Table
Reasons:
• “S” in Marital Status could refer to “Single” or “Separated”
• So, there is a kind of ambiguity in the data
Reasons for Incomplete Data
9
Defaults
• Field1 Default: 0 Field2 Default: N
• Field3 Default: 240 Field4 Default: 50
Handling Missing Values (Using Means
and Modes)
20
Use the mean for the numeric fields and the mode (if
exists) for the categorical fields
If mode doesn’t exist, you need to rely on either a default
value or to use a random value
Numeric Fields: Field1, Field3, and Field4
Field1 Mean = (21+24+22+12+11+16+16+17+18)/9
= 17.44
Field3 Mean = 334.44
Field4 mean = 81.78
If any field doesn’t accept decimal values, just
approximate the mean value
Handling Missing Values (Using Means
and Modes)
21
Category Occurrence
A 3
B 1
W 2
C 2
Assumptions
• Assume Field 1 and Field4 don’t accept decimal numbers, Hence we approximate
the mean
• Field3 accepts decimal numbers, hence we don’t approximate the mean value
Handling Missing Values (Using Random
Values)
23
Handling Outliers
24
Sample Table
Data Set Possible Outlier Values
• Outliers are data values that deviate from expected values of the rest of the
data set
• Outliers are extreme values that lie near the limits of the data range or go
against the trend of the remaining data.
• Normally, outliers need more investigation to make sure that they don’t occur
due to mistakes during data entry
Handling Outliers Using Inter-quartile
Range
25
Q1 Q3
• Quartile is any of the three values which divide the sorted data set
into four equal parts
• First quartile (Q1) cuts off lowest 25% of data
• Second quartile (Q2) cuts data set in half (it is the median of the data
set)
• Third quartile (Q3) cuts off highest 25% of data, or lowest 75%
Computing Q1, Q2, and Q3
26
Sample Table
Data Set that might contains outliers
Data Set
75000, -40000, 10000000, 50000, 99999
Example of Detecting Outliers using Inter-
quartile Range
31
Data Set:
75000, -40000, 10000000, 50000, 99999
Ordered Data Set:
-40000, 50000, 75000, 99999, 10000000
Q2 = 75000
Q1 = (–40000+50000)/2 = 5000
Q3 = (99999+10000000)/2 = 5049999.5
IRQ = Q3 – Q1
= 5049999.5 – 5000
= 5044999.5
Q1 – 1.5*IRQ = 5000 – 1.5*50449999.5 = – 7562499.5
Q3 + 1.5*IRQ = 5049999.5 + 1.5*5044999.5 = 12617498.75
All data in the data set are within range, hence there is no outliers in this
example
Example of Detecting Outliers using Inter-
quartile Range
32
Data Set
75000, 40000, 10000000, 50000, 99999, 75000
Example of Detecting Outliers using Inter-
quartile Range
33
Data Set:
75000, 40000, 10000000, 50000, 99999, 75000
Ordered Data Set:
40000, 50000, 75000, 75000, 99999, 10000000
Q2 = (75000+ 75000)/2 = 75000
Q1 = 50000
Q3 = 99999
IRQ = Q3 – Q1
= 99999 – 50000
= 49999
Q1 – 1.5*IRQ = 50000 – 1.5*49999 = –24998.5
Q3 + 1.5*IRQ = 99999 + 1.5* 49999 = 174997.5
Hence data item 10000000 is an outlier and should be re-investigated for any
data-entry errors
Noisy Data
34
Check validation rules and make sure field values follow the
rules; for example:
Age is not less than certain amount and age is a positive
number.
Example: if there is a rule governing your data says that age
must be between 20 and 60, then ages of 18, 15, and 68
are detected as errors
Each value of the categorical values belong to certain
category.
Example: if all the categories you have are A, B, C, and D,
Then if categories W or N are found they will be declared
as errors
Validation and Correction of Noisy Data (Cont.)
39
Incorrect Area-Code
Total_Price doesn’t equal (Quantity*Unit_Price)