CH2 Data Cleaning
CH2 Data Cleaning
TOPIC 2 :
Data Preparation - Cleaning
1
Where we are? Introduction to KDD
KDD
Data
steps
Understanding dataset & Domain
Data Preparation
Do data preprocessing
Evaluate result
Utilize knowledge
2
Chapter Outline (Data Cleaning)
• Why preprocess data?
• Real world data problem
• Cleaning for Missing data, Noisy
data, Outlier
• Techniques to identify &
overcome missing data, noisy
data, outlier
3
Major Tasks in Data Preprocessing
Identifyin
Data Data Evaluation &
g knowledge
Preprocessing Mining Presentation
Data
Sources
5
Why Data Preprocessing is Important?
• Data in the real world is dirty
– Incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only
aggregate data
• e.g., occupation=“”
– Noisy: containing errors or outliers
• e.g., Salary=“-10”
– Inconsistent: containing inconsistency in codes or
names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., inconsistency between duplicate records
6
Why Data is Dirty?
7
How to identify problem in data?
Histogram Analysis
Regression Analysis
8
Data Cleaning TASKS
Resolve
Correct redundancy
inconsistent data caused by data
integration
9
Missing values/data ?
10
How to Handle
Missing values/data ?
Ignore the tuple: usually done when class label
is missing (assuming the tasks in classification—
not effective when the percentage of missing
values per attribute varies considerably.
8 M PAKISTANI 4 37
X M CHINESE 5 X
4 M CHINESE 6 37.5
Eliminate/Ignore data objects or 20 F PAKISTAN 3 39
attributes 23 F PAKISTANI 7 39.5
6 M X 7 38.2
X F X X 39
X F PAKISTANI 2 X
20 F PAKISTANI 2 37
41 M BANGLA 5 37
24 F CHINESE 7 37
15 M CHINESE X 38.5
27 M PAKISTANI 2 38.5
7 F PAKISTANI 2 37
22 M PAKISTANI 3 41.2
33 M PAKISTANI 3 39
67 M PAKISTANI 5 39
6 M PAKISTANI X 39
19 F CHINESE 5 38
Missing values/data ?
• Example: 59 M CHINESE 3 50
8 M PAKISTANI 4 37
– Estimate the missing value 23.4 M CHINESE 5 38.9
4 M CHINESE 6 37.5
20 F PAKISTANI 3 39
Mean value (numerical) 23 F PAKISTANI 7 39.5
6 M PAKISTANI 7 38.2
23.4 F PAKISTANI 4 39
21 F PAKISTANI 5 38.5
23.4 F PAKISTANI 2 38.9
Mod value (categorical/ordinal/nominal) 20 F PAKISTANI 2 37
41 M BANGLA 5 37
24 F CHINESE 7 37
15 M CHINESE 4 38.5
27 M PAKISTANI 2 38.5
7 F PAKISTANI 2 37
22 M PAKISTANI 3 41.2
33 M PAKISTANI 3 39
67 M PAKISTANI 5 39
6 M PAKISTANI 4 39
19 F CHINESE 5 38
• Mean (algebraic measure) (sample vs. population): 1 n
x xi x
Note: n is sample size and N is population size. n i 1 N
n
• Median:
– Middle value if odd number of values, or average of the middle two values otherwise
– Estimated by interpolation (for grouped data):
Media
n
n / 2 ( freq ) l interv
median L1 (
al
) width
freq median
• Mode
– Value that occurs most frequently in the data
14
Symmetric vs. Skewed Data
positively skewed
15
Recap
• We have
– Type of data problem, why data problem occurred?, the
importance of data preprocessing
– Data Preprocessing task – Data Cleaning, Data Reduction,
Data Integration, Data transformation
– Data Cleaning
• Missing data - handling missing value
• NEXT
– Data Cleaning
• Noisy & Incomplete data – identify & clean
16
Noisy Data
• Noise ????
• random error or variance in a measured variable
• Incorrect attribute values may due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which requires data cleaning
– duplicate records
Box plot
– incomplete data
– inconsistent data
Histogram analysis
– Outlier value
• Technique to detect? Clustering 17
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
– Quartiles: Q1 (25th percentile), Q3 (75th percentile)
– Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers
individually
– Outlier: usually, a value higher/lower than 1.5 x IQR 1 n
1 n
( xi ) 2 x
2
2 i 2
N i 1 N i 1
• Variance and standard deviation (sample: s, population: σ)
– Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n 2
2
s ( xi x ) n 1 [
n 1 i 1
2
i 1
xi ( xi ) ]
n i 1
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR
– The median is marked by a line within the box
– Whiskers: two lines outside the box extended to Minimum and Maximum
– Outliers: points beyond a specified outlier threshold, plotted individually
19
How to Handle Noisy Data?
BINNING
– Sort data
– Perform binning
• Binning Method (smooth by bin means, smooth by bin median, smooth by
bin boundaries)
• Discretization Methods (Equal width distance, Equal frequency distance)
CLUSTERING
– detect suspicious values and check by human (e.g., deal with possible outliers)
20
Binning Methods for Data Smoothing
*Sorted data for price (RM): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29,
34
• Bin means:
• Divide data into k bins (e.g 3)
• 12 / 3 = 4
• Calculate the mean value for each bin
- Bin 1 4,8,9,15 (4+8+9+15) / 4 = 9 , so Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
• Bin boundaries:
• Divide data into k bins (e.g 3)
• Represent the data based on the nearest boundary
- Bin 1 4,8,9,15 so Bin 1 : 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25 21
- Bin 3: 26, 26, 26, 34
• Equal-width (distance) partitioning:
Discretization Methods
o Divides the range into N intervals of equal size: uniform
grid
o if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B –A)/N.
o The most straightforward, but outliers may dominate
presentation
o Skewed data is not handled well.
22
Example of Discretization Methods: Binning
The following data are a list prices of commonly sold item at
AllElectronic (rounded to the nearest dollar).
Price (Dollar)
1,1,5,5,5,5,5,8,8,10,10,1
0,10,12,14,14,14,15,15,1
5,15,15,15,18,18,18,18,1
8,18,18,18,20,20,20,20,2
0,20,20,21,21,21,21,25,2
5,25,25,25,28,28,30,30,3
0
Histogram analysis
(Data is badly distributed,
To many classes!)
Equal-width (distance) Binning
38.1 – 38.5 c
38.6. – 39.0 d
39.1 – 39.5 e
39.6 – 40.0 f
40.1 – 40.5 g
40.5-41.5 h
RAW
(frequency)
• https://www.youtube.com/watch?v=P--UFzlNG
eA
27
Outlier Removal
•Outlier– an observation data that lies outside the
range of the data values.
28
Box Plot Analysis
Box Plot: Is a technique for displaying one dimensional data and its
characteristic.. (apply quartile analysis)
IQR: Distance
between Q1 and Q3
• 3, 3, 7, 8, 7, 4, 4, 10, 1, 5, 1, 7, 2, 7, 9
• Isabel now needs to analyze this data. She can use a box plot to
visualize the data.
31
• 1. Sort the data 1, 1, 2, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 9 10,
Q2=5
• 2. Find the median median = 5
Q1=3 Q3=7
• 3. Find quartile 1 Q1 = 3
• 4. Find quartile 3 Q3 = 7
• 5. Calculate IQR
IQR = Q3 – Q1 7 – 3 = 4
-3 1 3 5 7 10 13
• 6. Calculate r
r = (Q1 - 1.5(IQR)) to (Q3 +1.5(IQR))
r = (3 - 1.5(4)) to (7+1.5(4)) Lower extreme
r = (-3 to 13)
33
Regression Analysis
y y
outlier
Model line Model line
x x
35
Case Study : Diabetes Dataset
• Total Records: 768
• Total Attributes: 9 with a target class (+
diabetes, - diabetes)
• Attributes Information
36
Case Study : Diabetes Dataset
37
Frequency Analysis
• Indentify missing value, forecast data distribution
38
Histogram Analysis
• Investigate data
distribution
39
SPSS: Box plot analysis
• To Identify outlier
Find:
40
Videos to watch
• http://www.youtube.com/watch?v=Ys3x2E
9WZXQ
• http://www.youtube.com/watch?v=P--UFzl
NGeA
41