0% found this document useful (0 votes)
26 views41 pages

CH2 Data Cleaning

Uploaded by

Hunzila Nisar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views41 pages

CH2 Data Cleaning

Uploaded by

Hunzila Nisar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

ARIN-2317

KNOWLEDGE DISCOVERY AND DATA


MINING

TOPIC 2 :
Data Preparation - Cleaning

1
Where we are? Introduction to KDD

KDD
Data
steps
Understanding dataset & Domain

Data Preparation
Do data preprocessing

Run Data Mining

Evaluate result

Utilize knowledge

2
Chapter Outline (Data Cleaning)
• Why preprocess data?
• Real world data problem
• Cleaning for Missing data, Noisy
data, Outlier
• Techniques to identify &
overcome missing data, noisy
data, outlier

3
Major Tasks in Data Preprocessing

Identifyin
Data Data Evaluation &
g knowledge
Preprocessing Mining Presentation
Data
Sources

Data Data Data Data Data


Cleaning Integration Transformati Reduction Discretization
on
Fill in missing Integration Normalizatio Obtains reduced Part of data
values, smooth of multiple n& representation reduction
noisy data, databases, aggregation in volume but but with
identify or remove data cubes, produces the particular
outliers, and or files same or similar importance,
resolve analytical especially 4
inconsistencies results for
numerical
Why Data Preprocessing is Important?

“Quality of Mining Result”


• Quality decisions must be based on quality data
– e.g., duplicate or missing data may cause incorrect or even
misleading statistics.

• Main idea – to ensure that data is clean (high quality of


data). Key element in KDD  80% of KDD efforts are in this
part.
• Certain data mining algorithm requires pre-processing for
better performance
– Data from various sources, varies in shape, need normalization

5
Why Data Preprocessing is Important?
• Data in the real world is dirty
– Incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only
aggregate data
• e.g., occupation=“”
– Noisy: containing errors or outliers
• e.g., Salary=“-10”
– Inconsistent: containing inconsistency in codes or
names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., inconsistency between duplicate records
6
Why Data is Dirty?

Incomplete Noisy data


Inconsistent
data comes comes from
data comes
from the process of
from
data

n/a data value when collected


Collection Different data
different consideration between the
time when the data was collected Entry sources
and when it is analyzed.
Transmission Functional
human/hardware/software problems
dependency
violation

7
How to identify problem in data?

1. Manual checking 2. Statistical


+ Tools Techniques
Descriptive Analysis : Min,
Mod, Median, Std Deviation

Histogram Analysis

Box Plot Analysis

Regression Analysis

8
Data Cleaning TASKS

Fill in missing Identify outliers


values and smooth out
noisy data

Resolve
Correct redundancy
inconsistent data caused by data
integration

9
Missing values/data ?

• Data is not always available


– E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data

• Missing data may be due to :


– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data

10
How to Handle
Missing values/data ?
Ignore the tuple: usually done when class label
is missing (assuming the tasks in classification—
not effective when the percentage of missing
values per attribute varies considerably.

Fill in the missing value manually: tedious +


infeasible?

Fill in it automatically with

a global the the the most Mod value of


constant : probable the attribute
attribute attribute value: (categorical/or
e.g.,
mean (for mean for inference- dinal)
“unknown”, a
new class?!
numerical all based such
attribute) samples as Bayesian
formula or
belonging decision tree
to the
11
same
class
Missing values/data ? age sex nationality fever temperature
59 M PAKISTANI 3 50

8 M PAKISTANI 4 37

X M CHINESE 5 X

4 M CHINESE 6 37.5
Eliminate/Ignore data objects or 20 F PAKISTAN 3 39
attributes 23 F PAKISTANI 7 39.5

6 M X 7 38.2

X F X X 39

But its not the best way! 21 F PAKISTANI 5 38.5

X F PAKISTANI 2 X

20 F PAKISTANI 2 37

41 M BANGLA 5 37

24 F CHINESE 7 37

15 M CHINESE X 38.5

27 M PAKISTANI 2 38.5

7 F PAKISTANI 2 37

22 M PAKISTANI 3 41.2

33 M PAKISTANI 3 39

67 M PAKISTANI 5 39

6 M PAKISTANI X 39

19 F CHINESE 5 38
Missing values/data ?

age sex nationality fever temperature

• Example: 59 M CHINESE 3 50
8 M PAKISTANI 4 37
– Estimate the missing value 23.4 M CHINESE 5 38.9
4 M CHINESE 6 37.5
20 F PAKISTANI 3 39
Mean value (numerical) 23 F PAKISTANI 7 39.5
6 M PAKISTANI 7 38.2
23.4 F PAKISTANI 4 39
21 F PAKISTANI 5 38.5
23.4 F PAKISTANI 2 38.9
Mod value (categorical/ordinal/nominal) 20 F PAKISTANI 2 37
41 M BANGLA 5 37
24 F CHINESE 7 37
15 M CHINESE 4 38.5
27 M PAKISTANI 2 38.5
7 F PAKISTANI 2 37
22 M PAKISTANI 3 41.2
33 M PAKISTANI 3 39
67 M PAKISTANI 5 39
6 M PAKISTANI 4 39
19 F CHINESE 5 38
• Mean (algebraic measure) (sample vs. population): 1 n
x   xi   x
Note: n is sample size and N is population size. n i 1 N
n

– Weighted arithmetic mean:


w x i i
x  i 1n
w
i 1
i

• Median:
– Middle value if odd number of values, or average of the middle two values otherwise
– Estimated by interpolation (for grouped data):
Media
n
n / 2  ( freq ) l interv

median L1  (
al
) width
freq median
• Mode
– Value that occurs most frequently in the data

14
Symmetric vs. Skewed Data

• Median, mean and mode of symmetric


symmetric, positively and negatively
skewed data

positively skewed

15
Recap
• We have
– Type of data problem, why data problem occurred?, the
importance of data preprocessing
– Data Preprocessing task – Data Cleaning, Data Reduction,
Data Integration, Data transformation
– Data Cleaning
• Missing data - handling missing value

• NEXT
– Data Cleaning
• Noisy & Incomplete data – identify & clean
16
Noisy Data
• Noise ????
• random error or variance in a measured variable
• Incorrect attribute values may due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which requires data cleaning
– duplicate records
Box plot
– incomplete data
– inconsistent data
Histogram analysis
– Outlier value
• Technique to detect? Clustering 17
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
– Quartiles: Q1 (25th percentile), Q3 (75th percentile)

– Inter-quartile range: IQR = Q3 – Q1

– Five number summary: min, Q1, median, Q3, max

– Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers
individually
– Outlier: usually, a value higher/lower than 1.5 x IQR 1 n
1 n

 ( xi   ) 2  x
2
2  i  2
N i 1 N i 1
• Variance and standard deviation (sample: s, population: σ)
– Variance: (algebraic, scalable computation)

1 n 1 n 2 1 n 2
2
s   ( xi  x )  n  1 [
n  1 i 1
2

i 1
xi  ( xi ) ]
n i 1

– Standard deviation s (or σ) is the square root of variance s2 (or σ2)


18
• Five-number summary of a distribution
– Minimum, Q1, Median, Q3, Maximum

• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR
– The median is marked by a line within the box
– Whiskers: two lines outside the box extended to Minimum and Maximum
– Outliers: points beyond a specified outlier threshold, plotted individually

19
How to Handle Noisy Data?
BINNING

– Sort data
– Perform binning
• Binning Method (smooth by bin means, smooth by bin median, smooth by
bin boundaries)
• Discretization Methods (Equal width distance, Equal frequency distance)
CLUSTERING

– detect and remove outliers

COMBINED OF HUMAN AND COMPUTERIZATION

– detect suspicious values and check by human (e.g., deal with possible outliers)

20
Binning Methods for Data Smoothing
*Sorted data for price (RM): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29,
34
• Bin means:
• Divide data into k bins (e.g 3)
• 12 / 3 = 4
• Calculate the mean value for each bin
- Bin 1  4,8,9,15  (4+8+9+15) / 4 = 9 , so Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
• Bin boundaries:
• Divide data into k bins (e.g 3)
• Represent the data based on the nearest boundary
- Bin 1  4,8,9,15 so Bin 1 : 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25 21
- Bin 3: 26, 26, 26, 34
• Equal-width (distance) partitioning:
Discretization Methods
o Divides the range into N intervals of equal size: uniform
grid
o if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B –A)/N.
o The most straightforward, but outliers may dominate
presentation
o Skewed data is not handled well.

Equal-depth (frequency) partitioning:


o Divides the range into N intervals, each containing
approximately same number of samples
o Good data scaling
o Managing categorical attributes can be tricky.

22
Example of Discretization Methods: Binning
The following data are a list prices of commonly sold item at
AllElectronic (rounded to the nearest dollar).

Price (Dollar)
1,1,5,5,5,5,5,8,8,10,10,1
0,10,12,14,14,14,15,15,1
5,15,15,15,18,18,18,18,1
8,18,18,18,20,20,20,20,2
0,20,20,21,21,21,21,25,2
5,25,25,25,28,28,30,30,3
0

Histogram analysis
(Data is badly distributed,
To many classes!)
Equal-width (distance) Binning

Before Binning After Equal-width Binning


24
Equal-width (distance)
Equal-width (distance) partitioning
 find width of intervals
W = (B –A)/N.
A=41.2 (max value), B=37 (min), N=8
W= (41.2-37)/8
= 0.53 ~ 0.5

 Create new 37.0 – 37.5 a


interval 37.6 – 38.0 b

38.1 – 38.5 c

38.6. – 39.0 d

39.1 – 39.5 e

39.6 – 40.0 f

40.1 – 40.5 g

40.5-41.5 h
RAW
(frequency)

Sort & Check


frequency
Show me! Equal-depth

Create new interval


26
• DISCRETIZATION IN WEKA :

• https://www.youtube.com/watch?v=P--UFzlNG
eA

27
Outlier Removal
•Outlier– an observation data that lies outside the
range of the data values.

•Typically caused by : incorrect measurements,


measurement from different populations,
measurement of rare event.
•How to detect : (1) using Box-and whisker @box-
whisker analysis @ quartiles analysis @ box plot
analysis (2) using clustering / regression line

28
Box Plot Analysis

Box Plot: Is a technique for displaying one dimensional data and its
characteristic.. (apply quartile analysis)

1.5 times 1.5 times IQR


IQR

IQR: Distance
between Q1 and Q3

(1st (3rd quartile/75th percentile)


quartile/25th
percentile)
29
Median (2nd quartile/50th percentile)
Quartile analysis

•Quartile analysis  outlier is defined as “any


value that is more than 1.5 times the inter
quartile range above the upper quartile (Q3) or
below the lower quartile (Q1)”.
• r = (Q1 - 1.5(IQR)) to (Q3 +1.5(IQR))
•Example : given sorted dataset
72, 74,75, 77, 78, 79, 82,85, 86, 90, 93, 94
From dataset,
Q1 = 77, Q3 = 86, therefore range will be [ 77 -
1.5(9) to 86 + 1.5(9)]  { 63.5 to 99.5}
Thus the dataset given contains no outlier. 30
BOX PLOT ANALYSIS

• Isabel is a news director at a local television station. She just


hired two new anchors for her noon newscast. Two weeks later,
Isabel's boss wants to meet with her about the new anchors. He's
not sure if the audience likes the change. Isabel conducts a survey
of a random group of 15 people, asking them to rank the anchors on
a scale of 1 to 10, with 10 being the best and 1 being the worst.
These are the ranks she got back on the survey:

• 3, 3, 7, 8, 7, 4, 4, 10, 1, 5, 1, 7, 2, 7, 9

• Isabel now needs to analyze this data. She can use a box plot to
visualize the data.

31
• 1. Sort the data 1, 1, 2, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 9 10,
Q2=5
• 2. Find the median median = 5
Q1=3 Q3=7
• 3. Find quartile 1 Q1 = 3

• 4. Find quartile 3 Q3 = 7

• 5. Calculate IQR

IQR = Q3 – Q1  7 – 3 = 4
-3 1 3 5 7 10 13
• 6. Calculate r
r = (Q1 - 1.5(IQR)) to (Q3 +1.5(IQR))
r = (3 - 1.5(4)) to (7+1.5(4)) Lower extreme
r = (-3 to 13)

• 7. Are all of the data located between the range ?


upper extreme
YES (1 -10 ) are in the range of (-3 to 13). So…NO OUTLIER
32
Cluster Analysis

33
Regression Analysis
y y
outlier
Model line Model line

x x

Model Model with


without outlier
outlier
34
Experiment
|Sample of Dengue Data Set|

35
Case Study : Diabetes Dataset
• Total Records: 768
• Total Attributes: 9 with a target class (+
diabetes, - diabetes)
• Attributes Information

36
Case Study : Diabetes Dataset

37
Frequency Analysis
• Indentify missing value, forecast data distribution

38
Histogram Analysis
• Investigate data
distribution

39
SPSS: Box plot analysis
• To Identify outlier

Find:

Q1, Q2, Q3, IQR, allowable range


value

40
Videos to watch

• http://www.youtube.com/watch?v=Ys3x2E
9WZXQ

• http://www.youtube.com/watch?v=P--UFzl
NGeA

41

You might also like