Lecture 7 - Data Cleaning
Lecture 7 - Data Cleaning
Data Preprocessing
- Data Cleaning
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same
or similar analytical results
2
Forms of Data Preprocessing
3
Data Preprocessing
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
4
4
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
e.g., Occupation=“ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday?
5
Sampling and Quantization
6
Sampling and Quantization
• If this signal is sampled with a sampling period less than Ts = 1/(2 · fmax), or
equivalently a sampling frequency larger than fs = 2 · fmax, then the original
signal can be completely reconstructed from the (infinite) time series.
• This is called Shannon’s sampling theorem.
7
Data Cleaning
Outliers:
8
Data Cleaning
Outliers detection:
9
Data Cleaning
Outliers detection:
• Inliers cannot be identified by outlier detection methods.
• In time series data, inliers may be detected when they significantly
deviate from the adjacent values, so a value may be classified as
an inlier if the difference from its neighbors is larger than a
threshold.
• A more common approach to remove inliers from time series is
filtering.
• Such constant features do not contain useful information, but they may
cause problems with some data analysis methods and may therefore
be removed from the data set
10
Incomplete (Missing) Data
Data is not always available
E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of entry
not register history or changes of the data
Missing data may need to be inferred
11
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute
varies considerably
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., “unknown”, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class:
smarter
the most probable value: inference-based such as Bayesian
formula or decision tree
12
Data Preprocessing
Inliers, outliers, or missing data can be handled in various ways:
13
Data Preprocessing
Inliers, outliers, or missing data can be handled in various ways:
14
Data Preprocessing
Inliers, outliers, or missing data can be handled in various ways:
15
Data Preprocessing
Inliers, outliers, or missing data can be handled in various ways:
16
Data Preprocessing
Inliers, outliers, or missing data can be handled in various ways:
17
Data Preprocessing
Inliers, outliers, or missing data can be handled in various ways:
18
How to Handle Missing Data?
Inliers, outliers, or missing data can be handled in various ways:
19
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may be due to
faulty data collection instruments
technology limitation
incomplete data
inconsistent data
20
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
Filtering
Clustering
detect and remove outliers
22
Binning Methods - Example
* Sorted data for price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Data: k = 1,2,3….n
Symmetric windows of the order q ∈ {3, 5, 7, . . .} consider the window
which contains xk, the (q-1)/2 previous values and the (q-1)/2 following
values.
Symmetric windows are only suitable for offline filtering when the
future values of the series are already known.
24
Filtering
The goal is not only to remove inliers and outliers but also to remove
noise.
Symmetric windows
Asymmetric windows
Asymmetric windows are also suitable for online filtering and are
able to provide each filter output yk as soon as xk is known.
25
Filtering
The mean value is often used as the statistical measure for the data
in the window.
Moving mean or moving average of order q is defined as
26
Asymmetric moving average filter
q = 200
moving average filtered data, q = 20
27
Asymmetric moving average filter
q = 200
moving average filtered data, q = 20
• The amplitude of single peak is reduced from 2 to 0.5 (q = 20) and 0.1 (q = 200).
• Better filter effects can be achieved by larger values of the window size q.
• The window size should be much smaller than the length of the time series to be
filtered, q << n.
28
Exponential filter
• The exponential filter works best with slow changes of the filtered data
• Each value of the filter output yk is similar to the previous value of the
filter output yk−1, except for a correction term that is computed as a
fraction η ∈ [0, 1] of the difference between previous filter error xk-1 -yk−1.
• The current filter output yk is affected by each past filter output yk−i , i = 1, .
k−1, with the multiplier (1−η)i, so the filter exponentially forgets previous filter
outputs, hence the name exponential filter
• k >= 3
29
Exponential filter
• For η = 0 the exponential filter maintains the initial value yk = y0 = 0.
• So, for nontrivial filter behavior, η should be chosen larger than zero but
smaller than one.
31
Exponential filter
Error
Time yt S(α=0.1) Error squared
1 71
2 70 71 -1.00 1.00
3 69 70.9 -1.90 3.61
4 68 70.71 -2.71 7.34
5 64 70.44 -6.44 41.47
6 65 69.80 -4.80 23.04
7 72 69.32 2.68 7.18
8 78 69.58 8.42 70.90
9 75 70.43 4.57 20.88
10 75 70.88 4.12 16.97
11 75 71.29 3.71 13.76
12 70 71.67 -1.67 2.79
The sum of the squared errors (SSE) = 208.94. The mean of the
squared errors (MSE) is the SSE /11 = 19.0.
32
Exponential filter
• The MSE was again calculated for α = 0.5 and turned out to be 16.29,
so in this case we would prefer an α of 0.5. Can we do better?
33
Exponential filter
34
Data Cleaning
Importance
“Data cleaning is one of the three biggest problems
in data warehousing”
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
35
Reference
36