0% found this document useful (0 votes)
11 views

Lecture 7 - Data Cleaning

clean

Uploaded by

raoseshu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Lecture 7 - Data Cleaning

clean

Uploaded by

raoseshu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Transfer Functions

Data Preprocessing
- Data Cleaning
Major Tasks in Data Preprocessing

 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same
or similar analytical results

2
Forms of Data Preprocessing

3
Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

 Summary

4
4
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
 incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
 e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers
 e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,
 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records
 Intentional (e.g., disguised missing data)
 Jan. 1 as everyone’s birthday?

5
Sampling and Quantization

6
Sampling and Quantization

• If this signal is sampled with a sampling period less than Ts = 1/(2 · fmax), or
equivalently a sampling frequency larger than fs = 2 · fmax, then the original
signal can be completely reconstructed from the (infinite) time series.
• This is called Shannon’s sampling theorem.

7
Data Cleaning
Outliers:

Original data Outliers Drift

8
Data Cleaning
Outliers detection:

9
Data Cleaning
Outliers detection:
• Inliers cannot be identified by outlier detection methods.
• In time series data, inliers may be detected when they significantly
deviate from the adjacent values, so a value may be classified as
an inlier if the difference from its neighbors is larger than a
threshold.
• A more common approach to remove inliers from time series is
filtering.

 Constant data features may be erroneous or correct.

• Such constant features do not contain useful information, but they may
cause problems with some data analysis methods and may therefore
be removed from the data set
10
Incomplete (Missing) Data
 Data is not always available
 E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of entry
 not register history or changes of the data
 Missing data may need to be inferred

11
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute
varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class:
smarter
 the most probable value: inference-based such as Bayesian
formula or decision tree

12
Data Preprocessing
 Inliers, outliers, or missing data can be handled in various ways:

13
Data Preprocessing
 Inliers, outliers, or missing data can be handled in various ways:

14
Data Preprocessing
 Inliers, outliers, or missing data can be handled in various ways:

15
Data Preprocessing
 Inliers, outliers, or missing data can be handled in various ways:

16
Data Preprocessing
 Inliers, outliers, or missing data can be handled in various ways:

17
Data Preprocessing
 Inliers, outliers, or missing data can be handled in various ways:

• It is often worth the effort to estimate missing


data and to correct invalid data.

• If sufficient data are available and data quality


is important, then suspicious data should be
completely removed

18
How to Handle Missing Data?
 Inliers, outliers, or missing data can be handled in various ways:

19
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to
 faulty data collection instruments

 data entry problems

 data transmission problems

 technology limitation

 inconsistency in naming convention

 Other data problems which require data cleaning


 duplicate records

 incomplete data

 inconsistent data

20
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins

 then one can smooth by bin means, smooth by bin


median, smooth by bin boundaries, etc.
 Regression
 smooth by fitting the data into regression functions

 Filtering
 Clustering
 detect and remove outliers

 Combined computer and human inspection


 detect suspicious values and check by human (e.g.,
deal with possible outliers)
21
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins

 then one can smooth by bin means, smooth by bin


median, smooth by bin boundaries, etc.

22
Binning Methods - Example

* Sorted data for price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into (equi-depth) bins:


- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Filtering
 The goal is not only to remove inliers and outliers but also to remove
noise.
 Symmetric windows
 Asymmetric windows

 Data: k = 1,2,3….n
 Symmetric windows of the order q ∈ {3, 5, 7, . . .} consider the window

which contains xk, the (q-1)/2 previous values and the (q-1)/2 following
values.
 Symmetric windows are only suitable for offline filtering when the
future values of the series are already known.

24
Filtering
 The goal is not only to remove inliers and outliers but also to remove
noise.
 Symmetric windows
 Asymmetric windows

 Asymmetric windows of the order q ∈ {2,3,4 . . .} consider the window


wkq = {xi | i = k−(q-1), . . . , k} which contains xk and the q-1 previous
values.

 Asymmetric windows are also suitable for online filtering and are
able to provide each filter output yk as soon as xk is known.

25
Filtering
 The mean value is often used as the statistical measure for the data
in the window.
 Moving mean or moving average of order q is defined as

Symmetric moving average filter

Asymmetric moving average filter

26
Asymmetric moving average filter

q = 200
moving average filtered data, q = 20

27
Asymmetric moving average filter

q = 200
moving average filtered data, q = 20

• The amplitude of single peak is reduced from 2 to 0.5 (q = 20) and 0.1 (q = 200).
• Better filter effects can be achieved by larger values of the window size q.
• The window size should be much smaller than the length of the time series to be
filtered, q << n.

28
Exponential filter
• The exponential filter works best with slow changes of the filtered data

• Each value of the filter output yk is similar to the previous value of the
filter output yk−1, except for a correction term that is computed as a
fraction η ∈ [0, 1] of the difference between previous filter error xk-1 -yk−1.

• The exponential filter is

• The current filter output yk is affected by each past filter output yk−i , i = 1, .
k−1, with the multiplier (1−η)i, so the filter exponentially forgets previous filter
outputs, hence the name exponential filter

• k >= 3

29
Exponential filter
• For η = 0 the exponential filter maintains the initial value yk = y0 = 0.

• For η = 1 it yields the current filter input, yk = xk-1.

• So, for nontrivial filter behavior, η should be chosen larger than zero but
smaller than one.

• η has to be chosen carefully. It has to be small enough to achieve a sufficient


filter effect but large enough to maintain the essential characteristics of the
original data
30
Exponential filter
Consider the time series with nine periods of data:
34, 38, 46, 41, 43, 48, 51, 50, 56

31
Exponential filter
Error
Time yt S(α=0.1) Error squared

1 71
2 70 71 -1.00 1.00
3 69 70.9 -1.90 3.61
4 68 70.71 -2.71 7.34
5 64 70.44 -6.44 41.47
6 65 69.80 -4.80 23.04
7 72 69.32 2.68 7.18
8 78 69.58 8.42 70.90
9 75 70.43 4.57 20.88
10 75 70.88 4.12 16.97
11 75 71.29 3.71 13.76
12 70 71.67 -1.67 2.79

The sum of the squared errors (SSE) = 208.94. The mean of the
squared errors (MSE) is the SSE /11 = 19.0.
32
Exponential filter

• The MSE was again calculated for α = 0.5 and turned out to be 16.29,
so in this case we would prefer an α of 0.5. Can we do better?

• We could apply the proven trial-and-error method.

• This is an iterative procedure beginning with a range of α between 0.1


and 0.9.

• We determine the best initial choice for α and then search


between α−Δ and α+Δ.

33
Exponential filter

34
Data Cleaning
 Importance
 “Data cleaning is one of the three biggest problems
in data warehousing”
 Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration

35
Reference

 T. Dasu and T. Johnson. Exploratory Data


Mining and Data Cleaning. John Wiley, 2003

36

You might also like