0% found this document useful (0 votes)

11 views

Lecture 7 - Data Cleaning

clean

Uploaded by

raoseshu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Lecture 7 - Data Cleaning

clean

Uploaded by

raoseshu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Transfer Functions

Data Preprocessing
- Data Cleaning
Major Tasks in Data Preprocessing

 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same
or similar analytical results

2
Forms of Data Preprocessing

3
Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

 Summary

4
4
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
 incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
 e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers
 e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,
 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records
 Intentional (e.g., disguised missing data)
 Jan. 1 as everyone’s birthday?

5
Sampling and Quantization

6
Sampling and Quantization

• If this signal is sampled with a sampling period less than Ts = 1/(2 · fmax), or
equivalently a sampling frequency larger than fs = 2 · fmax, then the original
signal can be completely reconstructed from the (infinite) time series.
• This is called Shannon’s sampling theorem.

7
Data Cleaning
Outliers:

Original data Outliers Drift

8
Data Cleaning
Outliers detection:

9
Data Cleaning
Outliers detection:
• Inliers cannot be identified by outlier detection methods.
• In time series data, inliers may be detected when they significantly
deviate from the adjacent values, so a value may be classified as
an inlier if the difference from its neighbors is larger than a
threshold.
• A more common approach to remove inliers from time series is
filtering.

 Constant data features may be erroneous or correct.

• Such constant features do not contain useful information, but they may
cause problems with some data analysis methods and may therefore
be removed from the data set
10
Incomplete (Missing) Data
 Data is not always available
 E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of entry
 not register history or changes of the data
 Missing data may need to be inferred

11
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute
varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class:
smarter
 the most probable value: inference-based such as Bayesian
formula or decision tree

12
Data Preprocessing
 Inliers, outliers, or missing data can be handled in various ways:

13
Data Preprocessing
 Inliers, outliers, or missing data can be handled in various ways:

14
Data Preprocessing
 Inliers, outliers, or missing data can be handled in various ways:

15
Data Preprocessing
 Inliers, outliers, or missing data can be handled in various ways:

16
Data Preprocessing
 Inliers, outliers, or missing data can be handled in various ways:

17
Data Preprocessing
 Inliers, outliers, or missing data can be handled in various ways:

• It is often worth the effort to estimate missing

data and to correct invalid data.

• If sufficient data are available and data quality

is important, then suspicious data should be
completely removed

18
How to Handle Missing Data?
 Inliers, outliers, or missing data can be handled in various ways:

19
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to
 faulty data collection instruments

 data entry problems

 data transmission problems

 technology limitation

 inconsistency in naming convention

 Other data problems which require data cleaning

 duplicate records

 incomplete data

 inconsistent data

20
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins

 then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.
 Regression
 smooth by fitting the data into regression functions

 Filtering
 Clustering
 detect and remove outliers

 Combined computer and human inspection

 detect suspicious values and check by human (e.g.,
deal with possible outliers)
21
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins

 then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.

22
Binning Methods - Example

* Sorted data for price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into (equi-depth) bins:

- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Filtering
 The goal is not only to remove inliers and outliers but also to remove
noise.
 Symmetric windows
 Asymmetric windows

 Data: k = 1,2,3….n
 Symmetric windows of the order q ∈ {3, 5, 7, . . .} consider the window

which contains xk, the (q-1)/2 previous values and the (q-1)/2 following
values.
 Symmetric windows are only suitable for offline filtering when the
future values of the series are already known.

24
Filtering
 The goal is not only to remove inliers and outliers but also to remove
noise.
 Symmetric windows
 Asymmetric windows

 Asymmetric windows of the order q ∈ {2,3,4 . . .} consider the window

wkq = {xi | i = k−(q-1), . . . , k} which contains xk and the q-1 previous
values.

 Asymmetric windows are also suitable for online filtering and are
able to provide each filter output yk as soon as xk is known.

25
Filtering
 The mean value is often used as the statistical measure for the data
in the window.
 Moving mean or moving average of order q is defined as

Symmetric moving average filter

Asymmetric moving average filter

26
Asymmetric moving average filter

q = 200
moving average filtered data, q = 20

27
Asymmetric moving average filter

q = 200
moving average filtered data, q = 20

• The amplitude of single peak is reduced from 2 to 0.5 (q = 20) and 0.1 (q = 200).
• Better filter effects can be achieved by larger values of the window size q.
• The window size should be much smaller than the length of the time series to be
filtered, q << n.

28
Exponential filter
• The exponential filter works best with slow changes of the filtered data

• Each value of the filter output yk is similar to the previous value of the
filter output yk−1, except for a correction term that is computed as a
fraction η ∈ [0, 1] of the difference between previous filter error xk-1 -yk−1.

• The exponential filter is

• The current filter output yk is affected by each past filter output yk−i , i = 1, .
k−1, with the multiplier (1−η)i, so the filter exponentially forgets previous filter
outputs, hence the name exponential filter

• k >= 3

29
Exponential filter
• For η = 0 the exponential filter maintains the initial value yk = y0 = 0.

• For η = 1 it yields the current filter input, yk = xk-1.

• So, for nontrivial filter behavior, η should be chosen larger than zero but
smaller than one.

• η has to be chosen carefully. It has to be small enough to achieve a sufficient

filter effect but large enough to maintain the essential characteristics of the
original data
30
Exponential filter
Consider the time series with nine periods of data:
34, 38, 46, 41, 43, 48, 51, 50, 56

31
Exponential filter
Error
Time yt S(α=0.1) Error squared

1 71
2 70 71 -1.00 1.00
3 69 70.9 -1.90 3.61
4 68 70.71 -2.71 7.34
5 64 70.44 -6.44 41.47
6 65 69.80 -4.80 23.04
7 72 69.32 2.68 7.18
8 78 69.58 8.42 70.90
9 75 70.43 4.57 20.88
10 75 70.88 4.12 16.97
11 75 71.29 3.71 13.76
12 70 71.67 -1.67 2.79

The sum of the squared errors (SSE) = 208.94. The mean of the
squared errors (MSE) is the SSE /11 = 19.0.
32
Exponential filter

• The MSE was again calculated for α = 0.5 and turned out to be 16.29,
so in this case we would prefer an α of 0.5. Can we do better?

• We could apply the proven trial-and-error method.

• This is an iterative procedure beginning with a range of α between 0.1

and 0.9.

• We determine the best initial choice for α and then search

between α−Δ and α+Δ.

33
Exponential filter

34
Data Cleaning
 Importance
 “Data cleaning is one of the three biggest problems
in data warehousing”
 Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration

35
Reference

 T. Dasu and T. Johnson. Exploratory Data

Mining and Data Cleaning. John Wiley, 2003

Digital Communications Fundamentals and Ap - Bernard Sklar Book Part 1
83% (6)
Digital Communications Fundamentals and Ap - Bernard Sklar Book Part 1
310 pages
JVC Kd-Adv7380 kd-dv7300 dv7301 dv7302 dv7304 dv7305 dv7306 dv7307 dv7308 dv7388 Ma312 SM
No ratings yet
JVC Kd-Adv7380 kd-dv7300 dv7301 dv7302 dv7304 dv7305 dv7306 dv7307 dv7308 dv7388 Ma312 SM
75 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Unit - II
No ratings yet
Unit - II
56 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Outliners
No ratings yet
Outliners
15 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
253777
No ratings yet
253777
66 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
CH 2
No ratings yet
CH 2
36 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
06 Data Mining-Data Preprocessing-Cleaning
No ratings yet
06 Data Mining-Data Preprocessing-Cleaning
6 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
DWDM UNIT-II
No ratings yet
DWDM UNIT-II
18 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
3b. Data Pre-Processing
No ratings yet
3b. Data Pre-Processing
84 pages
Unit 2
No ratings yet
Unit 2
46 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
data_mining_unit_3[1]
No ratings yet
data_mining_unit_3[1]
64 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
Chapter 2 3 Data Mining
No ratings yet
Chapter 2 3 Data Mining
4 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
04 - ML - Data Preprocessing
No ratings yet
04 - ML - Data Preprocessing
13 pages
Lecture 3
No ratings yet
Lecture 3
32 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
03_Data_Preprocessing
No ratings yet
03_Data_Preprocessing
15 pages
Unit-2 Lecture Notes
No ratings yet
Unit-2 Lecture Notes
33 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Data Preprocessing 013333
No ratings yet
Data Preprocessing 013333
8 pages
DWM
No ratings yet
DWM
14 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
Data Mining
No ratings yet
Data Mining
31 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Machine Learning Chapter 2
No ratings yet
Machine Learning Chapter 2
37 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
CS-DM MODULE-2
No ratings yet
CS-DM MODULE-2
30 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
Fundamental Transverse, Longitudinal, and Torsional Resonant Frequencies of Concrete Specimens
No ratings yet
Fundamental Transverse, Longitudinal, and Torsional Resonant Frequencies of Concrete Specimens
7 pages
Current Transformer Performance Study Using Software Tools
No ratings yet
Current Transformer Performance Study Using Software Tools
6 pages
U20Est109 / Problem Solving Approach L N D C S E
No ratings yet
U20Est109 / Problem Solving Approach L N D C S E
26 pages
In-the-Wild Interference Characterization and Modelling For Electro-Quasistatic-HBC With Miniaturized Wearables
No ratings yet
In-the-Wild Interference Characterization and Modelling For Electro-Quasistatic-HBC With Miniaturized Wearables
12 pages
Software Engineering for Embedded Systems Robert Oshana - The ebook with rich content is ready for you to download
100% (3)
Software Engineering for Embedded Systems Robert Oshana - The ebook with rich content is ready for you to download
47 pages
1.1.3 Sound
No ratings yet
1.1.3 Sound
6 pages
Programmable Digital QPSK/16-QAM Modulator: Internal Reference Clock Multiplier
No ratings yet
Programmable Digital QPSK/16-QAM Modulator: Internal Reference Clock Multiplier
32 pages
Mpeg-1 11172-1
No ratings yet
Mpeg-1 11172-1
46 pages
W5 Vibration Sensor: Endaq Cloud
No ratings yet
W5 Vibration Sensor: Endaq Cloud
4 pages
A Short Overview of MRI Artefacts: Review Article
No ratings yet
A Short Overview of MRI Artefacts: Review Article
5 pages
mc166 - Kompend - fp02 - e - DIAGR FUNCTIon
No ratings yet
mc166 - Kompend - fp02 - e - DIAGR FUNCTIon
45 pages
Itt - (J+L) TI.v (T-JT) J: (Half Short-Circuits) .) )
No ratings yet
Itt - (J+L) TI.v (T-JT) J: (Half Short-Circuits) .) )
8 pages
University of Tripoli Faculty of Engineering Department of Electrical and Electronic Engineering
No ratings yet
University of Tripoli Faculty of Engineering Department of Electrical and Electronic Engineering
8 pages
Assignment ADC
No ratings yet
Assignment ADC
2 pages
LeCroy 2001 Catalog
No ratings yet
LeCroy 2001 Catalog
200 pages
L1 Ultramaximizer: User Guide
No ratings yet
L1 Ultramaximizer: User Guide
30 pages
Image Processing (Mcae501B) Set I (Multiple Choice Type Questions)
No ratings yet
Image Processing (Mcae501B) Set I (Multiple Choice Type Questions)
2 pages
DC Lab 07
No ratings yet
DC Lab 07
10 pages
Embedded Systems
100% (1)
Embedded Systems
178 pages
Oral Questions On DSP 2020-21
No ratings yet
Oral Questions On DSP 2020-21
41 pages
CMVA CAT II Performance Objectives 2019 EN
No ratings yet
CMVA CAT II Performance Objectives 2019 EN
11 pages
Data Converters
No ratings yet
Data Converters
37 pages
Communication System
No ratings yet
Communication System
67 pages
Chapter 1 (Data Representation)
No ratings yet
Chapter 1 (Data Representation)
74 pages
Direct RF Conversion: From Vision To Reality: Tommy Neu
No ratings yet
Direct RF Conversion: From Vision To Reality: Tommy Neu
9 pages
DSA - NI-PXI-4461 Specs
No ratings yet
DSA - NI-PXI-4461 Specs
29 pages
Sampling Techniques
No ratings yet
Sampling Techniques
4 pages
Vdocuments - MX - Introduction To Signal Processing Orfanidis Solution Manual 3 PDF
No ratings yet
Vdocuments - MX - Introduction To Signal Processing Orfanidis Solution Manual 3 PDF
148 pages

Lecture 7 - Data Cleaning

Uploaded by

Lecture 7 - Data Cleaning

Uploaded by

Transfer Functions

 Data Preprocessing: An Overview

 Major Tasks in Data Preprocessing

 Data Transformation and Data Discretization

Original data Outliers Drift

 Constant data features may be erroneous or correct.

• It is often worth the effort to estimate missing

• If sufficient data are available and data quality

 data entry problems

 data transmission problems

 inconsistency in naming convention

 Other data problems which require data cleaning

 then one can smooth by bin means, smooth by bin

 Combined computer and human inspection

 then one can smooth by bin means, smooth by bin

* Partition into (equi-depth) bins:

 Asymmetric windows of the order q ∈ {2,3,4 . . .} consider the window

Symmetric moving average filter

Asymmetric moving average filter

• The exponential filter is

• For η = 1 it yields the current filter input, yk = xk-1.

• η has to be chosen carefully. It has to be small enough to achieve a sufficient

• We could apply the proven trial-and-error method.

• This is an iterative procedure beginning with a range of α between 0.1

• We determine the best initial choice for α and then search

 T. Dasu and T. Johnson. Exploratory Data

You might also like