0% found this document useful (0 votes)

14 views33 pages

Data Preprocessing

Preprocessing

Uploaded by

Bhavani Viswa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views33 pages

Data Preprocessing

Preprocessing

Uploaded by

Bhavani Viswa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

DATA PREPROCESSING

Why preprocess the data?

What are methods of Data
Preprocessing?
Descriptive Data Summarization
Data Cleaning
Data Integration and Transformation
Data Reduction
Discretization and Concept Hierarchy
generation
Why Preprocess the data?
Data collected from multiple, heterogeneous
sources.
Typically data size may be huge.
Data may be in inconsistent state and noisy.
Some data may be missed.
To handle high volume of data, need to check
the quality of data.
Low quality data will affect the mining
results.
High quality of dataset used in DM systems
will lead to get Optimal DM results.
Data Preprocessing Techniques
Data preprocessing techniques are applied to
improve the quality of patterns mined and reduce
time required for mining.

It can be applied before mining process to be

done.

Preprocessing Techniques:
 Descriptive Data Summarization
 Data Cleaning

 Data Integration

 Data Transformation

 Data Reduction
Descriptive Data Summarization
 To identify typical properties of data.
 Identify which data values in dataset can be treated as noisy or
outlier.
 Central Tendency and Dispersion of data to be measured.
 Central Tendency: Mean, Median, mode are the measures of
central tendency. Some measures followed in DM systems to
examine the computation efficiency of it.
 Distributive Measure: The dataset is partitioned into smaller
subsets and compute measures for each subset. The measures of
each subsets are merged in order to get the measure values of
dataset.
 Function sum() and count() are distributive measure.
 Algebraic Measure: The measures can be computed by applying an
algebraic function.
 Holistic Measure: It can be computed on the whole dataset. Ex:
Median.
Descriptive Data Summarization
Descriptive Data Summarization

Dispersion of Data: The degree to which

numerical data tend to spread is known as
dispersion of data.

Measures of Data Dispersion: Range, quartile

and standard deviation are the measures of data
dispersion.
Types of attributes
 Nominal:
 The values of a nominal attribute are just different names, i.e. nominal
attributes provide only enough information to distinguish one object from
another(=,≠)
 Examples: zip codes, employees ID numbers.

 Ordinal:
 The values of an ordinal attribute provide enough information to order
objects(<, >)
 Examples: Hardness of minerals, street numbers.

 Interval:
 For interval attributes, the differences between values are meaningful,i.e.
a unit of measurement exists(+,-)
 Examples: Calendar dates, Temperature in Celsius or Fahrenheit.

 Ratio:
 For ration variables, both differences and ratios are meaningful(*,/)
 Examples: Temperature in Kelvin, counts, age.
DATA CLEANING
Data cleaning is process to clean data by following
ways,
 Fill the missing values
 Remove the outliers
 Resolve data inconsistent
 Smoothing the noisy data.

Missing Values:
The missing values in dataset needs to be filled.
Ignore the entire tuple in the dataset.
Manually fill the missing value.
Global constant can be used to fill the missing value.
Attribute mean can be used to fill.
Most appropirate/probable value can be filled in missing dataset.
DATA CLEANING
Remove Outliers:
Outliers are extreme values that fall a long way
outside of the other observations.

There can be many reasons for the presence of

outliers in the data. Sometimes the outliers
may be genuine, while in other cases, they
could exist because of data entry errors.

It is important to understand the reasons for

the outliers before cleaning them.
DATA CLEANING
The process of finding outliers by running the
summary statistics on the variables. This can be done
by using the describe() function, which provides a
statistical summary of all the quantitative variables.

Outlier Identification methods:

Identifying Outliers with Interquartile Range (IQR)
Identifying Outliers with Skewness
Identifying Outliers with Visualization

Outlier treatment:
Quantile-based Flooring and Capping
Trimming
DATA CLEANING
Smoothing the Noisy Data:
Noisy is a random error or variance in a measured variable.
There are two ways of smoothing noisy.

Binning Method:
It smooth a sorted data value by consulting its values around it.
The sorted values are distributed into a number of buckets or
bins.

Regression Method:
Data can be smoothed by fitting the data to a function, such as
regression.
Linear regression invloves finding the best line to fit 2
attributes.
So, that one attribute can be used to predict the other.
Binning method: Example
 For example, Taken a numerical attribute price, now we see how
to remove the noisy?
 Stored data price: 4,8,15,21,21,24,25,28,34

Step 1: Partition the data into (equal frequency) bins.

Bin 1: 4,8,15 Bin 2: 21,21,24 Bin 3: 25,28,34

Step 2: Smoothing by bin in 3 ways (means, median or

boundaries).

Step 3: smoothing by bin means

Bin 1: 9,9,9 Bin 2: 22,22,22 Bin 3: 29,29,29

Step 4: Smoothing by bin boundaries

Bin 1: 4,4,15 Bin 2: 21,21,24 Bin 3: 25,25,34
Data Integration and Transformation
 Data Integration: It is nothing but merging of data from
multiple data sources. These sources may include multiple
databases, data cubes or flat files.

Issues in Data Integration:

 Schema integration and object matching is difficult. Ex:
entity identification problem.

 Redundancy is another issue. Actually, an attribute can be

redundant, if it is derived from some other attribute or set
of attributes. Some redundancies can be detected by
correlation analysis.

 Detection and resolution of data value conflicts is another

important issue.
Data Transformation
 The data are transformed or consolidated into appropriate forms for
mining.

Smoothing: To remove noise from the data. Binning, regression and

clustering are used.

Aggregation: Summary or aggregation operations are applied to the

data. It is used to constructing a data cube for analysis of data at
multiple granularities.

Generalization: Low-level data are replaced by higher-level concepts

through the use of concept hierarchies.

Normalization: The attribute values are scaled. i.e., to fall within a

small specified range.

Attribute Construction: New attributes are constructed and added

Data Reduction
 This technique used to obtain a reduced representation of
the data set.

 Mining on a reduced data set should give effective results.

 Data Reduction Strategies:

1. Data Cube Aggregation: Aggregation operations are
applied to the data in the construction of a data cube.

2. Attribute Subset Selection: In this, irrelevant, weakly

relevant or redundant attributes or dimensions may be
detected and removed.

3. Dimensionality Reduction: Encoding mechanism are used

to reduce the data set size.
Data Reduction
4. Numerosity Reduction: The data are
replaced or estimated by alternative, smaller
data representations such as parametric models
or nonparametric models.

5. Discretization & concept hierarchy

generation:
The raw data values for attributes are
replaced by ranges or higher conceptual levels.
Data discretization is useful for automatic
generation of concept hierarchies.
Data Cube Aggregations
Data Cube Aggregations
Attribute Subset Selection
The main goal of attribute subset selection is to
find a minimum set of attributes.

Heuristic methods used to select best attributes

to obtain optimal solutions.

Techniques in Heuristic method:

1. Stepwise forward selection
2. Stepwise backward elimination
3. Combination of forward selection and backward
elimination
4. Decision tree induction
Attribute Subset Selection
 Stepwise forward selection: It starts with an empty set
of attributes as the reduced set.

 It find best attribute and added to the reduced set.

 Repeat the iteration until to find best attribute in the

reduced set.
Example:
Initial attribute set: {a1,a2,a3,a4,a5,a6}

Initial reduced set: {}

=> {a1}
=> {a1,a2}
=> reduced attribute set:
{a1,a4,a6}
Attribute Subset Selection
 Stepwise Backward Elimination: It starts with full set
of attributes. At each step, it removes worst attribute
remaining in the set.

Example:

Initial attribute set:

{a1,a2,a3,a4,a5,a6}

=> {a1,a3,a4,a5,a6}

=> {a1,a4,a5,a6}

=> Reduced attribute set:

Decision Tree induction
 It constructs a flowchart like structure, in that each
internal node i.e., non-leaf node done a test on an
attribute.

 Each branch in flowchart corresponds to an outcome of

the test. The external node i.e., leaf node denotes a class
prediction.

 At each node, it finds best attributes to partition the data

into individual classes.
Dimensionality Reduction
It can be applied to obtain compressed
representation of original data.
There are two types of compression techniques.
Lossy data compression
Lossless data compression

Lossy method: The data can be reconstructed only

with the approximation of the original data. PCA
and Wavelet Transforms.

Lossless: In this method, the data can be

reconstructed from the compressed without any
loss of information.
Lossy Dimensionality Reduction
Discrete Wavelet Transform (DWT):
• The discrete wavelet transform (DWT) is a linear signal
processing technique.

• It transforms a vector into a numerically different vector (D

to D’) of wavelet coefficients.

• The two vectors are of the same length. However it is useful

for compression in the sense that wavelet-transformed data
can be truncated.

• A small compressed approximation of the data can be

retained by storing only a small fraction of the strongest
wavelet coefficient e.g., retain all wavelet coefficients larger
than some particular threshold and the remaining
DWT

• The resulting data representation is sparse. Computations

that can take advantage of sparsity are very fat if
performed in wavelet space.

• Given a set of coefficients, an approximation of the original

data can be got by applying the inverse DWT.

• The DWT is closely related to the discrete Fourier

transform (DFT) a signal processing technique involving
sine’s and cosines.

• The general procedure for applying a discrete wavelet

transform uses a hierarchical pyramid algorithm that
halves the data in each iteration, resulting in fast
computational speed.
DWT
 The method is as follows:
1. The length, L , of the input data vector must and integer power of
2.This condition can be met by padding the data vector with zeros as
necessary.

2. Each transform involves applying two functions. The first applies some
data smoothing, such as sum or weighted average .The second
performs a weighted difference, which acts to bring out the detailed
features of the data.

3. The two functions are applied to pairs of input data, resulting in two
sets of data of length L/2. In general these represent a smoothed or low
frequency version so he input data and the high frequency content of it.

4. The two functions are recursively applied to sets of data obtained in the
previous loop, until the resulting data sets obtained are of length 2.

5. A selection of values from the data sets obtained in the above iterations
are designated the wavelet coefficients of the transformed data.
Principal Components Analysis
Principal component analysis (PCA) is to reduce the
dimensionality of a data set consisting of many variables
correlated with each other, either heavily or lightly, while
retaining the variation present in the dataset, up to the
maximum extent.

 The same is done by transforming the variables to a new set

of variables, which are known as the principal components (or
simply, the PCs) and are orthogonal, ordered such that the
retention of variation present in the original variables
decreases as we move down in the order.

So, in this way, the 1st principal component retains maximum

variation that was present in the original components. The
principal components are the eigenvectors of a covariance
matrix, and hence they are orthogonal.
Numerosity Reduction
There are two methods can be used in this
technique.
1. Parametric methods:
It is used to estimate the data. So, only data
parameters alone need to be stored instead of
actual data.
Ex: Regression and Log-Linear
Models.

2. Non-Parametric methods:
It is used for storing reduced representation of
data.
Data Discretization
This techniques used to reduce the number of
values for given continuous attribute.

It accomplish the reduction by dividing the range

of the attribute into intervals.

The interval labels are used to replace the actual

data values.

i.e., Replacing the numerous values of

continuous attribute by small number of interval
labels, thus reduces the original data.
Discretization Technique
Supervised Discretization:
 Process done by using class information.

Unsupervised Discretization:
Process done based on direction it proceeds (i.e. top-
down or bottom –up)
Top-Down: The process begins from one or few points,
then split the entire attribute range, and repeating this
recursively on resulting intervals.

Bottom-Up: It begins by considering all continuous

values, then removes some by merging values values to
form intervals. This process can be done recursively to
resulting intervals.
Concept Hierarchy
Discretization can be performed recursively on
attribute to provide a hierarchical partitioning of
attribute values, known as concept hierarchy.

It is very useful for mining at multiple levels of

abstraction.

It can be used to reduce data by collecting and

replacing low-level concepts with high-level
concepts.

It can be applied on both numerical data and

categorical data.
Concept Hierarchy
Concept Hierarchy

Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Data Mining Notes C2
No ratings yet
Data Mining Notes C2
12 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
03preprocessing3 Part3 4
No ratings yet
03preprocessing3 Part3 4
49 pages
DM 2 Part 2
No ratings yet
DM 2 Part 2
35 pages
Unit 2 Part 4
No ratings yet
Unit 2 Part 4
47 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
Data Mining - Lecture 3
No ratings yet
Data Mining - Lecture 3
33 pages
Week 2
No ratings yet
Week 2
96 pages
Project
No ratings yet
Project
6 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Week2 2
No ratings yet
Week2 2
25 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
R21 Unit 2
No ratings yet
R21 Unit 2
101 pages
Unit 18
No ratings yet
Unit 18
4 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
10-2 Data Analysis and Pre-Processing Part 4 PDF
No ratings yet
10-2 Data Analysis and Pre-Processing Part 4 PDF
23 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
r20 DWDM Unit 2 PART 2
No ratings yet
r20 DWDM Unit 2 PART 2
15 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Data Mining
No ratings yet
Data Mining
21 pages
R20 DMT Unit-Ii
No ratings yet
R20 DMT Unit-Ii
17 pages
PythonScientific Simple PDF
100% (2)
PythonScientific Simple PDF
335 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Morphological Analysis: Natural Language Processing (CSE 5321)
No ratings yet
Morphological Analysis: Natural Language Processing (CSE 5321)
23 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Big Data Lecture # 04
No ratings yet
Big Data Lecture # 04
22 pages
Igcse 0580 Math p4 数学分类真题
No ratings yet
Igcse 0580 Math p4 数学分类真题
1,492 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Basics of Sigma-Delta Modulation
No ratings yet
Basics of Sigma-Delta Modulation
25 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Pre Processing
No ratings yet
Data Pre Processing
11 pages
Marginal Rate of Technical Substitution
No ratings yet
Marginal Rate of Technical Substitution
9 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
Lecture 7 Data Reduction
No ratings yet
Lecture 7 Data Reduction
5 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
ML 4
No ratings yet
ML 4
17 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Julia-1 5 0-DEV PDF
No ratings yet
Julia-1 5 0-DEV PDF
1,340 pages
Normalization
No ratings yet
Normalization
35 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Philosophy 1ST Prelim Notes 1
No ratings yet
Philosophy 1ST Prelim Notes 1
8 pages
Umbrello Handbook X
No ratings yet
Umbrello Handbook X
41 pages
Iare Iare Ads Lecture Notes
No ratings yet
Iare Iare Ads Lecture Notes
86 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Relation and Function Enhanced
No ratings yet
Relation and Function Enhanced
50 pages
Design of Beam
No ratings yet
Design of Beam
70 pages
Mahamaya Technical University,: Noida
No ratings yet
Mahamaya Technical University,: Noida
47 pages
Multiple Intelligences Test
No ratings yet
Multiple Intelligences Test
2 pages
Numerical Calculation of Tertiary Air Duct in The Cement Kiln Installation
No ratings yet
Numerical Calculation of Tertiary Air Duct in The Cement Kiln Installation
3 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Trigonometry 2
0% (1)
Trigonometry 2
1 page
Java - Util.Inputmismatchexception Java - Util.Scanner Java - Util.Stack
No ratings yet
Java - Util.Inputmismatchexception Java - Util.Scanner Java - Util.Stack
3 pages
Conservation of Energy
No ratings yet
Conservation of Energy
3 pages
A Natural Asymmetry in Electrical Systems With Far-Reaching Consequences
No ratings yet
A Natural Asymmetry in Electrical Systems With Far-Reaching Consequences
4 pages
15-150703-Design and Analysis of Algorithms PDF
No ratings yet
15-150703-Design and Analysis of Algorithms PDF
2 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
EDST5139 - Assignment 1
No ratings yet
EDST5139 - Assignment 1
8 pages
Econometrics
No ratings yet
Econometrics
1 page
F3401201815 - Raihan Rivellino Adzani - LAPORAN BAB 5
No ratings yet
F3401201815 - Raihan Rivellino Adzani - LAPORAN BAB 5
4 pages
Chapter 2 Modeling in The Frequency Domain
No ratings yet
Chapter 2 Modeling in The Frequency Domain
3 pages
6.0 CX1104 Part2Introduction 28sep2022
No ratings yet
6.0 CX1104 Part2Introduction 28sep2022
6 pages
Kline, A., Ahner, D., & Hill, R. (2019) - The Weapon-Target
No ratings yet
Kline, A., Ahner, D., & Hill, R. (2019) - The Weapon-Target
11 pages
PVP Valor Spreadsheet 2 (Make A Copy To Edit)
No ratings yet
PVP Valor Spreadsheet 2 (Make A Copy To Edit)
7 pages
Solutions - More - Questions S1
No ratings yet
Solutions - More - Questions S1
17 pages
Year 2 Autumn Paper 2 Reasoning 2022
No ratings yet
Year 2 Autumn Paper 2 Reasoning 2022
12 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet