0% found this document useful (0 votes)
14 views33 pages

Data Preprocessing

Preprocessing

Uploaded by

Bhavani Viswa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views33 pages

Data Preprocessing

Preprocessing

Uploaded by

Bhavani Viswa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

DATA PREPROCESSING

Why preprocess the data?


What are methods of Data
Preprocessing?
Descriptive Data Summarization
Data Cleaning
Data Integration and Transformation
Data Reduction
Discretization and Concept Hierarchy
generation
Why Preprocess the data?
Data collected from multiple, heterogeneous
sources.
Typically data size may be huge.
Data may be in inconsistent state and noisy.
Some data may be missed.
To handle high volume of data, need to check
the quality of data.
Low quality data will affect the mining
results.
High quality of dataset used in DM systems
will lead to get Optimal DM results.
Data Preprocessing Techniques
Data preprocessing techniques are applied to
improve the quality of patterns mined and reduce
time required for mining.

It can be applied before mining process to be


done.

Preprocessing Techniques:
 Descriptive Data Summarization
 Data Cleaning

 Data Integration

 Data Transformation

 Data Reduction
Descriptive Data Summarization
 To identify typical properties of data.
 Identify which data values in dataset can be treated as noisy or
outlier.
 Central Tendency and Dispersion of data to be measured.
 Central Tendency: Mean, Median, mode are the measures of
central tendency. Some measures followed in DM systems to
examine the computation efficiency of it.
 Distributive Measure: The dataset is partitioned into smaller
subsets and compute measures for each subset. The measures of
each subsets are merged in order to get the measure values of
dataset.
 Function sum() and count() are distributive measure.
 Algebraic Measure: The measures can be computed by applying an
algebraic function.
 Holistic Measure: It can be computed on the whole dataset. Ex:
Median.
Descriptive Data Summarization
Descriptive Data Summarization

Dispersion of Data: The degree to which


numerical data tend to spread is known as
dispersion of data.

Measures of Data Dispersion: Range, quartile


and standard deviation are the measures of data
dispersion.
Types of attributes
 Nominal:
 The values of a nominal attribute are just different names, i.e. nominal
attributes provide only enough information to distinguish one object from
another(=,≠)
 Examples: zip codes, employees ID numbers.

 Ordinal:
 The values of an ordinal attribute provide enough information to order
objects(<, >)
 Examples: Hardness of minerals, street numbers.

 Interval:
 For interval attributes, the differences between values are meaningful,i.e.
a unit of measurement exists(+,-)
 Examples: Calendar dates, Temperature in Celsius or Fahrenheit.

 Ratio:
 For ration variables, both differences and ratios are meaningful(*,/)
 Examples: Temperature in Kelvin, counts, age.
DATA CLEANING
Data cleaning is process to clean data by following
ways,
 Fill the missing values
 Remove the outliers
 Resolve data inconsistent
 Smoothing the noisy data.

Missing Values:
The missing values in dataset needs to be filled.
Ignore the entire tuple in the dataset.
Manually fill the missing value.
Global constant can be used to fill the missing value.
Attribute mean can be used to fill.
Most appropirate/probable value can be filled in missing dataset.
DATA CLEANING
Remove Outliers:
Outliers are extreme values that fall a long way
outside of the other observations.

There can be many reasons for the presence of


outliers in the data. Sometimes the outliers
may be genuine, while in other cases, they
could exist because of data entry errors.

It is important to understand the reasons for


the outliers before cleaning them.
DATA CLEANING
The process of finding outliers by running the
summary statistics on the variables. This can be done
by using the describe() function, which provides a
statistical summary of all the quantitative variables.

Outlier Identification methods:


Identifying Outliers with Interquartile Range (IQR)
Identifying Outliers with Skewness
Identifying Outliers with Visualization

Outlier treatment:
Quantile-based Flooring and Capping
Trimming
DATA CLEANING
Smoothing the Noisy Data:
Noisy is a random error or variance in a measured variable.
There are two ways of smoothing noisy.

Binning Method:
It smooth a sorted data value by consulting its values around it.
The sorted values are distributed into a number of buckets or
bins.

Regression Method:
Data can be smoothed by fitting the data to a function, such as
regression.
Linear regression invloves finding the best line to fit 2
attributes.
So, that one attribute can be used to predict the other.
Binning method: Example
 For example, Taken a numerical attribute price, now we see how
to remove the noisy?
 Stored data price: 4,8,15,21,21,24,25,28,34

Step 1: Partition the data into (equal frequency) bins.


Bin 1: 4,8,15 Bin 2: 21,21,24 Bin 3: 25,28,34

Step 2: Smoothing by bin in 3 ways (means, median or


boundaries).

Step 3: smoothing by bin means


Bin 1: 9,9,9 Bin 2: 22,22,22 Bin 3: 29,29,29

Step 4: Smoothing by bin boundaries


Bin 1: 4,4,15 Bin 2: 21,21,24 Bin 3: 25,25,34
Data Integration and Transformation
 Data Integration: It is nothing but merging of data from
multiple data sources. These sources may include multiple
databases, data cubes or flat files.

Issues in Data Integration:


 Schema integration and object matching is difficult. Ex:
entity identification problem.

 Redundancy is another issue. Actually, an attribute can be


redundant, if it is derived from some other attribute or set
of attributes. Some redundancies can be detected by
correlation analysis.

 Detection and resolution of data value conflicts is another


important issue.
Data Transformation
 The data are transformed or consolidated into appropriate forms for
mining.

Smoothing: To remove noise from the data. Binning, regression and


clustering are used.

Aggregation: Summary or aggregation operations are applied to the


data. It is used to constructing a data cube for analysis of data at
multiple granularities.

Generalization: Low-level data are replaced by higher-level concepts


through the use of concept hierarchies.

Normalization: The attribute values are scaled. i.e., to fall within a


small specified range.

Attribute Construction: New attributes are constructed and added


Data Reduction
 This technique used to obtain a reduced representation of
the data set.

 Mining on a reduced data set should give effective results.

 Data Reduction Strategies:


1. Data Cube Aggregation: Aggregation operations are
applied to the data in the construction of a data cube.

2. Attribute Subset Selection: In this, irrelevant, weakly


relevant or redundant attributes or dimensions may be
detected and removed.

3. Dimensionality Reduction: Encoding mechanism are used


to reduce the data set size.
Data Reduction
4. Numerosity Reduction: The data are
replaced or estimated by alternative, smaller
data representations such as parametric models
or nonparametric models.

5. Discretization & concept hierarchy


generation:
The raw data values for attributes are
replaced by ranges or higher conceptual levels.
Data discretization is useful for automatic
generation of concept hierarchies.
Data Cube Aggregations
Data Cube Aggregations
Attribute Subset Selection
The main goal of attribute subset selection is to
find a minimum set of attributes.

Heuristic methods used to select best attributes


to obtain optimal solutions.

Techniques in Heuristic method:


1. Stepwise forward selection
2. Stepwise backward elimination
3. Combination of forward selection and backward
elimination
4. Decision tree induction
Attribute Subset Selection
 Stepwise forward selection: It starts with an empty set
of attributes as the reduced set.

 It find best attribute and added to the reduced set.

 Repeat the iteration until to find best attribute in the


reduced set.
Example:
Initial attribute set: {a1,a2,a3,a4,a5,a6}

Initial reduced set: {}


=> {a1}
=> {a1,a2}
=> reduced attribute set:
{a1,a4,a6}
Attribute Subset Selection
 Stepwise Backward Elimination: It starts with full set
of attributes. At each step, it removes worst attribute
remaining in the set.

Example:

Initial attribute set:


{a1,a2,a3,a4,a5,a6}

=> {a1,a3,a4,a5,a6}

=> {a1,a4,a5,a6}

=> Reduced attribute set:


Decision Tree induction
 It constructs a flowchart like structure, in that each
internal node i.e., non-leaf node done a test on an
attribute.

 Each branch in flowchart corresponds to an outcome of


the test. The external node i.e., leaf node denotes a class
prediction.

 At each node, it finds best attributes to partition the data


into individual classes.
Dimensionality Reduction
It can be applied to obtain compressed
representation of original data.
There are two types of compression techniques.
Lossy data compression
Lossless data compression

Lossy method: The data can be reconstructed only


with the approximation of the original data. PCA
and Wavelet Transforms.

Lossless: In this method, the data can be


reconstructed from the compressed without any
loss of information.
Lossy Dimensionality Reduction
Discrete Wavelet Transform (DWT):
• The discrete wavelet transform (DWT) is a linear signal
processing technique.

• It transforms a vector into a numerically different vector (D


to D’) of wavelet coefficients.

• The two vectors are of the same length. However it is useful


for compression in the sense that wavelet-transformed data
can be truncated.

• A small compressed approximation of the data can be


retained by storing only a small fraction of the strongest
wavelet coefficient e.g., retain all wavelet coefficients larger
than some particular threshold and the remaining
DWT

• The resulting data representation is sparse. Computations


that can take advantage of sparsity are very fat if
performed in wavelet space.

• Given a set of coefficients, an approximation of the original


data can be got by applying the inverse DWT.

• The DWT is closely related to the discrete Fourier


transform (DFT) a signal processing technique involving
sine’s and cosines.

• The general procedure for applying a discrete wavelet


transform uses a hierarchical pyramid algorithm that
halves the data in each iteration, resulting in fast
computational speed.
DWT
 The method is as follows:
1. The length, L , of the input data vector must and integer power of
2.This condition can be met by padding the data vector with zeros as
necessary.

2. Each transform involves applying two functions. The first applies some
data smoothing, such as sum or weighted average .The second
performs a weighted difference, which acts to bring out the detailed
features of the data.

3. The two functions are applied to pairs of input data, resulting in two
sets of data of length L/2. In general these represent a smoothed or low
frequency version so he input data and the high frequency content of it.

4. The two functions are recursively applied to sets of data obtained in the
previous loop, until the resulting data sets obtained are of length 2.

5. A selection of values from the data sets obtained in the above iterations
are designated the wavelet coefficients of the transformed data.
Principal Components Analysis
Principal component analysis (PCA) is to reduce the
dimensionality of a data set consisting of many variables
correlated with each other, either heavily or lightly, while
retaining the variation present in the dataset, up to the
maximum extent.

 The same is done by transforming the variables to a new set


of variables, which are known as the principal components (or
simply, the PCs) and are orthogonal, ordered such that the
retention of variation present in the original variables
decreases as we move down in the order.

So, in this way, the 1st principal component retains maximum


variation that was present in the original components. The
principal components are the eigenvectors of a covariance
matrix, and hence they are orthogonal.
Numerosity Reduction
There are two methods can be used in this
technique.
1. Parametric methods:
It is used to estimate the data. So, only data
parameters alone need to be stored instead of
actual data.
Ex: Regression and Log-Linear
Models.

2. Non-Parametric methods:
It is used for storing reduced representation of
data.
Data Discretization
This techniques used to reduce the number of
values for given continuous attribute.

It accomplish the reduction by dividing the range


of the attribute into intervals.

The interval labels are used to replace the actual


data values.

i.e., Replacing the numerous values of


continuous attribute by small number of interval
labels, thus reduces the original data.
Discretization Technique
Supervised Discretization:
 Process done by using class information.

Unsupervised Discretization:
Process done based on direction it proceeds (i.e. top-
down or bottom –up)
Top-Down: The process begins from one or few points,
then split the entire attribute range, and repeating this
recursively on resulting intervals.

Bottom-Up: It begins by considering all continuous


values, then removes some by merging values values to
form intervals. This process can be done recursively to
resulting intervals.
Concept Hierarchy
Discretization can be performed recursively on
attribute to provide a hierarchical partitioning of
attribute values, known as concept hierarchy.

It is very useful for mining at multiple levels of


abstraction.

It can be used to reduce data by collecting and


replacing low-level concepts with high-level
concepts.

It can be applied on both numerical data and


categorical data.
Concept Hierarchy
Concept Hierarchy

You might also like