Data Preprocessing
Data Preprocessing
Preprocessing Techniques:
Descriptive Data Summarization
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Descriptive Data Summarization
To identify typical properties of data.
Identify which data values in dataset can be treated as noisy or
outlier.
Central Tendency and Dispersion of data to be measured.
Central Tendency: Mean, Median, mode are the measures of
central tendency. Some measures followed in DM systems to
examine the computation efficiency of it.
Distributive Measure: The dataset is partitioned into smaller
subsets and compute measures for each subset. The measures of
each subsets are merged in order to get the measure values of
dataset.
Function sum() and count() are distributive measure.
Algebraic Measure: The measures can be computed by applying an
algebraic function.
Holistic Measure: It can be computed on the whole dataset. Ex:
Median.
Descriptive Data Summarization
Descriptive Data Summarization
Ordinal:
The values of an ordinal attribute provide enough information to order
objects(<, >)
Examples: Hardness of minerals, street numbers.
Interval:
For interval attributes, the differences between values are meaningful,i.e.
a unit of measurement exists(+,-)
Examples: Calendar dates, Temperature in Celsius or Fahrenheit.
Ratio:
For ration variables, both differences and ratios are meaningful(*,/)
Examples: Temperature in Kelvin, counts, age.
DATA CLEANING
Data cleaning is process to clean data by following
ways,
Fill the missing values
Remove the outliers
Resolve data inconsistent
Smoothing the noisy data.
Missing Values:
The missing values in dataset needs to be filled.
Ignore the entire tuple in the dataset.
Manually fill the missing value.
Global constant can be used to fill the missing value.
Attribute mean can be used to fill.
Most appropirate/probable value can be filled in missing dataset.
DATA CLEANING
Remove Outliers:
Outliers are extreme values that fall a long way
outside of the other observations.
Outlier treatment:
Quantile-based Flooring and Capping
Trimming
DATA CLEANING
Smoothing the Noisy Data:
Noisy is a random error or variance in a measured variable.
There are two ways of smoothing noisy.
Binning Method:
It smooth a sorted data value by consulting its values around it.
The sorted values are distributed into a number of buckets or
bins.
Regression Method:
Data can be smoothed by fitting the data to a function, such as
regression.
Linear regression invloves finding the best line to fit 2
attributes.
So, that one attribute can be used to predict the other.
Binning method: Example
For example, Taken a numerical attribute price, now we see how
to remove the noisy?
Stored data price: 4,8,15,21,21,24,25,28,34
Example:
=> {a1,a3,a4,a5,a6}
=> {a1,a4,a5,a6}
2. Each transform involves applying two functions. The first applies some
data smoothing, such as sum or weighted average .The second
performs a weighted difference, which acts to bring out the detailed
features of the data.
3. The two functions are applied to pairs of input data, resulting in two
sets of data of length L/2. In general these represent a smoothed or low
frequency version so he input data and the high frequency content of it.
4. The two functions are recursively applied to sets of data obtained in the
previous loop, until the resulting data sets obtained are of length 2.
5. A selection of values from the data sets obtained in the above iterations
are designated the wavelet coefficients of the transformed data.
Principal Components Analysis
Principal component analysis (PCA) is to reduce the
dimensionality of a data set consisting of many variables
correlated with each other, either heavily or lightly, while
retaining the variation present in the dataset, up to the
maximum extent.
2. Non-Parametric methods:
It is used for storing reduced representation of
data.
Data Discretization
This techniques used to reduce the number of
values for given continuous attribute.
Unsupervised Discretization:
Process done based on direction it proceeds (i.e. top-
down or bottom –up)
Top-Down: The process begins from one or few points,
then split the entire attribute range, and repeating this
recursively on resulting intervals.