Data Preprocessing
G.A.Putri Saptawati
The need of data preprocessing
Problems
with huge real-world database
Incomplete data : missing value
Noisy
Inconsistent
Influence data mining process, especially pattern
mined
Techniques
Data
cleaning
Data integration
Data transformation
Data reduction
Improve the quality of the pattern mined
and/or the time required for the actual mining
3
Data Cleaning Missing values
Tuples have no recorded value for several
attributes
Ignore the tuple
Fill in the missing value
Using global constant
Using measured values : attribute mean, most
probable value
Data Cleaning Noisy
Random error or variance in a measured
variable
Binning
smooth a sorted data value by consulting
its neighborhood
local smoothing
Clustering
Detect the outliers by grouping similar
values
Regression
smooth data by fitting data to a function,
such as regression
linear regression, multiple linier regression
6
Data Integration
Combine data from multiple sources into coherent
data store
Schema integration: entity identification problem
Redundancy: detected by correlation analysis
Detection & resolution of data value conflict:
semantic heterogenity & different representation
Data Transformation
Data
are transformed or consolidated into
forms appropriate for mining
Involve:
Smoothing
Aggregation
Generalisation
Normalisation
Data Reduction
Reduce
representation of data set that is
much smaller in volume, while maintains the
integrity of the original data.
Strategies:
Data cube aggregation
Dimension reduction
Data compression