Normalization
Normalization
Data
Preprocessing
Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or
similar analytical results
Data discretization
Part of data reduction but with particular importance, especially for
numerical data
Forms of data
preprocessing
Data Preprocessing
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Data Transformation by
Normalization
v meanA
v'
stand _ devA
Data Transformation by
Normalization
Discretization
The raw values of a numeric attribute (e.g., age) are
replaced by interval labels (e.g., 0–10, 11–20, etc.) or
conceptual labels (e.g., youth, adult, senior ).
Three types of attributes:
Nominal — values from an unordered set
Ordinal — values from an ordered set
Continuous — real numbers
Discretization:
divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical attributes.
Reduce data size by discretization
Prepare for further analysis
Discretization
Discretization techniques can be categorized based on
how the discretization is performed, such as whether it
uses class information or which direction it proceeds
(i.e., top-down vs. bottom-up).
If the discretization process uses class information, then
we say it is supervised discretization.
If the process starts by first finding one or a few points
to split the entire attribute range, and then repeats this
recursively on the resulting intervals, it is called top-
down discretization or splitting.
Discretization
This contrasts with bottom-up discretization or
merging, which starts by considering all of the
continuous values as potential split-points, removes
some by merging neighborhood values to form intervals,
and then recursively applies this process to the resulting
intervals.
Data Reduction Strategies
Discretization
reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.
Concept hierarchies
reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute
age) by higher level concepts (such as young,
middle-aged, or senior).
Discretization for numeric
data