DATA
TRANSFORMATION
&
DATA
DISCRETIZATION
DATA TRANSFORMATION STRATEGIES
– Smoothing :
• To remove noise.
• Techniques : binning, regression, and clustering.
– Attribute Construction :
• Also known as feature construction.
• New attributes are constructed from given attributes.
– Aggregation
• Summary or aggregation operations applied to data.
• Typically used in constructing data cube for data analysis at multiple abstraction levels.
DATA TRANSFORMATION STRATEGIES
– Normalization :
• Attribute data are scaled so as to fall within a smaller range.
– Discretization :
• raw values of a numeric attribute are replaced by interval labels or conceptual labels.
– Concept hierarchy generation for nominal data :
• Attributes can be generalized to higher-level concepts.
DATA DISCRETIZATION
Based on which direction it proceeds :
– Top-down – find one or few points to split the entire attribute range, and then
repeats this recursively on the resulting intervals.
– Bottom-up - starts by considering all of the continuous values, removes some by
merging neighborhood values to form intervals, and then recursively applies this
process to the resulting intervals.
Whether class information is used :
– Supervised - The discretization process uses class information
– Unsupervised - The discretization process does not use class information
Data Transformation by
Normalization
– To help avoid dependence on the choice of measurement units, the data should
be normalized or standardized.
– The data should fall within a smaller or common range such as [−1,1] or [0.0,
1.0].
– It gives all attributes an equal weight.
Methods of Data Normalization
1. Min-max normalization
2. Z-score normalization
3. Normalization by decimal scaling.
Min-max normalization
– Suppose that minA and maxA are the minimum and maximum values of an
attribute, A. Min-max normalization maps a value, vi , of A to vi’ in the range
[new minA,new maxA] by computing:
Example
Suppose that the minimum and maximum values for the attribute income are
$12,000 and $98,000, respectively. We would like to map income to the range
[0.0,1.0].
By min-max normalization, a value of $73,600 for income is transformed to :
z-score normalization
(or zero-mean normalization)
– The values for an attribute, A, are normalized based on the mean (i.e., average)
and standard deviation of A. A value, vi , of A is normalized to vi’ by computing
z-score normalization
Also,
z-score normalization using the mean absolute deviation is
(effect of outliers is reduced.)
Example
Suppose that the mean and standard deviation of the values for the attribute
income are $54,000 and $16,000, respectively.
With z-score normalization, a value of $73,600 for income is transformed to
Decimal Scaling
– Normalizes by moving the decimal point of values of attribute A.
– The number of decimal points moved depends on the maximum absolute value
of A. A value, vi , of A is normalized to vi’ by computing:
where j is the smallest integer such that
Example
Suppose that the recorded values of A range from −986 to 917. The maximum
absolute value of A is 986.
To normalize by decimal scaling, we therefore divide each value by 1000 (i.e., j = 3)
so that −986 normalizes to −0.986 and 917 normalizes to 0.917.
Discretization By Binning
– It’s a top-down splitting technique.
– Its unsupervised discretization technique.
– Used for data reduction and concept hierarchy generation.
– Attribute values can be discretized by applying equal-width or equal-frequency
binning, and then replacing each bin value by the bin mean or median, as in
smoothing by bin means or smoothing by bin medians, respectively.
– Sensitive to user-specified number of bins and presence of outliers.
THANK YOU