Data Mining - Discretization
Data Mining - Discretization
GENERATION
Discretization:
Types of attributes:
Nominal values from an unordered set, e.g.,
color, profession
Ordinal values from an ordered set, e.g.,
military or academic rank
Continuous real numbers, e.g., integer or real
numbers
Discretization:
Divide the range of a continuous attribute into
intervals
Reduce data size by discretization
Discretization and Concept Hierarchy:
Discretization
Reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals
Interval labels can then be used to replace actual
data values
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an
attribute
Concept hierarchy:
Concept hierarchy formation
Recursively reduce the data by collecting and
replacing low level concepts (such as numeric
values for age) by higher level concepts (such as
Unsupervised
Cluster Analysis
Clusters form nodes of concept hierarchy
Can decompose / combine
Lower level / higher level of hierarchy
Entropy-Based Discretization:
Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the expected
information requirement after partitioning is
ordering
Automatically inferring the hierarchy
Heuristic rule