Data Discretization and Concept Hierarchy Generation - PPT
Data Discretization and Concept Hierarchy Generation - PPT
• Typical methods:
– Binning
– Entropy-based discretization
– Interval merging by 2 Analysis
– Clustering analysis
Example:
• Sorted data for price (in dollars):
– 4, 8, 15, 21, 21, 24, 25, 28, 34
• Equal-depth (frequency) partitioning:
– Bin 1: 4, 8, 15
– Bin 2: 21, 21, 24
– Bin 3: 25, 28, 34
Entropy-Based Discretization
• Entropy-based discretization is a supervised, top- down
splitting technique.
• It explores class distribution information in its
calculation and determination of split-points
• Let D consist of data instances defined by a set of
attributes and a class-label attribute.
• The class-label attribute provides the class information per
instance.
Entropy-Based Discretization
• The basic method for entropy-based discretization of an attribute A
within the set is as follows:
1. Each value of A can be considered as a potential interval boundary or split-point
(denoted split point) to partition the range of A.
– That is, a split-point for A can partition the instances in D into two subsets satisfying
the conditions A≤ split_point and A > split_point, respectively,
– thereby creating a binary discretization.
2. The information gain after partitioning is