4 - Discretization and Concept Hierarchy
4 - Discretization and Concept Hierarchy
4 - Discretization and Concept Hierarchy
and
Concept Hierarchy Generation
6th Semester
Department of Computer Science & Engineering
Jorhat Engineering college
Introduction
• Data Discretization
• Dividing the range of a continuous attribute into intervals
• Interval labels can then be used to replace actual data values
• Reduce the number of values for a given continuous attribute
• This leads to a concise, easy-to-use, knowledge-level representation of
mining results
• But, some classification algorithms only accept categorical attributes
• Can be divided into
• Discretization and Concept Hierarchy Generation
– For Numerical Data
• Assumptions
• All the methods can be applied recursively
• Each method assumes that the values to be discretized are sorted
in ascending order
Binning
• The sorted values are distributed into a number of buckets, or bins,
and then replacing each bin value by the bin mean or median
• Binning is
• An unsupervised discretization technique, because it does not
use class information
• A top-down splitting technique based on a specified number of
bins
• Binning methods
• Equal-width (distance) partitioning
• Equal-depth (frequency) partitioning
Binning :: Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform grid
• If A and B are the lowest and highest values of the attribute, the
width of intervals will be W = (B – A) / N
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well
• Example: Original data: 21, 28, 34, 24, 21, 15, 25, 4, 8
Sorted data: 4, 8, 15, 21, 21, 24, 25, 28, 34
Width of Intervals, W = (B – A) / N = (34 – 4) / 3 = 10
Bin 1 Bin 2 Bin 3
Interval 4-14 15-24 25-34
Elements 4, 8 15, 21, 21, 24 25, 28, 34
where
• D1 and D2 correspond to the instances in D
• |D| is the number of instances in D, and so on
• The entropy function for a given set is calculated based on the
class distribution of the tuples in the set
Entropy-Based Data Discretization
• For example, given m classes, C1, C2, …, Cm, the entropy of D1 is:
where
• pi is the probability of class Ci in D1
• Clustering considers
• The distribution as well as the closeness of data points
• Therefore is able to produce high-quality discretization results
Cluster Analysis
• Clustering can be used
• To generate a concept hierarchy for A
• By following either
• A top-down splitting strategy or
• A bottom-up merging strategy
• where each cluster forms a node of the concept hierarchy
• In Top-down splitting strategy
• Each initial cluster or partition may be further partitioned into
sub-clusters, forming a lower level of the concept hierarchy
• In bottom-up merging strategy
• Clusters are formed by repeatedly grouping neighboring
clusters in order to form higher level concepts
Concept Hierarchy Generation
for
Categorical Data
Concept Hierarchy Generation for
Categorical Data
• Generalization is
• The generation of concept hierarchies for categorical data
• Categorical attributes have
• A finite (but possibly large) number of distinct values with no
ordering among the values
• Examples
• Geographic location
• Job category
• Item type
• Etc.
Concept Hierarchy Generation for
Categorical Data
• Several methods for the generation of concept hierarchies for
categorical data
• Specification of a partial ordering of attributes explicitly at the
schema level by users or experts
• Specification of a portion of a hierarchy by explicit data
Grouping
• Specification of a set of attributes but not of their partial
ordering
Concept Hierarchy Generation for
Categorical Data
• Specification of a partial ordering of attributes explicitly at the
schema level by users or experts
• Example: A relational database may contain the following group of
attributes: street, city, state, and country
• A user or expert can easily define a concept hierarchy by specifying
ordering of the attributes at the schema level
• A hierarchy can be defined by specifying the total ordering among
these attributes at the schema level such as:
street < city < state < country
Concept Hierarchy Generation for
Categorical Data
• Specification of a portion of a hierarchy by explicit data Grouping
• we can easily specify explicit groupings for a small portion of
intermediate-level data
• Example
• After specifying that state and country form a hierarchy at the
schema level
• A user could define some intermediate levels manually such
as:
{Urbana, Chicago} < Illinois
Concept Hierarchy Generation for
Categorical Data
• Specification of a set of attributes but not of their partial ordering
• A user may specify a set of attributes forming a concept
hierarchy without their partial ordering
• The system can then try to automatically generate the attribute
ordering so as to construct a meaningful concept hierarchy
• Example
• Suppose a user selects a set of location-oriented attributes
such as street, country, state and city from the a database D,
but does not specify the hierarchical ordering among the
attributes
Concept Hierarchy Generation for
Categorical Data
• Automatic generation of a schema concept hierarchy
• Based on the number of distinct attribute values
• The attribute with the most distinct values is
placed at the lowest level of the hierarchy
• Example
Year
Month
Quarter
weekday