4 - Discretization and Concept Hierarchy

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 26

Data Discretization

and
Concept Hierarchy Generation

6th Semester
Department of Computer Science & Engineering
Jorhat Engineering college
Introduction
• Data Discretization
• Dividing the range of a continuous attribute into intervals
• Interval labels can then be used to replace actual data values
• Reduce the number of values for a given continuous attribute
• This leads to a concise, easy-to-use, knowledge-level representation of
mining results
• But, some classification algorithms only accept categorical attributes
• Can be divided into
• Discretization and Concept Hierarchy Generation
– For Numerical Data

• Discretization and Concept Hierarchy Generation


– For Categorical Data
Introduction
• Formation of Concept hierarchy
• Recursively reduce the data
–By collecting low level concepts such as numeric values,
– For example, age
– Replacing with higher level concepts such as
• Young, middle-aged or senior
Discretization and Concept Hierarchy Generation
for
Numerical Data
Data Discretization Techniques :: Categories

• Discretization techniques can be categorized


• Based on whether it uses class information or not as
• Supervised discretization
• A process which uses class information
• Unsupervised discretization
• A process that does not use class information
Data Discretization
Discretization techniques can be categorized based on which direction
it proceeds as:
• Top-down
• If the process starts by first finding one or a few points (called split
points or cut points) to split the entire attribute range
• Then repeats this recursively on the resulting intervals
• Bottom-up
• Starts by considering all of the continuous values as potential split
points
• Removes some by merging neighborhood values to form intervals
• Then recursively applies this process to the resulting intervals
Data Discretization Methods
• Typical methods
• Binning
• Entropy-based Discretization
• Interval merging by χ2 (chi-Square) Analysis
• Clustering analysis

• Assumptions
• All the methods can be applied recursively
• Each method assumes that the values to be discretized are sorted
in ascending order
Binning
• The sorted values are distributed into a number of buckets, or bins,
and then replacing each bin value by the bin mean or median
• Binning is
• An unsupervised discretization technique, because it does not
use class information
• A top-down splitting technique based on a specified number of
bins

• Binning methods
• Equal-width (distance) partitioning
• Equal-depth (frequency) partitioning
Binning :: Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform grid
• If A and B are the lowest and highest values of the attribute, the
width of intervals will be W = (B – A) / N
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well
• Example: Original data: 21, 28, 34, 24, 21, 15, 25, 4, 8
Sorted data: 4, 8, 15, 21, 21, 24, 25, 28, 34
Width of Intervals, W = (B – A) / N = (34 – 4) / 3 = 10
Bin 1 Bin 2 Bin 3
Interval 4-14 15-24 25-34
Elements 4, 8 15, 21, 21, 24 25, 28, 34

• Replace each value with mean or median of the bin


Binning :: Equal-depth (frequency) partitioning
• Divides the range into N intervals, each containing approximately
same number of samples
• Good data scaling
• Managing categorical attributes can be tricky
• Example:
Original data: 21, 28, 34, 24, 21, 15, 25, 4, 8
Sorted data: 4, 8, 15, 21, 21, 24, 25, 28, 34

Bin 1 Bin 2 Bin 3


Elements 4, 8, 15 21, 21, 24 25, 28, 34

• Replace each value with mean or median of the bin


Entropy-Based Data Discretization
• A supervised, top-down splitting technique
• Explores class distribution information in its calculation and
determination of split-points
• Let D consist of data instances defined by a set of attributes and a
class-label attribute
• The class-label attribute provides the class information per instance
Entropy-Based Data Discretization
• The basic method for entropy-based discretization of an attribute A
within the set D is
• Each value of A can be considered as a potential split-point to
partition the range of A
• That is, a split-point for A can partition the instances in D into
two subsets satisfying the conditions
A ≤ split_point and A > split_point,
respectively
• Creates a binary discretization
• Entropy
• A concept as well as a measurable physical property that is most
Entropy-Based Data Discretization
• The information gain after partitioning is

where
• D1 and D2 correspond to the instances in D
• |D| is the number of instances in D, and so on
• The entropy function for a given set is calculated based on the
class distribution of the tuples in the set
Entropy-Based Data Discretization
• For example, given m classes, C1, C2, …, Cm, the entropy of D1 is:

where
• pi is the probability of class Ci in D1

• Determined by dividing the number of tuples of class Ci in D1

by |D1|, the total number of tuples in D1

• When selecting a split-point for attribute A


• Need to pick the attribute value that gives the minimum
expected information requirement i.e., min(InfoA (D))
Entropy-Based Data Discretization
• The process of determining a split-point is recursively applied to
each partition obtained, until some stopping criterion is met such as:
• when the minimum information requirement on all candidate
split-points is less than a small threshold, t, or
• When the number of intervals is greater than a threshold,
max_interval
• The interval boundaries (split-points) defined may help improve
classification accuracy
Interval Merge by χ2 (Chi square) Analysis
Chi Merge
• A Supervised bottom-up method as it uses class information
• Find the best neighboring intervals and merge them to form larger
intervals recursively
• Treats intervals as discrete categories
• The basic notion is that
• For accurate discretization
• The relative class frequencies should be fairly consistent
within an interval
• Therefore
• If two adjacent intervals have a very similar distribution of
classes, then the intervals can be merged
• Otherwise, they should remain separate
Interval Merge by χ2 (Chi square) Analysis
• The Chi Merge method
• Initially, each distinct value of a numerical attribute, A, is
considered to be one interval
• χ2 tests are performed for every pair of adjacent intervals
• Adjacent intervals with the least χ2 values are merged together
• Since low χ2 values for a pair indicate similar class
distributions
• This merge process proceeds recursively until
• A predefined stopping criterion is met such as
• significance level
• Max_interval
• max inconsistency
• etc.
Cluster Analysis
• A popular data discretization method
• A clustering algorithm can be applied
• To discretize a numerical attribute, A
• By partitioning the values of A into clusters or groups

• Clustering considers
• The distribution as well as the closeness of data points
• Therefore is able to produce high-quality discretization results
Cluster Analysis
• Clustering can be used
• To generate a concept hierarchy for A
• By following either
• A top-down splitting strategy or
• A bottom-up merging strategy
• where each cluster forms a node of the concept hierarchy
• In Top-down splitting strategy
• Each initial cluster or partition may be further partitioned into
sub-clusters, forming a lower level of the concept hierarchy
• In bottom-up merging strategy
• Clusters are formed by repeatedly grouping neighboring
clusters in order to form higher level concepts
Concept Hierarchy Generation
for
Categorical Data
Concept Hierarchy Generation for
Categorical Data
• Generalization is
• The generation of concept hierarchies for categorical data
• Categorical attributes have
• A finite (but possibly large) number of distinct values with no
ordering among the values
• Examples
• Geographic location
• Job category
• Item type
• Etc.
Concept Hierarchy Generation for
Categorical Data
• Several methods for the generation of concept hierarchies for
categorical data
• Specification of a partial ordering of attributes explicitly at the
schema level by users or experts
• Specification of a portion of a hierarchy by explicit data
Grouping
• Specification of a set of attributes but not of their partial
ordering
Concept Hierarchy Generation for
Categorical Data
• Specification of a partial ordering of attributes explicitly at the
schema level by users or experts
• Example: A relational database may contain the following group of
attributes: street, city, state, and country
• A user or expert can easily define a concept hierarchy by specifying
ordering of the attributes at the schema level
• A hierarchy can be defined by specifying the total ordering among
these attributes at the schema level such as:
street < city < state < country
Concept Hierarchy Generation for
Categorical Data
• Specification of a portion of a hierarchy by explicit data Grouping
• we can easily specify explicit groupings for a small portion of
intermediate-level data
• Example
• After specifying that state and country form a hierarchy at the
schema level
• A user could define some intermediate levels manually such
as:
{Urbana, Chicago} < Illinois
Concept Hierarchy Generation for
Categorical Data
• Specification of a set of attributes but not of their partial ordering
• A user may specify a set of attributes forming a concept
hierarchy without their partial ordering
• The system can then try to automatically generate the attribute
ordering so as to construct a meaningful concept hierarchy
• Example
• Suppose a user selects a set of location-oriented attributes
such as street, country, state and city from the a database D,
but does not specify the hierarchical ordering among the
attributes
Concept Hierarchy Generation for
Categorical Data
• Automatic generation of a schema concept hierarchy
• Based on the number of distinct attribute values
• The attribute with the most distinct values is
placed at the lowest level of the hierarchy
• Example
Year

Month

Quarter

weekday

You might also like