0% found this document useful (0 votes)

6 views21 pages

Data Discretization and Concept Hierarchy Generation - PPT

The document discusses data discretization and concept hierarchy generation, focusing on methods for converting continuous attributes into categorical ones. It outlines various discretization techniques, such as supervised and unsupervised methods, as well as specific approaches like binning and entropy-based discretization. Additionally, it covers the generation of concept hierarchies for categorical data through user-defined specifications and automatic generation based on attribute values.

Uploaded by

Mrs.Minu Meera M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views21 pages

Data Discretization and Concept Hierarchy Generation - PPT

Uploaded by

Mrs.Minu Meera M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Data Mining

Data Discretization and

Concept Hierarchy Generation
Data Discretization

Dividing the range of a continuous attribute into intervals

Interval labels can then be used to replace actual data values
Reduce the number of values for a given continuous
attribute
Some classification algorithms only accept categorical
attributes.
This leads to a concise, easy-to-use, knowledge-level
representation of mining results.
Data Discretization
• Discretization techniques can be categorized based on whether it uses
class information, as:
– Supervised discretization
◆ the discretization process uses class information
– Unsupervised discretization
◆ the discretization process does not use class information
• Discretization techniques can be categorized based on which direction
it proceeds, as:
– Top-down
◆ If the process starts by first finding one or a few points (called split
points or cut points) to split the entire attribute range, and then
repeats this recursively on the resulting intervals
– Bottom-up
◆ Starts by considering all of the continuous values as potential split-
points,
◆ Removes some by merging neighborhood values to form intervals, and
◆ Then recursively applies this process to the resulting intervals.
Data Discretization

• Typical methods:
– Binning
– Entropy-based discretization
– Interval merging by 2 Analysis
– Clustering analysis

• All the methods can be applied recursively

• Each method assumes that the values to be discretized are
sorted in ascending order.
Binning
• The sorted values are distributed into a number of
buckets, or bins, and then replacing each bin value by
the bin mean or median
• Binning is:
– a top-down splitting technique based on a specified number
of bins.
– an unsupervised discretization technique, because it does
not use class information
• Binning methods:
– Equal-width (distance) partitioning
– Equal-depth (frequency) partitioning
Equal-width (distance) partitioning
• Equal-width (distance) partitioning
– Divides the range into N intervals of equal size: uniform grid
– if A and B are the lowest and highest values of the attribute, the width of intervals will be:
W = (B –A)/N.
– The most straightforward, but outliers may dominate presentation
– Skewed data is not handled well
• Example:
• Sorted data for price (in dollars):
– 4, 8, 15, 21, 21, 24, 25, 28, 34
• W = (B –A)/N = (34 – 4) / 3 = 10
– Bin 1: 4-14, Bin2: 15-24, Bin 3: 25-34
• Equal-width (distance) partitioning:
– Bin 1: 4, 8
– Bin 2: 15, 21, 21, 24
– Bin 3: 25, 28, 34
Equal-depth (frequency) partitioning
• Equal-depth (frequency) partitioning
– Divides the range into N intervals, each containing
approximately same number of samples
– Good data scaling
– Managing categorical attributes can be tricky

Example:
• Sorted data for price (in dollars):
– 4, 8, 15, 21, 21, 24, 25, 28, 34
• Equal-depth (frequency) partitioning:
– Bin 1: 4, 8, 15
– Bin 2: 21, 21, 24
– Bin 3: 25, 28, 34
Entropy-Based Discretization
• Entropy-based discretization is a supervised, top- down
splitting technique.
• It explores class distribution information in its
calculation and determination of split-points
• Let D consist of data instances defined by a set of
attributes and a class-label attribute.
• The class-label attribute provides the class information per
instance.
Entropy-Based Discretization
• The basic method for entropy-based discretization of an attribute A
within the set is as follows:
1. Each value of A can be considered as a potential interval boundary or split-point
(denoted split point) to partition the range of A.
– That is, a split-point for A can partition the instances in D into two subsets satisfying
the conditions A≤ split_point and A > split_point, respectively,
– thereby creating a binary discretization.
2. The information gain after partitioning is

– where D1 and D2 correspond to the instances in D

– |D| is the number of instances in D, and so on.
– The entropy function for a given set is calculated based on the class
distribution of the tuples in the set.
– For example, given m classes, C1, C2, …, Cm, the entropy of D1 is:
Entropy-Based Discretization
– where pi is the probability of class Ci in D1, determined by dividing the
number of tuples of class Ci in D1 by |D1|, the total number of
tuples in D1.
– Therefore, when selecting a split-point for attribute A, we want to pick
the attribute value that gives the minimum expected information
requirement (i.e., min(InfoA(D))).
3) The process of determining a split-point is recursively applied to each partition
obtained, until some stopping criterion is met, such as:
– when the minimum information requirement on all candidate
split-points is less than a small threshold, e,
– or when the number of intervals is greater than a threshold,
max_interval.
• The interval boundaries (split-points) are defined may help improve classification accuracy
• The entropy and information gain measures described here are also used for decision tree
induction.
Interval Merge by 2 Analysis
• ChiMerge:
– It is a bottom-up method
– Find the best neighboring intervals and merge them to form
larger intervals recursively
– The method is supervised in that it uses class information.
– The basic notion is that for accurate discretization, the
relative class frequencies should be fairly consistent within
an interval.
– Therefore, if two adjacent intervals have a very similar
distribution of classes, then the intervals can be merged.
Otherwise, they should remain separate.
– ChiMerge treats intervals as discrete categories
Interval Merge by 2 Analysis
• The ChiMerge method:
– Initially, each distinct value of a numerical attribute A is
considered to be one interval
– 2 tests are performed for every pair of adjacent
intervals
– Adjacent intervals with the least 2 values are merged
together, since low 2 values for a pair indicate similar class
distributions
– This merge process proceeds recursively until a predefined
stopping criterion is met (such as significance level, max-
interval, max inconsistency, etc.)
Cluster Analysis

• Cluster analysis is a popular data discretization method.

• A clustering algorithm can be applied to discretize a numerical attribute, A, by partitioning
the values of A into clusters or groups.
• Clustering takes the distribution of A into consideration, as well as the closeness of data
points, and therefore is able to produce high-quality discretization results.
• Clustering can be used to generate a concept hierarchy for A by following either a top-down splitting
strategy or a bottom-up merging strategy, where each cluster forms a node of the concept hierarchy.
• In the former, each initial cluster or partition may be further decomposed into several subclusters,
forming a lower level of the hierarchy.
• In the latter, clusters are formed by repeatedly grouping neighboring clusters in order to form higher-
level concepts.
Concept Hierarchy Generation for
Categorical Data
Concept Hierarchy Generation for Categorical Data

• Generalization is the generation of concept hierarchies

for categorical data
• Categorical attributes have a finite (but possibly large)
number of distinct values, with no ordering among the
values.
• Examples include
– geographic location,
– job category, and
– itemtype.
Concept Hierarchy Generation for Categorical Data

• There are several methods for the generation of concept hierarchies

for categorical data:
– Specification of a partial ordering of attributes explicitly at the schema
level by users or experts
– Specification of a portion of a hierarchy by explicit data grouping
– Specification of a set of attributes, but not of their partial ordering
Concept Hierarchy Generation for Categorical Data

• Specification of a partial ordering of attributes explicitly at

the schema level by users or experts
– Example: a relational database or a dimension location of a data
warehouse may contain the following group of attributes: street, city,
province or state, and country.
– A user or expert can easily define a concept hierarchy by specifying
ordering of the attributes at the schema level.
– A hierarchy can be defined by specifying the total ordering among these
attributes at the schema level, such as:
◆ street < city < province or state < country
Concept Hierarchy Generation for Categorical Data

• Specification of a portion of a hierarchy by explicit data

grouping
– we can easily specify explicit groupings for a small portion
of intermediate-level data.
– For example, after specifying that province and country
form a hierarchy at the schema level, a user could define
some intermediate levels manually, such as:
◆ {Urbana, Champaign, Chicago} < Illinois
Concept Hierarchy Generation for Categorical Data

• Specification of a set of attributes, but not of their

partial ordering
– A user may specify a set of attributes forming a concept
hierarchy, but omit to explicitly state their partial ordering.
– The system can then try to automatically generate the
attribute ordering so as to construct a meaningful concept
hierarchy.
– Example: Suppose a user selects a set of location-oriented
attributes, street, country, province_or_state, and city, from
the AllElectronics database, but does not specify the
hierarchical ordering among the attributes.
Concept Hierarchy Generation for Categorical Data
• Automatic generation of a schema concept hierarchy based on
the number of distinct attribute values.
• The attribute with the most
distinct values is placed at
the lowest level of the
hierarchy
• Exceptions, e.g.,
weekday, month, quarter,
year
THANK YOU

4 - Discretization and Concept Hierarchy
No ratings yet
4 - Discretization and Concept Hierarchy
27 pages
Suraj Data
No ratings yet
Suraj Data
100 pages
4 - Discretization and Concept Hierarchy
No ratings yet
4 - Discretization and Concept Hierarchy
26 pages
Lesson Plan Details (LP DWDM)
No ratings yet
Lesson Plan Details (LP DWDM)
8 pages
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
No ratings yet
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
57 pages
Quran Fonts
0% (1)
Quran Fonts
8 pages
Grouping
No ratings yet
Grouping
98 pages
Discret Ization
No ratings yet
Discret Ization
12 pages
Data Mining - Discretization
100% (1)
Data Mining - Discretization
5 pages
Discretization of Continuous Attributes
No ratings yet
Discretization of Continuous Attributes
38 pages
Data Preprocessing For Clustering
No ratings yet
Data Preprocessing For Clustering
40 pages
DMiningKuliah 2B DPreparation Lanjutan New2 - 3
No ratings yet
DMiningKuliah 2B DPreparation Lanjutan New2 - 3
40 pages
Probability Mass Function & Density Function
No ratings yet
Probability Mass Function & Density Function
34 pages
Data Mining: Concepts and Techniques: April 30, 2012
No ratings yet
Data Mining: Concepts and Techniques: April 30, 2012
64 pages
Data Preprocessing: Data Cleaning Data Integration and Transformation
No ratings yet
Data Preprocessing: Data Cleaning Data Integration and Transformation
41 pages
Discretization Techniques A Recent Survey
No ratings yet
Discretization Techniques A Recent Survey
12 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
Entropy-Based Algorithm For Discretization: April 2011
No ratings yet
Entropy-Based Algorithm For Discretization: April 2011
9 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
69 pages
It-3031 (DMDW) - CS End Nov 2023
No ratings yet
It-3031 (DMDW) - CS End Nov 2023
23 pages
Data Transformation
No ratings yet
Data Transformation
16 pages
Discretization and Concept Hierarchy Generation
No ratings yet
Discretization and Concept Hierarchy Generation
16 pages
Entropy Discretization
No ratings yet
Entropy Discretization
20 pages
02 - ML - Data Presentation-24-03-09
No ratings yet
02 - ML - Data Presentation-24-03-09
21 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
22 pages
W2-Data Preparation
No ratings yet
W2-Data Preparation
46 pages
Chi2 Feature Selection and Discretization of Numeric Attributes
No ratings yet
Chi2 Feature Selection and Discretization of Numeric Attributes
4 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
6 IJAEST Volume No 2 Issue No 2 Representative Based Method of Categorical Data Clustering 152 156
No ratings yet
6 IJAEST Volume No 2 Issue No 2 Representative Based Method of Categorical Data Clustering 152 156
5 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
Week 4 - 5 - Data Preprocessing
No ratings yet
Week 4 - 5 - Data Preprocessing
67 pages
#CH-2 1 5
No ratings yet
#CH-2 1 5
19 pages
8 Chapter Eight
No ratings yet
8 Chapter Eight
20 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
3 - Sınıflandırma 2
No ratings yet
3 - Sınıflandırma 2
62 pages
Data Warehousing/Mining Comp 150 DW Chapter 8. Cluster Analysis
No ratings yet
Data Warehousing/Mining Comp 150 DW Chapter 8. Cluster Analysis
80 pages
Improved Discretization Based Decision Tree For Continuous Attributes
No ratings yet
Improved Discretization Based Decision Tree For Continuous Attributes
5 pages
Clustering
No ratings yet
Clustering
27 pages
DM Data Transformation Techniques
No ratings yet
DM Data Transformation Techniques
25 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
3.data Pre-Processing Concepts
No ratings yet
3.data Pre-Processing Concepts
8 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
Dicretization and Conc Hierarchy Details
No ratings yet
Dicretization and Conc Hierarchy Details
4 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
Unit 4
No ratings yet
Unit 4
65 pages
1.9-b - Discretization - Concept-Hierarchy
No ratings yet
1.9-b - Discretization - Concept-Hierarchy
2 pages
Assignment 3
No ratings yet
Assignment 3
3 pages
CH 7discretization and Concept Hierarchy Generation
No ratings yet
CH 7discretization and Concept Hierarchy Generation
7 pages
Data Discretization
No ratings yet
Data Discretization
4 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
IDS5
No ratings yet
IDS5
56 pages
Feature Eng Cheat Sheet
No ratings yet
Feature Eng Cheat Sheet
5 pages
Cluster Is A Group of Objects That Belongs To The Same Class
No ratings yet
Cluster Is A Group of Objects That Belongs To The Same Class
12 pages
Sae Arp741c 2016
No ratings yet
Sae Arp741c 2016
22 pages
Operating Instructions: Rotary Microtome CUT 4062 / CUT 5062 / CUT 6062
No ratings yet
Operating Instructions: Rotary Microtome CUT 4062 / CUT 5062 / CUT 6062
38 pages
Acsai Test Score
No ratings yet
Acsai Test Score
3 pages
Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi
No ratings yet
Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi
24 pages
Big Data Lecture # 04
No ratings yet
Big Data Lecture # 04
22 pages
Discretization - and - Concept - Hierarchy - Generation Word
No ratings yet
Discretization - and - Concept - Hierarchy - Generation Word
4 pages
Module-5 Clustering Algorithms
No ratings yet
Module-5 Clustering Algorithms
44 pages
A Two Step Clustering Method For Mixed Categorical and Numerical Data
No ratings yet
A Two Step Clustering Method For Mixed Categorical and Numerical Data
9 pages
Features Description: Single Phase, Multifunction Energy Meter IC
No ratings yet
Features Description: Single Phase, Multifunction Energy Meter IC
30 pages
A Data Pre Processing
No ratings yet
A Data Pre Processing
7 pages
1-Introduction To Algorithms and C Programming
No ratings yet
1-Introduction To Algorithms and C Programming
50 pages
Data Mining Unit3
No ratings yet
Data Mining Unit3
19 pages
Mining Using Genitic Algorithms
No ratings yet
Mining Using Genitic Algorithms
7 pages
36 Lean Manufacturing Tools
No ratings yet
36 Lean Manufacturing Tools
21 pages
ESL Brains Texting Is Killing Language TV 1311
No ratings yet
ESL Brains Texting Is Killing Language TV 1311
2 pages
SAP Afaria System Requirements
No ratings yet
SAP Afaria System Requirements
38 pages
Surveillance Systems
No ratings yet
Surveillance Systems
17 pages
17 Microprocessor Systems Lecture No 17 JMP and LOOP Instructions PDF
No ratings yet
17 Microprocessor Systems Lecture No 17 JMP and LOOP Instructions PDF
12 pages
Appendix C - Machine Language: Code Operand Description
No ratings yet
Appendix C - Machine Language: Code Operand Description
1 page
Paperscan V3: User Guide
No ratings yet
Paperscan V3: User Guide
53 pages
Lecture 1: Cryptography: 1.2.1 Symmetric Case
No ratings yet
Lecture 1: Cryptography: 1.2.1 Symmetric Case
3 pages
RD545 Acoustic Leak Detector: Advanced Electronic Ground Microphone
No ratings yet
RD545 Acoustic Leak Detector: Advanced Electronic Ground Microphone
2 pages
WIREs Data Min Knowl - 2023 - Shaik - Remote Patient Monitoring Using Artificial Intelligence Current State
No ratings yet
WIREs Data Min Knowl - 2023 - Shaik - Remote Patient Monitoring Using Artificial Intelligence Current State
31 pages
Lowongan Pekerjaan - Employee Referral Program (10022021)
No ratings yet
Lowongan Pekerjaan - Employee Referral Program (10022021)
5 pages
Check If A Number Is Palindrome in PL/SQL
No ratings yet
Check If A Number Is Palindrome in PL/SQL
6 pages
Elevayt
No ratings yet
Elevayt
8 pages
Case Study Instructions
No ratings yet
Case Study Instructions
8 pages
Pathfinder Solution Overview
No ratings yet
Pathfinder Solution Overview
2 pages
Networking
No ratings yet
Networking
4 pages
EE502 Assignment Answers
No ratings yet
EE502 Assignment Answers
2 pages
Computer Forensic Analyst Intern-JD
No ratings yet
Computer Forensic Analyst Intern-JD
2 pages
May Mobile
No ratings yet
May Mobile
3 pages
Hephaestus 7100 - Quick Reference Guide
No ratings yet
Hephaestus 7100 - Quick Reference Guide
4 pages
Tax Invoice Cum Acknowledgement Receipt of PAN Application (Form 49A)
No ratings yet
Tax Invoice Cum Acknowledgement Receipt of PAN Application (Form 49A)
1 page
Auction of Dead Stock - Auction Notice of CT
No ratings yet
Auction of Dead Stock - Auction Notice of CT
1 page
Image Segmentation: Unlocking Insights through Pixel Precision
From Everand
Image Segmentation: Unlocking Insights through Pixel Precision
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Data Discretization and Concept Hierarchy Generation - PPT

Uploaded by

Data Discretization and Concept Hierarchy Generation - PPT

Uploaded by

Data Mining

Data Discretization and

Dividing the range of a continuous attribute into intervals

• All the methods can be applied recursively

– where D1 and D2 correspond to the instances in D

• Cluster analysis is a popular data discretization method.

• Generalization is the generation of concept hierarchies

• There are several methods for the generation of concept hierarchies

• Specification of a partial ordering of attributes explicitly at

• Specification of a portion of a hierarchy by explicit data

• Specification of a set of attributes, but not of their

You might also like