0% found this document useful (0 votes)

126 views26 pages

4 - Discretization and Concept Hierarchy

Uploaded by

Nilakhya Chawrok

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

126 views26 pages

4 - Discretization and Concept Hierarchy

Uploaded by

Nilakhya Chawrok

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 26

Data Discretization

and
Concept Hierarchy Generation

6th Semester
Department of Computer Science & Engineering
Jorhat Engineering college
Introduction
• Data Discretization
• Dividing the range of a continuous attribute into intervals
• Interval labels can then be used to replace actual data values
• Reduce the number of values for a given continuous attribute
• This leads to a concise, easy-to-use, knowledge-level representation of
mining results
• But, some classification algorithms only accept categorical attributes
• Can be divided into
• Discretization and Concept Hierarchy Generation
– For Numerical Data

• Discretization and Concept Hierarchy Generation

– For Categorical Data
Introduction
• Formation of Concept hierarchy
• Recursively reduce the data
–By collecting low level concepts such as numeric values,
– For example, age
– Replacing with higher level concepts such as
• Young, middle-aged or senior
Discretization and Concept Hierarchy Generation
for
Numerical Data
Data Discretization Techniques :: Categories

• Discretization techniques can be categorized

• Based on whether it uses class information or not as
• Supervised discretization
• A process which uses class information
• Unsupervised discretization
• A process that does not use class information
Data Discretization
Discretization techniques can be categorized based on which direction
it proceeds as:
• Top-down
• If the process starts by first finding one or a few points (called split
points or cut points) to split the entire attribute range
• Then repeats this recursively on the resulting intervals
• Bottom-up
• Starts by considering all of the continuous values as potential split
points
• Removes some by merging neighborhood values to form intervals
• Then recursively applies this process to the resulting intervals
Data Discretization Methods
• Typical methods
• Binning
• Entropy-based Discretization
• Interval merging by χ2 (chi-Square) Analysis
• Clustering analysis

• Assumptions
• All the methods can be applied recursively
• Each method assumes that the values to be discretized are sorted
in ascending order
Binning
• The sorted values are distributed into a number of buckets, or bins,
and then replacing each bin value by the bin mean or median
• Binning is
• An unsupervised discretization technique, because it does not
use class information
• A top-down splitting technique based on a specified number of
bins

• Binning methods
• Equal-width (distance) partitioning
• Equal-depth (frequency) partitioning
Binning :: Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform grid
• If A and B are the lowest and highest values of the attribute, the
width of intervals will be W = (B – A) / N
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well
• Example: Original data: 21, 28, 34, 24, 21, 15, 25, 4, 8
Sorted data: 4, 8, 15, 21, 21, 24, 25, 28, 34
Width of Intervals, W = (B – A) / N = (34 – 4) / 3 = 10
Bin 1 Bin 2 Bin 3
Interval 4-14 15-24 25-34
Elements 4, 8 15, 21, 21, 24 25, 28, 34

• Replace each value with mean or median of the bin

Binning :: Equal-depth (frequency) partitioning
• Divides the range into N intervals, each containing approximately
same number of samples
• Good data scaling
• Managing categorical attributes can be tricky
• Example:
Original data: 21, 28, 34, 24, 21, 15, 25, 4, 8
Sorted data: 4, 8, 15, 21, 21, 24, 25, 28, 34

Bin 1 Bin 2 Bin 3

Elements 4, 8, 15 21, 21, 24 25, 28, 34

• Replace each value with mean or median of the bin

Entropy-Based Data Discretization
• A supervised, top-down splitting technique
• Explores class distribution information in its calculation and
determination of split-points
• Let D consist of data instances defined by a set of attributes and a
class-label attribute
• The class-label attribute provides the class information per instance
Entropy-Based Data Discretization
• The basic method for entropy-based discretization of an attribute A
within the set D is
• Each value of A can be considered as a potential split-point to
partition the range of A
• That is, a split-point for A can partition the instances in D into
two subsets satisfying the conditions
A ≤ split_point and A > split_point,
respectively
• Creates a binary discretization
• Entropy
• A concept as well as a measurable physical property that is most
Entropy-Based Data Discretization
• The information gain after partitioning is

where
• D1 and D2 correspond to the instances in D
• |D| is the number of instances in D, and so on
• The entropy function for a given set is calculated based on the
class distribution of the tuples in the set
Entropy-Based Data Discretization
• For example, given m classes, C1, C2, …, Cm, the entropy of D1 is:

where
• pi is the probability of class Ci in D1

• Determined by dividing the number of tuples of class Ci in D1

by |D1|, the total number of tuples in D1

• When selecting a split-point for attribute A

• Need to pick the attribute value that gives the minimum
expected information requirement i.e., min(InfoA (D))
Entropy-Based Data Discretization
• The process of determining a split-point is recursively applied to
each partition obtained, until some stopping criterion is met such as:
• when the minimum information requirement on all candidate
split-points is less than a small threshold, t, or
• When the number of intervals is greater than a threshold,
max_interval
• The interval boundaries (split-points) defined may help improve
classification accuracy
Interval Merge by χ2 (Chi square) Analysis
Chi Merge
• A Supervised bottom-up method as it uses class information
• Find the best neighboring intervals and merge them to form larger
intervals recursively
• Treats intervals as discrete categories
• The basic notion is that
• For accurate discretization
• The relative class frequencies should be fairly consistent
within an interval
• Therefore
• If two adjacent intervals have a very similar distribution of
classes, then the intervals can be merged
• Otherwise, they should remain separate
Interval Merge by χ2 (Chi square) Analysis
• The Chi Merge method
• Initially, each distinct value of a numerical attribute, A, is
considered to be one interval
• χ2 tests are performed for every pair of adjacent intervals
• Adjacent intervals with the least χ2 values are merged together
• Since low χ2 values for a pair indicate similar class
distributions
• This merge process proceeds recursively until
• A predefined stopping criterion is met such as
• significance level
• Max_interval
• max inconsistency
• etc.
Cluster Analysis
• A popular data discretization method
• A clustering algorithm can be applied
• To discretize a numerical attribute, A
• By partitioning the values of A into clusters or groups

• Clustering considers
• The distribution as well as the closeness of data points
• Therefore is able to produce high-quality discretization results
Cluster Analysis
• Clustering can be used
• To generate a concept hierarchy for A
• By following either
• A top-down splitting strategy or
• A bottom-up merging strategy
• where each cluster forms a node of the concept hierarchy
• In Top-down splitting strategy
• Each initial cluster or partition may be further partitioned into
sub-clusters, forming a lower level of the concept hierarchy
• In bottom-up merging strategy
• Clusters are formed by repeatedly grouping neighboring
clusters in order to form higher level concepts
Concept Hierarchy Generation
for
Categorical Data
Concept Hierarchy Generation for
Categorical Data
• Generalization is
• The generation of concept hierarchies for categorical data
• Categorical attributes have
• A finite (but possibly large) number of distinct values with no
ordering among the values
• Examples
• Geographic location
• Job category
• Item type
• Etc.
Concept Hierarchy Generation for
Categorical Data
• Several methods for the generation of concept hierarchies for
categorical data
• Specification of a partial ordering of attributes explicitly at the
schema level by users or experts
• Specification of a portion of a hierarchy by explicit data
Grouping
• Specification of a set of attributes but not of their partial
ordering
Concept Hierarchy Generation for
Categorical Data
• Specification of a partial ordering of attributes explicitly at the
schema level by users or experts
• Example: A relational database may contain the following group of
attributes: street, city, state, and country
• A user or expert can easily define a concept hierarchy by specifying
ordering of the attributes at the schema level
• A hierarchy can be defined by specifying the total ordering among
these attributes at the schema level such as:
street < city < state < country
Concept Hierarchy Generation for
Categorical Data
• Specification of a portion of a hierarchy by explicit data Grouping
• we can easily specify explicit groupings for a small portion of
intermediate-level data
• Example
• After specifying that state and country form a hierarchy at the
schema level
• A user could define some intermediate levels manually such
as:
{Urbana, Chicago} < Illinois
Concept Hierarchy Generation for
Categorical Data
• Specification of a set of attributes but not of their partial ordering
• A user may specify a set of attributes forming a concept
hierarchy without their partial ordering
• The system can then try to automatically generate the attribute
ordering so as to construct a meaningful concept hierarchy
• Example
• Suppose a user selects a set of location-oriented attributes
such as street, country, state and city from the a database D,
but does not specify the hierarchical ordering among the
attributes
Concept Hierarchy Generation for
Categorical Data
• Automatic generation of a schema concept hierarchy
• Based on the number of distinct attribute values
• The attribute with the most distinct values is
placed at the lowest level of the hierarchy
• Example
Year

Month

Quarter

weekday

CLO 3DMarvelous Designer Manual
75% (16)
CLO 3DMarvelous Designer Manual
405 pages
Daniel R. Mitchell - BasicSynth (2008)
No ratings yet
Daniel R. Mitchell - BasicSynth (2008)
288 pages
NetCol5000-A050 In-Row Air Cooled Smart Cooling Product User Manual
No ratings yet
NetCol5000-A050 In-Row Air Cooled Smart Cooling Product User Manual
232 pages
4 - Discretization and Concept Hierarchy
No ratings yet
4 - Discretization and Concept Hierarchy
27 pages
K Medoids
No ratings yet
K Medoids
101 pages
Module-5 Clustering Algorithms
No ratings yet
Module-5 Clustering Algorithms
44 pages
Data Discretization
No ratings yet
Data Discretization
9 pages
02 Pre Processing
No ratings yet
02 Pre Processing
52 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
W2-Data Preparation
No ratings yet
W2-Data Preparation
46 pages
Big Data Lecture # 04
No ratings yet
Big Data Lecture # 04
22 pages
Dumpsys ANR WindowManager
No ratings yet
Dumpsys ANR WindowManager
1,707 pages
Data Mining and Data Warehousing CSPC-308
No ratings yet
Data Mining and Data Warehousing CSPC-308
51 pages
DM Data Transformation Techniques
No ratings yet
DM Data Transformation Techniques
25 pages
IDS5
No ratings yet
IDS5
56 pages
BOD310 EN Col06 FV Show
No ratings yet
BOD310 EN Col06 FV Show
142 pages
Slide 2 - Data Preprocessing
100% (1)
Slide 2 - Data Preprocessing
39 pages
Data Discretization and Concept Hierarchy Generation - PPT
No ratings yet
Data Discretization and Concept Hierarchy Generation - PPT
21 pages
#CH-2 1 5
No ratings yet
#CH-2 1 5
19 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Week 4 - 5 - Data Preprocessing
No ratings yet
Week 4 - 5 - Data Preprocessing
67 pages
Clustering
No ratings yet
Clustering
27 pages
This Study Resource Was Shared Via
20% (5)
This Study Resource Was Shared Via
2 pages
DMiningKuliah 2B DPreparation Lanjutan New2 - 3
No ratings yet
DMiningKuliah 2B DPreparation Lanjutan New2 - 3
40 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
Topic 1 Introduction To Digital Logic and Boolean Algebra
No ratings yet
Topic 1 Introduction To Digital Logic and Boolean Algebra
99 pages
Hierarchicalclustering
No ratings yet
Hierarchicalclustering
20 pages
PIC16F877A μc1
No ratings yet
PIC16F877A μc1
26 pages
02 - ML - Data Presentation-24-03-09
No ratings yet
02 - ML - Data Presentation-24-03-09
21 pages
Data Transformation
No ratings yet
Data Transformation
16 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
4 Binning
No ratings yet
4 Binning
19 pages
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
No ratings yet
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
57 pages
Secret Killing of Assam
No ratings yet
Secret Killing of Assam
224 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
69 pages
Discretization of Continuous Attributes
No ratings yet
Discretization of Continuous Attributes
38 pages
NAT Reviewer
No ratings yet
NAT Reviewer
75 pages
Algebra Handout #5 Answers and Solutions
0% (1)
Algebra Handout #5 Answers and Solutions
5 pages
Tutorial Galgo R
No ratings yet
Tutorial Galgo R
92 pages
It-3031 (DMDW) - CS End Nov 2023
No ratings yet
It-3031 (DMDW) - CS End Nov 2023
23 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Data Mining
No ratings yet
Data Mining
21 pages
Data Warehousing/Mining Comp 150 DW Chapter 8. Cluster Analysis
No ratings yet
Data Warehousing/Mining Comp 150 DW Chapter 8. Cluster Analysis
80 pages
Normalization
No ratings yet
Normalization
35 pages
Data Preprocessing: Data Cleaning Data Integration and Transformation
No ratings yet
Data Preprocessing: Data Cleaning Data Integration and Transformation
41 pages
C49 DWM Expt4
No ratings yet
C49 DWM Expt4
14 pages
Discret Ization
No ratings yet
Discret Ization
12 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
Numpy 1721963082
No ratings yet
Numpy 1721963082
68 pages
Dicretization and Conc Hierarchy Details
No ratings yet
Dicretization and Conc Hierarchy Details
4 pages
5 Data Exploration
No ratings yet
5 Data Exploration
41 pages
Entropy Discretization
No ratings yet
Entropy Discretization
20 pages
Unit 4
No ratings yet
Unit 4
65 pages
FY24 EMEA TAC Sec Workshop - Firewall - ASAFTD High-Availability
No ratings yet
FY24 EMEA TAC Sec Workshop - Firewall - ASAFTD High-Availability
43 pages
Cluster Analysis Introduction
No ratings yet
Cluster Analysis Introduction
23 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
Lesson Plan Details (LP DWDM)
No ratings yet
Lesson Plan Details (LP DWDM)
8 pages
Discretization and Concept Hierarchy Generation
No ratings yet
Discretization and Concept Hierarchy Generation
16 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Oracle 1z0-960 v2018-03-23 q61
No ratings yet
Oracle 1z0-960 v2018-03-23 q61
21 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Feature Eng Cheat Sheet
No ratings yet
Feature Eng Cheat Sheet
5 pages
ICDL Documents Syllabus 6.0 1
No ratings yet
ICDL Documents Syllabus 6.0 1
6 pages
The Comprehensive Guide To Web Development
No ratings yet
The Comprehensive Guide To Web Development
6 pages
Data Communication and Networking I
No ratings yet
Data Communication and Networking I
2 pages
Discretization Techniques A Recent Survey
No ratings yet
Discretization Techniques A Recent Survey
12 pages
The Paradigm Shift in Indian Oil and Gas Industry: A Knowledge Paper Prepared For
No ratings yet
The Paradigm Shift in Indian Oil and Gas Industry: A Knowledge Paper Prepared For
36 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Computer Interface Design: Dr. Ghassan Abu Samhadana
No ratings yet
Computer Interface Design: Dr. Ghassan Abu Samhadana
37 pages
Entropy-Based Algorithm For Discretization: April 2011
No ratings yet
Entropy-Based Algorithm For Discretization: April 2011
9 pages
Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi
No ratings yet
Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi
24 pages
Quadratic Py Qs
No ratings yet
Quadratic Py Qs
8 pages
AppLoader User Guide
No ratings yet
AppLoader User Guide
92 pages
Chi2 Feature Selection and Discretization of Numeric Attributes
No ratings yet
Chi2 Feature Selection and Discretization of Numeric Attributes
4 pages
Academic Internship Final Report
No ratings yet
Academic Internship Final Report
11 pages
Digital Logic and Computer Architecture
No ratings yet
Digital Logic and Computer Architecture
27 pages
DBMS Class Test 2 Answers
No ratings yet
DBMS Class Test 2 Answers
8 pages
1.9-b - Discretization - Concept-Hierarchy
No ratings yet
1.9-b - Discretization - Concept-Hierarchy
2 pages
Questão Do Batismo - Salomão L. Ginsburg by Memória Dos Batistas - Issuu
No ratings yet
Questão Do Batismo - Salomão L. Ginsburg by Memória Dos Batistas - Issuu
1 page
Iocl1 Internship
No ratings yet
Iocl1 Internship
24 pages
Computer Architecture - Teachers Notes
No ratings yet
Computer Architecture - Teachers Notes
11 pages
BI034 - SI034 - 02 - Introduction To C++
No ratings yet
BI034 - SI034 - 02 - Introduction To C++
43 pages
Assignment 3
No ratings yet
Assignment 3
3 pages
Improved Discretization Based Decision Tree For Continuous Attributes
No ratings yet
Improved Discretization Based Decision Tree For Continuous Attributes
5 pages
HS1031 Individual Assignment Answer Sheet T1 2023
No ratings yet
HS1031 Individual Assignment Answer Sheet T1 2023
7 pages
CH 7discretization and Concept Hierarchy Generation
No ratings yet
CH 7discretization and Concept Hierarchy Generation
7 pages
Data Discretization
No ratings yet
Data Discretization
4 pages
Se201-Software Requirements Analysis
No ratings yet
Se201-Software Requirements Analysis
50 pages
A Data Pre Processing
No ratings yet
A Data Pre Processing
7 pages
Mark 7 Arterion Injection System Brochure (PP-M-MARK-US-0076-1) - 0
No ratings yet
Mark 7 Arterion Injection System Brochure (PP-M-MARK-US-0076-1) - 0
9 pages
HashMap HashSet LeetCode Questions
No ratings yet
HashMap HashSet LeetCode Questions
2 pages
Data Mining - Discretization
100% (1)
Data Mining - Discretization
5 pages
Assamedu
No ratings yet
Assamedu
5 pages
Puchakayala Kranthi Reddy: Skills
No ratings yet
Puchakayala Kranthi Reddy: Skills
3 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

4 - Discretization and Concept Hierarchy

Uploaded by

4 - Discretization and Concept Hierarchy

Uploaded by

Data Discretization

• Discretization and Concept Hierarchy Generation

• Discretization techniques can be categorized

• Replace each value with mean or median of the bin

Bin 1 Bin 2 Bin 3

• Replace each value with mean or median of the bin

• Determined by dividing the number of tuples of class Ci in D1

by |D1|, the total number of tuples in D1

• When selecting a split-point for attribute A

You might also like