0% found this document useful (0 votes)

23 views9 pages

Data Discretization

The document discusses various techniques for data discretization including binning, histogram analysis, cluster analysis, decision tree analysis, and correlation analysis. It provides examples and definitions for each technique and discusses their applications for data preprocessing and analytics.

Uploaded by

2200030218cseh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views9 pages

Data Discretization

Uploaded by

2200030218cseh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

KONERU LAKSHMAIAH EDUCATION FOUNDATION

(Deemed to be University estd, u/s, 3 of the UGC Act,

1956) (NAAC Accredited “A++” Grade University)

Green Fields, Guntur District, A.P., India – 522502

B.Tech. IIInd Year

PROGRAM
A.Y.2023-24
Even Semester
22CS2227 - DATA ANALYTICS AND VISUALIZATION
CO 2

Session 25.: Data Discretization

1. Course Description
Data Discretization Techniques in Data Science is an advanced course designed to provide
students with a comprehensive understanding of the fundamental concepts and methodologies
related to data discretization. In this course, students will delve into the principles, techniques, and
applications of data discretization, a critical step in the data preprocessing pipeline that enables
efficient and effective analysis of large datasets.
Prerequisites includes Basic knowledge of data structures and algorithms,Familiarity with
programming fundamentals in languages such as Python or R,Understanding of fundamental
concepts in statistics and data analysis.

2. Aim

The aim of data discretization is to transform continuous or numerical data into discrete or categorical form,
facilitating the analysis and processing of data by reducing complexity and noise while preserving the
underlying patterns and relationships within the data. This process allows for more efficient computation and
analysis, particularly in the context of machine learning algorithms and data mining tasks.

3. Instructional Objectives (Course Objectives)

 Gain a comprehensive understanding of data discretization concepts and

methodologies.
 Develop proficiency in implementing various data discretization techniques using
programming languages and data preprocessing tools.
 Evaluate the impact of data discretization on data quality, model performance, and
data analysis outcomes.
 Apply data discretization techniques to real-world datasets and analyze their
implications on data mining tasks.
4. Learning Outcomes (Course Outcome)

By the end of the course, students will be able to proficiently implement diverse data
discretization techniques using programming languages, critically evaluate their
impact on data quality
5. Module Description (CO-2 Description)

Applying various data discretization methods to explore the data.

6. Session Introduction
In this session, we will explore the fundamental concepts and methodologies related to
data discretization, emphasizing the significance of this preprocessing step in enabling
efficient data analysis. Through interactive discussions and practical demonstrations, we
aim to deepen your understanding of various discretization methods and their implications
for data quality and analysis outcomes. Get ready to delve into the intricacies of data
discretization and its role in shaping effective data science pipelines.

7. Session description

Data discretization is defined as a process of converting continuous data attribute values into
a finite set of intervals with minimal loss of information and associating with each interval
some specific data value or conceptual labels.

Data Discretization is considered as a data reduction mechanism, because it diminishes data

from large domains of numeric values to a subset of categorical values.

Ex. age can be transformed to (0-10,11-20….) or to conceptual labels like youth, adult,
senior.

Fits the problem statement

Often, it is easier to understand continuous data (such as weight) when divided and stored
into meaningful categories or groups. For example, we can divide a continuous variable,
weight, and store it in the following groups :
Under 100 lbs (light), between 140–160 lbs (mid), and over 200 lbs (heavy)

We would consider the structure useful if we see no objective difference between variables
falling under the same weight class.
In our example, weights of 85 lbs and 56 lbs convey the same information (the object is
light). Therefore, discretization helps make our data easier to understand if it fits the
problem statement.

Methods of Data Discretization

1.Binning
2.Histogram analysis
3.Cluster analysis
4.Decision tree analysis
5.Correlation analysis

1.Binning:
 Binning is a top-down splitting technique based on a specified number of bins.
 The main challenge in this discretization is to choose the number of intervals or bins
and how to decide on their width.
 Binning methods smooth a sorted data value by consulting its “neighborhood”, that is
the values around it. The sorted values are distributed into several “buckets” or bins.
Because binning methods consult the neighborhood of values, they perform local
smoothing.
 Attribute values can be discretized by applying equal-width or equalfrequency
binning, and then replacing each bin value by the bin mean or median, as in
smoothing by bin means or smoothing by bin

2.Histogram analysis:
Histograms (or frequency histograms) are at least a century old and are widely used.
• “Histos” means pole or mast, and “gram” means chart, so a histogram is a chart of
poles.
• Plotting histograms is a graphical method for summarizing the distribution of a given
attribute, X.
• If X is nominal, such as automobile model or item type, then a pole or vertical bar is
drawn for each known value of X.
• The height of the bar indicates the frequency (i.e., count) of that X value.
• The resulting graph is more commonly known as a bar chart.
• If X is numeric, the term histogram is preferred.
• The range of values for X is partitioned into disjoint consecutive subranges.
• The subranges, referred to as buckets or bins, are disjoint subsets of the data
distribution for X.
• The range of a bucket is known as the width.
Typically, the buckets are of equal width
For example, a price attribute with a value range of $1 to $200 (rounded up to the nearest
dollar) can be partitioned into subranges 1 to 20, 21 to 40, 41 to 60, and so on.
For each subrange, a bar is drawn with a height that represents the total count of items
observed within the subrange

3.Cluster analysis:
Cluster analysis is a popular data discretization method.
A clustering algorithm can be applied to discretize a numeric attribute, A, by partitioning the
values of A into clusters or groups based on similarity, and store cluster representation
(e.g., centroid and diameter) only.
It partitions the data set into clusters.
Properties of clusters:
(i) All the data points in a cluster should be similar to each other.
(ii) The data points from different clusters should be as different as possible.

4.Decision tree analysis:

A decision tree is a hierarchical model used in decision support that depicts decisions and
their potential outcomes, incorporating chance events, resource expenses, and utility. This
algorithmic model utilizes conditional control statements and is non-parametric,
supervised learning, useful for both classification and regression tasks. The tree structure
is comprised of a root node, branches, internal nodes, and leaf nodes, forming a
hierarchical, tree-like structure.

Example of Decision Tree

Let’s understand decision trees with the help of an example:

In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy, or rainy?
If yes then it will go to the next feature which is humidity and wind. It will again check if
there is a strong wind or weak, if it’s a weak wind and it’s rainy then the person may go and
play.

5.Correlation analysis:
Correlation analysis is a statistical method used to measure the strength of the linear
relationship between two variables and compute their association. Correlation analysis
calculates the level of change in one variable due to the change in the other. A high
correlation points to a strong relationship between the two variables, while a low correlation
means that the variables are weakly related.
Researchers use correlation analysis to analyze quantitative data collected through research
methods like surveys and live polls for market research. They try to identify relationships,
patterns, significant connections, and trends between two variables or datasets. There is a
positive correlation between two variables when an increase in one variable leads to an
increase in the other. On the other hand, a negative correlation means that when one variable
increases, the other decreases and vice-versa.

8. Activities/ Case studies/related to the session

Case Study: Discretization Impact on Classification Models

Background:
Participants are provided with a dataset containing continuous variables, and they are tasked
with applying different discretization techniques such as equal width and equal frequency
discretization. They will then train classification models using the discretized data and
evaluate the impact on model performance metrics such as accuracy, precision, and recall.

9. Examples & contemporary extracts of articles/ practices to convey the idea

of the session

Enhancing Privacy Preservation through Data Discretization in Healthcare Analytics:

Example:

In the field of healthcare analytics, data discretization has emerged as a key strategy for
preserving patient privacy while enabling comprehensive analysis. A recent study by
Johnson et al. (2023) showcased how the application of differential privacy techniques
combined with data discretization methods allowed for effective analysis of patient health
records, ensuring compliance with privacy regulations without compromising the utility of
the data for research and analysis purposes
10. SAQ's-Self Assessment Questions

1. What is the primary goal of data discretization?

a) To increase data dimensionality
b) To reduce the number of data points
c) To handle continuous data
d) To remove outliers

2.How does data discretization contribute to handling missing or noisy data?

a) It removes the noisy data points
b) It reduces the impact of missing data
c) It increases the overall data complexity
d) It has no impact on missing or noisy data

11. Summary
Data discretization is a data preprocessing technique that involves transforming
continuous data into discrete form, enabling easier analysis and interpretation. It simplifies
complex datasets by partitioning numerical values into intervals or categories, reducing
computational complexity and noise. Discretization aids in preserving data privacy and
security, particularly in sensitive domains such as healthcare and finance, by anonymizing
identifiable information. It enhances the performance of machine learning models by
reducing overfitting and improving generalization. Through techniques like equal width
and equal frequency discretization, it facilitates the identification of meaningful patterns
and trends in the data, enabling more informed decision-making. Moreover, it plays a
critical role in data mining tasks, including classification, clustering, and association rule
mining, by facilitating efficient data exploration and pattern recognition.
12. Terminal Questions

1) What are the different methods of data discretization?

2) Examples of real-world applications where data discretization is commonly used?

13. Case Studies (Co Wise)

Enhancing Privacy Preservation through Data Discretization in Healthcare Analytics:

Example:

Solution:
1. Data discretization is defined as a process of converting continuous data attribute values
into a finite set of intervals with minimal loss of information and associating with each
interval some specific data value or conceptual labels.

Data Discretization is considered as a data reduction mechanism, because it diminishes data

from large domains of numeric values to a subset of categorical values.

Ex. age can be transformed to (0-10,11-20….) or to conceptual labels like youth, adult,
senior.

Fits the problem statement

Methods of Data Discretization

1.Binning
2.Histogram analysis
3.Cluster analysis
4.Decision tree analysis
5.Correlation analysis

2. Data discretization is commonly employed in various real-world applications across

different domains to enable effective data analysis and decision-making. Some notable
examples of its usage include:

Healthcare: Patient health records often contain sensitive and continuous data, such
as medical test results and vital signs. Data discretization techniques are applied to
preserve patient privacy while enabling analysis for medical research and predictive
modeling.

Finance: Financial institutions use data discretization to analyze transactional data

for fraud detection and risk assessment. By discretizing transactional patterns, they
can identify irregularities and suspicious activities, enhancing security measures and
reducing financial risks.

Marketing: Customer data, including purchase history and demographic information,

is discretized to identify customer segments and behavior patterns. This facilitates
targeted marketing campaigns and personalized product recommendations,
improving customer engagement and retention.

Telecommunications: Telecom companies utilize data discretization to analyze

customer usage patterns and network performance. By discretizing network traffic
data and customer behavior, they can optimize network resource allocation and
improve service quality based on different usage categories.

Manufacturing: Data discretization is employed in quality control processes to

categorize production data and identify patterns related to product defects and
machine performance. This aids in improving production efficiency, minimizing
defects, and ensuring product quality and reliability.

Education: Educational institutions use data discretization to analyze student

performance data and identify learning patterns and trends. By discretizing academic
performance metrics, educators can personalize learning experiences and
interventions, leading to improved educational outcomes for students.

These examples illustrate the diverse applications of data discretization across various
industries, demonstrating its vital role in facilitating data analysis, pattern recognition, and
decision-making processes.

15. Glossary

Textual Annotation: The practice of adding comments, labels, or metadata to textual content to
provide additional information, context, or insights.
Labels: Short descriptions or tags attached to text to categorize or classify it, making it easier to
organize and search for.
Contextual Information: Additional data or details that surround the text, offering a better
understanding of the content's significance.

16. Reference Books:

1. Python Data Science Handbook, by Jake VanderPlas, Released November 2016

Publisher(s): O'Reilly Media, Inc. ISBN: 9781491912058

Sites and Web links:

Text and Annotation | Python Data Science Handbook (jakevdp.github.io)

17. Keywords
Discretization,Data preprocessing,Continuous data,Categorical data,Equal width
discretization,Equal frequency discretization,Supervised discretization,Unsupervised
discretization,Information gain,Clustering-based discretization,Decision tree-based
discretization

Chapter 2_ Data Exploration, Preprocessing and Visualization
No ratings yet
Chapter 2_ Data Exploration, Preprocessing and Visualization
92 pages
Lecture 7 -Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 -Data Preprocessing - Cleaning-M
21 pages
DMDW NOTES UNIT 2
0% (1)
DMDW NOTES UNIT 2
11 pages
4 Popular Discretization Techniques You Need to Know in Data Science (1)
No ratings yet
4 Popular Discretization Techniques You Need to Know in Data Science (1)
17 pages
insem notes
No ratings yet
insem notes
8 pages
DATA ANALYSIS
No ratings yet
DATA ANALYSIS
16 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
Exam-1
No ratings yet
Exam-1
12 pages
Data Mining - Lecture 3
No ratings yet
Data Mining - Lecture 3
33 pages
Module 1_BCS602_chapter 02.pptx
No ratings yet
Module 1_BCS602_chapter 02.pptx
90 pages
DSBDA
No ratings yet
DSBDA
18 pages
what is data science and cpare data science and information science
No ratings yet
what is data science and cpare data science and information science
11 pages
Data Mining Techniques Unit 2
No ratings yet
Data Mining Techniques Unit 2
48 pages
Wa0029.
No ratings yet
Wa0029.
4 pages
Week 2
No ratings yet
Week 2
96 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
ADS IA 1 syllabus prep (1)
No ratings yet
ADS IA 1 syllabus prep (1)
5 pages
Unit 4-1
No ratings yet
Unit 4-1
13 pages
#CH-2.1.5
No ratings yet
#CH-2.1.5
19 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
DM Data transformation techniques
No ratings yet
DM Data transformation techniques
25 pages
DWM - Exp 1
No ratings yet
DWM - Exp 1
11 pages
Week2-2
No ratings yet
Week2-2
25 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
FDS CH 3
No ratings yet
FDS CH 3
2 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
4 - Basics in Statistics and Linear Algebra
No ratings yet
4 - Basics in Statistics and Linear Algebra
7 pages
Data - part 1
No ratings yet
Data - part 1
58 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
Ac 2010-1120: Real-Time Video Transmission From High Altitude Balloon: An Interdisciplinary Senior Design Project
No ratings yet
Ac 2010-1120: Real-Time Video Transmission From High Altitude Balloon: An Interdisciplinary Senior Design Project
12 pages
Ignore The Tuple
No ratings yet
Ignore The Tuple
2 pages
Mvda - Question Bank
No ratings yet
Mvda - Question Bank
14 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Chapter3
No ratings yet
Chapter3
50 pages
UNIT 4
No ratings yet
UNIT 4
42 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
IDS5
No ratings yet
IDS5
56 pages
Slide 2 - Data Preprocessing
100% (1)
Slide 2 - Data Preprocessing
39 pages
3.data Pre-Processing Concepts
No ratings yet
3.data Pre-Processing Concepts
8 pages
Unit I
No ratings yet
Unit I
57 pages
DMDW 5
No ratings yet
DMDW 5
25 pages
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
3. Microservices composite handson
No ratings yet
3. Microservices composite handson
17 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
Normalization
No ratings yet
Normalization
35 pages
Data Discretization
No ratings yet
Data Discretization
4 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
3. Git-HOL
No ratings yet
3. Git-HOL
2 pages
1. Microservices using Spring Boot 3 exercises
No ratings yet
1. Microservices using Spring Boot 3 exercises
2 pages
4. Git-HOL
No ratings yet
4. Git-HOL
2 pages
0. Sample Microservices exercises
No ratings yet
0. Sample Microservices exercises
6 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
Unit 2 - Data Visualization Techniques
No ratings yet
Unit 2 - Data Visualization Techniques
101 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Strategies for teaching students with learning and behavior problems 8th Edition by Sharon Vaughn, Candace Bos 0137034679 978-0137034673instant download
100% (4)
Strategies for teaching students with learning and behavior problems 8th Edition by Sharon Vaughn, Candace Bos 0137034679 978-0137034673instant download
80 pages
q4 l4 Properties of Equality
No ratings yet
q4 l4 Properties of Equality
15 pages
The Game Of Hockey _ History in Nigeria, Facilities, equipment and Basic – EduDelightTutors
No ratings yet
The Game Of Hockey _ History in Nigeria, Facilities, equipment and Basic – EduDelightTutors
10 pages
True False Not Given UZBEK
No ratings yet
True False Not Given UZBEK
8 pages
RSRM TP-H1148 Main Grain Propellant Crack Initiation Evaluation
No ratings yet
RSRM TP-H1148 Main Grain Propellant Crack Initiation Evaluation
6 pages
SRC-SAN JOSE SY 2023-2024-2
No ratings yet
SRC-SAN JOSE SY 2023-2024-2
22 pages
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Learning Delivery Modalities Course 2: Reflection A
83% (6)
Learning Delivery Modalities Course 2: Reflection A
8 pages
Wur Thesis Defense
100% (3)
Wur Thesis Defense
4 pages
IIM Raipur 2024 26 Final Placement Report 2022-24-14
No ratings yet
IIM Raipur 2024 26 Final Placement Report 2022-24-14
22 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Factors Associated With Academic Burnout and Its Prevalence Among University Students: A Cross-Sectional Study
No ratings yet
Factors Associated With Academic Burnout and Its Prevalence Among University Students: A Cross-Sectional Study
13 pages
Prospectus 2020
No ratings yet
Prospectus 2020
134 pages
Tools Research Methodology
No ratings yet
Tools Research Methodology
32 pages
Systematic Literature Reviews An Introduction
No ratings yet
Systematic Literature Reviews An Introduction
11 pages
Data Science Roles, Stages in A Data Science Project
No ratings yet
Data Science Roles, Stages in A Data Science Project
14 pages
Islamic Studies in Korea
No ratings yet
Islamic Studies in Korea
19 pages
Communication Research - Module 3 Final
No ratings yet
Communication Research - Module 3 Final
9 pages
Year 1 Daily Lesson Plans: Content Standard
No ratings yet
Year 1 Daily Lesson Plans: Content Standard
6 pages
Syllabi For ZOL203G2
No ratings yet
Syllabi For ZOL203G2
2 pages
Unit-4 JSP - Notes - WT
No ratings yet
Unit-4 JSP - Notes - WT
28 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
811 826
No ratings yet
811 826
16 pages
The Teaching Profession
100% (1)
The Teaching Profession
8 pages
Ual Brief 4 l2 Unit 7
No ratings yet
Ual Brief 4 l2 Unit 7
3 pages
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Alex Lesson Plan The Water Cycle
No ratings yet
Alex Lesson Plan The Water Cycle
5 pages
Calculation of Marble
No ratings yet
Calculation of Marble
2 pages
Proclaimer October 8
No ratings yet
Proclaimer October 8
1 page
Demon Demons: From "The Goetia: The Lesser Key of Solomon The King" (1904) Written by S.L. Macgregor Mathers
No ratings yet
Demon Demons: From "The Goetia: The Lesser Key of Solomon The King" (1904) Written by S.L. Macgregor Mathers
3 pages
Class Program: Grade 7
No ratings yet
Class Program: Grade 7
1 page
Acknowledgements
No ratings yet
Acknowledgements
3 pages
Juspay - REVA 2024 Batch v1
No ratings yet
Juspay - REVA 2024 Batch v1
36 pages
How To Select and Develop High Potential Leaders
100% (1)
How To Select and Develop High Potential Leaders
10 pages

Data Discretization

Uploaded by

Data Discretization

Uploaded by

KONERU LAKSHMAIAH EDUCATION FOUNDATION

(Deemed to be University estd, u/s, 3 of the UGC Act,

1956) (NAAC Accredited “A++” Grade University)

Green Fields, Guntur District, A.P., India – 522502

B.Tech. IIInd Year

Session 25.: Data Discretization

3. Instructional Objectives (Course Objectives)

 Gain a comprehensive understanding of data discretization concepts and

Applying various data discretization methods to explore the data.

Data Discretization is considered as a data reduction mechanism, because it diminishes data

Fits the problem statement

Methods of Data Discretization

4.Decision tree analysis:

Example of Decision Tree

8. Activities/ Case studies/related to the session

Case Study: Discretization Impact on Classification Models

9. Examples & contemporary extracts of articles/ practices to convey the idea

Enhancing Privacy Preservation through Data Discretization in Healthcare Analytics:

1. What is the primary goal of data discretization?

2.How does data discretization contribute to handling missing or noisy data?

1) What are the different methods of data discretization?

2) Examples of real-world applications where data discretization is commonly used?

Enhancing Privacy Preservation through Data Discretization in Healthcare Analytics:

Data Discretization is considered as a data reduction mechanism, because it diminishes data

Fits the problem statement

Methods of Data Discretization

2. Data discretization is commonly employed in various real-world applications across

Finance: Financial institutions use data discretization to analyze transactional data

Marketing: Customer data, including purchase history and demographic information,

Telecommunications: Telecom companies utilize data discretization to analyze

Manufacturing: Data discretization is employed in quality control processes to

Education: Educational institutions use data discretization to analyze student

16. Reference Books:

1. Python Data Science Handbook, by Jake VanderPlas, Released November 2016

Sites and Web links:

You might also like