0% found this document useful (0 votes)

22 views16 pages

Data Transformation

Data transformation is the process of converting data into a suitable format for analysis, involving tasks like cleaning and normalization. Normalization adjusts numerical data to a common scale to prevent bias and improve model performance, with methods including Min-Max, Z-Score, and Decimal Scaling. Additionally, discretization by binning simplifies continuous data into categorical data, and concept hierarchy generation groups nominal data into broader categories for easier analysis.

Uploaded by

mahithavg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views16 pages

Data Transformation

Uploaded by

mahithavg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 16

Data Transformation

What is Data Transformation?

Data transformation is the process of converting data from its
original format or structure into a format that is suitable for analysis,
reporting, or use in other systems. This process often involves cleaning,
organizing, and adjusting the data to meet specific requirements for
downstream applications. Data transformation can include tasks like
aggregation, filtering, encoding, and applying mathematical or statistical
operations to modify data.
 Data Transformation by Normalization

 Normalization is a specific type of data transformation that

involves adjusting the values of numerical data to a common scale,
without distorting differences in the ranges of values.
 The purpose of normalization is to make sure that features
(variables) have comparable scales, which can be important when
working with machine learning models or statistical analysis.

 Why Normalize Data?

 Prevent Bias: In many models, variables with larger ranges or values
can dominate the analysis, overshadowing other variables.
 Improve Model Performance: Normalized data allows models to
learn more efficiently because they can treat all features with equal
importance.
Common Methods of Normalization:

1.Min-Max Normalization: This method scales the data to a specific

range, usually between 0 and 1.

X is the original value.

X_{min} and X_{max} are the minimum and maximum values of the
feature in the dataset.
2. Z-Score Normalization (Standardization): This technique transforms the data so
that it has a mean of 0 and a standard deviation of 1.

X is the original value.

mu is the mean of the feature.
sigma is the standard deviation of the feature.

3. Decimal Scaling: This method normalizes the data by moving the decimal point.

j is chosen such that the maximum absolute value of the transformed data is less than
1.
Discretization by binning

• Discretization by binning is a technique used in data preprocessing to

transform continuous numerical data into categorical data.
• This process involves dividing a continuous range of values into a set of
intervals or "bins" and then replacing the values in each bin with a
representative value (such as the bin's midpoint or the average value of the
bin).
• The purpose is to simplify the data and reduce its granularity, making it
easier to analyze, visualize, or use in machine learning models
Types of Binning Methods:
1.Equal-width Binning:
• In equal-width binning, the range of the data is divided into intervals (bins)
of equal width (size).
• The width of each bin is calculated by dividing the difference between the
maximum and minimum values by the number of bins you want to create.
Example:
For data: [2, 5, 7, 9, 12, 15, 18, 21]
Divide it into 3 bins (equal width):
• Bin 1: [2, 8]
• Bin 2: [8, 14]
• Bin 3: [14, 21]
2. Equal-frequency Binning:
• In equal-frequency binning, each bin contains an equal number of data
points, rather than an equal range of values.
• This ensures that each bin has approximately the same number of
elements, which can be useful when dealing with skewed or imbalanced
data.
Example:
For data: [2, 5, 7, 9, 12, 15, 18, 21]
If you want 3 bins, you can divide the data into three groups:
• Bin 1: [2, 5, 7]
• Bin 2: [9, 12, 15]
• Bin 3: [18, 21]
3. Custom
Binning:
• Custom binning allows you to define your own bin edges based on domain knowledge
or the specific nature of the data.
• For example, if you're categorizing ages, you might define the following bins:
• Bin 1: 0-18 (Child)
• Bin 2: 19-35 (Young Adult)
• Bin 3: 36-60 (Adult)
• Bin 4: 61+ (Senior)

4. Cluster-based Binning:
• This method groups data into clusters using clustering algorithms like K-means.
Each cluster represents a bin, and values within each cluster are assigned to the
same bin.
• Cluster-based binning is useful when data naturally forms groups or clusters.
Example:
If you have data points that naturally group into clusters like [2, 5, 7], [9, 12, 15], and
[18, 21], each of these could form a bin.
Advantages of Binning:
• Noise Reduction: Smoothing
• Improved Model Performance: Accuracy
• Interpretability: Simplicity
Disadvantages of Binning:
• Loss of Information: Generalization
• Choice of Number of Bins: Arbitrary
• Sensitive to Outliers: Distortion
Concept Hierarchy Generation for
Nominal Data
What is Nominal Data?
Nominal data is a type of categorical data that represents distinct categories without
any inherent order or ranking.
In other words, nominal data consists of labels or names that are used to identify
categories, but there is no meaningful way to order them.
Here are some key features of nominal data:
• Categories with no order: The values are simply different from each other, but there’s
no ranking or order between them.
• No mathematical operations: You can't perform any mathematical operations (like
addition or subtraction) on nominal data. For example, you can’t say one category is
"greater" or "less" than another.
• Labels: Nominal data is often used to label things in different categories.
Examples of Nominal Data:

1.Colors: Red, Blue, Green, Yellow

These are simply different colors, and there is no inherent order like “Red > Blue.”
2.Countries: USA, Canada, Germany, India
Countries are distinct categories, but there’s no order like “USA is greater than Canada” in a
meaningful way for nominal data.
3.Fruits: Apple, Banana, Cherry, Mango
Fruits are just different types of food, and there's no ranking or order in how they are
categorized.
4.Gender: Male, Female, Non-binary
Gender categories are different from each other, but there's no ranking order in the nominal
sense.
5.Types of Animals: Dog, Cat, Elephant, Tiger
These are distinct categories with no order.
• Concept Hierarchy Generation for nominal data refers to grouping those
categories into higher-level concepts or abstractions. This helps to generalize and
simplify the data.
• For example, imagine we have a dataset with the following nominal data about
animals:
Animal
------
Dog
Cat
Elephant
Tiger

We can group these animals into broader categories (higher-level concepts)

based on certain properties, like "Mammals" or "Wild Animals".
Concept Hierarchy Example for 2. Generated Concept Hierarchy
Nominal Data: (Grouping by Categories):
Animal
1. Original Nominal Data: ├── Mammals
Animal │ ├── Dog
------ │ ├── Cat
Dog ├── Wild Animals
Cat │ ├── Elephant
Elephant │ └── Tiger
Tiger

3. Generalized Data: After applying the hierarchy, we can replace

the specific animal names with their generalized categories:

Generalized Animal
-------------------
Mammals
Mammals
Wild Animals
Wild Animals
Why is Concept Hierarchy Important?
• Simplifies the Data: It helps in reducing complexity by grouping detailed categories into
broader, more generalized concepts.
• Improves Understanding: It makes it easier to understand patterns or trends in the data,
because you can analyze the data at a higher level (e.g., analyzing "Mammals" rather than
individual animals like "Dog" and "Cat").
• Data Mining Efficiency: By generalizing the data, algorithms can work more efficiently because
they don’t have to handle every small category separately.
• Better Insights: It helps to see relationships that might not be obvious when working with
individual categories.
Thank You

Pico Bricks Ebook 15
100% (1)
Pico Bricks Ebook 15
234 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
50 Safety Director Interview Questions and Answers 1734275478
No ratings yet
50 Safety Director Interview Questions and Answers 1734275478
5 pages
How God Answers Prayer
100% (1)
How God Answers Prayer
12 pages
Unit 2 - Data Visualization Techniques
No ratings yet
Unit 2 - Data Visualization Techniques
101 pages
TRD PRM
No ratings yet
TRD PRM
33 pages
2024 Memo 16 Conduct of School Intramurals
No ratings yet
2024 Memo 16 Conduct of School Intramurals
7 pages
Fibonacci
No ratings yet
Fibonacci
2 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
WH006-02 Sop For Receipt of Raw and Packing Material
No ratings yet
WH006-02 Sop For Receipt of Raw and Packing Material
6 pages
PD Integration of Thought Feelings and Behavior
No ratings yet
PD Integration of Thought Feelings and Behavior
15 pages
Search and Rescue Robots Developed by The European ICARUS Project
No ratings yet
Search and Rescue Robots Developed by The European ICARUS Project
7 pages
Apacible - NCM118 LP1 Introduction
No ratings yet
Apacible - NCM118 LP1 Introduction
6 pages
4 - Finding and Fixing Data Quality Issues
No ratings yet
4 - Finding and Fixing Data Quality Issues
48 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
Cluster
No ratings yet
Cluster
120 pages
Biostatistics - Data and Its Types
No ratings yet
Biostatistics - Data and Its Types
11 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Module 1 - BCS602 - Chapter 02
No ratings yet
Module 1 - BCS602 - Chapter 02
90 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Developing A Mapreduce Application: by Dr. K. Venkateswara Rao Professor Department of Cse
No ratings yet
Developing A Mapreduce Application: by Dr. K. Venkateswara Rao Professor Department of Cse
83 pages
Clustering 1
No ratings yet
Clustering 1
75 pages
Solutions-Grand Marks Booster Challenege#1
No ratings yet
Solutions-Grand Marks Booster Challenege#1
66 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
FINAL LECTURE 3,4.pptx - AutoRecovered (Autosaved)
No ratings yet
FINAL LECTURE 3,4.pptx - AutoRecovered (Autosaved)
80 pages
E-Note 33325 Content Document 20250319114322AM
No ratings yet
E-Note 33325 Content Document 20250319114322AM
69 pages
FINAL LECTURE 3,4.pptx - AutoRecovered
No ratings yet
FINAL LECTURE 3,4.pptx - AutoRecovered
73 pages
Rheology and Transport Phenomena (FET)
No ratings yet
Rheology and Transport Phenomena (FET)
9 pages
Data Science
No ratings yet
Data Science
47 pages
IDS5
No ratings yet
IDS5
56 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
AIML Unit 2 Understanding Data
No ratings yet
AIML Unit 2 Understanding Data
51 pages
Data and Metrics
No ratings yet
Data and Metrics
35 pages
Data Preprocessing For Clustering
No ratings yet
Data Preprocessing For Clustering
40 pages
Module-1 C1-C2
No ratings yet
Module-1 C1-C2
39 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
02 Data
No ratings yet
02 Data
35 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
DAI101 4 Data Preparation
No ratings yet
DAI101 4 Data Preparation
45 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
3 - AML - Lecture 3 - Feature Engg
No ratings yet
3 - AML - Lecture 3 - Feature Engg
39 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
ML Notes
No ratings yet
ML Notes
44 pages
Cluster Analysis
No ratings yet
Cluster Analysis
39 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
Lecture # 13 Data - Transformation - Techniques
No ratings yet
Lecture # 13 Data - Transformation - Techniques
36 pages
5 Data Pre Processing II
No ratings yet
5 Data Pre Processing II
26 pages
Lec 5
No ratings yet
Lec 5
24 pages
Lecture24 s12
No ratings yet
Lecture24 s12
24 pages
Series and Parallel - Simple Circuits: © Boardworks LTD 2003
No ratings yet
Series and Parallel - Simple Circuits: © Boardworks LTD 2003
22 pages
DM Data Transformation Techniques
No ratings yet
DM Data Transformation Techniques
25 pages
4 Data Pre Processing II
No ratings yet
4 Data Pre Processing II
26 pages
Datalec 1
No ratings yet
Datalec 1
23 pages
02 - ML - Data Presentation-24-03-09
No ratings yet
02 - ML - Data Presentation-24-03-09
21 pages
Beamforming: Fundamentals To Implementation: Luc Langlois, Director Products & Emerging Technologies /5G Avnet
No ratings yet
Beamforming: Fundamentals To Implementation: Luc Langlois, Director Products & Emerging Technologies /5G Avnet
24 pages
3 1 Chapter 3 Normalization
No ratings yet
3 1 Chapter 3 Normalization
22 pages
Data ch2
No ratings yet
Data ch2
16 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
DMDW 5
No ratings yet
DMDW 5
25 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
Information Brochure Diploma Certificate Courses
No ratings yet
Information Brochure Diploma Certificate Courses
12 pages
Datasheet LT1171HV
No ratings yet
Datasheet LT1171HV
20 pages
Feature Engineering
No ratings yet
Feature Engineering
18 pages
Sitrans fmt020
No ratings yet
Sitrans fmt020
11 pages
Theory of Structures 3 CIV 301 DR M Abdel Kader Double Integration Method
No ratings yet
Theory of Structures 3 CIV 301 DR M Abdel Kader Double Integration Method
14 pages
Data Preparation Notebook
No ratings yet
Data Preparation Notebook
14 pages
Machine Learning Unit 2
No ratings yet
Machine Learning Unit 2
9 pages
Reviewer Print
No ratings yet
Reviewer Print
9 pages
Aramid Prepreg Market
No ratings yet
Aramid Prepreg Market
8 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
FDS Most Imp Question
No ratings yet
FDS Most Imp Question
12 pages
For More Details, Please Consult Your Hyundai Dealer. Hyundai Motor India LTD 5th-6th Floor, Corporate One - Baani Building, Plot No.-5, Commercial Centre, Jasola, New Delhi-110076
No ratings yet
For More Details, Please Consult Your Hyundai Dealer. Hyundai Motor India LTD 5th-6th Floor, Corporate One - Baani Building, Plot No.-5, Commercial Centre, Jasola, New Delhi-110076
8 pages
Astm E9 09
No ratings yet
Astm E9 09
4 pages
Insem Notes
No ratings yet
Insem Notes
8 pages
Bahasa Inggris
No ratings yet
Bahasa Inggris
4 pages
Unit 1
No ratings yet
Unit 1
8 pages
Data Transformation
No ratings yet
Data Transformation
5 pages
Purana - Padma Purana - Patalak - Estudies
No ratings yet
Purana - Padma Purana - Patalak - Estudies
5 pages
B.A (Prog.) Human Resource Management - Organisational Behaviour Sem-V (5270)
No ratings yet
B.A (Prog.) Human Resource Management - Organisational Behaviour Sem-V (5270)
4 pages
Lit Analysis (The Illiad)
No ratings yet
Lit Analysis (The Illiad)
4 pages
Martin Return Roller Guard: Technical Data Sheet Technical Data Sheet
No ratings yet
Martin Return Roller Guard: Technical Data Sheet Technical Data Sheet
1 page
Icd 16 5 Eng V2.1 PDF
No ratings yet
Icd 16 5 Eng V2.1 PDF
2 pages
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
From Everand
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
Peter Bradley
No ratings yet

Data Transformation

Uploaded by

Data Transformation

Uploaded by

Data Transformation

What is Data Transformation?

 Normalization is a specific type of data transformation that

 Why Normalize Data?

1.Min-Max Normalization: This method scales the data to a specific

X is the original value.

X is the original value.

• Discretization by binning is a technique used in data preprocessing to

1.Colors: Red, Blue, Green, Yellow

We can group these animals into broader categories (higher-level concepts)

3. Generalized Data: After applying the hierarchy, we can replace

You might also like