0% found this document useful (0 votes)

14 views49 pages

DM 02 04 Data Transformation

The document outlines various data transformation tasks essential for data mining, including normalization, attribute construction, aggregation, attribute subset selection, discretization, and generalization. It details methods for normalization such as min-max, z-score, and decimal scaling, as well as techniques for attribute subset selection and discretization. Each section provides insights into the importance and application of these tasks in preparing data for effective analysis and mining.

Uploaded by

Jithin S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views49 pages

DM 02 04 Data Transformation

Uploaded by

Jithin S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Data Mining

Data Preprocessing

Data Transformation

Data Transformation
Outline
• Introduction
• Normalization
• Attribute Construction
• Aggregation
• Attribute Subset Selection
• Discretization
• Generalization
• References

Data Transformation
Introduction

Data Transformation
Data Transformation
• Data transformation
– the data are transformed into forms appropriate for mining.
• Data transformation tasks:
– Normalization
– Attribute construction
– Aggregation
– Attribute Subset Selection
– Discretization
– Generalization

Data Transformation
Normalization

Data Transformation
Normalization
• An attribute is normalized by scaling its values so that
they fall within a small specified range, such as 0.0 to
1.0.
• Normalization is particularly useful for classification
algorithms involving
– neural networks
– distance measurements such as nearest-neighbor
classification and clustering.
• If using the neural network backpropagation algorithm
for classification mining, normalizing the input values
for each attribute measured in the training instances
will help speed up the learning phase.
Data Transformation
Normalization
• For distance-based methods, normalization helps
prevent attributes with initially large ranges (e.g.,
income) from out-weighing attributes with initially
smaller ranges (e.g., binary attributes).
• Normalization methods
– Min-max normalization
– z-score normalization
– Normalization by decimal scaling

Data Transformation
Min-max Normalization
• Min-max normalization
– performs a linear transformation on the original data.
• Suppose that:
– minA and maxA are the minimum and maximum values of
an attribute, A.
• Min-max normalization maps a value, v, of A to v′ in
the range [new_minA, new_maxA] by computing:

v − minA
v'= (new _ maxA − new _ minA) + new _ minA
maxA − minA

Data Transformation
Example:
Min-max Normalization

• Let income range $12,000 to $98,000 normalized to

[0.0, 1.0].
• Then $73,000 is mapped to

73,600 −12,000
(1.0 − 0) + 0 = 0.716
98,000 −12,000

Data Transformation
z-score normalization

• In z-score normalization (or zero-mean

normalization)
– the values for an attribute, A, are normalized based on the
mean (Ā) and standard deviation (σA) of A.
• A value, v, of A is normalized to v′ by computing

Data Transformation
Example: z-score
Normalization

• Let Ā = 54,000, σA = 16,000, for the attribute income

• With z-score normalization, a value of $73,600 for
income is transformed to:
73,600 −54,000 = 1.225
16,000

Data Transformation
Decimal Scaling
• Normalization by decimal scaling
– normalizes by moving the decimal point of values
of attribute A.
– The number of decimal points moved depends on
the maximum absolute value of A.
– A value, v, of A is normalized to v′ by computing
v
v' =
10 j
– where j is the smallest integer such that Max(|v′|) < 1.

Data Transformation
Example: Decimal Scaling

• Suppose that the recorded values of A range from -986

to 917.
• The maximum absolute value of A is 986.
• To normalize by decimal scaling, we therefore divide
each value by 1,000 (i.e., j = 3) so that values come
out to be as 0.986, 0.917 and so on.

Data Transformation
Normalization

• Note that normalization can change the original data

quite a bit, especially the z-score method.

Data Transformation
Attribute Construction

• Attribute construction (feature construction)

– new attributes are constructed from the given attributes and
added in order to help improve the accuracy and
understanding of structure in high-dimensional data.
• Example
– we may wish to add the attribute area based on the
attributes height and width.
• By attribute construction can discover missing
information.

Data Transformation
Data Aggregation

 The method of storing and presenting data in a

summary format.

 This is a crucial step since the accuracy of data

analysis insights is highly dependent on the quantity
and quality of the data used.
Data Aggregation
• On the left, the sales are shown per quarter. On the right, the
data are aggregated to provide the annual sales
• Sales data for a given branch of AllElectronics for the years
2002 to 2004.

Data Transformation
Attribute Subset Selection

Data Transformation
Attribute Subset Selection
• Why attribute subset selection
– Data sets for analysis may contain hundreds of attributes,
many of which may be irrelevant to the mining task or
redundant.
• For example,
– if the task is to classify customers as to whether or not they
are likely to purchase a popular new CD at AllElectronics
when notified of a sale, attributes such as the customer’s
telephone number are likely to be irrelevant, unlike
attributes such as age or music_taste.

Data Transformation
Attribute Subset Selection
• Using domain expert to pick out some of the useful
attributes
– Sometimes this can be a difficult and time-consuming task,
especially when the behavior of the data is not well known.
• Leaving out relevant attributes or keeping irrelevant
attributes result in discovered patterns of poor quality.
• In addition, the added volume of irrelevant or
redundant attributes can slow down the mining
process.

Data Transformation
Attribute Subset Selection
• Attribute subset selection (feature selection):
– Reduce the data set size by removing irrelevant or
redundant attributes
• Goal:
– select a minimum set of features (attributes) such that the
probability distribution of different classes given the values
for those features is as close as possible to the original
distribution given the values of all features
– It reduces the number of attributes appearing in the
discovered patterns, helping to make the patterns easier to
understand.

Data Transformation
Attribute Subset Selection
• How can we find a ‘good’ subset of the original
attributes?
– For n attributes, there are 2n possible subsets.
– An exhaustive search for the optimal subset of attributes
can be prohibitively expensive, especially as n increase.
– Heuristic methods that explore a reduced search space are
commonly used for attribute subset selection.
– These methods are typically greedy in that, while searching
through attribute space, they always make what looks to be
the best choice at the time.
– Such greedy methods are effective in practice and may
come close to estimating an optimal solution.

Data Transformation
Attribute Subset Selection
• Heuristic methods:
– Step-wise forward selection
– Step-wise backward elimination
– Combining forward selection and backward elimination
– Decision-tree induction

• The “best” (and “worst”) attributes are typically determined

using:
– the tests of statistical significance, which assume that the attributes are
independent of one another.
– the information gain measure used in building decision trees for
classification.

Data Transformation
Attribute Subset Selection
• Stepwise forward selection:
– The procedure starts with an empty set of attributes as the reduced set.
– First: The best single-feature is picked.
– Next: At each subsequent iteration or step, the best of the remaining
original attributes is added to the set.

Data Transformation
Attribute Subset Selection
• Stepwise backward elimination:
– The procedure starts with the full set of attributes.
– At each step, it removes the worst attribute remaining in the
set.

Data Transformation
Attribute Subset Selection

• Combining forward selection and backward

elimination:
– The stepwise forward selection and backward elimination
methods can be combined
– At each step, the procedure selects the best attribute and
removes the worst from among the remaining attributes.

Data Transformation
Attribute Subset Selection
• Decision tree induction:
– Decision tree algorithms, such as ID3, C4.5, and CART,
were originally intended for classification.
– Decision tree induction constructs a flowchart-like structure
where each internal (nonleaf) node denotes a test on an
attribute, each branch corresponds to an outcome of the test,
and each external (leaf) node denotes a class prediction.
– At each node, the algorithm chooses the “best” attribute to
partition the data into individual classes.
– When decision tree induction is used for attribute subset
selection, a tree is constructed from the given data.
– All attributes that do not appear in the tree are assumed to
be irrelevant.
Data Transformation
Attribute Subset Selection
• Decision tree induction

Data Transformation
Discretization

Data Transformation
Discretization
• Data Discretization:
– Dividing the range of a continuous attribute into intervals
– Interval labels can then be used to replace actual data values
– Reduce the number of values for a given continuous
attribute
– Some classification algorithms only accept categorical
attributes.
– This leads to a concise, easy-to-use, knowledge-level
representation of mining results.

Data Transformation
Discretization

• Discretization techniques can be categorized based on

whether it uses class information, as:
– Supervised discretization
◆ the discretization process uses class information
– Unsupervised discretization
◆ the discretization process does not use class information

Data Transformation
Discretization
• Discretization techniques can be categorized based on
which direction it proceeds, as:
– Top-down
◆ If the process starts by first finding one or a few points (called split
points or cut points) to split the entire attribute range, and then
repeats this recursively on the resulting intervals
– Bottom-up
◆ starts by considering all of the continuous values as potential split-
points,
◆ removes some by merging neighborhood values to form intervals,
and then recursively applies this process to the resulting intervals.

Data Transformation
Discretization

• Typical methods:
– Binning
◆ Top-down split, unsupervised,
– Clustering analysis
◆ Either top-down split or bottom-up merge, unsupervised
– Interval merging by 2 Analysis:
◆ unsupervised, bottom-up merge

• All the methods can be applied recursively

Data Transformation
Binning
• Binning
– The sorted values are distributed into a number of buckets,
or bins, and then replacing each bin value by the bin mean
or median
– Binning is a top-down splitting technique based on a
specified number of bins.
– Binning is an unsupervised discretization technique,
because it does not use class information
• Binning methods:
– Equal-width (distance) partitioning
– Equal-depth (frequency) partitioning

Data Transformation
Equal-width (distance)
partitioning
• Equal-width (distance) partitioning
– Divides the range into N intervals of equal size: uniform
grid
– if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B –A)/N.
– The most straightforward, but outliers may dominate
presentation
– Skewed data is not handled well

Data Transformation
Equal-width (distance) partitioning

• Sorted data for price (in dollars):

– 4, 8, 15, 21, 21, 24, 25, 28, 34
• W = (B –A)/N = (34 – 4) / 3 = 10
– Bin 1: 4-14, Bin2: 15-24, Bin 3: 25-34
• Equal-width (distance) partitioning:
– Bin 1: 4, 8
– Bin 2: 15, 21, 21, 24
– Bin 3: 25, 28, 34

Data Transformation
Equal-depth (frequency)
partitioning

• Equal-depth (frequency) partitioning

– Divides the range into N intervals, each containing
approximately same number of samples
– Good data scaling

Data Transformation
Equal-depth (frequency)
partitioning

• Sorted data for price (in dollars):

– 4, 8, 15, 21, 21, 24, 25, 28, 34
• Equal-depth (frequency) partitioning:
– Bin 1: 4, 8, 15
– Bin 2: 21, 21, 24
– Bin 3: 25, 28, 34

Data Transformation
Cluster Analysis
• Cluster analysis is a popular data discretization
method.
• A clustering algorithm can be applied to discretize a
numerical attribute, A, by partitioning the values of A
into clusters or groups.
• Clustering takes the distribution of A into
consideration, as well as the closeness of data points,
and therefore is able to produce high-quality
discretization results.

Data Transformation
Interval Merge by 2 Analysis
• ChiMerge:
– It is a bottom-up method
– Find the best neighboring intervals and merge them to form
larger intervals recursively
– The method is supervised in that it uses class information.
– The basic notion is that for accurate discretization, the
relative class frequencies should be fairly consistent within
an interval.
– Therefore, if two adjacent intervals have a very similar
distribution of classes, then the intervals can be merged.
Otherwise, they should remain separate.
– ChiMerge treats intervals as discrete categories

Data Transformation
Interval Merge by 2 Analysis
• The ChiMerge method:
– Initially, each distinct value of a numerical attribute A is
considered to be one interval

– 2 tests are performed for every pair of adjacent intervals

– Adjacent intervals with the least 2 values are merged
together, since low 2 values for a pair indicate similar class
distributions
– This merge process proceeds recursively until a predefined
stopping criterion is met (such as significance level, max-
interval, max inconsistency, etc.)

Data Transformation
Generalization

• Generalization is the generation of concept hierarchies

for categorical data
• Categorical attributes have a finite (but possibly large)
number of distinct values, with no ordering among the
values.
• Examples include
– geographic location,
– job category, and
– itemtype.

Data Transformation
Example: Generalization
• A relational database or a dimension location of a data
warehouse may contain the following group of
attributes: street, city, province or state, and country.
• A user or expert can easily define a concept hierarchy
by specifying ordering of the attributes at the schema
level.
• A hierarchy can be defined by specifying the total
ordering among these attributes at the schema level,
such as:
◆ street < city < province or state < country

Data Transformation
Data Transformation Tasks
• Normalization
– the attribute data are scaled so as to fall within a small
specified range, such as -1.0 to 1.0, 0.0 to 1.0
• Attribute construction (or feature construction)
– new attributes are constructed and added from the given set
of attributes to help the mining process.
• Aggregation
– summary or aggregation operations are applied to the data.
– For example, the daily sales data may be aggregated so as
to compute monthly and annual total amounts.

Data Transformation
Data Transformation Tasks
• Discretization
– Dividing the range of a continuous attribute into intervals
– For example, values for numerical attributes, like age, may
be mapped to higher-level concepts, like youth, middle-
aged, and senior.
• Generalization
– where low-level or “primitive” (raw) data are replaced by
higher-level concepts through the use of concept
hierarchies.
– For example, categorical attributes, like street, can be
generalized to higher-level concepts, like city or country.

Data Transformation
References

• J. Han, M. Kamber, Data Mining: Concepts and

Techniques, Elsevier Inc. (2006). (Chapter 2)

Data Transformation

dmdw2 2
No ratings yet
dmdw2 2
24 pages
3 DM
No ratings yet
3 DM
36 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
DM 2 Part 2
No ratings yet
DM 2 Part 2
35 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
10-2 Data Analysis and Pre-Processing Part 4 PDF
No ratings yet
10-2 Data Analysis and Pre-Processing Part 4 PDF
23 pages
Week 2
No ratings yet
Week 2
96 pages
Data Mining Slide
No ratings yet
Data Mining Slide
35 pages
03preprocessing3 Part3 4
No ratings yet
03preprocessing3 Part3 4
49 pages
Adobe Scan 19 Mar 2025
No ratings yet
Adobe Scan 19 Mar 2025
8 pages
Preprocessing 1
No ratings yet
Preprocessing 1
11 pages
Data Transformation and Standardization
No ratings yet
Data Transformation and Standardization
5 pages
DMBI Unit-4,5,6
No ratings yet
DMBI Unit-4,5,6
38 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
35 pages
DM 02 04 Data Transformation
No ratings yet
DM 02 04 Data Transformation
52 pages
DM Data Transformation Techniques
No ratings yet
DM Data Transformation Techniques
25 pages
R21 Unit 2
No ratings yet
R21 Unit 2
101 pages
Lecture 10 - Data Transformation-M
No ratings yet
Lecture 10 - Data Transformation-M
8 pages
Data Mining
No ratings yet
Data Mining
33 pages
Unit 4
No ratings yet
Unit 4
42 pages
Lecture 4
No ratings yet
Lecture 4
13 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Unit-2 Data Warehouse Notes
No ratings yet
Unit-2 Data Warehouse Notes
11 pages
Decision Tree New
No ratings yet
Decision Tree New
52 pages
Unit 4 Data Warehousing and Data Mining
No ratings yet
Unit 4 Data Warehousing and Data Mining
15 pages
CH2 Data Integration - Transformation
No ratings yet
CH2 Data Integration - Transformation
16 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
DMlecture 1
No ratings yet
DMlecture 1
39 pages
Down 2
No ratings yet
Down 2
61 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
r20 DWDM Unit 2 PART 2
No ratings yet
r20 DWDM Unit 2 PART 2
15 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
No ratings yet
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
57 pages
Data Mining
No ratings yet
Data Mining
21 pages
Data Transformation in Data Mining
No ratings yet
Data Transformation in Data Mining
6 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Data Mining Notes C2
No ratings yet
Data Mining Notes C2
12 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
69 pages
DMDW 5
No ratings yet
DMDW 5
25 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Gini Index Problem
No ratings yet
Gini Index Problem
12 pages
3point5point2 Normalization
No ratings yet
3point5point2 Normalization
3 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Normalization
No ratings yet
Normalization
35 pages
Higher Ed
No ratings yet
Higher Ed
60 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
Quiz #3 - 09.28.24 - BT4 3A
No ratings yet
Quiz #3 - 09.28.24 - BT4 3A
33 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
Update Stock 04 Desember 2023
No ratings yet
Update Stock 04 Desember 2023
49 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Afmp 2011-2017
100% (2)
Afmp 2011-2017
263 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Gust Loads On Aircraft
No ratings yet
Gust Loads On Aircraft
59 pages
Lecture-1 Introduction Water Supply Engr.
100% (1)
Lecture-1 Introduction Water Supply Engr.
49 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
27 pages
DP-300 Updated Dumps - Administering Microsoft Azure SQL Solutions
No ratings yet
DP-300 Updated Dumps - Administering Microsoft Azure SQL Solutions
46 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Communication For Ugc Net Paper 1 Topics Brief
No ratings yet
Communication For Ugc Net Paper 1 Topics Brief
15 pages
IP Anycast
No ratings yet
IP Anycast
5 pages
Qatar PPPs
No ratings yet
Qatar PPPs
29 pages
Data Mining: A Preprocessing Engine
No ratings yet
Data Mining: A Preprocessing Engine
5 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Link State Protocol
No ratings yet
Link State Protocol
5 pages
TCP Congestion Control
No ratings yet
TCP Congestion Control
5 pages
Gad 7 B Đào Nha
No ratings yet
Gad 7 B Đào Nha
8 pages
Motivation Research in Writing
No ratings yet
Motivation Research in Writing
26 pages
Mining Using Genitic Algorithms
No ratings yet
Mining Using Genitic Algorithms
7 pages
Lecture 4.2 Supervised Learning Classification
No ratings yet
Lecture 4.2 Supervised Learning Classification
25 pages
Piping Material Steel
No ratings yet
Piping Material Steel
44 pages
LAN Switching and Link Layer Switches
No ratings yet
LAN Switching and Link Layer Switches
7 pages
Transport Layer Services
No ratings yet
Transport Layer Services
8 pages
Numerical Similarity Measures Versus Jaccard For Collaborative Filtering
No ratings yet
Numerical Similarity Measures Versus Jaccard For Collaborative Filtering
14 pages
PThread API Reference
No ratings yet
PThread API Reference
348 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Radiography Interpretation
No ratings yet
Radiography Interpretation
13 pages
Module 5 - Interests Formula and Rates
No ratings yet
Module 5 - Interests Formula and Rates
14 pages
IADC-SPE-184628-MS - Drill Bit Connections A Time For Change
No ratings yet
IADC-SPE-184628-MS - Drill Bit Connections A Time For Change
10 pages
Distributed Systems Lab 10
No ratings yet
Distributed Systems Lab 10
24 pages
Bowser Document
No ratings yet
Bowser Document
2 pages
St. Jude Sub-Parish Catholic Strategic Plan 2022-2026
No ratings yet
St. Jude Sub-Parish Catholic Strategic Plan 2022-2026
10 pages
Software Requirements Specification Template
No ratings yet
Software Requirements Specification Template
12 pages
Doing-The-Job-British-English-Student
No ratings yet
Doing-The-Job-British-English-Student
8 pages
PP-SFC Introduction To Production Orders
No ratings yet
PP-SFC Introduction To Production Orders
12 pages
Isuzu FXZ26 360 Tipper
No ratings yet
Isuzu FXZ26 360 Tipper
2 pages
3) Sieve Analysis Test
100% (1)
3) Sieve Analysis Test
2 pages
Samarinda Culture
No ratings yet
Samarinda Culture
2 pages
Electrical Contractor & Gen. Services
No ratings yet
Electrical Contractor & Gen. Services
2 pages
Sri Lanka Matrimonial Advertisements
No ratings yet
Sri Lanka Matrimonial Advertisements
17 pages
FINAL Wireline Operators
No ratings yet
FINAL Wireline Operators
2 pages
Lulu Chang Resume
No ratings yet
Lulu Chang Resume
1 page
Lamination Suitability For Flexible Packaging Appl PDF
No ratings yet
Lamination Suitability For Flexible Packaging Appl PDF
3 pages
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
From Everand
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
Arun Manivannan
No ratings yet