Data Transformation

This document discusses strategies for data transformation and discretization. It describes techniques for smoothing, attribute construction, aggregation, normalization, and discretization of data. For discretization, it covers top-down and bottom-up approaches, as well as supervised and unsupervised methods. Specific normalization techniques explained are min-max normalization, z-score normalization, and decimal scaling. Discretization by binning is also covered.

Uploaded by

Avudaiappan S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views16 pages

Data Transformation

Uploaded by

Avudaiappan S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 16

DATA

TRANSFORMATION
&
DATA
DISCRETIZATION
DATA TRANSFORMATION STRATEGIES

– Smoothing :
• To remove noise.
• Techniques : binning, regression, and clustering.
– Attribute Construction :
• Also known as feature construction.
• New attributes are constructed from given attributes.
– Aggregation
• Summary or aggregation operations applied to data.
• Typically used in constructing data cube for data analysis at multiple abstraction levels.
DATA TRANSFORMATION STRATEGIES

– Normalization :
• Attribute data are scaled so as to fall within a smaller range.
– Discretization :
• raw values of a numeric attribute are replaced by interval labels or conceptual labels.
– Concept hierarchy generation for nominal data :
• Attributes can be generalized to higher-level concepts.
DATA DISCRETIZATION

Based on which direction it proceeds :

– Top-down – find one or few points to split the entire attribute range, and then
repeats this recursively on the resulting intervals.
– Bottom-up - starts by considering all of the continuous values, removes some by
merging neighborhood values to form intervals, and then recursively applies this
process to the resulting intervals.
Whether class information is used :
– Supervised - The discretization process uses class information
– Unsupervised - The discretization process does not use class information
Data Transformation by
Normalization
– To help avoid dependence on the choice of measurement units, the data should
be normalized or standardized.
– The data should fall within a smaller or common range such as [−1,1] or [0.0,
1.0].
– It gives all attributes an equal weight.
Methods of Data Normalization

1. Min-max normalization
2. Z-score normalization
3. Normalization by decimal scaling.
Min-max normalization

– Suppose that minA and maxA are the minimum and maximum values of an
attribute, A. Min-max normalization maps a value, vi , of A to vi’ in the range
[new minA,new maxA] by computing:
Example

Suppose that the minimum and maximum values for the attribute income are
$12,000 and $98,000, respectively. We would like to map income to the range
[0.0,1.0].
By min-max normalization, a value of $73,600 for income is transformed to :
z-score normalization
(or zero-mean normalization)

– The values for an attribute, A, are normalized based on the mean (i.e., average)
and standard deviation of A. A value, vi , of A is normalized to vi’ by computing
z-score normalization

Also,
z-score normalization using the mean absolute deviation is

(effect of outliers is reduced.)

Example

Suppose that the mean and standard deviation of the values for the attribute
income are $54,000 and $16,000, respectively.
With z-score normalization, a value of $73,600 for income is transformed to
Decimal Scaling

– Normalizes by moving the decimal point of values of attribute A.

– The number of decimal points moved depends on the maximum absolute value
of A. A value, vi , of A is normalized to vi’ by computing:

where j is the smallest integer such that

Example

Suppose that the recorded values of A range from −986 to 917. The maximum
absolute value of A is 986.
To normalize by decimal scaling, we therefore divide each value by 1000 (i.e., j = 3)
so that −986 normalizes to −0.986 and 917 normalizes to 0.917.
Discretization By Binning

– It’s a top-down splitting technique.

– Its unsupervised discretization technique.
– Used for data reduction and concept hierarchy generation.
– Attribute values can be discretized by applying equal-width or equal-frequency
binning, and then replacing each bin value by the bin mean or median, as in
smoothing by bin means or smoothing by bin medians, respectively.
– Sensitive to user-specified number of bins and presence of outliers.
THANK YOU

Graph Partitioning and Graph Clustering (PDFDrive)
No ratings yet
Graph Partitioning and Graph Clustering (PDFDrive)
258 pages
Unit 3-Fuzzy Clustering
No ratings yet
Unit 3-Fuzzy Clustering
34 pages
MCA 102 End Term 2024-2026
No ratings yet
MCA 102 End Term 2024-2026
2 pages
Evolutionary Equations: Christian Seifert Sascha Trostorff Marcus Waurick
100% (1)
Evolutionary Equations: Christian Seifert Sascha Trostorff Marcus Waurick
321 pages
04 Pole Placement Design
No ratings yet
04 Pole Placement Design
13 pages
Pullout Capacity Prediction of Circular Plate Anch. Thesis
No ratings yet
Pullout Capacity Prediction of Circular Plate Anch. Thesis
10 pages
Seminar Report ON: Ai and Its Intelligent Agents
No ratings yet
Seminar Report ON: Ai and Its Intelligent Agents
17 pages
BCA SEM 3 Computer Oriented Numerical Methods BC0043
75% (4)
BCA SEM 3 Computer Oriented Numerical Methods BC0043
10 pages
The Complexity of Hard-Decision Decoding Linear Codes: A. E. Kroukt A
No ratings yet
The Complexity of Hard-Decision Decoding Linear Codes: A. E. Kroukt A
1 page
MS 2 Simultaneous Equations and Inequalities
No ratings yet
MS 2 Simultaneous Equations and Inequalities
15 pages
Exp#02 Analysing Biomedical Signal Using DFT and Reconstruct The Signal Using IDFT
No ratings yet
Exp#02 Analysing Biomedical Signal Using DFT and Reconstruct The Signal Using IDFT
6 pages
Supp. Exam Schedule 2018
No ratings yet
Supp. Exam Schedule 2018
12 pages
Disadvantages of Chat GPT
No ratings yet
Disadvantages of Chat GPT
2 pages
Assignment Combining Transformations Sept 2008
No ratings yet
Assignment Combining Transformations Sept 2008
2 pages
Algorithms: Worst Case and Best Case Analysis Asymptotic Notations
100% (1)
Algorithms: Worst Case and Best Case Analysis Asymptotic Notations
31 pages
STAT 3301 - Dataset and Data Summary Report
No ratings yet
STAT 3301 - Dataset and Data Summary Report
9 pages
Ass 3
No ratings yet
Ass 3
9 pages
NTCC REPORT 010424.docx Megha Pachuri
No ratings yet
NTCC REPORT 010424.docx Megha Pachuri
27 pages
Time Series Analysis Lecture 8-3
No ratings yet
Time Series Analysis Lecture 8-3
12 pages
Visual Cryptography.4003826.Powerpoint
No ratings yet
Visual Cryptography.4003826.Powerpoint
16 pages
2020 Specimen Paper 2
No ratings yet
2020 Specimen Paper 2
14 pages
Computer Studies 2024-25 Question Paper
No ratings yet
Computer Studies 2024-25 Question Paper
2 pages
Sem232 LA CC07 Group08
No ratings yet
Sem232 LA CC07 Group08
23 pages
A Fast Factorisation of Semi-Primes Using Sum
No ratings yet
A Fast Factorisation of Semi-Primes Using Sum
13 pages
Tour Operations Problem Using Kruskal's Algorithm Daa Mini Project Report
No ratings yet
Tour Operations Problem Using Kruskal's Algorithm Daa Mini Project Report
13 pages
Graph Theory: Penn State Math 485 Lecture Notes: Licensed Under A
100% (1)
Graph Theory: Penn State Math 485 Lecture Notes: Licensed Under A
154 pages
LinearAI DS FinalCh1-8,10!13!2021S2 DR - Omar
No ratings yet
LinearAI DS FinalCh1-8,10!13!2021S2 DR - Omar
13 pages
Engineering Mathematics (3) Lecture Notes 2.1
No ratings yet
Engineering Mathematics (3) Lecture Notes 2.1
7 pages
Regge Calculus in The Canonical Form: Budker Institute of Nuclear Physics, Novosibirsk 630090, Russia
No ratings yet
Regge Calculus in The Canonical Form: Budker Institute of Nuclear Physics, Novosibirsk 630090, Russia
24 pages
DCN Error Correction
No ratings yet
DCN Error Correction
6 pages

Data Transformation

Uploaded by

Data Transformation

Uploaded by

DATA

Based on which direction it proceeds :

(effect of outliers is reduced.)

– Normalizes by moving the decimal point of values of attribute A.

where j is the smallest integer such that

– It’s a top-down splitting technique.

You might also like