0% found this document useful (0 votes)

104 views5 pages

Data Transformation and Standardization

Uploaded by

adityatompe17

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

104 views5 pages

Data Transformation and Standardization

Uploaded by

adityatompe17

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Unit 03: Advanced Statistics

Data Transformation
Data transformation in data mining refers to the process of converting raw data into
a format that is suitable for analysis and modelling. The goal of data transformation
is to prepare the data for data mining so that it can be used to extract useful insights
and knowledge. Data transformation typically involves several steps, including:
1. Data cleaning: Removing or correcting errors, inconsistencies, and
missing values in the data.
2. Data integration: Combining data from multiple sources, such as
databases and spreadsheets, into a single format.
3. Data normalization: Scaling the data to a common range of values, such
as between 0 and 1, to facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by selecting a
subset of relevant features or attributes.
5. Data discretization: Converting continuous data into discrete categories
or bins.
6. Data aggregation: Combining data at different levels of granularity, such
as by summing or averaging, to create new features or attributes.
7. Data transformation is an important step in the data mining process as it
helps to ensure that the data is in a format that is suitable for analysis and
modeling, and that it is free of errors and inconsistencies. Data
transformation can also help to improve the performance of data mining
algorithms, by reducing the dimensionality of the data, and by scaling the
data to a common range of values.
The data are transformed in ways that are ideal for mining the data. The data
transformation involves steps that are:
1. Smoothing: It is a process that is used to remove noise from the dataset using
some algorithms It allows for highlighting important features present in the dataset.
It helps in predicting the patterns. When collecting data, it can be manipulated to
eliminate or reduce any variance or any other noise form. The concept behind data
smoothing is that it will be able to identify simple changes to help predict different
trends and patterns. This serves as a help to analysts or traders who need to look at a
lot of data which can often be difficult to digest for finding patterns that they
wouldn’t see otherwise.
2. Aggregation: Data collection or aggregation is the method of storing and
presenting data in a summary format. The data may be obtained from multiple data
sources to integrate these data sources into a data analysis description. This is a
crucial step since the accuracy of data analysis insights is highly dependent on the
quantity and quality of the data used. Gathering accurate data of high quality and a
large enough quantity is necessary to produce relevant results. The collection of data
is useful for everything from decisions concerning financing or business strategy of
the product, pricing, operations, and marketing strategies. For example, Sales, data
may be aggregated to compute monthly& annual total amounts.
3. Discretization: It is a process of transforming continuous data into set of small
intervals. Most Data Mining activities in the real world require continuous attributes.
Yet many of the existing data mining frameworks are unable to handle these
attributes. Also, even if a data mining task can manage a continuous attribute, it can
significantly improve its efficiency by replacing a constant quality attribute with its
discrete values. For example, (1-10, 11-20) (age:- young, middle age, senior).
4. Attribute Construction: Where new attributes are created & applied to assist the
mining process from the given set of attributes. This simplifies the original data &
makes the mining more efficient.
5. Generalization: It converts low-level data attributes to high-level data attributes
using concept hierarchy. For Example Age initially in Numerical form (22, 25) is
converted into categorical value (young, old). For example, Categorical attributes,
such as house addresses, may be generalized to higher-level definitions, such as town
or country.
6. Normalization: Data normalization involves converting all data variables into a
given range. Techniques that are used for normalization are:
 Min-Max Normalization:
 This transforms the original data linearly.
 Suppose that: min_A is the minima and max_A is the maxima of
an attribute, P
 Where v is the value you want to plot in the new range.
 v’ is the new value you get after normalizing the old value.
 Z-Score Normalization:
 In z-score normalization (or zero-mean normalization) the values
of an attribute (A), are normalized based on the mean of A and its
standard deviation
 A value, v, of attribute A is normalized to v’ by computing
 Decimal Scaling:
 It normalizes the values of an attribute by changing the position
of their decimal points
 The number of points by which the decimal point is moved can
be determined by the absolute maximum value of attribute A.
 A value, v, of attribute A is normalized to v’ by computing
 where j is the smallest integer such that Max(|v’|) < 1.
 Suppose: Values of an attribute P varies from -99 to 99.
 The maximum absolute value of P is 99.
 For normalizing the values we divide the numbers by 100 (i.e., j
= 2) or (number of integers in the largest number) so that values
come out to be as 0.98, 0.97 and so on.

ADVANTAGES OR DISADVANTAGES:

Advantages of Data Transformation in Data Mining:

1. Improves Data Quality: Data transformation helps to improve the quality of
data by removing errors, inconsistencies, and missing values.
2. Facilitates Data Integration: Data transformation enables the integration of
data from multiple sources, which can improve the accuracy and
completeness of the data.
3. Improves Data Analysis: Data transformation helps to prepare the data for
analysis and modeling by normalizing, reducing dimensionality, and
discretizing the data.
4. Increases Data Security: Data transformation can be used to mask sensitive
data, or to remove sensitive information from the data, which can help to
increase data security.
5. Enhances Data Mining Algorithm Performance: Data transformation can
improve the performance of data mining algorithms by reducing the
dimensionality of the data and scaling the data to a common range of
values.

Disadvantages of Data Transformation in Data Mining:

1. Time-consuming: Data transformation can be a time-consuming process,

especially when dealing with large datasets.
2. Complexity: Data transformation can be a complex process, requiring
specialized skills and knowledge to implement and interpret the results.
3. Data Loss: Data transformation can result in data loss, such as when
discretizing continuous data, or when removing attributes or features from
the data.
4. Biased transformation: Data transformation can result in bias, if the data is
not properly understood or used.
5. High cost: Data transformation can be an expensive process, requiring
significant investments in hardware, software, and personnel.
Overfitting: Data transformation can lead to overfitting, which is a common problem
in machine learning where a model learns the detail and noise in the training data to
the extent that it negatively impacts the performance of the model on new unseen data.

Data Standardization

What is a data standardization example?

Data standardization is converting data into a standard format that computers can easily
understand and use. For example, when you standardize data, you might convert all
measurements into the metric system or all dates into a single format (such as YYYY-MM-
DD).

What is the purpose of data standardization?

The primary purpose of data standardization is to improve data quality, reduce costs, and
better decision-making. Data standardization can also help to enhance communication and
collaboration between different teams and departments.

3. Why is data standardization important in healthcare?

Data standardization is critical in healthcare because it helps improve patient care, increase
operational efficiency, and reduce costs. When data is standardized, it is easier to exchange
and use, which leads to better decision-making and improved patient outcomes.

4. What are data standardization and normalization?

Data standardization converts data into a standard format. This is often done to improve the
compatibility of data between different systems. Data normalization, however, ensures that
data is consistent and free of errors. This is usually done by ensuring that information is
stored in a standard format and adheres to rules.
Conclusion
Data standardization is the method of organizing data so that it can be easily accessed and
used by businesses. This process is essential because it allows companies to make better
decisions, improve efficiency, and save money. There are a few different ways to standardize
data; the most effective method will vary depending on the type and amount of data.
However, all businesses can benefit from data standardization, which is a crucial part of data
management.
Normalization vs Standardization



Feature scaling is one of the most important data preprocessing step in machine learning.
Algorithms that compute the distance between the features are biased towards numerically
larger values if the data is not scaled.
Tree-based algorithms are fairly insensitive to the scale of the features. Also, feature scaling
helps machine learning, and deep learning algorithms train and converge faster.
There are some feature scaling techniques such as Normalization and Standardization that are
the most popular and at the same time, the most confusing ones.
Normalization or Min-Max Scaling is used to transform features to be on a similar scale.
The new point is calculated as:
X_new = (X - X_min)/(X_max - X_min)
This scales the range to [0, 1] or sometimes [-1, 1]. Geometrically speaking, transformation
squishes the n-dimensional data into an n-dimensional unit hypercube. Normalization is
useful when there are no outliers as it cannot cope up with them. Usually, we would scale
age and not incomes because only a few people have high incomes but the age is close to
uniform.
Standardization or Z-Score Normalization is the transformation of features by
subtracting from mean and dividing by standard deviation. This is often called as Z-score.
X_new = (X - mean)/Std
Standardization can be helpful in cases where the data follows a Gaussian distribution.
However, this does not have to be necessarily true. Geometrically speaking, it translates the
data to the mean vector of original data to the origin and squishes or expands the points if
std is 1 respectively. We can see that we are just changing mean and standard deviation to a
standard normal distribution which is still normal thus the shape of the distribution is not
affected.
Standardization does not get affected by outliers because there is no predefined range of
transformed features.
Difference between Normalization and Standardization
S.NO. Normalization Standardization

Minimum and maximum value of Mean and standard deviation is used for
1.
features are used for scaling scaling.

It is used when features are of It is used when we want to ensure zero

2.
different scales. mean and unit standard deviation.

Scales values between [0, 1] or [-1,

3. It is not bounded to a certain range.
1].

4. It is really affected by outliers. It is much less affected by outliers.

Scikit-Learn provides a transformer Scikit-Learn provides a transformer

5. called MinMaxScaler for called StandardScaler for
Normalization. standardization.

This transformation squishes the n- It translates the data to the mean vector
6. dimensional data into an n- of original data to the origin and
dimensional unit hypercube. squishes or expands.

It is useful when we don’t know It is useful when the feature distribution

7.
about the distribution is Normal or Gaussian.

It is a often called as Scaling It is a often called as Z-Score

8.
Normalization Normalization.

Thesis Well Testing, Methods and Applicability
No ratings yet
Thesis Well Testing, Methods and Applicability
164 pages
Data Binning
No ratings yet
Data Binning
9 pages
Data Normalization in Data Mining
No ratings yet
Data Normalization in Data Mining
8 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
A Detailed Lesson Plan in MATHEMATICS 6 Day 3
No ratings yet
A Detailed Lesson Plan in MATHEMATICS 6 Day 3
10 pages
Domain Name System: Window Server 2012 R2
No ratings yet
Domain Name System: Window Server 2012 R2
46 pages
Motion Controller K Series
No ratings yet
Motion Controller K Series
61 pages
Unit 4 Data Handling and Model Evaluation 4.1 Data Aggregation
100% (1)
Unit 4 Data Handling and Model Evaluation 4.1 Data Aggregation
31 pages
MRP System Nervousness
100% (1)
MRP System Nervousness
232 pages
2015 - AutoCAD Tutorial Architecture Imperial Version
67% (6)
2015 - AutoCAD Tutorial Architecture Imperial Version
44 pages
Fuckbook
No ratings yet
Fuckbook
5 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Infowar and Spiritual Apocalypse The Destiny of Mankind: DR Bill Deagle MD
100% (2)
Infowar and Spiritual Apocalypse The Destiny of Mankind: DR Bill Deagle MD
37 pages
4 Data Pre Processing II
No ratings yet
4 Data Pre Processing II
26 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
dmdw2 2
No ratings yet
dmdw2 2
24 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
Unit 3
No ratings yet
Unit 3
22 pages
08.25.17 Game Notes PDF
No ratings yet
08.25.17 Game Notes PDF
8 pages
Movie
No ratings yet
Movie
25 pages
DM 02 04 Data Transformation
No ratings yet
DM 02 04 Data Transformation
52 pages
56 Supreme Court Reports Annotated: Velarde vs. Court of Appeals
No ratings yet
56 Supreme Court Reports Annotated: Velarde vs. Court of Appeals
15 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
Putriana S.F - C1G019024 - Consumption, Savings and Investment Function
No ratings yet
Putriana S.F - C1G019024 - Consumption, Savings and Investment Function
1 page
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
Well Completion
No ratings yet
Well Completion
64 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
62 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Cyber Security - Simple Risk Calculation
No ratings yet
Cyber Security - Simple Risk Calculation
3 pages
Data Transformation
No ratings yet
Data Transformation
12 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
CH2 Data Integration - Transformation
No ratings yet
CH2 Data Integration - Transformation
16 pages
PTON Q3 2023 Shareholder Letter - VF
No ratings yet
PTON Q3 2023 Shareholder Letter - VF
13 pages
Bi Ut2 Answers
No ratings yet
Bi Ut2 Answers
23 pages
Module 2
No ratings yet
Module 2
42 pages
d04634 41 Value Sheet Nortrol Mu
100% (1)
d04634 41 Value Sheet Nortrol Mu
8 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Data Preprocessing
No ratings yet
Data Preprocessing
8 pages
Sem A Tic Microsoft
No ratings yet
Sem A Tic Microsoft
31 pages
FDS CH 3
No ratings yet
FDS CH 3
2 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Down 2
No ratings yet
Down 2
61 pages
Unit-2 Data Warehouse Notes
No ratings yet
Unit-2 Data Warehouse Notes
11 pages
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
22 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
Study+Material+Unit 4+Data+Preprocessing+
No ratings yet
Study+Material+Unit 4+Data+Preprocessing+
8 pages
Study of Brand
No ratings yet
Study of Brand
38 pages
Unit 3
No ratings yet
Unit 3
18 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
QB 10 Marker
No ratings yet
QB 10 Marker
19 pages
Lesson Plan 4 - Dribbling
No ratings yet
Lesson Plan 4 - Dribbling
7 pages
Data Transformation in Data Mining
No ratings yet
Data Transformation in Data Mining
6 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Week 3
No ratings yet
Week 3
23 pages
De-140613-160123-Fortnight Attendence and Mid Marks Uploading Instructions To Principals
No ratings yet
De-140613-160123-Fortnight Attendence and Mid Marks Uploading Instructions To Principals
2 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
LFL Technical
No ratings yet
LFL Technical
4 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
08 Fog Lights
No ratings yet
08 Fog Lights
14 pages
Data Mining
No ratings yet
Data Mining
5 pages
Data Mining Chapter3 0
No ratings yet
Data Mining Chapter3 0
32 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
Brittany Ann Collet
No ratings yet
Brittany Ann Collet
1 page
BDA Class1
No ratings yet
BDA Class1
33 pages
Lista de Precios General en $ y Bs 24.08.2020
No ratings yet
Lista de Precios General en $ y Bs 24.08.2020
16 pages
DMDW 5
No ratings yet
DMDW 5
25 pages
Data Integration & Transformation
No ratings yet
Data Integration & Transformation
14 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
Grammar Quiz - Gustavo Millan
No ratings yet
Grammar Quiz - Gustavo Millan
2 pages
cc15 2nd
No ratings yet
cc15 2nd
2 pages
Normalization
No ratings yet
Normalization
35 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Parental Involvement and The Reading Ability Skills of Grade Three Learners
No ratings yet
Parental Involvement and The Reading Ability Skills of Grade Three Learners
15 pages
DM Module1
No ratings yet
DM Module1
15 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
Business Uses of Data Mining and Data Warehousing MIS 304 Section 04 CRN-41595
No ratings yet
Business Uses of Data Mining and Data Warehousing MIS 304 Section 04 CRN-41595
23 pages
Normandy vs. Duque
No ratings yet
Normandy vs. Duque
2 pages
Anthology of Harmonica Tunings
100% (4)
Anthology of Harmonica Tunings
69 pages
Bai Tap Ve Su Hoa Hop Giua Chu Ngu Va Dong Tu
No ratings yet
Bai Tap Ve Su Hoa Hop Giua Chu Ngu Va Dong Tu
4 pages
Foreign Exchange: - Purchase and Sale of National Currencies - Huge Market
No ratings yet
Foreign Exchange: - Purchase and Sale of National Currencies - Huge Market
102 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet

Data Transformation and Standardization

Uploaded by

Data Transformation and Standardization

Uploaded by

Unit 03: Advanced Statistics

Advantages of Data Transformation in Data Mining:

Disadvantages of Data Transformation in Data Mining:

1. Time-consuming: Data transformation can be a time-consuming process,

What is a data standardization example?

What is the purpose of data standardization?

3. Why is data standardization important in healthcare?

4. What are data standardization and normalization?

It is used when features are of It is used when we want to ensure zero

Scales values between [0, 1] or [-1,

4. It is really affected by outliers. It is much less affected by outliers.

Scikit-Learn provides a transformer Scikit-Learn provides a transformer

It is useful when we don’t know It is useful when the feature distribution

It is a often called as Scaling It is a often called as Z-Score

You might also like