Data Transformation and Standardization
Data Transformation and Standardization
Data Transformation
Data transformation in data mining refers to the process of converting raw data into
a format that is suitable for analysis and modelling. The goal of data transformation
is to prepare the data for data mining so that it can be used to extract useful insights
and knowledge. Data transformation typically involves several steps, including:
1. Data cleaning: Removing or correcting errors, inconsistencies, and
missing values in the data.
2. Data integration: Combining data from multiple sources, such as
databases and spreadsheets, into a single format.
3. Data normalization: Scaling the data to a common range of values, such
as between 0 and 1, to facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by selecting a
subset of relevant features or attributes.
5. Data discretization: Converting continuous data into discrete categories
or bins.
6. Data aggregation: Combining data at different levels of granularity, such
as by summing or averaging, to create new features or attributes.
7. Data transformation is an important step in the data mining process as it
helps to ensure that the data is in a format that is suitable for analysis and
modeling, and that it is free of errors and inconsistencies. Data
transformation can also help to improve the performance of data mining
algorithms, by reducing the dimensionality of the data, and by scaling the
data to a common range of values.
The data are transformed in ways that are ideal for mining the data. The data
transformation involves steps that are:
1. Smoothing: It is a process that is used to remove noise from the dataset using
some algorithms It allows for highlighting important features present in the dataset.
It helps in predicting the patterns. When collecting data, it can be manipulated to
eliminate or reduce any variance or any other noise form. The concept behind data
smoothing is that it will be able to identify simple changes to help predict different
trends and patterns. This serves as a help to analysts or traders who need to look at a
lot of data which can often be difficult to digest for finding patterns that they
wouldn’t see otherwise.
2. Aggregation: Data collection or aggregation is the method of storing and
presenting data in a summary format. The data may be obtained from multiple data
sources to integrate these data sources into a data analysis description. This is a
crucial step since the accuracy of data analysis insights is highly dependent on the
quantity and quality of the data used. Gathering accurate data of high quality and a
large enough quantity is necessary to produce relevant results. The collection of data
is useful for everything from decisions concerning financing or business strategy of
the product, pricing, operations, and marketing strategies. For example, Sales, data
may be aggregated to compute monthly& annual total amounts.
3. Discretization: It is a process of transforming continuous data into set of small
intervals. Most Data Mining activities in the real world require continuous attributes.
Yet many of the existing data mining frameworks are unable to handle these
attributes. Also, even if a data mining task can manage a continuous attribute, it can
significantly improve its efficiency by replacing a constant quality attribute with its
discrete values. For example, (1-10, 11-20) (age:- young, middle age, senior).
4. Attribute Construction: Where new attributes are created & applied to assist the
mining process from the given set of attributes. This simplifies the original data &
makes the mining more efficient.
5. Generalization: It converts low-level data attributes to high-level data attributes
using concept hierarchy. For Example Age initially in Numerical form (22, 25) is
converted into categorical value (young, old). For example, Categorical attributes,
such as house addresses, may be generalized to higher-level definitions, such as town
or country.
6. Normalization: Data normalization involves converting all data variables into a
given range. Techniques that are used for normalization are:
Min-Max Normalization:
This transforms the original data linearly.
Suppose that: min_A is the minima and max_A is the maxima of
an attribute, P
Where v is the value you want to plot in the new range.
v’ is the new value you get after normalizing the old value.
Z-Score Normalization:
In z-score normalization (or zero-mean normalization) the values
of an attribute (A), are normalized based on the mean of A and its
standard deviation
A value, v, of attribute A is normalized to v’ by computing
Decimal Scaling:
It normalizes the values of an attribute by changing the position
of their decimal points
The number of points by which the decimal point is moved can
be determined by the absolute maximum value of attribute A.
A value, v, of attribute A is normalized to v’ by computing
where j is the smallest integer such that Max(|v’|) < 1.
Suppose: Values of an attribute P varies from -99 to 99.
The maximum absolute value of P is 99.
For normalizing the values we divide the numbers by 100 (i.e., j
= 2) or (number of integers in the largest number) so that values
come out to be as 0.98, 0.97 and so on.
ADVANTAGES OR DISADVANTAGES:
Data Standardization
Data standardization is converting data into a standard format that computers can easily
understand and use. For example, when you standardize data, you might convert all
measurements into the metric system or all dates into a single format (such as YYYY-MM-
DD).
Minimum and maximum value of Mean and standard deviation is used for
1.
features are used for scaling scaling.
This transformation squishes the n- It translates the data to the mean vector
6. dimensional data into an n- of original data to the origin and
dimensional unit hypercube. squishes or expands.