Data transformation in data mining
Data transformation in data mining
Data transformation in data mining refers to the process of converting raw data into a
format that is suitable for analysis and modeling. The goal of data transformation is to
prepare the data for data mining so that it can be used to extract useful insights and
knowledge. Data transformation typically involves several steps, including:
1. Data cleaning: Removing or correcting errors, inconsistencies, and missing
values in the data.
2. Data integration: Combining data from multiple sources, such as databases
and spreadsheets, into a single format.
3. Data normalization: Scaling the data to a common range of values, such as
between 0 and 1, to facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by selecting a subset
of relevant features or attributes.
5. Data discretization: Converting continuous data into discrete categories or
bins.
6. Data aggregation: Combining data at different levels of granularity, such as
by summing or averaging, to create new features or attributes.
7. Data transformation is an important step in the data mining process as it helps
to ensure that the data is in a format that is suitable for analysis and modeling,
and that it is free of errors and inconsistencies. Data transformation can also
help to improve the performance of data mining algorithms, by reducing the
dimensionality of the data, and by scaling the data to a common range of
values.
The data are transformed in ways that are ideal for mining the data. The data
transformation involves steps that are:
1. Smoothing: It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or
reduce any variance or any other noise form. The concept behind data smoothing is that
it will be able to identify simple changes to help predict different trends and patterns.
This serves as a help to analysts or traders who need to look at a lot of data which can
often be difficult to digest for finding patterns that they wouldn’t see otherwise.
2. Aggregation: Data collection or aggregation is the method of storing and presenting
data in a summary format. The data may be obtained from multiple data sources to
integrate these data sources into a data analysis description. This is a crucial step since
the accuracy of data analysis insights is highly dependent on the quantity and quality of
the data used. Gathering accurate data of high quality and a large enough quantity is
necessary to produce relevant results. The collection of data is useful for everything
from decisions concerning financing or business strategy of the product, pricing,
operations, and marketing strategies. For example, Sales, data may be aggregated to
compute monthly& annual total amounts.
3. Discretization: It is a process of transforming continuous data into set of small
intervals. Most Data Mining activities in the real world require continuous attributes.
Yet many of the existing data mining frameworks are unable to handle these attributes.
Also, even if a data mining task can manage a continuous attribute, it can significantly
improve its efficiency by replacing a constant quality attribute with its discrete values.
For example, (1-10, 11-20) (age:- young, middle age, senior).
4. Attribute Construction: Where new attributes are created & applied to assist the
mining process from the given set of attributes. This simplifies the original data &
makes the mining more efficient.
5. Generalization: It converts low-level data attributes to high-level data attributes
using concept hierarchy. For Example Age initially in Numerical form (22, 25) is
converted into categorical value (young, old). For example, Categorical attributes, such
as house addresses, may be generalized to higher-level definitions, such as town or
country.
6. Normalization: Data normalization involves converting all data variables into a
given range. Techniques that are used for normalization are:
Min-Max Normalization:
This transforms the original data linearly.
Suppose that: min_A is the minima and max_A is the maxima of an
attribute, P
Where v is the value you want to plot in the new range.
v’ is the new value you get after normalizing the old value.
Z-Score Normalization:
In z-score normalization (or zero-mean normalization) the values of
an attribute (A), are normalized based on the mean of A and its
standard deviation
A value, v, of attribute A is normalized to v’ by computing
Decimal Scaling:
It normalizes the values of an attribute by changing the position of
their decimal points
The number of points by which the decimal point is moved can be
determined by the absolute maximum value of attribute A.
A value, v, of attribute A is normalized to v’ by computing
where j is the smallest integer such that Max(|v’|) < 1.
Suppose: Values of an attribute P varies from -99 to 99.
The maximum absolute value of P is 99.
For normalizing the values we divide the numbers by 100 (i.e., j = 2)
or (number of integers in the largest number) so that values come out
to be as 0.98, 0.97 and so on.
ADVANTAGES OR DISADVANTAGES:
Another example is analytics, where we gather the static data of website visitors. For
example, all visitors who visit the site with the IP address of India are shown under
country level.
Some Famous techniques of data discretization
Histogram analysis
Histogram refers to a plot used to represent the underlying frequency distribution of a
continuous data set. Histogram assists the data inspection for data distribution. For
example, Outliers, skewness representation, normal distribution representation, etc.
Binning
Binning refers to a data smoothing technique that helps to group a huge number of
continuous values into smaller values. For data discretization and the development of
idea hierarchy, this technique can also be used.
Cluster Analysis
Cluster analysis is a form of data discretization. A clustering algorithm is executed by
dividing the values of x numbers into clusters to isolate a computational feature of x.
Data discretization using decision tree analysis
Data discretization refers to a decision tree analysis in which a top-down slicing
technique is used. It is done through a supervised procedure. In a numeric attribute
discretization, first, you need to select the attribute that has the least entropy, and then
you need to run it with the help of a recursive process. The recursive process divides it
into various discretized disjoint intervals, from top to bottom, using the same splitting
criterion.
Data discretization using correlation analysis
Discretizing data by linear regression technique, you can get the best neighboring
interval, and then the large intervals are combined to develop a larger overlap to form the
final 20 overlapping intervals. It is a supervised procedure.
Data discretization and concept hierarchy generation
The term hierarchy represents an organizational structure or mapping in which items are
ranked according to their levels of importance. In other words, we can say that a
hierarchy concept refers to a sequence of mappings with a set of more general concepts to
complex concepts. It means mapping is done from low-level concepts to high-level
concepts. For example, in computer science, there are different types of hierarchical
systems. A document is placed in a folder in windows at a specific place in the tree
structure is the best example of a computer hierarchical tree model. There are two types
of hierarchy: top-down mapping and the second one is bottom-up mapping.
Let's understand this concept hierarchy for the dimension location with the help of an
example.
A particular city can map with the belonging country. For example, New Delhi can be
mapped to India, and India can be mapped to Asia.
Top-down mapping
Top-down mapping generally starts with the top with some general information and ends
with the bottom to the specialized information.
Bottom-up mapping
Bottom-up mapping generally starts with the bottom with some specialized information
and ends with the top to the generalized information.