0% found this document useful (0 votes)
7 views

Data transformation in data mining

Data transformation in data mining involves converting raw data into a suitable format for analysis, including steps like data cleaning, integration, normalization, and discretization. This process enhances data quality, facilitates integration, and improves the performance of mining algorithms, though it can be time-consuming and complex. Discretization specifically refers to converting continuous data into smaller intervals to simplify data management and analysis.

Uploaded by

fosice8498
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Data transformation in data mining

Data transformation in data mining involves converting raw data into a suitable format for analysis, including steps like data cleaning, integration, normalization, and discretization. This process enhances data quality, facilitates integration, and improves the performance of mining algorithms, though it can be time-consuming and complex. Discretization specifically refers to converting continuous data into smaller intervals to simplify data management and analysis.

Uploaded by

fosice8498
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Data transformation in data mining

Data transformation in data mining refers to the process of converting raw data into a
format that is suitable for analysis and modeling. The goal of data transformation is to
prepare the data for data mining so that it can be used to extract useful insights and
knowledge. Data transformation typically involves several steps, including:
1. Data cleaning: Removing or correcting errors, inconsistencies, and missing
values in the data.
2. Data integration: Combining data from multiple sources, such as databases
and spreadsheets, into a single format.
3. Data normalization: Scaling the data to a common range of values, such as
between 0 and 1, to facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by selecting a subset
of relevant features or attributes.
5. Data discretization: Converting continuous data into discrete categories or
bins.
6. Data aggregation: Combining data at different levels of granularity, such as
by summing or averaging, to create new features or attributes.
7. Data transformation is an important step in the data mining process as it helps
to ensure that the data is in a format that is suitable for analysis and modeling,
and that it is free of errors and inconsistencies. Data transformation can also
help to improve the performance of data mining algorithms, by reducing the
dimensionality of the data, and by scaling the data to a common range of
values.
The data are transformed in ways that are ideal for mining the data. The data
transformation involves steps that are:
1. Smoothing: It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or
reduce any variance or any other noise form. The concept behind data smoothing is that
it will be able to identify simple changes to help predict different trends and patterns.
This serves as a help to analysts or traders who need to look at a lot of data which can
often be difficult to digest for finding patterns that they wouldn’t see otherwise.
2. Aggregation: Data collection or aggregation is the method of storing and presenting
data in a summary format. The data may be obtained from multiple data sources to
integrate these data sources into a data analysis description. This is a crucial step since
the accuracy of data analysis insights is highly dependent on the quantity and quality of
the data used. Gathering accurate data of high quality and a large enough quantity is
necessary to produce relevant results. The collection of data is useful for everything
from decisions concerning financing or business strategy of the product, pricing,
operations, and marketing strategies. For example, Sales, data may be aggregated to
compute monthly& annual total amounts.
3. Discretization: It is a process of transforming continuous data into set of small
intervals. Most Data Mining activities in the real world require continuous attributes.
Yet many of the existing data mining frameworks are unable to handle these attributes.
Also, even if a data mining task can manage a continuous attribute, it can significantly
improve its efficiency by replacing a constant quality attribute with its discrete values.
For example, (1-10, 11-20) (age:- young, middle age, senior).
4. Attribute Construction: Where new attributes are created & applied to assist the
mining process from the given set of attributes. This simplifies the original data &
makes the mining more efficient.
5. Generalization: It converts low-level data attributes to high-level data attributes
using concept hierarchy. For Example Age initially in Numerical form (22, 25) is
converted into categorical value (young, old). For example, Categorical attributes, such
as house addresses, may be generalized to higher-level definitions, such as town or
country.
6. Normalization: Data normalization involves converting all data variables into a
given range. Techniques that are used for normalization are:
 Min-Max Normalization:
 This transforms the original data linearly.
 Suppose that: min_A is the minima and max_A is the maxima of an
attribute, P
 Where v is the value you want to plot in the new range.
 v’ is the new value you get after normalizing the old value.
 Z-Score Normalization:
 In z-score normalization (or zero-mean normalization) the values of
an attribute (A), are normalized based on the mean of A and its
standard deviation
 A value, v, of attribute A is normalized to v’ by computing
 Decimal Scaling:
 It normalizes the values of an attribute by changing the position of
their decimal points
 The number of points by which the decimal point is moved can be
determined by the absolute maximum value of attribute A.
 A value, v, of attribute A is normalized to v’ by computing
 where j is the smallest integer such that Max(|v’|) < 1.
 Suppose: Values of an attribute P varies from -99 to 99.
 The maximum absolute value of P is 99.
 For normalizing the values we divide the numbers by 100 (i.e., j = 2)
or (number of integers in the largest number) so that values come out
to be as 0.98, 0.97 and so on.

ADVANTAGES OR DISADVANTAGES:

Advantages of Data Transformation in Data Mining:


1. Improves Data Quality: Data transformation helps to improve the quality of
data by removing errors, inconsistencies, and missing values.
2. Facilitates Data Integration: Data transformation enables the integration of data
from multiple sources, which can improve the accuracy and completeness of
the data.
3. Improves Data Analysis: Data transformation helps to prepare the data for
analysis and modeling by normalizing, reducing dimensionality, and
discretizing the data.
4. Increases Data Security: Data transformation can be used to mask sensitive
data, or to remove sensitive information from the data, which can help to
increase data security.
5. Enhances Data Mining Algorithm Performance: Data transformation can
improve the performance of data mining algorithms by reducing the
dimensionality of the data and scaling the data to a common range of values.

Disadvantages of Data Transformation in Data Mining

Time-consuming: Data transformation can be a time-consuming process, especially


when dealing with large datasets.
1. Complexity: Data transformation can be a complex process, requiring
specialized skills and knowledge to implement and interpret the results.
2. Data Loss: Data transformation can result in data loss, such as when
discretizing continuous data, or when removing attributes or features from the
data.
3. Biased transformation: Data transformation can result in bias, if the data is not
properly understood or used.
4. High cost: Data transformation can be an expensive process, requiring
significant investments in hardware, software, and personnel.
Overfitting: Data transformation can lead to overfitting, which is a common problem in
machine learning where a model learns the detail and noise in the training data to the
extent that it negatively impacts the performance of the model on new unseen data.

Discretization in data mining


Data discretization refers to a method of converting a huge number of data values into
smaller ones so that the evaluation and management of data become easy. In other words,
data discretization is a method of converting attributes values of continuous data into a
finite set of intervals with minimum data loss. There are two forms of data discretization
first is supervised discretization, and the second is unsupervised discretization.
Supervised discretization refers to a method in which the class data is used. Unsupervised
discretization refers to a method depending upon the way which operation proceeds. It
means it works on the top-down splitting strategy and bottom-up merging strategy.
Now, we can understand this concept with the help of an example
Suppose we have an attribute of Age with the given values

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Table before Discretization

Attribute Age Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78

After Discretization Child Young Mature Old

Another example is analytics, where we gather the static data of website visitors. For
example, all visitors who visit the site with the IP address of India are shown under
country level.
Some Famous techniques of data discretization
Histogram analysis
Histogram refers to a plot used to represent the underlying frequency distribution of a
continuous data set. Histogram assists the data inspection for data distribution. For
example, Outliers, skewness representation, normal distribution representation, etc.
Binning
Binning refers to a data smoothing technique that helps to group a huge number of
continuous values into smaller values. For data discretization and the development of
idea hierarchy, this technique can also be used.
Cluster Analysis
Cluster analysis is a form of data discretization. A clustering algorithm is executed by
dividing the values of x numbers into clusters to isolate a computational feature of x.
Data discretization using decision tree analysis
Data discretization refers to a decision tree analysis in which a top-down slicing
technique is used. It is done through a supervised procedure. In a numeric attribute
discretization, first, you need to select the attribute that has the least entropy, and then
you need to run it with the help of a recursive process. The recursive process divides it
into various discretized disjoint intervals, from top to bottom, using the same splitting
criterion.
Data discretization using correlation analysis
Discretizing data by linear regression technique, you can get the best neighboring
interval, and then the large intervals are combined to develop a larger overlap to form the
final 20 overlapping intervals. It is a supervised procedure.
Data discretization and concept hierarchy generation
The term hierarchy represents an organizational structure or mapping in which items are
ranked according to their levels of importance. In other words, we can say that a
hierarchy concept refers to a sequence of mappings with a set of more general concepts to
complex concepts. It means mapping is done from low-level concepts to high-level
concepts. For example, in computer science, there are different types of hierarchical
systems. A document is placed in a folder in windows at a specific place in the tree
structure is the best example of a computer hierarchical tree model. There are two types
of hierarchy: top-down mapping and the second one is bottom-up mapping.
Let's understand this concept hierarchy for the dimension location with the help of an
example.
A particular city can map with the belonging country. For example, New Delhi can be
mapped to India, and India can be mapped to Asia.
Top-down mapping
Top-down mapping generally starts with the top with some general information and ends
with the bottom to the specialized information.
Bottom-up mapping
Bottom-up mapping generally starts with the bottom with some specialized information
and ends with the top to the generalized information.

Data discretization and binarization in data mining


Data discretization is a method of converting attributes values of continuous data into a
finite set of intervals with minimum data loss. In contrast, data binarization is used to
transform the continuous and discrete attributes into binary attributes.
Why is Discretization important?
As we know, an infinite of degrees of freedom mathematical problem poses with the
continuous data. For many purposes, data scientists need the implementation of
discretization. It is also used to improve signal noise ratio.

You might also like