Data transformation in data mining

Data transformation in data mining involves converting raw data into a suitable format for analysis, including steps like data cleaning, integration, normalization, and discretization. This process enhances data quality, facilitates integration, and improves the performance of mining algorithms, though it can be time-consuming and complex. Discretization specifically refers to converting continuous data into smaller intervals to simplify data management and analysis.

Uploaded by

fosice8498

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views6 pages

Data transformation in data mining

Uploaded by

fosice8498

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Data transformation in data mining

Data transformation in data mining refers to the process of converting raw data into a
format that is suitable for analysis and modeling. The goal of data transformation is to
prepare the data for data mining so that it can be used to extract useful insights and
knowledge. Data transformation typically involves several steps, including:
1. Data cleaning: Removing or correcting errors, inconsistencies, and missing
values in the data.
2. Data integration: Combining data from multiple sources, such as databases
and spreadsheets, into a single format.
3. Data normalization: Scaling the data to a common range of values, such as
between 0 and 1, to facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by selecting a subset
of relevant features or attributes.
5. Data discretization: Converting continuous data into discrete categories or
bins.
6. Data aggregation: Combining data at different levels of granularity, such as
by summing or averaging, to create new features or attributes.
7. Data transformation is an important step in the data mining process as it helps
to ensure that the data is in a format that is suitable for analysis and modeling,
and that it is free of errors and inconsistencies. Data transformation can also
help to improve the performance of data mining algorithms, by reducing the
dimensionality of the data, and by scaling the data to a common range of
values.
The data are transformed in ways that are ideal for mining the data. The data
transformation involves steps that are:
1. Smoothing: It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or
reduce any variance or any other noise form. The concept behind data smoothing is that
it will be able to identify simple changes to help predict different trends and patterns.
This serves as a help to analysts or traders who need to look at a lot of data which can
often be difficult to digest for finding patterns that they wouldn’t see otherwise.
2. Aggregation: Data collection or aggregation is the method of storing and presenting
data in a summary format. The data may be obtained from multiple data sources to
integrate these data sources into a data analysis description. This is a crucial step since
the accuracy of data analysis insights is highly dependent on the quantity and quality of
the data used. Gathering accurate data of high quality and a large enough quantity is
necessary to produce relevant results. The collection of data is useful for everything
from decisions concerning financing or business strategy of the product, pricing,
operations, and marketing strategies. For example, Sales, data may be aggregated to
compute monthly& annual total amounts.
3. Discretization: It is a process of transforming continuous data into set of small
intervals. Most Data Mining activities in the real world require continuous attributes.
Yet many of the existing data mining frameworks are unable to handle these attributes.
Also, even if a data mining task can manage a continuous attribute, it can significantly
improve its efficiency by replacing a constant quality attribute with its discrete values.
For example, (1-10, 11-20) (age:- young, middle age, senior).
4. Attribute Construction: Where new attributes are created & applied to assist the
mining process from the given set of attributes. This simplifies the original data &
makes the mining more efficient.
5. Generalization: It converts low-level data attributes to high-level data attributes
using concept hierarchy. For Example Age initially in Numerical form (22, 25) is
converted into categorical value (young, old). For example, Categorical attributes, such
as house addresses, may be generalized to higher-level definitions, such as town or
country.
6. Normalization: Data normalization involves converting all data variables into a
given range. Techniques that are used for normalization are:
 Min-Max Normalization:
 This transforms the original data linearly.
 Suppose that: min_A is the minima and max_A is the maxima of an
attribute, P
 Where v is the value you want to plot in the new range.
 v’ is the new value you get after normalizing the old value.
 Z-Score Normalization:
 In z-score normalization (or zero-mean normalization) the values of
an attribute (A), are normalized based on the mean of A and its
standard deviation
 A value, v, of attribute A is normalized to v’ by computing
 Decimal Scaling:
 It normalizes the values of an attribute by changing the position of
their decimal points
 The number of points by which the decimal point is moved can be
determined by the absolute maximum value of attribute A.
 A value, v, of attribute A is normalized to v’ by computing
 where j is the smallest integer such that Max(|v’|) < 1.
 Suppose: Values of an attribute P varies from -99 to 99.
 The maximum absolute value of P is 99.
 For normalizing the values we divide the numbers by 100 (i.e., j = 2)
or (number of integers in the largest number) so that values come out
to be as 0.98, 0.97 and so on.

ADVANTAGES OR DISADVANTAGES:

Advantages of Data Transformation in Data Mining:

1. Improves Data Quality: Data transformation helps to improve the quality of
data by removing errors, inconsistencies, and missing values.
2. Facilitates Data Integration: Data transformation enables the integration of data
from multiple sources, which can improve the accuracy and completeness of
the data.
3. Improves Data Analysis: Data transformation helps to prepare the data for
analysis and modeling by normalizing, reducing dimensionality, and
discretizing the data.
4. Increases Data Security: Data transformation can be used to mask sensitive
data, or to remove sensitive information from the data, which can help to
increase data security.
5. Enhances Data Mining Algorithm Performance: Data transformation can
improve the performance of data mining algorithms by reducing the
dimensionality of the data and scaling the data to a common range of values.

Disadvantages of Data Transformation in Data Mining

Time-consuming: Data transformation can be a time-consuming process, especially

when dealing with large datasets.
1. Complexity: Data transformation can be a complex process, requiring
specialized skills and knowledge to implement and interpret the results.
2. Data Loss: Data transformation can result in data loss, such as when
discretizing continuous data, or when removing attributes or features from the
data.
3. Biased transformation: Data transformation can result in bias, if the data is not
properly understood or used.
4. High cost: Data transformation can be an expensive process, requiring
significant investments in hardware, software, and personnel.
Overfitting: Data transformation can lead to overfitting, which is a common problem in
machine learning where a model learns the detail and noise in the training data to the
extent that it negatively impacts the performance of the model on new unseen data.

Discretization in data mining

Data discretization refers to a method of converting a huge number of data values into
smaller ones so that the evaluation and management of data become easy. In other words,
data discretization is a method of converting attributes values of continuous data into a
finite set of intervals with minimum data loss. There are two forms of data discretization
first is supervised discretization, and the second is unsupervised discretization.
Supervised discretization refers to a method in which the class data is used. Unsupervised
discretization refers to a method depending upon the way which operation proceeds. It
means it works on the top-down splitting strategy and bottom-up merging strategy.
Now, we can understand this concept with the help of an example
Suppose we have an attribute of Age with the given values

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Table before Discretization

Attribute Age Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78

After Discretization Child Young Mature Old

Another example is analytics, where we gather the static data of website visitors. For
example, all visitors who visit the site with the IP address of India are shown under
country level.
Some Famous techniques of data discretization
Histogram analysis
Histogram refers to a plot used to represent the underlying frequency distribution of a
continuous data set. Histogram assists the data inspection for data distribution. For
example, Outliers, skewness representation, normal distribution representation, etc.
Binning
Binning refers to a data smoothing technique that helps to group a huge number of
continuous values into smaller values. For data discretization and the development of
idea hierarchy, this technique can also be used.
Cluster Analysis
Cluster analysis is a form of data discretization. A clustering algorithm is executed by
dividing the values of x numbers into clusters to isolate a computational feature of x.
Data discretization using decision tree analysis
Data discretization refers to a decision tree analysis in which a top-down slicing
technique is used. It is done through a supervised procedure. In a numeric attribute
discretization, first, you need to select the attribute that has the least entropy, and then
you need to run it with the help of a recursive process. The recursive process divides it
into various discretized disjoint intervals, from top to bottom, using the same splitting
criterion.
Data discretization using correlation analysis
Discretizing data by linear regression technique, you can get the best neighboring
interval, and then the large intervals are combined to develop a larger overlap to form the
final 20 overlapping intervals. It is a supervised procedure.
Data discretization and concept hierarchy generation
The term hierarchy represents an organizational structure or mapping in which items are
ranked according to their levels of importance. In other words, we can say that a
hierarchy concept refers to a sequence of mappings with a set of more general concepts to
complex concepts. It means mapping is done from low-level concepts to high-level
concepts. For example, in computer science, there are different types of hierarchical
systems. A document is placed in a folder in windows at a specific place in the tree
structure is the best example of a computer hierarchical tree model. There are two types
of hierarchy: top-down mapping and the second one is bottom-up mapping.
Let's understand this concept hierarchy for the dimension location with the help of an
example.
A particular city can map with the belonging country. For example, New Delhi can be
mapped to India, and India can be mapped to Asia.
Top-down mapping
Top-down mapping generally starts with the top with some general information and ends
with the bottom to the specialized information.
Bottom-up mapping
Bottom-up mapping generally starts with the bottom with some specialized information
and ends with the top to the generalized information.

Data discretization and binarization in data mining

Data discretization is a method of converting attributes values of continuous data into a
finite set of intervals with minimum data loss. In contrast, data binarization is used to
transform the continuous and discrete attributes into binary attributes.
Why is Discretization important?
As we know, an infinite of degrees of freedom mathematical problem poses with the
continuous data. For many purposes, data scientists need the implementation of
discretization. It is also used to improve signal noise ratio.

Data Transformation and standardization
No ratings yet
Data Transformation and standardization
5 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
22 pages
OJCST_Vol13_N2-3_p_78-81
No ratings yet
OJCST_Vol13_N2-3_p_78-81
4 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Down 2
No ratings yet
Down 2
61 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
62 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
DMDW 5
No ratings yet
DMDW 5
25 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Data Transformation
No ratings yet
Data Transformation
12 pages
Data Mining
No ratings yet
Data Mining
5 pages
Business Uses of Data Mining and Data Warehousing MIS 304 Section 04 CRN-41595
No ratings yet
Business Uses of Data Mining and Data Warehousing MIS 304 Section 04 CRN-41595
23 pages
Normalization
No ratings yet
Normalization
35 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
Unit 3
No ratings yet
Unit 3
18 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
data preprocessing
No ratings yet
data preprocessing
8 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Data pre Processing
No ratings yet
Data pre Processing
11 pages
Data Mining Chapter3 0
No ratings yet
Data Mining Chapter3 0
32 pages
QB 10 Marker
No ratings yet
QB 10 Marker
19 pages
FDS CH 3
No ratings yet
FDS CH 3
2 pages
Week 3
No ratings yet
Week 3
23 pages
DM 2 Part 2
No ratings yet
DM 2 Part 2
35 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
Module 2
No ratings yet
Module 2
42 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
unit 2 Preprocessing in Data Mining
No ratings yet
unit 2 Preprocessing in Data Mining
6 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
Unit-3 Data Reduction
No ratings yet
Unit-3 Data Reduction
5 pages
DATA MINING MODULE 2
No ratings yet
DATA MINING MODULE 2
23 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
DM Module1
No ratings yet
DM Module1
15 pages
BI_Unit 5
No ratings yet
BI_Unit 5
9 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
3.data Pre-Processing Concepts
No ratings yet
3.data Pre-Processing Concepts
8 pages
DWDM unit 3
No ratings yet
DWDM unit 3
16 pages
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
No ratings yet
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
3 pages
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
No ratings yet
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
25 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
Unit 4 Data Handling and Model Evaluation 4.1 Data Aggregation
100% (1)
Unit 4 Data Handling and Model Evaluation 4.1 Data Aggregation
31 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Dta Mining
No ratings yet
Dta Mining
15 pages
UNIT 3
No ratings yet
UNIT 3
22 pages
Data Mining
No ratings yet
Data Mining
15 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Complete_AIVerse_Tutorial
No ratings yet
Complete_AIVerse_Tutorial
8 pages
Chapter 8_ IT Infrastructure (1)
No ratings yet
Chapter 8_ IT Infrastructure (1)
7 pages
Complex HTTP interactions
No ratings yet
Complex HTTP interactions
3 pages
4 What is JSON (1)
No ratings yet
4 What is JSON (1)
1 page
8 JSON Example Explained (1).docx
No ratings yet
8 JSON Example Explained (1).docx
1 page
Computer Integrated Manufacturing Systems - Lecture Notes, Study Material and Important Questions, Answers
No ratings yet
Computer Integrated Manufacturing Systems - Lecture Notes, Study Material and Important Questions, Answers
38 pages
(Aerotech) Loop Transmission - Powerful Servo Tuning Function
No ratings yet
(Aerotech) Loop Transmission - Powerful Servo Tuning Function
2 pages
Iso 4066 1977
No ratings yet
Iso 4066 1977
9 pages
Fianl Year Project Report
No ratings yet
Fianl Year Project Report
62 pages
ykd2305m Stepper Motor
No ratings yet
ykd2305m Stepper Motor
2 pages
Quantum Cloud Computing: A Review, Open Problems, and Future Directions
No ratings yet
Quantum Cloud Computing: A Review, Open Problems, and Future Directions
35 pages
Class 11 Compiled Practical (2)
No ratings yet
Class 11 Compiled Practical (2)
35 pages
Performance Evaluation of A Double Drum Dryer For Potato Flake Production
No ratings yet
Performance Evaluation of A Double Drum Dryer For Potato Flake Production
8 pages
Experimental Technique
No ratings yet
Experimental Technique
27 pages
5892 Bearing Damage Analysis Brochure
100% (1)
5892 Bearing Damage Analysis Brochure
40 pages
Design of Concrete Structures - I Question Paper
50% (2)
Design of Concrete Structures - I Question Paper
3 pages
BLUEPRINT 2025 10
No ratings yet
BLUEPRINT 2025 10
25 pages
Yacc Presentation
No ratings yet
Yacc Presentation
10 pages
Inorganic MP 2
No ratings yet
Inorganic MP 2
4 pages
Unit - 1
No ratings yet
Unit - 1
81 pages
Comparison of Highwall Control Methods at Bayswater Colliery
No ratings yet
Comparison of Highwall Control Methods at Bayswater Colliery
7 pages
1dec Chemical Sciences Piii
No ratings yet
1dec Chemical Sciences Piii
32 pages
Week 4
No ratings yet
Week 4
33 pages
Fanuc Field Control™ Genius® Bus Interface Unit
No ratings yet
Fanuc Field Control™ Genius® Bus Interface Unit
248 pages
Worksheet_mysql
No ratings yet
Worksheet_mysql
4 pages
7th-Computer-Education-Eng-Version
No ratings yet
7th-Computer-Education-Eng-Version
1 page
DTL-10-Series
No ratings yet
DTL-10-Series
6 pages
Math 8 Worksheet Congruent Triangles
No ratings yet
Math 8 Worksheet Congruent Triangles
2 pages
Module05 PDF
No ratings yet
Module05 PDF
337 pages
CCP Question BANK
No ratings yet
CCP Question BANK
5 pages
CPC Usb/Arm7: CAN PC Interface
No ratings yet
CPC Usb/Arm7: CAN PC Interface
11 pages
Document
0% (1)
Document
24 pages
MT2 Revision Class 6
No ratings yet
MT2 Revision Class 6
18 pages
Energies 15 05501 PDF
No ratings yet
Energies 15 05501 PDF
15 pages
AML Bitcoin
No ratings yet
AML Bitcoin
7 pages

Data transformation in data mining

Uploaded by

Data transformation in data mining

Uploaded by

Data transformation in data mining

Advantages of Data Transformation in Data Mining:

Disadvantages of Data Transformation in Data Mining

Time-consuming: Data transformation can be a time-consuming process, especially

Discretization in data mining

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Table before Discretization

Attribute Age Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78

After Discretization Child Young Mature Old

Data discretization and binarization in data mining

You might also like