0% found this document useful (0 votes)
541 views14 pages

Data Integration & Transformation

Data integration involves combining data from multiple disparate sources into a unified view. This process merges data from different data stores, which can include databases, data cubes, or flat files. During data integration, metadata is used along with techniques like correlation analysis, conflict detection and resolution to smoothly integrate data. Data transformation prepares the data for mining by performing operations like smoothing, aggregation, generalization, normalization, and attribute construction. Smoothing removes noise while aggregation summarizes data. Generalization and normalization standardize values into consistent formats or levels, and attribute construction creates new attributes to improve accuracy. There are different methods for data normalization including min-max normalization, z-score normalization, and normalization by decimal scaling, each of

Uploaded by

Rupesh V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
541 views14 pages

Data Integration & Transformation

Data integration involves combining data from multiple disparate sources into a unified view. This process merges data from different data stores, which can include databases, data cubes, or flat files. During data integration, metadata is used along with techniques like correlation analysis, conflict detection and resolution to smoothly integrate data. Data transformation prepares the data for mining by performing operations like smoothing, aggregation, generalization, normalization, and attribute construction. Smoothing removes noise while aggregation summarizes data. Generalization and normalization standardize values into consistent formats or levels, and attribute construction creates new attributes to improve accuracy. There are different methods for data normalization including min-max normalization, z-score normalization, and normalization by decimal scaling, each of

Uploaded by

Rupesh V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Data Mining

Data Integration
and
Transformation
Data Integration
*Data Integration involves combining data from
several disparate source, which are stored using various
technologies and provide a unified view of the data.
* The later initiative is often called a data warehouse.
* It merges the data from multiple data stores (data
source).
* It includes multiple databases, data
cubes or flat
files.
*Metadata, correlation analysis, data conflict detection
and resolution of semantic heterogeneity contribute towards
smooth data integration.
Advantages :
1. Independence.
2. Faster query processing.
3. Complex query processing.
4. Advanced data summarization & storage possible.
5. High volume data processing.
Disadvantages :
6. Latency (since data needs to be loaded using ETL).
7. Costlier (data localization, infrastructure, security).
There are a number of issues to consider during data integration.
1. Schema Integration.
2. Redundancy.
3. Detection and resolution of data value conflicts.
Schema integration :
The real-world entities from multiple source be matched
is referred to as the entity identification problem.
For example,
Data analyst or the computer be sure that customer_id in
one database and cust_number in another refer to the same
entity. Databases and data warehouses that is a data about the
data it’s a meta data.
Redundancy :
* It is another important issue.
*An attribute may be redundant if it can be “derived”
from another table, such as annual revenue.
*Some redundancies can be detected by correlation
analysis.
For example,
Two attributes, such analysis can measure how
strongly one attribute implies the other based on the
available data.
The correlation between attributes attribute A and B by
Detection and resolution of data value conflicts :
*A third important issue in data integration is the
detection and resolution of data value conflicts.
*The same real-world entity, attribute values from
different sources. This may be due to differences in
representation, scaling, or encoding.
*An attribute in one system may be recorded at a
lower level of abstraction than the “same” attribute in another.
*For example, the total sales in one database may
refer to one branch of All Electronics, an attribute of the same
name in another database may refer to the total sales for All
Electronics stores in a given region.
Data Transformation
*Data transformation the data are transformed
or consolidated into forms in appropriate for mining.
* Data transformation can involve
1. Smoothing.
2. Aggregation.
3. Generalization.
4. Normalization.
5. Attribute construction.
Smoothing :
Which works to remove the noise from Such
data. techniques include binning, clustering and
regression.
Aggregation :
*Where summary or aggregation operations are applied
to the data.
*For example, the daily sales data may be aggregated so
as to compute monthly and annual total amounts.
Generalization :
*The data where low-level or “primitive” data are placed
by higher-level concepts through the use of concept through
the use of concept hierarchies.
*For example, the attributes like street can be
generalized to higher-level concept city or country when the
numeric attributes to higher-level concept young, middle-
aged and street.
Normalization :
Where the attribute data are scaled so as to fall within a
specified range, such as -1.0 to 1.0 or 0.0 to 1.0
Attribute construction :
Where new attribute are a constructed and added from the
given set of attributes to help the mining process.
•Smoothing is a form of Data Cleaning
•Aggregation and Generalization are the forms of Data
Reduction
•Normalization is useful in Classification
•Attribute Construction helps to Improve the Accuracy and
Understanding of Structure in High Dimensional Data.

.
Normalization :
There are Three methods for data normalization.
* Min-Max normalization.
* Z-Score normalization.
* Normalization by decimal scaling.
Min – Max Normalization:
It performs a linear transformation on the original data.
Suppose that min A and max A are the minimum and
maximum values of attributes A. A Min – Max
normalization maps a value v of A to v’ in the range.
Example: Suppose the Min and Max Values for the Attribute
Income are $12000, $98000 Respectively and the Mapping in the
Range [0,1]. By Min-Max Normalization what is the Value of
$73600
Solution: V’ = 73600-12000/ 98000-12000[1]+0
= 0.716
It Preserves the Relationships among original data values

Z – Score Normalization :
The Z – Score normalization a value of an attribute A are
normalized based on the mean and standard deviation of
A. A value v of A is normalized to v’
Where A Bar is the Mean value of Attribute A
Sigma A is the Standard Deviation of A

Example:
Suppose Mean and SD values of Attribute Income are
$54000, $16000 respectively. Using Z-Score
Normalization what is the Value of $73600
Solution:

v’= 73600-54000/16000
=1.225
Normalization by Decimal Scaling :
Normalization by decimal scaling normalizes by moving the
decimal point of values of attribute A.
The number of decimal points moved depends on the
maximum absolute value of A. A value v of A is normalized to v’
by computing

where j is the smallest integer such that Max(|V’|) < 1.


Example: Suppose Absolute Maximum Value of A is 986
Sol: By Decimal Scaling
V’=986/1000 where j=3
=0.986
Thank You

You might also like