Data Integration & Transformation
Data Integration & Transformation
Data Integration
and
Transformation
Data Integration
*Data Integration involves combining data from
several disparate source, which are stored using various
technologies and provide a unified view of the data.
* The later initiative is often called a data warehouse.
* It merges the data from multiple data stores (data
source).
* It includes multiple databases, data
cubes or flat
files.
*Metadata, correlation analysis, data conflict detection
and resolution of semantic heterogeneity contribute towards
smooth data integration.
Advantages :
1. Independence.
2. Faster query processing.
3. Complex query processing.
4. Advanced data summarization & storage possible.
5. High volume data processing.
Disadvantages :
6. Latency (since data needs to be loaded using ETL).
7. Costlier (data localization, infrastructure, security).
There are a number of issues to consider during data integration.
1. Schema Integration.
2. Redundancy.
3. Detection and resolution of data value conflicts.
Schema integration :
The real-world entities from multiple source be matched
is referred to as the entity identification problem.
For example,
Data analyst or the computer be sure that customer_id in
one database and cust_number in another refer to the same
entity. Databases and data warehouses that is a data about the
data it’s a meta data.
Redundancy :
* It is another important issue.
*An attribute may be redundant if it can be “derived”
from another table, such as annual revenue.
*Some redundancies can be detected by correlation
analysis.
For example,
Two attributes, such analysis can measure how
strongly one attribute implies the other based on the
available data.
The correlation between attributes attribute A and B by
Detection and resolution of data value conflicts :
*A third important issue in data integration is the
detection and resolution of data value conflicts.
*The same real-world entity, attribute values from
different sources. This may be due to differences in
representation, scaling, or encoding.
*An attribute in one system may be recorded at a
lower level of abstraction than the “same” attribute in another.
*For example, the total sales in one database may
refer to one branch of All Electronics, an attribute of the same
name in another database may refer to the total sales for All
Electronics stores in a given region.
Data Transformation
*Data transformation the data are transformed
or consolidated into forms in appropriate for mining.
* Data transformation can involve
1. Smoothing.
2. Aggregation.
3. Generalization.
4. Normalization.
5. Attribute construction.
Smoothing :
Which works to remove the noise from Such
data. techniques include binning, clustering and
regression.
Aggregation :
*Where summary or aggregation operations are applied
to the data.
*For example, the daily sales data may be aggregated so
as to compute monthly and annual total amounts.
Generalization :
*The data where low-level or “primitive” data are placed
by higher-level concepts through the use of concept through
the use of concept hierarchies.
*For example, the attributes like street can be
generalized to higher-level concept city or country when the
numeric attributes to higher-level concept young, middle-
aged and street.
Normalization :
Where the attribute data are scaled so as to fall within a
specified range, such as -1.0 to 1.0 or 0.0 to 1.0
Attribute construction :
Where new attribute are a constructed and added from the
given set of attributes to help the mining process.
•Smoothing is a form of Data Cleaning
•Aggregation and Generalization are the forms of Data
Reduction
•Normalization is useful in Classification
•Attribute Construction helps to Improve the Accuracy and
Understanding of Structure in High Dimensional Data.
.
Normalization :
There are Three methods for data normalization.
* Min-Max normalization.
* Z-Score normalization.
* Normalization by decimal scaling.
Min – Max Normalization:
It performs a linear transformation on the original data.
Suppose that min A and max A are the minimum and
maximum values of attributes A. A Min – Max
normalization maps a value v of A to v’ in the range.
Example: Suppose the Min and Max Values for the Attribute
Income are $12000, $98000 Respectively and the Mapping in the
Range [0,1]. By Min-Max Normalization what is the Value of
$73600
Solution: V’ = 73600-12000/ 98000-12000[1]+0
= 0.716
It Preserves the Relationships among original data values
Z – Score Normalization :
The Z – Score normalization a value of an attribute A are
normalized based on the mean and standard deviation of
A. A value v of A is normalized to v’
Where A Bar is the Mean value of Attribute A
Sigma A is the Standard Deviation of A
Example:
Suppose Mean and SD values of Attribute Income are
$54000, $16000 respectively. Using Z-Score
Normalization what is the Value of $73600
Solution:
v’= 73600-54000/16000
=1.225
Normalization by Decimal Scaling :
Normalization by decimal scaling normalizes by moving the
decimal point of values of attribute A.
The number of decimal points moved depends on the
maximum absolute value of A. A value v of A is normalized to v’
by computing