0% found this document useful (0 votes)
139 views3 pages

1.6 - Data Integration, 1.10 - Transformation

1. Data integration is a technique to merge data from multiple sources, such as flat files and multi-dimensional databases, into a coherent data store like a data warehouse. 2. Issues that can arise during data integration include schema integration where different sources use the same attributes differently, redundancy where some attributes provide redundant information, and resolving conflicts in data values represented differently across sources. 3. Data transformation techniques prepare data for mining and include smoothing to remove noise, aggregation to summarize data, generalization to replace low-level concepts, and normalization to scale attribute values into a standard range.

Uploaded by

dssd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
139 views3 pages

1.6 - Data Integration, 1.10 - Transformation

1. Data integration is a technique to merge data from multiple sources, such as flat files and multi-dimensional databases, into a coherent data store like a data warehouse. 2. Issues that can arise during data integration include schema integration where different sources use the same attributes differently, redundancy where some attributes provide redundant information, and resolving conflicts in data values represented differently across sources. 3. Data transformation techniques prepare data for mining and include smoothing to remove noise, aggregation to summarize data, generalization to replace low-level concepts, and normalization to scale attribute values into a standard range.

Uploaded by

dssd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

1.

6 DATA INTEGRATION :
It is a technique to merge data from multiple sources into a coherent data store such as a
data warehouse.
It is a preprocessing method that involves merging of data files from different sources like flat
files, multi – dimensional database, data cubes…etc in order to form a data store like data
warehouse.
Assume that in our data warehouse, data from two different companies named A and B is
provided.
Ex.: Company A : Company B:
Emp.no Name DO Age Price
B
1 John ….. 28 ₹10
0 Emp. ID Name Price
2 Rasul ….. 32 ₹50 4 Siva $100
0 5 Ram $500

While maintaining and integrating the data in warehouse, the following issues may be arisen
such as,
- Schema Integration and Object matching
Both companies are using with the same property (datatype) like numerical value for
the different attribute names as Emp. No & Emp.ID. During Integration (automation),
system could not distinguish them so that it may accept all the numbers at one place
- Redundancy ( unwanted attributes)
In company A table, two attributes like DOB and Age as both are not required since
they implicit one another. It means that if we know DOB, Age can be calculated and
vice – versa. So, one of them is to be simply ignored or removed.
- Detection and resolution of data value objects
For this case, suppose Company A is representing Price in Rupees and Company B is
doing in dollars. At this point of time, simple replacement or substitution is not sufficient
as it may give conflicts…..i.e., $100 and ₹100.
That’s the reason, not only detecting the mistake, but also to be resolved by correctly
modifying the values.
https://www.youtube.com/watch?v=UKUq7hZdZUw (Data Integration)
1.10 DATA TRANSFORMATION :
It is a data preprocessing technique which transforms or consolidates the data into alternate
forms appropriate for mining.
Here, the following four steps are involved, namely,
a) Smoothing – Removing the noise from data
( similar to Binning, Regression, Clustering…)
b) Aggregation – using summary or aggregate function, data cube ( multi – dimensional
database) can be constructed. This process is much helpful in OLAP ( On Line
Analytical Processing ) operations.
c) Generalization – here, low level concepts are replaced with higher level concepts.
Ex.: in some databases, street is going to be replaced simply by city / country.
d) Normalization - here, attribute values are normalized by scaling their values so that
they fall in specified range.
Ex.: suppose there are a set of values like 2, 100, 1, 500, 35, 900…. Then these must
be scaled in such a way that the specified range is to be selected ….like may be 0 to 1
in which all must fall.
This normalization process can be done in 2 ways, such that
 Min / Max Normalization – in this method, the new value of an attribute can be
found by using the formula as
v′ = (v – minx ) / ( maxx – minx ) , where
v′ is new value
v is the actual / original attribute value
minx and maxx are the minimum and maximum values of a given set of elements.
In the above example, for first attribute ---- v = 2, minx = 1 and
maxx = 900…..so on
 Z – score Normalization or Zero mean Normalization
Here also the following formula is to be adopted as
v′ = (v – x′) / σx ) , where
v′ is new value
v is the actual / original attribute value
x′ is Mean of attribute
σx is standard deviation of attribute
https://www.youtube.com/watch?v=RQ0I1u-q8N8 ( Data Tranformation)

You might also like