1.6 - Data Integration, 1.10 - Transformation
1.6 - Data Integration, 1.10 - Transformation
6 DATA INTEGRATION :
It is a technique to merge data from multiple sources into a coherent data store such as a
data warehouse.
It is a preprocessing method that involves merging of data files from different sources like flat
files, multi – dimensional database, data cubes…etc in order to form a data store like data
warehouse.
Assume that in our data warehouse, data from two different companies named A and B is
provided.
Ex.: Company A : Company B:
Emp.no Name DO Age Price
B
1 John ….. 28 ₹10
0 Emp. ID Name Price
2 Rasul ….. 32 ₹50 4 Siva $100
0 5 Ram $500
While maintaining and integrating the data in warehouse, the following issues may be arisen
such as,
- Schema Integration and Object matching
Both companies are using with the same property (datatype) like numerical value for
the different attribute names as Emp. No & Emp.ID. During Integration (automation),
system could not distinguish them so that it may accept all the numbers at one place
- Redundancy ( unwanted attributes)
In company A table, two attributes like DOB and Age as both are not required since
they implicit one another. It means that if we know DOB, Age can be calculated and
vice – versa. So, one of them is to be simply ignored or removed.
- Detection and resolution of data value objects
For this case, suppose Company A is representing Price in Rupees and Company B is
doing in dollars. At this point of time, simple replacement or substitution is not sufficient
as it may give conflicts…..i.e., $100 and ₹100.
That’s the reason, not only detecting the mistake, but also to be resolved by correctly
modifying the values.
https://www.youtube.com/watch?v=UKUq7hZdZUw (Data Integration)
1.10 DATA TRANSFORMATION :
It is a data preprocessing technique which transforms or consolidates the data into alternate
forms appropriate for mining.
Here, the following four steps are involved, namely,
a) Smoothing – Removing the noise from data
( similar to Binning, Regression, Clustering…)
b) Aggregation – using summary or aggregate function, data cube ( multi – dimensional
database) can be constructed. This process is much helpful in OLAP ( On Line
Analytical Processing ) operations.
c) Generalization – here, low level concepts are replaced with higher level concepts.
Ex.: in some databases, street is going to be replaced simply by city / country.
d) Normalization - here, attribute values are normalized by scaling their values so that
they fall in specified range.
Ex.: suppose there are a set of values like 2, 100, 1, 500, 35, 900…. Then these must
be scaled in such a way that the specified range is to be selected ….like may be 0 to 1
in which all must fall.
This normalization process can be done in 2 ways, such that
Min / Max Normalization – in this method, the new value of an attribute can be
found by using the formula as
v′ = (v – minx ) / ( maxx – minx ) , where
v′ is new value
v is the actual / original attribute value
minx and maxx are the minimum and maximum values of a given set of elements.
In the above example, for first attribute ---- v = 2, minx = 1 and
maxx = 900…..so on
Z – score Normalization or Zero mean Normalization
Here also the following formula is to be adopted as
v′ = (v – x′) / σx ) , where
v′ is new value
v is the actual / original attribute value
x′ is Mean of attribute
σx is standard deviation of attribute
https://www.youtube.com/watch?v=RQ0I1u-q8N8 ( Data Tranformation)