Lecture 4-Data Preprocessing - Integration
Lecture 4-Data Preprocessing - Integration
Data Preprocessing
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration
Data reduction
Data Transformation and Discretization
Summary
2
Data Integration
Data integration:
combines data from multiple sources into a
coherent store
3
Data Integration (Problem 1)
Attribute naming ( in schema integration)
Problem: Entity identification problem: identify real world entities from
multiple data sources. Attributes are named differently across different
data sources, e.g., A.cust-id B.cust-# (integrate metadata from
different sources).
CustomerID
…
… CustomerID
CustID …
… …
… Extraction,
Transformation,
ClientID and Loading
… (ETL) tool.
…
Gender
Male
Female Gender
Gender Male
M Female
F
Weight
(kilograms)
6 Weight
10 (kilograms)
6
Weight 10
(pounds) 2.72
6 4.54
10
6
Handling Redundant Data in Data Integration
Redundant data occur often when integration of
multiple databases
The same attribute may have different names in
different databases
One attribute may be a “derived” attribute in another
table, e.g., annual revenue
Careful integration of the data from multiple
sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality
Redundancy can be checked using correlation
Analysis. 7
Correlation Analysis for Detecting Redundancy
For numeric attributes we can use correlation and covariance
Correlation between two attributes can be checked by:-
in1 ( xi X )( yi Y ) in1 ( xi yi ) n X Y
rX ,Y n
2 2
in 1 ( x X ) in 1( y Y ) X Y
i i
1. Resulting value > 0, then A and B are positively correlated. If A
increase B will also increase. If value of r is close to 1 either A or B
can be removed
2. Resulting value = 0, then A and B are independent.
3. Resulting value < 0, then A and B are negatively correlated. If the
value of A increases, the value B will decreases.
Covariance between two attributes is
C ov( X ,Y ) in1 ( xi X )( yi Y )
1 xy n X Y
n n
1. It is not a very good measure because a zero value does not always
means independence.
8
Example: Correlation and Covariance
For the following data find correlation and covariance measures value
t1 6 20
t2 5 10
t3 4 14
t4 3 5
t5 2 5
Total 20 54
9
2 Correlation Test for Nominal Data
A correlation relationship between two nominal attributes can be
discovered by a 2 (Chi-square) test.
c r (oij eij )2
2
i 1 j 1 eij
where