Lecture 8 -Data Prep-Integration - M
Lecture 8 -Data Prep-Integration - M
Data Mining
Lecture # 8
Data Preprocessing
(Ch # 3)
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration
Data reduction
Data Transformation and
Discretization
Summary 2
Data Integration
Data integration:
combines data from multiple sources into a
coherent store
3
Data Integration (Problem 1)
Attribute naming ( in schema integration)
Problem: Entity identification problem: identify real world
entities from multiple data sources. Attributes are named
differently across different data sources, e.g., A.cust-id
B.cust-# (integrate metadata from different sources).
CustomerID
…
… CustomerID
CustID …
… …
… Extraction,
Transformation,
ClientID and Loading
… (ETL) tool.
…
Gender
Male
Female Gender
Gender Male
M Female
F
Weight
(kilograms)
6 Weight
10 (kilograms)
6
Weight 10
(pounds) 2.72
6 4.54
10
6
Other data intergration
problem
Delays in delivering data
Data Privacy and Security risks
Data quality issues
Scalability
Performance
Complexity
Handling Redundant Data in Data
Integration
Redundant data occur often when
integration of multiple databases
The same attribute may have different names in
different databases
One attribute may be a “derived” attribute in
another table, e.g., annual revenue
Careful integration of the data from multiple
sources may help reduce/avoid redundancies
and inconsistencies and improve mining
speed and quality
Redundancy can be checked using
correlation Analysis. 8
Correlation Analysis for Detecting
Redundancy
For numeric attributes we can use correlation and covariance
Correlation between two attributes can be checked by:-
in1 ( xi X )( yi Y ) in1 ( xi yi ) n X Y
rX ,Y n
2 2
in1 ( x X ) in1 ( y Y ) X Y
i i
1. Resulting value > 0, then A and B are positively correlated. If
A increase B will also increase. If value of r is close to 1 either
A or B can be removed
2. Resulting value = 0, then A and B are independent.
3. Resulting value < 0, then A and B are negatively correlated. If
the value of A increases, the value B will decreases.
Covariance between two attributes is
C ov( X ,Y ) in1 ( xi X )( yi Y )
1 xy n X Y
n n
1. It is not a very good measure because a zero value does not
always means independence.
9
Example: Correlation and Covariance
For the following data find correlation and covariance
measures value
10
2 Correlation Test for Nominal Data
A correlation relationship between two nominal attributes can
be discovered by a 2 (Chi-square) test.
c r (oij eij )2
2
i 1 j 1 eij
where