0% found this document useful (0 votes)
11 views

Lecture 4-Data Preprocessing - Integration

Uploaded by

ÀbdUŁ ßaŠiŤ
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Lecture 4-Data Preprocessing - Integration

Uploaded by

ÀbdUŁ ßaŠiŤ
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Data Mining

Data Preprocessing
Data Preprocessing
 Why preprocess the data?
 Data cleaning
 Data integration
 Data reduction
 Data Transformation and Discretization
 Summary
2
Data Integration
 Data integration:
 combines data from multiple sources into a
coherent store

3
Data Integration (Problem 1)
 Attribute naming ( in schema integration)
Problem: Entity identification problem: identify real world entities from
multiple data sources. Attributes are named differently across different
data sources, e.g., A.cust-id  B.cust-# (integrate metadata from
different sources).

CustomerID

… CustomerID
CustID …
… …
… Extraction,
Transformation,
ClientID and Loading
… (ETL) tool.

Multiple Sources Coherent Store


4
Data Integration (Problem 2)
• Data Encoding
Problem: Same attribute has same values denoted in
different ways

Gender
Male
Female Gender
Gender Male
M Female
F

Multiple Sources Coherent Store

Similarly, Islamabad might be denoted as ‘isb’, ‘ISB’ or ‘ISBD’, may be


misspelled, or be inconsistently capitalized (some programs are CASE-
SENSITIVE)
5
Data Integration (Problem 3)
• Measurement Basis (data value conflicts)
Problem: For the same real world entity, attribute values
from different sources are different possible reasons:
different representations, different scales, e.g., metric vs.
British units, kg vs lb.

Weight
(kilograms)
6 Weight
10 (kilograms)
6
Weight 10
(pounds) 2.72
6 4.54
10

Multiple Sources Coherent Store

6
Handling Redundant Data in Data Integration
 Redundant data occur often when integration of
multiple databases
 The same attribute may have different names in
different databases
 One attribute may be a “derived” attribute in another
table, e.g., annual revenue
 Careful integration of the data from multiple
sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality
 Redundancy can be checked using correlation
Analysis. 7
Correlation Analysis for Detecting Redundancy
 For numeric attributes we can use correlation and covariance
 Correlation between two attributes can be checked by:-
 in1 ( xi  X )( yi  Y ) in1 ( xi yi )  n X Y
rX ,Y   n 
2 2
in 1 ( x  X ) in 1( y  Y ) X Y
i i
1. Resulting value > 0, then A and B are positively correlated. If A
increase B will also increase. If value of r is close to 1 either A or B
can be removed
2. Resulting value = 0, then A and B are independent.
3. Resulting value < 0, then A and B are negatively correlated. If the
value of A increases, the value B will decreases.
 Covariance between two attributes is

C ov( X ,Y )   in1 ( xi  X )( yi  Y )
 1  xy  n X Y 
n n
1. It is not a very good measure because a zero value does not always
means independence.
8
Example: Correlation and Covariance
 For the following data find correlation and covariance measures value

Time point AllElctronics (X) HighTech (Y)

t1 6 20
t2 5 10
t3 4 14
t4 3 5
t5 2 5
Total 20 54

9
2 Correlation Test for Nominal Data
 A correlation relationship between two nominal attributes can be
discovered by a 2 (Chi-square) test.

c r (oij eij )2
2   
i 1 j 1 eij
where

eij  count ( Aai )count (B bi )


n
Example: For the following data find weather the two attributes are
independent or not.

male female Total


Fiction 250 200 450
Non- 50 1000 1050
fiction
Total 300 1200 1500 10
2 Correlation Test for Nominal Data
We formulate our Null and Alternative Hypotheses as
 Hypotheses
H0: There is no association between the two attributes (variables)
H1: There is an association between the two attributes (variables)
 Test Statistics
c r (oij eij )2
2   
i 1 j 1 eij
where
Oij is the observed cell count in the ith row and jth column of the table
eij is the expected cell count in the ith row and jth column of the table,
computed as
eij  row i total×row j total
n
 The calculated value is then compared to the critical value from
the distribution table with degrees of freedom df = (R - 1)(C - 1) and
chosen confidence level = 0.05 or 0.01. If the calculated value >
critical value, then we reject the null hypothesis. 11

You might also like