0% found this document useful (0 votes)
10 views

Lecture 8 -Data Prep-Integration - M

The document discusses data preprocessing in data mining, focusing on the importance of data cleaning, integration, reduction, and transformation. It highlights various challenges in data integration, such as attribute naming discrepancies, data encoding issues, and measurement basis conflicts. Additionally, it covers methods for handling redundant data and detecting correlations using statistical analysis techniques like correlation and chi-square tests.

Uploaded by

gihel53025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lecture 8 -Data Prep-Integration - M

The document discusses data preprocessing in data mining, focusing on the importance of data cleaning, integration, reduction, and transformation. It highlights various challenges in data integration, such as attribute naming discrepancies, data encoding issues, and measurement basis conflicts. Additionally, it covers methods for handling redundant data and detecting correlations using statistical analysis techniques like correlation and chi-square tests.

Uploaded by

gihel53025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

CS06504

Data Mining
Lecture # 8
Data Preprocessing
(Ch # 3)
Data Preprocessing
 Why preprocess the data?
 Data cleaning
 Data integration
 Data reduction
 Data Transformation and
Discretization
 Summary 2
Data Integration
 Data integration:
 combines data from multiple sources into a
coherent store

3
Data Integration (Problem 1)
 Attribute naming ( in schema integration)
Problem: Entity identification problem: identify real world
entities from multiple data sources. Attributes are named
differently across different data sources, e.g., A.cust-id 
B.cust-# (integrate metadata from different sources).

CustomerID

… CustomerID
CustID …
… …
… Extraction,
Transformation,
ClientID and Loading
… (ETL) tool.

Multiple Sources Coherent Store


4
Data Integration (Problem 2)
• Data Encoding
Problem: Same attribute has same values denoted
in different ways

Gender
Male
Female Gender
Gender Male
M Female
F

Multiple Sources Coherent Store

Similarly, Islamabad might be denoted as ‘isb’, ‘ISB’ or


‘ISBD’, may be misspelled, or be inconsistently capitalized
(some programs are CASE-SENSITIVE)
5
Data Integration (Problem 3)
• Measurement Basis (data value conflicts)
Problem: For the same real world entity, attribute values
from different sources are different
possible reasons: different representations, different
scales, e.g., metric vs. British units, kg vs lb

Weight
(kilograms)
6 Weight
10 (kilograms)
6
Weight 10
(pounds) 2.72
6 4.54
10

Multiple Sources Coherent Store

6
Other data intergration
problem
 Delays in delivering data
 Data Privacy and Security risks
 Data quality issues
 Scalability
 Performance
 Complexity
Handling Redundant Data in Data
Integration
 Redundant data occur often when
integration of multiple databases
 The same attribute may have different names in
different databases
 One attribute may be a “derived” attribute in
another table, e.g., annual revenue
 Careful integration of the data from multiple
sources may help reduce/avoid redundancies
and inconsistencies and improve mining
speed and quality
 Redundancy can be checked using
correlation Analysis. 8
Correlation Analysis for Detecting
Redundancy
 For numeric attributes we can use correlation and covariance
 Correlation between two attributes can be checked by:-

 in1 ( xi  X )( yi  Y )  in1 ( xi yi )  n X Y
rX ,Y   n 
2 2
 in1 ( x  X )  in1 ( y  Y ) X Y
i i
1. Resulting value > 0, then A and B are positively correlated. If
A increase B will also increase. If value of r is close to 1 either
A or B can be removed
2. Resulting value = 0, then A and B are independent.
3. Resulting value < 0, then A and B are negatively correlated. If
the value of A increases, the value B will decreases.
 Covariance between two attributes is

C ov( X ,Y )   in1 ( xi  X )( yi  Y )
1   xy  n X Y 
n n
1. It is not a very good measure because a zero value does not
always means independence.
9
Example: Correlation and Covariance
 For the following data find correlation and covariance
measures value

Time point AllElctronics HighTech (Y)


(X)
t1 6 20
t2 5 10
t3 4 14
t4 3 5
t5 2 5
Total 20 54

10
2 Correlation Test for Nominal Data
 A correlation relationship between two nominal attributes can
be discovered by a 2 (Chi-square) test.

c r (oij  eij )2
 2  
i 1 j 1 eij
where

eij count ( Aai )count (B bi )


n
Example: For the following data find weather the two attributes
are independent or not.

male female Total


Fiction 250 200 450
Non- 50 1000 1050
fiction
Total 300 1200 1500 11
2 Correlation Test for Nominal Data
We formulate our Null and Alternative Hypotheses as
 Hypotheses
H0: There is no association between the two attributes (variables)
H1: There is an association between the two attributes (variables)
 Test Statistics
c r (oij  eij )2
 2  
i 1 j 1 eij
where
Oij is the observed cell count in the ith row and jth column of the table
eij is the expected cell count in the ith row and jth column of the table,
computed as
eij row i total×row j total
n
 The calculated value is then compared to the critical value from
the distribution table with degrees of freedom df = (R - 1)(C - 1) and
chosen confidence level = 0.05 or 0.01. If the calculated value >
critical value, then we reject the null hypothesis. 12
. .

You might also like