Lecture 8 -Data Prep-Integration - M

The document discusses data preprocessing in data mining, focusing on the importance of data cleaning, integration, reduction, and transformation. It highlights various challenges in data integration, such as attribute naming discrepancies, data encoding issues, and measurement basis conflicts. Additionally, it covers methods for handling redundant data and detecting correlations using statistical analysis techniques like correlation and chi-square tests.

Uploaded by

gihel53025

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Lecture 8 -Data Prep-Integration - M

Uploaded by

gihel53025

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 13

CS06504

Data Mining
Lecture # 8
Data Preprocessing
(Ch # 3)
Data Preprocessing
 Why preprocess the data?
 Data cleaning
 Data integration
 Data reduction
 Data Transformation and
Discretization
 Summary 2
Data Integration
 Data integration:
 combines data from multiple sources into a
coherent store

3
Data Integration (Problem 1)
 Attribute naming ( in schema integration)
Problem: Entity identification problem: identify real world
entities from multiple data sources. Attributes are named
differently across different data sources, e.g., A.cust-id 
B.cust-# (integrate metadata from different sources).

CustomerID
…
… CustomerID
CustID …
… …
… Extraction,
Transformation,
ClientID and Loading
… (ETL) tool.
…

Multiple Sources Coherent Store

4
Data Integration (Problem 2)
• Data Encoding
Problem: Same attribute has same values denoted
in different ways

Gender
Male
Female Gender
Gender Male
M Female
F

Multiple Sources Coherent Store

Similarly, Islamabad might be denoted as ‘isb’, ‘ISB’ or

‘ISBD’, may be misspelled, or be inconsistently capitalized
(some programs are CASE-SENSITIVE)
5
Data Integration (Problem 3)
• Measurement Basis (data value conflicts)
Problem: For the same real world entity, attribute values
from different sources are different
possible reasons: different representations, different
scales, e.g., metric vs. British units, kg vs lb

Weight
(kilograms)
6 Weight
10 (kilograms)
6
Weight 10
(pounds) 2.72
6 4.54
10

Multiple Sources Coherent Store

6
Other data intergration
problem
 Delays in delivering data
 Data Privacy and Security risks
 Data quality issues
 Scalability
 Performance
 Complexity
Handling Redundant Data in Data
Integration
 Redundant data occur often when
integration of multiple databases
 The same attribute may have different names in
different databases
 One attribute may be a “derived” attribute in
another table, e.g., annual revenue
 Careful integration of the data from multiple
sources may help reduce/avoid redundancies
and inconsistencies and improve mining
speed and quality
 Redundancy can be checked using
correlation Analysis. 8
Correlation Analysis for Detecting
Redundancy
 For numeric attributes we can use correlation and covariance
 Correlation between two attributes can be checked by:-

 in1 ( xi  X )( yi  Y )  in1 ( xi yi )  n X Y
rX ,Y   n 
2 2
 in1 ( x  X )  in1 ( y  Y ) X Y
i i
1. Resulting value > 0, then A and B are positively correlated. If
A increase B will also increase. If value of r is close to 1 either
A or B can be removed
2. Resulting value = 0, then A and B are independent.
3. Resulting value < 0, then A and B are negatively correlated. If
the value of A increases, the value B will decreases.
 Covariance between two attributes is

C ov( X ,Y )   in1 ( xi  X )( yi  Y )
1   xy  n X Y 
n n
1. It is not a very good measure because a zero value does not
always means independence.
9
Example: Correlation and Covariance
 For the following data find correlation and covariance
measures value

Time point AllElctronics HighTech (Y)

(X)
t1 6 20
t2 5 10
t3 4 14
t4 3 5
t5 2 5
Total 20 54

10
2 Correlation Test for Nominal Data
 A correlation relationship between two nominal attributes can
be discovered by a 2 (Chi-square) test.

c r (oij  eij )2
 2  
i 1 j 1 eij
where

eij count ( Aai )count (B bi )

n
Example: For the following data find weather the two attributes
are independent or not.

male female Total

Fiction 250 200 450
Non- 50 1000 1050
fiction
Total 300 1200 1500 11
2 Correlation Test for Nominal Data
We formulate our Null and Alternative Hypotheses as
 Hypotheses
H0: There is no association between the two attributes (variables)
H1: There is an association between the two attributes (variables)
 Test Statistics
c r (oij  eij )2
 2  
i 1 j 1 eij
where
Oij is the observed cell count in the ith row and jth column of the table
eij is the expected cell count in the ith row and jth column of the table,
computed as
eij row i total×row j total
n
 The calculated value is then compared to the critical value from
the distribution table with degrees of freedom df = (R - 1)(C - 1) and
chosen confidence level = 0.05 or 0.01. If the calculated value >
critical value, then we reject the null hypothesis. 12
. .

BAC Product and Application Handbook
No ratings yet
BAC Product and Application Handbook
513 pages
The mathematics of quantum mechanics
From Everand
The mathematics of quantum mechanics
Alessio Mangoni
No ratings yet
Queue Program in C
No ratings yet
Queue Program in C
3 pages
Elevation..retail Store
No ratings yet
Elevation..retail Store
1 page
Lecture 4-Data Preprocessing - Integration
No ratings yet
Lecture 4-Data Preprocessing - Integration
12 pages
UpdatedUnit 1 Data Preprocessing
No ratings yet
UpdatedUnit 1 Data Preprocessing
38 pages
CH 03-01 Data Preprocessing
No ratings yet
CH 03-01 Data Preprocessing
27 pages
Ch 3-Final
No ratings yet
Ch 3-Final
39 pages
Ue22cs342aa2 20240827192243
No ratings yet
Ue22cs342aa2 20240827192243
28 pages
DP
No ratings yet
DP
44 pages
DM_merged
No ratings yet
DM_merged
169 pages
Unit 3
No ratings yet
Unit 3
164 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
80 pages
03 Pre Processing
No ratings yet
03 Pre Processing
89 pages
IT326 - Ch3
No ratings yet
IT326 - Ch3
33 pages
' 3 IT326 - Ch2 - Pre-Processing
No ratings yet
' 3 IT326 - Ch2 - Pre-Processing
48 pages
data mining 3
No ratings yet
data mining 3
57 pages
03Preprocessing (2)
No ratings yet
03Preprocessing (2)
80 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
56 pages
Slide 05 Chapter3 Data Preprocessing
No ratings yet
Slide 05 Chapter3 Data Preprocessing
58 pages
2020 Preprocessing
No ratings yet
2020 Preprocessing
63 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
Lecture6
No ratings yet
Lecture6
19 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
03Preprocessing_20160222
No ratings yet
03Preprocessing_20160222
65 pages
PPT1
No ratings yet
PPT1
93 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Module 2
No ratings yet
Module 2
62 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
03preprocessing Part2
No ratings yet
03preprocessing Part2
15 pages
Data Mining: Dosen: Dr. Vitri Tundjungsari
No ratings yet
Data Mining: Dosen: Dr. Vitri Tundjungsari
64 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
Chapter3 DataPreprocessing
No ratings yet
Chapter3 DataPreprocessing
50 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
64 pages
Data Mining P5
No ratings yet
Data Mining P5
32 pages
_03Preprocessing
No ratings yet
_03Preprocessing
60 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Data Pre Processing
No ratings yet
Data Pre Processing
63 pages
Lec7
No ratings yet
Lec7
45 pages
03 Pre Processing
No ratings yet
03 Pre Processing
63 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
Lecture 3
No ratings yet
Lecture 3
47 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
66 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
40 pages
Mining
No ratings yet
Mining
63 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
10-1 Data analysis and pre-processing part 3.pdf
No ratings yet
10-1 Data analysis and pre-processing part 3.pdf
19 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Module 5 03preprocessing
No ratings yet
Module 5 03preprocessing
63 pages
Data_Preprocessing-1-19
No ratings yet
Data_Preprocessing-1-19
19 pages
Chapter 3
No ratings yet
Chapter 3
56 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
DM LAQs (CT 1)
No ratings yet
DM LAQs (CT 1)
40 pages
03Preprocessing
No ratings yet
03Preprocessing
38 pages
Co-Clustering: Models, Algorithms and Applications
From Everand
Co-Clustering: Models, Algorithms and Applications
Gérard Govaert
No ratings yet
Lecture 7 -Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 -Data Preprocessing - Cleaning-M
21 pages
Lecture 9 -Data Prep - Reduction - PCA-M
No ratings yet
Lecture 9 -Data Prep - Reduction - PCA-M
44 pages
Company Research
No ratings yet
Company Research
5 pages
Chapter_5_v8.2
No ratings yet
Chapter_5_v8.2
21 pages
Lecture 13-Supervised Learning-Decision Trees-M
No ratings yet
Lecture 13-Supervised Learning-Decision Trees-M
47 pages
Lecture 12 - Weka Tutorial
No ratings yet
Lecture 12 - Weka Tutorial
84 pages
Lecture 10-Assiciation Rule Mining-I-M
No ratings yet
Lecture 10-Assiciation Rule Mining-I-M
30 pages
synthetic_tourism_dataset
No ratings yet
synthetic_tourism_dataset
112 pages
synthetic_tourism_dataset_gilgit_baltistan
No ratings yet
synthetic_tourism_dataset_gilgit_baltistan
112 pages
Key Performance Indicator & Monitoring (KPIM) : Actual
No ratings yet
Key Performance Indicator & Monitoring (KPIM) : Actual
13 pages
DOC-20250410-WA0005.
No ratings yet
DOC-20250410-WA0005.
18 pages
Module 1 - Lesson 2 Thoughts
No ratings yet
Module 1 - Lesson 2 Thoughts
2 pages
Flange Calculation As Per BS - XLSX - 170111
No ratings yet
Flange Calculation As Per BS - XLSX - 170111
7 pages
WSMOI - Session 05
No ratings yet
WSMOI - Session 05
38 pages
02b Ethernet Frames Exercise
No ratings yet
02b Ethernet Frames Exercise
6 pages
Power Steering Gears
No ratings yet
Power Steering Gears
5 pages
Circuit Breaker and Switchgear Vol.1
No ratings yet
Circuit Breaker and Switchgear Vol.1
113 pages
The Influence of TikTok Reviews On Students' Travel Choices
No ratings yet
The Influence of TikTok Reviews On Students' Travel Choices
2 pages
Randomness Vs Arbitrariness Author(s) : ZAHA HADID Source: AA Files, No. 2 (July 1982), P. 62 Published By: Stable URL: Accessed: 12/06/2014 21:22
No ratings yet
Randomness Vs Arbitrariness Author(s) : ZAHA HADID Source: AA Files, No. 2 (July 1982), P. 62 Published By: Stable URL: Accessed: 12/06/2014 21:22
2 pages
InstallationLog
No ratings yet
InstallationLog
5 pages
Aritree Saha_11730823014
No ratings yet
Aritree Saha_11730823014
9 pages
Utility-Scale Batteries: Innovation Landscape Brief
No ratings yet
Utility-Scale Batteries: Innovation Landscape Brief
24 pages
Biểu Đồ Không Có Tiêu Đề.drawio
No ratings yet
Biểu Đồ Không Có Tiêu Đề.drawio
46 pages
Albany RR 200 Owners Manual
No ratings yet
Albany RR 200 Owners Manual
46 pages
Safety Valve Qap 270622
No ratings yet
Safety Valve Qap 270622
2 pages
Speed, Velocity, and Acceleration Worksheet Live Worksheets
No ratings yet
Speed, Velocity, and Acceleration Worksheet Live Worksheets
1 page
CCTV Fundamentals New
No ratings yet
CCTV Fundamentals New
4 pages
AeroTrak - Plus - A100 31 35 50 51 55 - APC - User Manual 6016408 - US
No ratings yet
AeroTrak - Plus - A100 31 35 50 51 55 - APC - User Manual 6016408 - US
74 pages
SUNDOO - SH Series Digital Force Gauge
No ratings yet
SUNDOO - SH Series Digital Force Gauge
2 pages
Reliability, Maintainability & Availability Introduction
100% (1)
Reliability, Maintainability & Availability Introduction
42 pages
Tennis Court
No ratings yet
Tennis Court
27 pages
Owen Oil Tools Has Been Manufacturing Expendable Retrievable
No ratings yet
Owen Oil Tools Has Been Manufacturing Expendable Retrievable
7 pages
d401.01 Tech Review
No ratings yet
d401.01 Tech Review
23 pages
Financial Technology
No ratings yet
Financial Technology
4 pages
BLG305 Ders8 e
No ratings yet
BLG305 Ders8 e
33 pages
777d Update Test Charts
100% (1)
777d Update Test Charts
38 pages

Lecture 8 -Data Prep-Integration - M

Uploaded by

Lecture 8 -Data Prep-Integration - M

Uploaded by

CS06504

Multiple Sources Coherent Store

Multiple Sources Coherent Store

Similarly, Islamabad might be denoted as ‘isb’, ‘ISB’ or

Multiple Sources Coherent Store

Time point AllElctronics HighTech (Y)

eij count ( Aai )count (B bi )

male female Total

You might also like