Lecture 4-Data Preprocessing - Integration

Uploaded by

ÀbdUŁ ßaŠiŤ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views12 pages

Lecture 4-Data Preprocessing - Integration

Uploaded by

ÀbdUŁ ßaŠiŤ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 12

Data Mining

Data Preprocessing
Data Preprocessing
 Why preprocess the data?
 Data cleaning
 Data integration
 Data reduction
 Data Transformation and Discretization
 Summary
2
Data Integration
 Data integration:
 combines data from multiple sources into a
coherent store

3
Data Integration (Problem 1)
 Attribute naming ( in schema integration)
Problem: Entity identification problem: identify real world entities from
multiple data sources. Attributes are named differently across different
data sources, e.g., A.cust-id  B.cust-# (integrate metadata from
different sources).

CustomerID
…
… CustomerID
CustID …
… …
… Extraction,
Transformation,
ClientID and Loading
… (ETL) tool.
…

Multiple Sources Coherent Store

4
Data Integration (Problem 2)
• Data Encoding
Problem: Same attribute has same values denoted in
different ways

Gender
Male
Female Gender
Gender Male
M Female
F

Multiple Sources Coherent Store

Similarly, Islamabad might be denoted as ‘isb’, ‘ISB’ or ‘ISBD’, may be

misspelled, or be inconsistently capitalized (some programs are CASE-
SENSITIVE)
5
Data Integration (Problem 3)
• Measurement Basis (data value conflicts)
Problem: For the same real world entity, attribute values
from different sources are different possible reasons:
different representations, different scales, e.g., metric vs.
British units, kg vs lb.

Weight
(kilograms)
6 Weight
10 (kilograms)
6
Weight 10
(pounds) 2.72
6 4.54
10

Multiple Sources Coherent Store

6
Handling Redundant Data in Data Integration
 Redundant data occur often when integration of
multiple databases
 The same attribute may have different names in
different databases
 One attribute may be a “derived” attribute in another
table, e.g., annual revenue
 Careful integration of the data from multiple
sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality
 Redundancy can be checked using correlation
Analysis. 7
Correlation Analysis for Detecting Redundancy
 For numeric attributes we can use correlation and covariance
 Correlation between two attributes can be checked by:-
 in1 ( xi  X )( yi  Y ) in1 ( xi yi )  n X Y
rX ,Y   n 
2 2
in 1 ( x  X ) in 1( y  Y ) X Y
i i
1. Resulting value > 0, then A and B are positively correlated. If A
increase B will also increase. If value of r is close to 1 either A or B
can be removed
2. Resulting value = 0, then A and B are independent.
3. Resulting value < 0, then A and B are negatively correlated. If the
value of A increases, the value B will decreases.
 Covariance between two attributes is

C ov( X ,Y )   in1 ( xi  X )( yi  Y )
 1  xy  n X Y 
n n
1. It is not a very good measure because a zero value does not always
means independence.
8
Example: Correlation and Covariance
 For the following data find correlation and covariance measures value

Time point AllElctronics (X) HighTech (Y)

t1 6 20
t2 5 10
t3 4 14
t4 3 5
t5 2 5
Total 20 54

9
2 Correlation Test for Nominal Data
 A correlation relationship between two nominal attributes can be
discovered by a 2 (Chi-square) test.

c r (oij eij )2
2   
i 1 j 1 eij
where

eij  count ( Aai )count (B bi )

n
Example: For the following data find weather the two attributes are
independent or not.

male female Total

Fiction 250 200 450
Non- 50 1000 1050
fiction
Total 300 1200 1500 10
2 Correlation Test for Nominal Data
We formulate our Null and Alternative Hypotheses as
 Hypotheses
H0: There is no association between the two attributes (variables)
H1: There is an association between the two attributes (variables)
 Test Statistics
c r (oij eij )2
2   
i 1 j 1 eij
where
Oij is the observed cell count in the ith row and jth column of the table
eij is the expected cell count in the ith row and jth column of the table,
computed as
eij  row i total×row j total
n
 The calculated value is then compared to the critical value from
the distribution table with degrees of freedom df = (R - 1)(C - 1) and
chosen confidence level = 0.05 or 0.01. If the calculated value >
critical value, then we reject the null hypothesis. 11

Grade 12 Physics Exam Questions and Answers
80% (10)
Grade 12 Physics Exam Questions and Answers
3 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
SAP PM Configuration 3
100% (1)
SAP PM Configuration 3
30 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Data Mining: Dosen: Dr. Vitri Tundjungsari
No ratings yet
Data Mining: Dosen: Dr. Vitri Tundjungsari
64 pages
Module 2
No ratings yet
Module 2
62 pages
ZYJ260
No ratings yet
ZYJ260
78 pages
Data Pre Processing
No ratings yet
Data Pre Processing
63 pages
UpdatedUnit 1 Data Preprocessing
No ratings yet
UpdatedUnit 1 Data Preprocessing
38 pages
Grid Audit Report Format
100% (1)
Grid Audit Report Format
7 pages
Lecture 8 - Data Prep-Integration - M
No ratings yet
Lecture 8 - Data Prep-Integration - M
13 pages
03preprocessing Part2
No ratings yet
03preprocessing Part2
15 pages
' 3 IT326 - Ch2 - Pre-Processing
No ratings yet
' 3 IT326 - Ch2 - Pre-Processing
48 pages
Chapter3 DataPreprocessing
No ratings yet
Chapter3 DataPreprocessing
50 pages
Data Preprocessing: Week 2
No ratings yet
Data Preprocessing: Week 2
67 pages
Ue22cs342aa2 20240827192243
No ratings yet
Ue22cs342aa2 20240827192243
28 pages
Data Preprocessingedfgh
No ratings yet
Data Preprocessingedfgh
21 pages
CH 3-Final
No ratings yet
CH 3-Final
39 pages
IT326 - Ch3
No ratings yet
IT326 - Ch3
33 pages
DM LAQs (CT 1)
No ratings yet
DM LAQs (CT 1)
40 pages
DWDM Unit 1 Chap2 PDF
No ratings yet
DWDM Unit 1 Chap2 PDF
21 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
CH 03-01 Data Preprocessing
No ratings yet
CH 03-01 Data Preprocessing
27 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
PPT1
No ratings yet
PPT1
93 pages
Lecture 3
No ratings yet
Lecture 3
47 pages
10-1 Data Analysis and Pre-Processing Part 3 PDF
No ratings yet
10-1 Data Analysis and Pre-Processing Part 3 PDF
19 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
80 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
Data Mining 3
No ratings yet
Data Mining 3
57 pages
LT-LT-: Satellite Tracer
No ratings yet
LT-LT-: Satellite Tracer
70 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
DP
No ratings yet
DP
44 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
66 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
Lec 7
No ratings yet
Lec 7
45 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
03 Preprocessing
No ratings yet
03 Preprocessing
80 pages
03 Preprocessing
No ratings yet
03 Preprocessing
60 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
DM Merged
No ratings yet
DM Merged
169 pages
Unit 3
No ratings yet
Unit 3
164 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
Lesson Plan Subject/Grade Unit/Skill/Topic of Lesson Standards Addressed Va:Re9.1. 2 Va:Cr2.1.2 Vacr3.1.2
100% (1)
Lesson Plan Subject/Grade Unit/Skill/Topic of Lesson Standards Addressed Va:Re9.1. 2 Va:Cr2.1.2 Vacr3.1.2
4 pages
03 Pre Processing
No ratings yet
03 Pre Processing
63 pages
03preprocessing 20160222
No ratings yet
03preprocessing 20160222
65 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
03 Pre Processing
No ratings yet
03 Pre Processing
89 pages
Data Mining P5
No ratings yet
Data Mining P5
32 pages
2020 Preprocessing
No ratings yet
2020 Preprocessing
63 pages
Data - Preprocessing 1 19
No ratings yet
Data - Preprocessing 1 19
19 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
56 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
64 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
Mining
No ratings yet
Mining
63 pages
Slide 05 Chapter3 Data Preprocessing
No ratings yet
Slide 05 Chapter3 Data Preprocessing
58 pages
Portable Radios: Operating Instructions
100% (1)
Portable Radios: Operating Instructions
47 pages
Lecture#2 Data Mining MS (DEIM) Spring 2025
No ratings yet
Lecture#2 Data Mining MS (DEIM) Spring 2025
61 pages
Operator'S Manual: T6.145 T6.155 T6.165 T6.175 T6.180 Autocommand
No ratings yet
Operator'S Manual: T6.145 T6.155 T6.165 T6.175 T6.180 Autocommand
22 pages
Filling Station Case Study
No ratings yet
Filling Station Case Study
22 pages
Literature Review On Accessibility
100% (1)
Literature Review On Accessibility
7 pages
Design of HVAC Control System For Building Energy Management Systems
No ratings yet
Design of HVAC Control System For Building Energy Management Systems
5 pages
Gateway Profile 4.5 Service Manual
No ratings yet
Gateway Profile 4.5 Service Manual
90 pages
Architecture and Sociology
No ratings yet
Architecture and Sociology
11 pages
Unit One: Lesson 10 "I'll Always Be Proud of Him"
No ratings yet
Unit One: Lesson 10 "I'll Always Be Proud of Him"
11 pages
Reg 216 - B520
No ratings yet
Reg 216 - B520
24 pages
Instructional Design Rubric Final
No ratings yet
Instructional Design Rubric Final
1 page
Interview Vera Geier PDF
No ratings yet
Interview Vera Geier PDF
2 pages
SLG - Sequence of Operation
No ratings yet
SLG - Sequence of Operation
1 page
Tire Dimensions
No ratings yet
Tire Dimensions
1 page
Hytera+VM780+4G+Body+Worn+Camera+User+Manual+ (HyTalk) +V1.0.00 Eng
No ratings yet
Hytera+VM780+4G+Body+Worn+Camera+User+Manual+ (HyTalk) +V1.0.00 Eng
50 pages
Critical Elements For A Successful Energy Transition - A Systematic Review
No ratings yet
Critical Elements For A Successful Energy Transition - A Systematic Review
21 pages
Understanding Demand: Unit 2: Microeconomics
No ratings yet
Understanding Demand: Unit 2: Microeconomics
26 pages
Paper 4 PDF
No ratings yet
Paper 4 PDF
5 pages
Vmware - Kopia
No ratings yet
Vmware - Kopia
45 pages
Izar Net 2 14
No ratings yet
Izar Net 2 14
3 pages
Schmidt Sciences
No ratings yet
Schmidt Sciences
6 pages
Offline Schedule-Siioc2023 Version2
No ratings yet
Offline Schedule-Siioc2023 Version2
5 pages
Log
No ratings yet
Log
8 pages
STS Lesson 1
No ratings yet
STS Lesson 1
8 pages
MAED Math 4
No ratings yet
MAED Math 4
2 pages
Co-Clustering: Models, Algorithms and Applications
From Everand
Co-Clustering: Models, Algorithms and Applications
Gérard Govaert
No ratings yet
100 Puzzles to Learn Data Warehousing
From Everand
100 Puzzles to Learn Data Warehousing
Cristian Scutaru
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet

Lecture 4-Data Preprocessing - Integration

Uploaded by

Lecture 4-Data Preprocessing - Integration

Uploaded by

Data Mining

Multiple Sources Coherent Store

Multiple Sources Coherent Store

Similarly, Islamabad might be denoted as ‘isb’, ‘ISB’ or ‘ISBD’, may be

Multiple Sources Coherent Store

Time point AllElctronics (X) HighTech (Y)

eij  count ( Aai )count (B bi )

male female Total

You might also like