Data Mining - Lecture 3
Data Mining - Lecture 3
Integration
Reduction
Data Pre-processing
Transformation
By
Dr. Nora Shoaip
Lecture 3
Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems
2024 - 2025
Data Integration
• Entity Identification Problem
• Redundancy and correlation analysis
• Tuple duplication
Data Integration
3
Data Integration
Entity Identification Problem
4
Data Integration
Redundancy and Correlation Analysis
5
Data Integration
Redundancy and Correlation Analysis
gender
male female Total
Fiction 250 200 450
Preferred Non-fiction 50 1000 1050
reading
Total 300 1200 1500
6
Data Integration
Redundancy and Correlation Analysis
gender
male female Total
Fiction 250 (90) 200 (360) 450
Preferred Non-fiction 50(210) 1000 (840) 1050
reading
Total 300 1200 1500
7
Data Integration
Redundancy and Correlation Analysis
gender
male female Total
Fiction 250 (90) 200 (360) 450
Preferred Non-fiction 50(210) 1000 (840) 1050
reading
Total 300 1200 1500
8
Data Integration
Redundancy and Correlation Analysis
9
Data Integration
Redundancy and Correlation Analysis
10
Data Integration
Redundancy and Correlation Analysis
11
Data Integration
Redundancy and Correlation Analysis
T1 6 20
T2 5 10
T3 4 14
T4 3 5
T5 2 5
12
Data Integration
More Issues
Tuple duplication
The use of denormalized tables (often done to improve performance by
avoiding joins) is another source of data redundancy.
e.g. purchaser name and address, and purchases
Data value conflict
e.g. grading system in two different institutes A, B, … versus 90%,
80% …
13
Data Reduction
• Wavelet transforms
• PCA
• Attribute subset selection
• Regression
• Histograms
• Clustering
• Sampling
Data Reduction
Strategies
15
Data Reduction
Attribute Subset Selection
find a min set of attributes such that the resulting probability distribution of data is as
close as possible to the original distribution using all attributes
An exhaustive search can be prohibitively expensive
Heuristic (Greedy) search
◦Stepwise forward selection: start with empty set of attributes as reduced set. The best of the
attributes is determined and added to the reduced set. At each subsequent iteration, the best of
the remaining attributes is added to the set
◦Stepwise backward elimination: start with the full set of attributes. At each step, remove the
worst attribute remaining in the set
◦Combination of forward selection and backward elimination
◦Decision tree induction
Attribute construction e.g. area attribute based on height and width attributes
16
Data Reduction
Attribute Subset
Selection
17
Data Reduction- Numerosity reduction
Regression
18
Data Reduction
Regression
X Y
1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25
19
Data Reduction
Regression
X Y
1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25
20
Data Reduction
Histograms
Equal-width: the width of each bucket range is uniform (e.g., the width of $10 for the
buckets).
21
Data Reduction
Histograms
22
Data Reduction
Sampling
24
Transformation and Discretization
Transformation Strategies
Attribute construction
Aggregation
labels (e.g. 0–10, 11–20) or conceptual labels (e.g., youth, adult, senior)
25
Transformation and Discretization
Transformation by Normalization
26
Transformation and Discretization
Transformation by Normalization
27
Transformation and Discretization
Transformation by Normalization
28
Transformation and Discretization
Concept Hierarchy
30
Summary
Cleaning Integration Reduction Transformation/Discretization
Binning Binning
Regression Regression Regression
Correlation analysis Correlation
Histograms Histogram analysis
Clustering Clustering
Attribute construction Attribute construction
Aggregation
Normalization
Outlier analysis
Wavelet transforms
PCA
Attribute subset selection
Sampling
Concept hierarchy
31
Summary
21