Data Mining
Data Mining
Overview
Data
Data Pre-processing Cleaning
Integration
By
Dr. Nora Shoaip
Lecture 3
Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems
2023 - 2024
Quiz
2
Quiz
21 3
Overview
4
Why Preprocess Data?
5
Major Preprocessing Tasks
That Improve Quality of Data
6
Data Cleaning
8
Data Cleaning
Missing Values
Ignore the tuple not very effective, unless the tuple contains
several attributes with missing values
Fill in the missing value manually time consuming, not
feasible for large data sets
Use a global constant replace all missing attribute values by
same value (e.g. unknown)
may mistakenly think that “unknown” is an interesting concept
9
Data Cleaning
Missing Values
10
Data Cleaning
Noisy Data
11
Data Cleaning
Noisy Data
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
13
Data Cleaning
Noisy Data Partition into (equal-width)
bins
Bin 1: 4, 8, 15
Example: Sorted data for price (in dollars): Bin 2: 21, 21, 24, 25, 28
4, 8, 15, 21, 21, 24, 25, 28, 34 Bin 3: 34
Smoothing by bin means
Bin 1: 9, 9, 9
Bin 2: 24, 24, 24,24,24
Bin 3: 34
Smoothing by bin boundaries
Bin 1: 4, 4, 15
Bin 2: 21, 21, 21, 28, 28
Bin 3: 34
14
Data Cleaning
Noisy Data
15
Data Integration
16
Data Integration
17
Data Integration
Entity Identification Problem
18
Data Integration
Redundancy and Correlation Analysis
19
Data Integration
Redundancy and Correlation Analysis
gender
male female Total
Fiction 250 200 450
Preferred Non-fiction 50 1000 1050
reading
Total 300 1200 1500
20
Data Integration
Redundancy and Correlation Analysis
gender
male female Total
Fiction 250 (90) 200 (360) 450
Preferred Non-fiction 50(210) 1000 (840) 1050
reading
Total 300 1200 1500
21
Data Integration
Redundancy and Correlation Analysis
gender
male female Total
Fiction 250 (90) 200 (360) 450
Preferred Non-fiction 50(210) 1000 (840) 1050
reading
Total 300 1200 1500
22
Data Integration
Redundancy and Correlation Analysis
23
Data Integration
Redundancy and Correlation Analysis
24
Data Integration
Redundancy and Correlation Analysis
25
Data Integration
Redundancy and Correlation Analysis
T1 6 20
T2 5 10
T3 4 14
T4 3 5
T5 2 5
26
Data Integration
More Issues
Tuple duplication
The use of denormalized tables (often done to improve performance by
avoiding joins) is another source of data redundancy.
e.g. purchaser name and address, and purchases
Data value conflict
e.g. grading system in two different institutes A, B, … versus 90%,
80% …
27
Quiz
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7
• Calculate the correlation coefficient. Are these two attributes positively or negatively
correlated? Compute their covariance.
o (Hint: n = 18
o SD for Age and fat are 12.85 and 8.99 respectively
o Mean for Age and fat are 46.44 and 28.78 respectively
o E(age* fat) = 1431.29)
• Partition the data into three bins by each of equal-frequency and equal-width partitioning
• Use smoothing by bin boundaries to smooth these data
28
Quiz.. Sol.
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7
29
Quiz.. Sol.
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7
• Equal frequency bin for age Equal width bin for age
o Bin 1= 23,23,27,27,39,41 o Bin 1= 23,23,27,27
o Bin 2= 47,49,50,52,54,54 o Bin 2= 39,41,47,49
o Bin 3= 50,52,54,54, 56, 57,58,58,60,61
o Bin 3= 56,57,58,58,60,61
Smoothing by boundary
• Smoothing by boundary o Bin 1= 23,23,27,27
o Bin 1= 23,23,23,23,41,41 o Bin 2= 39,41,47,49
o Bin 2= 47,47,47,54,54,54 o Bin 3= 50,50,50,50, 50, 61,61,61,61,61
o Bin 3= 56,56,56,56,61,61
30