Data Mining - Lecture 2
Data Mining - Lecture 2
similarity
data quality
Know Your Data
Preprocessing
By
Dr. Nora Shoaip
Lecture2
Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems
2024 - 2025
Measuring Data similarity &
dissimilarity
• Data matrix & dissimilarity matrix
• Proximity Measures for( Nominal-
Binary) attributes
• Dissimilarity of Numerical Data
Measuring Data similarity & dissimilarity
25
Data matrix & dissimilarity matrix
21
Data matrix & dissimilarity matrix
21
Proximity Measures for Nominal
attributes
21
Proximity Measures for Binary
attributes
21
Proximity Measures for Binary
attributes- Example
21
Dissimilarity of Numerical Data
21
Dissimilarity of Numerical Data
MinKowski Distance
21
Why preprocess data?
Major tasks
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
Overview
12
Why Preprocess Data?
13
Major Preprocessing Tasks
That Improve Quality of Data
16
Data Cleaning
Missing Values
Ignore the tuple not very effective, unless the tuple contains
several attributes with missing values
Fill in the missing value manually time consuming, not
feasible for large data sets
Use a global constant replace all missing attribute values by
same value (e.g. unknown)
may mistakenly think that “unknown” is an interesting concept
17
Data Cleaning
Missing Values
18
Data Cleaning
Noisy Data
19
Data Cleaning
Noisy Data
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
21
Data Cleaning
Noisy Data Partition into (equal-width)
bins
Bin 1: 4, 8, 15
Example: Sorted data for price (in dollars): Bin 2: 21, 21, 24, 25, 28
4, 8, 15, 21, 21, 24, 25, 28, 34 Bin 3: 34
Smoothing by bin means
Bin 1: 9, 9, 9
Bin 2: 24, 24, 24,24,24
Bin 3: 34
Smoothing by bin boundaries
Bin 1: 4, 4, 15
Bin 2: 21, 21, 21, 28, 28
Bin 3: 34
22
Data Cleaning
Noisy Data
23