0% found this document useful (0 votes)
17 views23 pages

Data Mining - Lecture 2

Data Mining - Lecture 2

Uploaded by

hendymostafa256
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views23 pages

Data Mining - Lecture 2

Data Mining - Lecture 2

Uploaded by

hendymostafa256
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Data Mining and Business Intelligence

similarity

data quality
Know Your Data
Preprocessing

By
Dr. Nora Shoaip
Lecture2

Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems

2024 - 2025
Measuring Data similarity &
dissimilarity
• Data matrix & dissimilarity matrix
• Proximity Measures for( Nominal-
Binary) attributes
• Dissimilarity of Numerical Data
Measuring Data similarity & dissimilarity

25
Data matrix & dissimilarity matrix

21
Data matrix & dissimilarity matrix

21
Proximity Measures for Nominal
attributes

21
Proximity Measures for Binary
attributes

21
Proximity Measures for Binary
attributes- Example

21
Dissimilarity of Numerical Data

21
Dissimilarity of Numerical Data
MinKowski Distance

21
Why preprocess data?
Major tasks
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
Overview

Databases are highly susceptible to noisy, missing, and


inconsistent data
Low-quality data will lead to low-quality mining results

“How can the data be preprocessed in order to help improve the


quality of the data and, consequently, of the mining results?
How can the data be preprocessed so as to improve the efficiency
and ease of the mining process?”

12
Why Preprocess Data?

To satisfy the requirements of the intended use


 Factors of data quality:
◦Accuracy  lack of due to faulty instruments, errors caused by
human/computer/transmission, deliberate errors …
◦Completeness  lack of due to different design phases, optional attributes
◦Consistency  lack of due to semantics, data types, field formats …
◦Timeliness
◦Believability how much the data are trusted by users
◦Interpretability  how easy the data are understood

13
Major Preprocessing Tasks
That Improve Quality of Data

 Data cleaning  filling in missing values, smoothing noisy data, identifying or


removing outliers, and resolving inconsistencies
 Data integration  include data from multiple sources in your analysis, map
semantic concepts, infer attributes …
 Data reduction  obtain a reduced representation of the data set that is much
smaller in volume, while producing almost the same analytical results
 Discretization  raw data values for attributes are replaced by ranges or higher
conceptual levels
 Data transformation  normalization
14
Data Cleaning
 Data in the Real World Is Dirty!
◦incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data
e.g., Occupation=“ ” (missing data)
◦noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
◦inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
◦Intentional  Jan. 1 as everyone’s birthday?
15
Data Cleaning

… fill in missing values, smooth out noise while identifying outliers,


and correct inconsistencies in the data

A missing value may not imply an error in the data!


◦e.g. driver’s license number

16
Data Cleaning
Missing Values

 Ignore the tuple  not very effective, unless the tuple contains
several attributes with missing values
 Fill in the missing value manually  time consuming, not
feasible for large data sets
 Use a global constant  replace all missing attribute values by
same value (e.g. unknown)
 may mistakenly think that “unknown” is an interesting concept

17
Data Cleaning
Missing Values

 Use mean or median  For normal (symmetric) data


distributions, the mean is used, while skewed data distribution
should employ the median
 Use mean or median for all samples belonging to the same
class as the given tuple  e.g. mean or median of customers in
a certain age group
 Use the most probable value  using regression, inference-
based tools such as Bayesian formula or decision tree
 Most popular

18
Data Cleaning
Noisy Data

Noise is a random error or variance in a measured


variable

Data smoothing techniques:


1. Binning
2. Regression
3. Outlier Analysis

19
Data Cleaning
Noisy Data

1. Binning  smooth a sorted data value by consulting its


“neighborhood”
◦sorted values are partitioned into a # of “buckets,” or bins  local
smoothing
◦equal-frequency bins  each bin has same # of values
◦equal-width bins  interval range of values per bin is constant
 Smoothing by bin means  each bin value is replaced by the bin mean
 Smoothing by bin medians  each bin value is replaced by the bin median
 Smoothing by bin boundaries  each bin value is replaced by the closest
boundary value (min & max in a bin are bin boundaries)
20
Data Cleaning
Partition into (equal-
Noisy Data frequency) bins
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Example: Sorted data for price (in dollars): Bin 3: 25, 28, 34
4, 8, 15, 21, 21, 24, 25, 28, 34 Smoothing by bin means

Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries

Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
21
Data Cleaning
Noisy Data Partition into (equal-width)
bins
Bin 1: 4, 8, 15
Example: Sorted data for price (in dollars): Bin 2: 21, 21, 24, 25, 28
4, 8, 15, 21, 21, 24, 25, 28, 34 Bin 3: 34
Smoothing by bin means
Bin 1: 9, 9, 9
Bin 2: 24, 24, 24,24,24
Bin 3: 34
Smoothing by bin boundaries
Bin 1: 4, 4, 15
Bin 2: 21, 21, 21, 28, 28
Bin 3: 34
22
Data Cleaning
Noisy Data

2. Regression  Conform data values to a function


◦Linear regression  find “best” line to fit two attributes so that one
attribute can be used to predict the other
3. Outlier Analysis
 Potter’s Wheel  Automated interactive data
cleaning tool

23

You might also like