0% found this document useful (0 votes)
50 views31 pages

Data Mining

The document discusses various techniques for preprocessing data in order to improve its quality for data mining purposes, including data cleaning techniques to handle missing values, noisy data, and inconsistencies. It also covers data integration topics such as resolving semantic heterogeneity across multiple data sources and analyzing and removing redundancy. The overall goal of data preprocessing is to produce a reduced and cleaner representation of the data to improve the effectiveness of subsequent data mining and analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views31 pages

Data Mining

The document discusses various techniques for preprocessing data in order to improve its quality for data mining purposes, including data cleaning techniques to handle missing values, noisy data, and inconsistencies. It also covers data integration topics such as resolving semantic heterogeneity across multiple data sources and analyzing and removing redundancy. The overall goal of data preprocessing is to produce a reduced and cleaner representation of the data to improve the effectiveness of subsequent data mining and analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Data Mining and Business Intelligence

Overview

Data
Data Pre-processing Cleaning

Integration
By
Dr. Nora Shoaip

Lecture 3

Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems

2023 - 2024
Quiz

Draw the Box-Plot for the following dataset


4.3, 5.1, 3.9, 4.5, 4.4, 4.9, 5.0, 4.7, 4.1, 4.6, 4.4, 4.3, 4.8,
4.4, 4.2, 4.5, 4.4

2
Quiz

21 3
Overview

Databases are highly susceptible to noisy, missing, and


inconsistent data
Low-quality data will lead to low-quality mining results

“How can the data be preprocessed in order to help improve the


quality of the data and, consequently, of the mining results?
How can the data be preprocessed so as to improve the efficiency
and ease of the mining process?”

4
Why Preprocess Data?

To satisfy the requirements of the intended use


 Factors of data quality:
◦Accuracy  lack of due to faulty instruments, errors caused by
human/computer/transmission, deliberate errors …
◦Completeness  lack of due to different design phases, optional attributes
◦Consistency  lack of due to semantics, data types, field formats …
◦Timeliness
◦Believability how much the data are trusted by users
◦Interpretability  how easy the data are understood

5
Major Preprocessing Tasks
That Improve Quality of Data

 Data cleaning  filling in missing values, smoothing noisy data,


identifying or removing outliers, and resolving inconsistencies
 Data integration  include data from multiple sources in your analysis,
map semantic concepts, infer attributes …
 Data reduction  obtain a reduced representation of the data set that
is much smaller in volume, while producing almost the same analytical
results
 Discretization  raw data values for attributes are replaced by ranges
or higher conceptual levels
 Data transformation  normalization

6
Data Cleaning

 Data in the Real World Is Dirty!


◦incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
e.g., Occupation=“ ” (missing data)
◦noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
◦inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
◦Intentional  Jan. 1 as everyone’s birthday?
7
Data Cleaning

… fill in missing values, smooth out noise while identifying outliers,


and correct inconsistencies in the data

A missing value may not imply an error in the data!


◦e.g. driver’s license number

8
Data Cleaning
Missing Values

 Ignore the tuple  not very effective, unless the tuple contains
several attributes with missing values
 Fill in the missing value manually  time consuming, not
feasible for large data sets
 Use a global constant  replace all missing attribute values by
same value (e.g. unknown)
 may mistakenly think that “unknown” is an interesting concept

9
Data Cleaning
Missing Values

 Use mean or median  For normal (symmetric) data


distributions, the mean is used, while skewed data distribution
should employ the median
 Use mean or median for all samples belonging to the same
class as the given tuple  e.g. mean or median of customers in
a certain age group
 Use the most probable value  using regression, inference-
based tools such as Bayesian formula or decision tree
 Most popular

10
Data Cleaning
Noisy Data

Noise is a random error or variance in a measured


variable

Data smoothing techniques:


1. Binning
2. Regression
3. Outlier Analysis

11
Data Cleaning
Noisy Data

1. Binning  smooth a sorted data value by consulting its


“neighborhood”
◦sorted values are partitioned into a # of “buckets,” or bins  local
smoothing
◦equal-frequency bins  each bin has same # of values
◦equal-width bins  interval range of values per bin is constant
 Smoothing by bin means  each bin value is replaced by the bin mean
 Smoothing by bin medians  each bin value is replaced by the bin median
 Smoothing by bin boundaries  each bin value is replaced by the closest
boundary value (min & max in a bin are bin boundaries)
12
Data Cleaning
Partition into (equal-
Noisy Data frequency) bins
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Example: Sorted data for price (in dollars): Bin 3: 25, 28, 34
4, 8, 15, 21, 21, 24, 25, 28, 34 Smoothing by bin means

Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries

Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
13
Data Cleaning
Noisy Data Partition into (equal-width)
bins
Bin 1: 4, 8, 15
Example: Sorted data for price (in dollars): Bin 2: 21, 21, 24, 25, 28
4, 8, 15, 21, 21, 24, 25, 28, 34 Bin 3: 34
Smoothing by bin means
Bin 1: 9, 9, 9
Bin 2: 24, 24, 24,24,24
Bin 3: 34
Smoothing by bin boundaries
Bin 1: 4, 4, 15
Bin 2: 21, 21, 21, 28, 28
Bin 3: 34
14
Data Cleaning
Noisy Data

2. Regression  Conform data values to a function


◦Linear regression  find “best” line to fit two attributes so that one
attribute can be used to predict the other
3. Outlier Analysis
 Potter’s Wheel  Automated interactive data
cleaning tool

15
Data Integration

 Entity Identification Problem


 Redundancy and correlation analysis
 Tuple duplication
 Tuple duplication
 Data value conflict detection

16
Data Integration

Merging data from multiple data stores


Helps reduce and avoid redundancies and inconsistencies in the resulting data set
Challenges:
 Semantic heterogeneity  entity identification problem
 Structure of data  functional dependencies and referential constraints
 Redundancy

17
Data Integration
Entity Identification Problem

 Schema integration and object matching


 Metadata  name, meaning, data type, and range of values
permitted, null rules for handling blank, zero, or null values

 can help avoid errors in schema integration and data


transformation

18
Data Integration
Redundancy and Correlation Analysis

19
Data Integration
Redundancy and Correlation Analysis

gender
male female Total
Fiction 250 200 450
Preferred Non-fiction 50 1000 1050
reading
Total 300 1200 1500

20
Data Integration
Redundancy and Correlation Analysis

gender
male female Total
Fiction 250 (90) 200 (360) 450
Preferred Non-fiction 50(210) 1000 (840) 1050
reading
Total 300 1200 1500

21
Data Integration
Redundancy and Correlation Analysis

gender
male female Total
Fiction 250 (90) 200 (360) 450
Preferred Non-fiction 50(210) 1000 (840) 1050
reading
Total 300 1200 1500

22
Data Integration
Redundancy and Correlation Analysis

23
Data Integration
Redundancy and Correlation Analysis

24
Data Integration
Redundancy and Correlation Analysis

25
Data Integration
Redundancy and Correlation Analysis

Time AllElectronics HighTech


point

T1 6 20
T2 5 10
T3 4 14
T4 3 5
T5 2 5

26
Data Integration
More Issues

Tuple duplication
The use of denormalized tables (often done to improve performance by
avoiding joins) is another source of data redundancy.
e.g. purchaser name and address, and purchases
Data value conflict
e.g. grading system in two different institutes  A, B, … versus 90%,
80% …

27
Quiz
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61

%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7

• Calculate the correlation coefficient. Are these two attributes positively or negatively
correlated? Compute their covariance.
o (Hint: n = 18
o SD for Age and fat are 12.85 and 8.99 respectively
o Mean for Age and fat are 46.44 and 28.78 respectively
o E(age* fat) = 1431.29)
• Partition the data into three bins by each of equal-frequency and equal-width partitioning
• Use smoothing by bin boundaries to smooth these data

28
Quiz.. Sol.
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61

%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7

29
Quiz.. Sol.
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61

%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7

• Equal frequency bin for age  Equal width bin for age
o Bin 1= 23,23,27,27,39,41 o Bin 1= 23,23,27,27
o Bin 2= 47,49,50,52,54,54 o Bin 2= 39,41,47,49
o Bin 3= 50,52,54,54, 56, 57,58,58,60,61
o Bin 3= 56,57,58,58,60,61
 Smoothing by boundary
• Smoothing by boundary o Bin 1= 23,23,27,27
o Bin 1= 23,23,23,23,41,41 o Bin 2= 39,41,47,49
o Bin 2= 47,47,47,54,54,54 o Bin 3= 50,50,50,50, 50, 61,61,61,61,61
o Bin 3= 56,56,56,56,61,61

30

You might also like