0% found this document useful (0 votes)

50 views31 pages

Data Mining

The document discusses various techniques for preprocessing data in order to improve its quality for data mining purposes, including data cleaning techniques to handle missing values, noisy data, and inconsistencies. It also covers data integration topics such as resolving semantic heterogeneity across multiple data sources and analyzing and removing redundancy. The overall goal of data preprocessing is to produce a reduced and cleaner representation of the data to improve the effectiveness of subsequent data mining and analysis.

Uploaded by

mohamedelgohary679

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views31 pages

Data Mining

Uploaded by

mohamedelgohary679

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Data Mining and Business Intelligence

Overview

Data
Data Pre-processing Cleaning

Integration
By
Dr. Nora Shoaip

Lecture 3

Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems

2023 - 2024
Quiz

Draw the Box-Plot for the following dataset

4.3, 5.1, 3.9, 4.5, 4.4, 4.9, 5.0, 4.7, 4.1, 4.6, 4.4, 4.3, 4.8,
4.4, 4.2, 4.5, 4.4

2
Quiz

21 3
Overview

Databases are highly susceptible to noisy, missing, and

inconsistent data
Low-quality data will lead to low-quality mining results

“How can the data be preprocessed in order to help improve the

quality of the data and, consequently, of the mining results?
How can the data be preprocessed so as to improve the efficiency
and ease of the mining process?”

4
Why Preprocess Data?

To satisfy the requirements of the intended use

 Factors of data quality:
◦Accuracy  lack of due to faulty instruments, errors caused by
human/computer/transmission, deliberate errors …
◦Completeness  lack of due to different design phases, optional attributes
◦Consistency  lack of due to semantics, data types, field formats …
◦Timeliness
◦Believability how much the data are trusted by users
◦Interpretability  how easy the data are understood

5
Major Preprocessing Tasks
That Improve Quality of Data

 Data cleaning  filling in missing values, smoothing noisy data,

identifying or removing outliers, and resolving inconsistencies
 Data integration  include data from multiple sources in your analysis,
map semantic concepts, infer attributes …
 Data reduction  obtain a reduced representation of the data set that
is much smaller in volume, while producing almost the same analytical
results
 Discretization  raw data values for attributes are replaced by ranges
or higher conceptual levels
 Data transformation  normalization

6
Data Cleaning

 Data in the Real World Is Dirty!

◦incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
e.g., Occupation=“ ” (missing data)
◦noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
◦inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
◦Intentional  Jan. 1 as everyone’s birthday?
7
Data Cleaning

… fill in missing values, smooth out noise while identifying outliers,

and correct inconsistencies in the data

A missing value may not imply an error in the data!

◦e.g. driver’s license number

8
Data Cleaning
Missing Values

 Ignore the tuple  not very effective, unless the tuple contains
several attributes with missing values
 Fill in the missing value manually  time consuming, not
feasible for large data sets
 Use a global constant  replace all missing attribute values by
same value (e.g. unknown)
 may mistakenly think that “unknown” is an interesting concept

9
Data Cleaning
Missing Values

 Use mean or median  For normal (symmetric) data

distributions, the mean is used, while skewed data distribution
should employ the median
 Use mean or median for all samples belonging to the same
class as the given tuple  e.g. mean or median of customers in
a certain age group
 Use the most probable value  using regression, inference-
based tools such as Bayesian formula or decision tree
 Most popular

10
Data Cleaning
Noisy Data

Noise is a random error or variance in a measured

variable

Data smoothing techniques:

1. Binning
2. Regression
3. Outlier Analysis

11
Data Cleaning
Noisy Data

1. Binning  smooth a sorted data value by consulting its

“neighborhood”
◦sorted values are partitioned into a # of “buckets,” or bins  local
smoothing
◦equal-frequency bins  each bin has same # of values
◦equal-width bins  interval range of values per bin is constant
 Smoothing by bin means  each bin value is replaced by the bin mean
 Smoothing by bin medians  each bin value is replaced by the bin median
 Smoothing by bin boundaries  each bin value is replaced by the closest
boundary value (min & max in a bin are bin boundaries)
12
Data Cleaning
Partition into (equal-
Noisy Data frequency) bins
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Example: Sorted data for price (in dollars): Bin 3: 25, 28, 34
4, 8, 15, 21, 21, 24, 25, 28, 34 Smoothing by bin means

Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries

Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
13
Data Cleaning
Noisy Data Partition into (equal-width)
bins
Bin 1: 4, 8, 15
Example: Sorted data for price (in dollars): Bin 2: 21, 21, 24, 25, 28
4, 8, 15, 21, 21, 24, 25, 28, 34 Bin 3: 34
Smoothing by bin means
Bin 1: 9, 9, 9
Bin 2: 24, 24, 24,24,24
Bin 3: 34
Smoothing by bin boundaries
Bin 1: 4, 4, 15
Bin 2: 21, 21, 21, 28, 28
Bin 3: 34
14
Data Cleaning
Noisy Data

2. Regression  Conform data values to a function

◦Linear regression  find “best” line to fit two attributes so that one
attribute can be used to predict the other
3. Outlier Analysis
 Potter’s Wheel  Automated interactive data
cleaning tool

15
Data Integration

 Entity Identification Problem

 Redundancy and correlation analysis
 Tuple duplication
 Tuple duplication
 Data value conflict detection

16
Data Integration

Merging data from multiple data stores

Helps reduce and avoid redundancies and inconsistencies in the resulting data set
Challenges:
 Semantic heterogeneity  entity identification problem
 Structure of data  functional dependencies and referential constraints
 Redundancy

17
Data Integration
Entity Identification Problem

 Schema integration and object matching

 Metadata  name, meaning, data type, and range of values
permitted, null rules for handling blank, zero, or null values

 can help avoid errors in schema integration and data

transformation

18
Data Integration
Redundancy and Correlation Analysis

19
Data Integration
Redundancy and Correlation Analysis

gender
male female Total
Fiction 250 200 450
Preferred Non-fiction 50 1000 1050
reading
Total 300 1200 1500

20
Data Integration
Redundancy and Correlation Analysis

gender
male female Total
Fiction 250 (90) 200 (360) 450
Preferred Non-fiction 50(210) 1000 (840) 1050
reading
Total 300 1200 1500

21
Data Integration
Redundancy and Correlation Analysis

gender
male female Total
Fiction 250 (90) 200 (360) 450
Preferred Non-fiction 50(210) 1000 (840) 1050
reading
Total 300 1200 1500

22
Data Integration
Redundancy and Correlation Analysis

23
Data Integration
Redundancy and Correlation Analysis

24
Data Integration
Redundancy and Correlation Analysis

25
Data Integration
Redundancy and Correlation Analysis

Time AllElectronics HighTech

point

T1 6 20
T2 5 10
T3 4 14
T4 3 5
T5 2 5

26
Data Integration
More Issues

Tuple duplication
The use of denormalized tables (often done to improve performance by
avoiding joins) is another source of data redundancy.
e.g. purchaser name and address, and purchases
Data value conflict
e.g. grading system in two different institutes  A, B, … versus 90%,
80% …

27
Quiz
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61

%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7

• Calculate the correlation coefficient. Are these two attributes positively or negatively
correlated? Compute their covariance.
o (Hint: n = 18
o SD for Age and fat are 12.85 and 8.99 respectively
o Mean for Age and fat are 46.44 and 28.78 respectively
o E(age* fat) = 1431.29)
• Partition the data into three bins by each of equal-frequency and equal-width partitioning
• Use smoothing by bin boundaries to smooth these data

28
Quiz.. Sol.
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61

%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7

29
Quiz.. Sol.
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61

%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7

• Equal frequency bin for age  Equal width bin for age
o Bin 1= 23,23,27,27,39,41 o Bin 1= 23,23,27,27
o Bin 2= 47,49,50,52,54,54 o Bin 2= 39,41,47,49
o Bin 3= 50,52,54,54, 56, 57,58,58,60,61
o Bin 3= 56,57,58,58,60,61
 Smoothing by boundary
• Smoothing by boundary o Bin 1= 23,23,27,27
o Bin 1= 23,23,23,23,41,41 o Bin 2= 39,41,47,49
o Bin 2= 47,47,47,54,54,54 o Bin 3= 50,50,50,50, 50, 61,61,61,61,61
o Bin 3= 56,56,56,56,61,61

Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
Week2 2
No ratings yet
Week2 2
25 pages
DM Day3 Preprocessing A F24
No ratings yet
DM Day3 Preprocessing A F24
85 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Chapter 2 3 Data Mining
No ratings yet
Chapter 2 3 Data Mining
4 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Unit 2
No ratings yet
Unit 2
37 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
CH 2
No ratings yet
CH 2
36 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Unit - II
No ratings yet
Unit - II
56 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
CH 3
No ratings yet
CH 3
68 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
UNIT II Data Processing (1) .PPTX DMT
No ratings yet
UNIT II Data Processing (1) .PPTX DMT
43 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
DWDM Unit-Ii
No ratings yet
DWDM Unit-Ii
18 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Unit-2 Preprocessing
No ratings yet
Unit-2 Preprocessing
18 pages
Correlation
No ratings yet
Correlation
14 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
From Everand
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
César Pérez López
No ratings yet
Mastering Pandas in Python: Course Book
From Everand
Mastering Pandas in Python: Course Book
Pedro Martins
No ratings yet
Mastering Data Science: From Basics to Expert Proficiency
From Everand
Mastering Data Science: From Basics to Expert Proficiency
William Smith
No ratings yet
Padovan 2014
No ratings yet
Padovan 2014
11 pages
Qualcomm 213
No ratings yet
Qualcomm 213
28 pages
Lecture - 7 - MSC
No ratings yet
Lecture - 7 - MSC
13 pages
Digital Quartz AM/FM Tuner: Owners Manual
No ratings yet
Digital Quartz AM/FM Tuner: Owners Manual
4 pages
API ISCAN-LITE Scanner
No ratings yet
API ISCAN-LITE Scanner
4 pages
ZXA10 C320 Datasheet: Key Features Technical Specifications
No ratings yet
ZXA10 C320 Datasheet: Key Features Technical Specifications
3 pages
Modern Teaching Methods
75% (4)
Modern Teaching Methods
10 pages
Geomatics Engineering Technology
No ratings yet
Geomatics Engineering Technology
3 pages
19 Arid 3235 LAB (5,6,7,8)
No ratings yet
19 Arid 3235 LAB (5,6,7,8)
11 pages
Microsoft Services Agreement
No ratings yet
Microsoft Services Agreement
25 pages
MANUAL AMPLIFICADOR KENWOOD Ar304
No ratings yet
MANUAL AMPLIFICADOR KENWOOD Ar304
24 pages
AGRU - FM 1613 Approved HDPE Pipes Fittings
No ratings yet
AGRU - FM 1613 Approved HDPE Pipes Fittings
64 pages
Introduction To: Energy Modelling & Building Simulation
No ratings yet
Introduction To: Energy Modelling & Building Simulation
14 pages
HTML File Paths
No ratings yet
HTML File Paths
7 pages
DBMS
No ratings yet
DBMS
19 pages
SP800 Operating Manual
No ratings yet
SP800 Operating Manual
17 pages
LA - Sleeve Auto Performance 2014 Catalog
No ratings yet
LA - Sleeve Auto Performance 2014 Catalog
76 pages
VGuard 450C Circuit Diagram
No ratings yet
VGuard 450C Circuit Diagram
1 page
Post-Implementation Steps For SAP Note 3295909
No ratings yet
Post-Implementation Steps For SAP Note 3295909
5 pages
International Journal of Data Science and Analytics (IJDSA)
No ratings yet
International Journal of Data Science and Analytics (IJDSA)
2 pages
Colleges List
No ratings yet
Colleges List
28 pages
Pragya Sachdeva Resume
No ratings yet
Pragya Sachdeva Resume
1 page
Week 5 Lesson Plan Empowerment Technology
No ratings yet
Week 5 Lesson Plan Empowerment Technology
4 pages
Panel Kapasitor Bank-Model - PDF 1
No ratings yet
Panel Kapasitor Bank-Model - PDF 1
1 page
Learning Typescript Fudamentals
100% (1)
Learning Typescript Fudamentals
72 pages
Netezza Analytics Transition Service Flyer
No ratings yet
Netezza Analytics Transition Service Flyer
2 pages
D-4856 Vensim Conversion Guide (Aaron Diamond)
No ratings yet
D-4856 Vensim Conversion Guide (Aaron Diamond)
6 pages
Os Practical
No ratings yet
Os Practical
23 pages
Linux Commands
No ratings yet
Linux Commands
4 pages
1.3. Clarification To Comments On Turbne Foundation Load Calculation - Rev A
No ratings yet
1.3. Clarification To Comments On Turbne Foundation Load Calculation - Rev A
2 pages

Data Mining

Uploaded by

Data Mining

Uploaded by

Data Mining and Business Intelligence

Draw the Box-Plot for the following dataset

Databases are highly susceptible to noisy, missing, and

“How can the data be preprocessed in order to help improve the

To satisfy the requirements of the intended use

 Data cleaning  filling in missing values, smoothing noisy data,

 Data in the Real World Is Dirty!

… fill in missing values, smooth out noise while identifying outliers,

A missing value may not imply an error in the data!

 Use mean or median  For normal (symmetric) data

Noise is a random error or variance in a measured

Data smoothing techniques:

1. Binning  smooth a sorted data value by consulting its

2. Regression  Conform data values to a function

 Entity Identification Problem

Merging data from multiple data stores

 Schema integration and object matching

 can help avoid errors in schema integration and data

Time AllElectronics HighTech

You might also like