0% found this document useful (0 votes)

7 views19 pages

Lecture6

The document outlines the importance of data preprocessing in data mining and business intelligence, focusing on data quality and its elements such as accuracy, completeness, and consistency. It details major tasks involved in data preprocessing, including data cleaning, integration, reduction, and transformation, as well as methods for handling redundancy through correlation analysis. Additionally, it explains the concept of covariance and its relationship to correlation in numeric data analysis.

Uploaded by

kiro2morris3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views19 pages

Lecture6

Uploaded by

kiro2morris3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

SET 393: Data Mining and Business Intelligence

3rd Year

Spring 2025

Lec. 6

Chapter 2. Data, Measurements, and Data Preprocessing

Assistant Professor: Dr. Rasha Saleh
Outline

◼ Data Preprocessing
◼ Data Quality

◼ Major Tasks in Data Preprocessing

◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Data Transformation

2
Data Preprocessing: Why Preprocess the Data?

3
Data Quality
◼ Elements defining data quality
◼ Accuracy: correct or wrong, accurate or not

◼ Completeness: not recorded, unavailable…

◼ Consistency: inconsistent naming (customer ID, customer no.), coding,

format (date and time format, 1 pm, 13 o’clock)

◼ Timeliness: timely updated? (calculating the revenue using old/not updated

data)
◼ Believability: how much the data are trusted by users? Based on the data

collecting source/method/rules, the most trustable data can be obtained

from governmental institutions such as statistics institute
◼ Interpretability: how easy the data are understood? Imagin all attributes

are symbols/abbreviations without providing keys explaining/describing

them. This is called metadata that contains data description.
4
Major tasks in Data Preprocessing
◼ Data Cleaning
◼ Fill in missing values, smooth noisy data, identify or remove outliers, and resolve

inconsistences
◼ Data Integration
◼ Integration of multiple databases or files

◼ Data Reduction
◼ Dimensionality reduction

◼ Numerosity reduction

◼ Data Transformation
◼ Normalization

◼ (was applied in similarity and dissimilarity of ordinal data)

◼ Concept hierarchy generation

◼ (university/faculty/specialization) apply generalization by
using the university attribute only (high level)
◼ Discretization:Divide data into intervals such as the age into intervals, childhood, youth, ….)
Data Integration: Attribute Redundancy
and Correlation Analysis
◼ Redundant data occur often when integration of multiple databases.
◼ Causes of redundancy:
◼ An attribute may be redundant if it can be "derived" from another
attribute or set of attributes. ( date of birth attribute and the age
attribute)
◼ Inconsistencies in attribute naming can also cause redundancies in
the resulting dataset. ( student ID, Student No.)
◼ Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality (1500 record, 500 were redundant) (analysis quality
maybe biased towards the redundant attributes as their effect is found
higher which doesn’t reflect the real data) 6
How to Handle redundancy with
correlation analysis?
◼ Some redundancies can be detected by correlation analysis.
◼ Given two attributes, correlation analysis can measure how
strongly one attribute implies the other, based on the available
data.
◼ Each type of data has special type of correlation measure.
◼ Nominal data: blood type, hair color, nationality
Independent

dependent

Independent Positively correlated

dependent Negatively correlated

7
Correlation Analysis (Nominal Data)

Observed value is
the actual count.

And the expected value

Is the expected frequency

8
Chi-Square Calculation: An Example
Observed/actual Expected
male female

For the first column,

finish all rows and then
move to the second
column
= 284.44 + 121.90 + 71.11 + male
30.48 = 507.93
9
Chi-Square Calculation: An Example
Cont.
Degrees of freedom refer to the maximum number of logically independent values which may vary in a data
sample. Degrees of freedom are calculated by subtracting one from the number of items within the data sample.

Critical value 0.05 3.841

10
Chi-Square Calculation: An Example
Cont.

Critical value 0.05 3.841

Chi square is: 507.93

1- check the values of the degree of
freedom (df) = 1
2- use probability level = 0.05
3- check the critical values (cv) that
matches both df value and alpha value

Chi square is > critical value Therefore, there is a strong correlation between gender
507.93 > 3.841 and preferred reading 11
Chi-Square Calculation: An Example
Cont.
male female
Chi square is > critical value
507.93 > 3.841

Therefore, there is a strong correlation between gender and preferred reading

From the contingency table, male and like science fiction observed value (250) > expected value (90)
female and don’t like science fiction observed value (1000) > expected value (360)

THEREFORE

Specifically, Result shows that male and like science fiction are correlated in the group
female and don’t like science fiction are correlated in the group

12
Correlation Analysis (Numeric Data)
Correlation Coefficient
A: B:
‫معامل االرتباط‬
Students’ Studying
grades Hours
a1 b1
relation
…. ….
n n

13
Covariance (Numeric Data)

Covariance is a statistical tool used to determine the relationship

between the movements of two random variables, to what extent, they
change together. When two stocks tend to move together, they are seen
as having a positive covariance; it defines the changes between the two
variables, such that change in one variable is equal to change in another
variable . when they move inversely, the covariance is negative.

14
Covariance (Numeric Data)
Notice: very near to
the correlation formula

Notice: we can get correlation from covariance

15
Covariance (Numeric Data)

Independent → implies Cov = 0

Cov = 0 → doesn’t imply independency

Therefore, correlation coefficient is preferable than covariance

16
Covariance: An Example

A B
2 5
3 8
5 10
4 11
6 14

Positively dependent 17
Covariance: Another Example

> 0 : Positively dependent

18
Thank You

Phishing Dummies Ebook
100% (1)
Phishing Dummies Ebook
49 pages
Roscommon Newspapers Volume 2
No ratings yet
Roscommon Newspapers Volume 2
16 pages
03preprocessing Part2
No ratings yet
03preprocessing Part2
15 pages
CH 03-01 Data Preprocessing
No ratings yet
CH 03-01 Data Preprocessing
27 pages
UpdatedUnit 1 Data Preprocessing
No ratings yet
UpdatedUnit 1 Data Preprocessing
38 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
IT326 - Ch3
No ratings yet
IT326 - Ch3
33 pages
data mining 3
No ratings yet
data mining 3
57 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
80 pages
10-1 Data analysis and pre-processing part 3.pdf
No ratings yet
10-1 Data analysis and pre-processing part 3.pdf
19 pages
2020 Preprocessing
No ratings yet
2020 Preprocessing
63 pages
PPT1
No ratings yet
PPT1
93 pages
03Preprocessing (2)
No ratings yet
03Preprocessing (2)
80 pages
DM LAQs (CT 1)
No ratings yet
DM LAQs (CT 1)
40 pages
CH 9
No ratings yet
CH 9
12 pages
Ch 3-Final
No ratings yet
Ch 3-Final
39 pages
Correlational Research Designs
No ratings yet
Correlational Research Designs
12 pages
Chapter3 DataPreprocessing
No ratings yet
Chapter3 DataPreprocessing
50 pages
Lecture 8 -Data Prep-Integration - M
No ratings yet
Lecture 8 -Data Prep-Integration - M
13 pages
Slide 05 Chapter3 Data Preprocessing
No ratings yet
Slide 05 Chapter3 Data Preprocessing
58 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
56 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
03 Pre Processing
No ratings yet
03 Pre Processing
89 pages
' 3 IT326 - Ch2 - Pre-Processing
No ratings yet
' 3 IT326 - Ch2 - Pre-Processing
48 pages
Module 2
No ratings yet
Module 2
62 pages
Chap2-Data
No ratings yet
Chap2-Data
101 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
Lecture 3
No ratings yet
Lecture 3
47 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
40 pages
DM_merged
No ratings yet
DM_merged
169 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
Lecture 4-Data Preprocessing - Integration
No ratings yet
Lecture 4-Data Preprocessing - Integration
12 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Data Mining: Dosen: Dr. Vitri Tundjungsari
No ratings yet
Data Mining: Dosen: Dr. Vitri Tundjungsari
64 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
64 pages
DP
No ratings yet
DP
44 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
Ue22cs342aa2 20240827192243
No ratings yet
Ue22cs342aa2 20240827192243
28 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
Chapter 3
No ratings yet
Chapter 3
56 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
54 pages
Business Research CH-6
No ratings yet
Business Research CH-6
28 pages
Lec7
No ratings yet
Lec7
45 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
66 pages
2-Data Fundamentals for BI - Part1
No ratings yet
2-Data Fundamentals for BI - Part1
39 pages
Data Pre Processing
No ratings yet
Data Pre Processing
63 pages
Unit 3
No ratings yet
Unit 3
164 pages
Ch-5
No ratings yet
Ch-5
26 pages
_03Preprocessing
No ratings yet
_03Preprocessing
60 pages
Mining
No ratings yet
Mining
63 pages
Data Preprocessing (DWDM MOD 2)
No ratings yet
Data Preprocessing (DWDM MOD 2)
62 pages
Week3- Data Preprocessing, Extraction and Preparation
No ratings yet
Week3- Data Preprocessing, Extraction and Preparation
34 pages
Module 5 03preprocessing
No ratings yet
Module 5 03preprocessing
63 pages
Chi Squared for Beginners
From Everand
Chi Squared for Beginners
Stephanie Glen
No ratings yet
Lucia Laurent Gurti (Repaired)
No ratings yet
Lucia Laurent Gurti (Repaired)
24 pages
Abs PVC Pipe
No ratings yet
Abs PVC Pipe
44 pages
Log Aod Save
No ratings yet
Log Aod Save
4 pages
Genesys Logic, Inc.: Revision 1.04 Nov. 5, 2009
No ratings yet
Genesys Logic, Inc.: Revision 1.04 Nov. 5, 2009
21 pages
Ef Human Rights Based Due Diligence Tool
No ratings yet
Ef Human Rights Based Due Diligence Tool
59 pages
Information Privacy
No ratings yet
Information Privacy
4 pages
12.2.13 ISO-IOT Cancellation Enhancements Doc ID 2980242.1
No ratings yet
12.2.13 ISO-IOT Cancellation Enhancements Doc ID 2980242.1
14 pages
Tata Steel
No ratings yet
Tata Steel
2 pages
Stephens Et Al. - 2020 - What Works To Increase Diversity A Multi-Level AP
No ratings yet
Stephens Et Al. - 2020 - What Works To Increase Diversity A Multi-Level AP
51 pages
Ethanol 95% Denatured
No ratings yet
Ethanol 95% Denatured
10 pages
Architecture Portfolio - Roberto Morales Nalda - 2022-2024
No ratings yet
Architecture Portfolio - Roberto Morales Nalda - 2022-2024
22 pages
Project PPT Final
No ratings yet
Project PPT Final
17 pages
Nutrition Challenge Score Sheet
No ratings yet
Nutrition Challenge Score Sheet
2 pages
Dessler Hrm12ge PPT 03
0% (1)
Dessler Hrm12ge PPT 03
42 pages
Case Study Background
No ratings yet
Case Study Background
6 pages
The SAP Blue Book __ Copy 95x4-Cqat-jpki-w2zs
No ratings yet
The SAP Blue Book __ Copy 95x4-Cqat-jpki-w2zs
195 pages
Pharma in MP Dhar
67% (3)
Pharma in MP Dhar
3 pages
Home Thermostat Circuit
No ratings yet
Home Thermostat Circuit
2 pages
CMOS Inverter Optimization Techniques
No ratings yet
CMOS Inverter Optimization Techniques
21 pages
India Yamaha Motors
No ratings yet
India Yamaha Motors
3 pages
ADVR-16-400Hz: Universal Hybrid Analog-Digital Voltage Regulator Operation Manual
No ratings yet
ADVR-16-400Hz: Universal Hybrid Analog-Digital Voltage Regulator Operation Manual
6 pages
How To Select A 3D Printer Under Rs. 100,000
No ratings yet
How To Select A 3D Printer Under Rs. 100,000
13 pages
Britt Support Cat 9710
No ratings yet
Britt Support Cat 9710
47 pages
Kaysafe Y-Type IOM Rev3-2021
No ratings yet
Kaysafe Y-Type IOM Rev3-2021
8 pages
Nonlinear Control of A Single-Link Flexible Joint Manipulator Using Differential Flatness
No ratings yet
Nonlinear Control of A Single-Link Flexible Joint Manipulator Using Differential Flatness
6 pages
Dehyd Fruits
No ratings yet
Dehyd Fruits
28 pages
SWOT Analysis of Google
No ratings yet
SWOT Analysis of Google
11 pages
DMS-Railway Management System
No ratings yet
DMS-Railway Management System
33 pages

Lecture6

Uploaded by

Lecture6

Uploaded by

SET 393: Data Mining and Business Intelligence

Chapter 2. Data, Measurements, and Data Preprocessing

◼ Major Tasks in Data Preprocessing

◼ Completeness: not recorded, unavailable…

◼ Consistency: inconsistent naming (customer ID, customer no.), coding,

format (date and time format, 1 pm, 13 o’clock)

collecting source/method/rules, the most trustable data can be obtained

are symbols/abbreviations without providing keys explaining/describing

◼ (was applied in similarity and dissimilarity of ordinal data)

◼ Concept hierarchy generation

Independent Positively correlated

dependent Negatively correlated

And the expected value

For the first column,

Critical value 0.05 3.841

Critical value 0.05 3.841

Chi square is: 507.93

Therefore, there is a strong correlation between gender and preferred reading

Covariance is a statistical tool used to determine the relationship

Notice: we can get correlation from covariance

Independent → implies Cov = 0

Therefore, correlation coefficient is preferable than covariance

> 0 : Positively dependent

You might also like