0% found this document useful (0 votes)
5 views

Lecture6

The document outlines the importance of data preprocessing in data mining and business intelligence, focusing on data quality and its elements such as accuracy, completeness, and consistency. It details major tasks involved in data preprocessing, including data cleaning, integration, reduction, and transformation, as well as methods for handling redundancy through correlation analysis. Additionally, it explains the concept of covariance and its relationship to correlation in numeric data analysis.

Uploaded by

kiro2morris3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture6

The document outlines the importance of data preprocessing in data mining and business intelligence, focusing on data quality and its elements such as accuracy, completeness, and consistency. It details major tasks involved in data preprocessing, including data cleaning, integration, reduction, and transformation, as well as methods for handling redundancy through correlation analysis. Additionally, it explains the concept of covariance and its relationship to correlation in numeric data analysis.

Uploaded by

kiro2morris3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

SET 393: Data Mining and Business Intelligence

3rd Year

Spring 2025

Lec. 6

Chapter 2. Data, Measurements, and Data Preprocessing


Assistant Professor: Dr. Rasha Saleh
Outline

◼ Data Preprocessing
◼ Data Quality

◼ Major Tasks in Data Preprocessing

◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Data Transformation

2
Data Preprocessing: Why Preprocess the Data?

3
Data Quality
◼ Elements defining data quality
◼ Accuracy: correct or wrong, accurate or not

◼ Completeness: not recorded, unavailable…

◼ Consistency: inconsistent naming (customer ID, customer no.), coding,

format (date and time format, 1 pm, 13 o’clock)


◼ Timeliness: timely updated? (calculating the revenue using old/not updated

data)
◼ Believability: how much the data are trusted by users? Based on the data

collecting source/method/rules, the most trustable data can be obtained


from governmental institutions such as statistics institute
◼ Interpretability: how easy the data are understood? Imagin all attributes

are symbols/abbreviations without providing keys explaining/describing


them. This is called metadata that contains data description.
4
Major tasks in Data Preprocessing
◼ Data Cleaning
◼ Fill in missing values, smooth noisy data, identify or remove outliers, and resolve

inconsistences
◼ Data Integration
◼ Integration of multiple databases or files

◼ Data Reduction
◼ Dimensionality reduction

◼ Numerosity reduction

◼ Data Transformation
◼ Normalization

◼ (was applied in similarity and dissimilarity of ordinal data)

◼ Concept hierarchy generation


◼ (university/faculty/specialization) apply generalization by
using the university attribute only (high level)
◼ Discretization:Divide data into intervals such as the age into intervals, childhood, youth, ….)
Data Integration: Attribute Redundancy
and Correlation Analysis
◼ Redundant data occur often when integration of multiple databases.
◼ Causes of redundancy:
◼ An attribute may be redundant if it can be "derived" from another
attribute or set of attributes. ( date of birth attribute and the age
attribute)
◼ Inconsistencies in attribute naming can also cause redundancies in
the resulting dataset. ( student ID, Student No.)
◼ Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality (1500 record, 500 were redundant) (analysis quality
maybe biased towards the redundant attributes as their effect is found
higher which doesn’t reflect the real data) 6
How to Handle redundancy with
correlation analysis?
◼ Some redundancies can be detected by correlation analysis.
◼ Given two attributes, correlation analysis can measure how
strongly one attribute implies the other, based on the available
data.
◼ Each type of data has special type of correlation measure.
◼ Nominal data: blood type, hair color, nationality
Independent

dependent

Independent Positively correlated

dependent Negatively correlated


7
Correlation Analysis (Nominal Data)

Observed value is
the actual count.

And the expected value


Is the expected frequency

8
Chi-Square Calculation: An Example
Observed/actual Expected
male female

For the first column,


finish all rows and then
move to the second
column
= 284.44 + 121.90 + 71.11 + male
30.48 = 507.93
9
Chi-Square Calculation: An Example
Cont.
Degrees of freedom refer to the maximum number of logically independent values which may vary in a data
sample. Degrees of freedom are calculated by subtracting one from the number of items within the data sample.

Critical value 0.05 3.841

10
Chi-Square Calculation: An Example
Cont.

Critical value 0.05 3.841

Chi square is: 507.93


1- check the values of the degree of
freedom (df) = 1
2- use probability level = 0.05
3- check the critical values (cv) that
matches both df value and alpha value

Chi square is > critical value Therefore, there is a strong correlation between gender
507.93 > 3.841 and preferred reading 11
Chi-Square Calculation: An Example
Cont.
male female
Chi square is > critical value
507.93 > 3.841

Therefore, there is a strong correlation between gender and preferred reading

From the contingency table, male and like science fiction observed value (250) > expected value (90)
female and don’t like science fiction observed value (1000) > expected value (360)

THEREFORE

Specifically, Result shows that male and like science fiction are correlated in the group
female and don’t like science fiction are correlated in the group

12
Correlation Analysis (Numeric Data)
Correlation Coefficient
A: B:
‫معامل االرتباط‬
Students’ Studying
grades Hours
a1 b1
relation
…. ….
n n

13
Covariance (Numeric Data)

Covariance is a statistical tool used to determine the relationship


between the movements of two random variables, to what extent, they
change together. When two stocks tend to move together, they are seen
as having a positive covariance; it defines the changes between the two
variables, such that change in one variable is equal to change in another
variable . when they move inversely, the covariance is negative.

14
Covariance (Numeric Data)
Notice: very near to
the correlation formula

Notice: we can get correlation from covariance

15
Covariance (Numeric Data)

Independent → implies Cov = 0


Cov = 0 → doesn’t imply independency

Therefore, correlation coefficient is preferable than covariance

16
Covariance: An Example

A B
2 5
3 8
5 10
4 11
6 14

Positively dependent 17
Covariance: Another Example

> 0 : Positively dependent


18
Thank You

You might also like