0% found this document useful (0 votes)
13 views41 pages

Lecture 02

Uploaded by

vkr2471
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views41 pages

Lecture 02

Uploaded by

vkr2471
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Data Preprocessing:

Concepts and Techniques

Lecture 02
Data Preprocessing

An Overview
• Data Quality
• Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation and Data Discretization
• Summary
Data Quality: Why Preprocess the Data?

Assessment of data quality: A comprehensive perspective


Precision: right or incorrect, precise or imprecise
Integrity: absent recordings, inaccessible, ...
Uniformity: certain alterations while others remain unchanged,
inconsistencies, ...
Punctuality: timely updates?
Credibility: the reliability of data accuracy?
Comprehensibility: the ease with which data can be grasped?
Major Tasks in Data Preprocessing

Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
Data Cleaning
• Real Data: flawed in various way, e.g., instrument faulty, human or
computer error, transmission error
• Incomplete Data: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• Noisy Data: containing noise, errors, or outliers
• e.g., Age=“−10” (an error)
• Inconsistent Data: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional Data:(e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
Missing Data
• Unavailable Data (sometimes)
• E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
• Missing data may be due to
• equipment malfunction
Date Temperature Humidity Wind Speed
Weather Monitoring System:
(°C) (%) (km/h)
2023-07-01 25.0 60 15
2023-07-02 26.5 58 10
2023-07-03 N/A 55 20
2023-07-04 N/A 62 12
2023-07-05 27.0 59 18

Explanation: On July 3rd and 4th, the temperature sensor malfunctioned,


resulting in no temperature data being recorded for those days.
Missing data may be due to Missing Data
• inconsistent with other recorded data and thus deleted: Sales Database

Transaction ID Customer ID Product ID Quantity Price Total


001 1001 2001 2 $50.00 $100.00
002 1002 2002 -50 $30.00 N/A
003 1003 2003 3 $20.00 $60.00
004 1004 2004 1 $15.00 $15.00
Explanation: Transaction 002 was flagged and the total value was deleted because
it showed an impossible purchase quantity of -50 units, which is inconsistent with
other recorded data.

• data not entered due to misunderstanding


Hospital Patient Records: Patient ID Age Gender Smoking Status Diagnosis
001 50 Female N/A Diabetes
002 37 Male Non-smoker Asthma

Explanation: The "smoking status" field is missing for several patients because
a nurse misunderstood the form and left this field blank for those entries.
Missing Data
Missing data may be due to
• certain data may not be considered important at the time of entry
Customer Database:
Customer ID Name Email Occupation
1 Alice [email protected] N/A
2 Bob [email protected] N/A
3 Charlie [email protected] Engineer
4 Daisy [email protected] Teacher

Explanation: The company initially did not record customers' occupation, considering it
unimportant. Later, when analyzing customer demographics, this data was found missing for
initial entries.
• not register history or changes of the data
Product Inventory System: Date Product ID Product Name Price
2023-07-15 3001 Widget A $12
2023-08-01 3001 Widget A $15
2023-08-15 3002 Widget B $20
Explanation: The price of Widget A is updated regularly, but past prices are not stored. If
analysis requires historical pricing data, only the most recent price is available, and previous
prices are missing.
• In such cases, it may be necessary to infer missing data.
How to Handle Missing Data?

Student Grades Student ID Age Exam Score Grade


1 20 85 A
Data: 2 21 78 B
3 21 N/A N/A
4 22 88 A
5 23 90 A

Student ID Age Exam Score Grade


Ignore the tuple:
usually done when class label is 1 20 85 A
missing (when doing 2 21 78 B
classification)—not effective when
the % of missing values per attribute 4 22 88 A
varies considerably 5 23 90 A
Student ID 2 does not have grade in
the data given. So, we can ignore
that row.
How to Handle Missing Data?
Student Grades Data: Student ID Age Exam Score Grade
1 20 85 A
2 21 N/A B
3 N/ 78 C
A
4 22 88 A
5 23 90 A

Fill in it automatically with


a global constant : e.g., Student ID Age Exam Score Grade
“unknown”, a new class?!
1 20 85 A
Use a global constant like
"unknown" for missing values. 2 21 unknown B
3 unknown 78 C
4 22 88 A
5 23 90 A
How to Handle Missing Data?
Student ID Age Exam Score Grade
• Student Grades Data: 1 20 85 A
2 21 N/A B
3 N/A 78 C
4 22 88 A
5 23 90 A

• Fill in it automatically with Student Age Exam Grade

• the attribute mean ID


1 20
Score
85 A
2 21 85.25 B
3 21.5 78 C
Age mean = 21.5,
Exam Score mean = 85.25 4 22 88 A
5 23 90 A
How to Handle Missing Data?
Student Grades Data: Student ID Age Exam Score Grade
1 20 85 A
2 21 N/A B
3 N/A 78 C
4 22 88 A
5 23 90 A
Fill in it automatically with
the attribute mean for all samples belonging to the same class: smarter

Student Age Exam Grade


ID Score
1 20 85 A
2 21 85 B
3 22 78 C

4 22 88 A
5 23 90 A
How to Handle Missing Data?

Fill in it automatically with


the most probable value: inference-based such as Bayesian formula or
decision tree
(Will be explained with examples and hands-on in classification topic)

Fill in the missing value manually —------ but this process will be —----
tedious + infeasible
Numerical

Fill in Missing Values: Given the dataset:

Age Salary
25 5000
30 ?
35 8000
? 10000
27 5500
Noisy Data
• Noise: random error or variance in a measured variable
- A temperature sensor records 100°C in one reading while nearby sensors
show around 22°C. The 100°C reading is likely random noise.
• Incorrect attribute values may be due to
• faulty data collection instruments
- A humidity sensor reads 120% (which is impossible) due to a
malfunction.
• data entry problems
- An income field incorrectly shows -$5,000 due to a typing error.
• data transmission problems
- A transaction record shows missing data (e.g., Amount is N/A) because
of a transmission error.
• inconsistency in naming convention
- "john doe" vs. "John Doe" in user records, leading to inconsistent
naming.
• Other data problems which require data cleaning
• duplicate records
• incomplete data
• inconsistent data
How to Handle Noisy Data?
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median, smooth
by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with
possible outliers)
Numerical

Smooth Noisy Data: Given the dataset: Temperature


20
22
21
100
23
20
25
19

Apply a moving average filter with a window size of 3 to smooth the data.
Data Cleaning - IQR (Interquartile Range) method

The IQR method is used to identify and remove outliers from a dataset.
It is based on the range within which the middle 50% of data values lie.
Outliers are values that fall outside this range.

Quartiles divide a dataset into four equal parts. Q1 (First Quartile) is the 25th percentile,
and Q3 (Third Quartile) is the 75th percentile. Here’s how to calculate them:
Steps to Calculate Q1 and Q3:
Sort the Data: Arrange your data in ascending order.
Determine the Position of Q1 and Q3:
Q1 Position: Position of Q1=(N+1)/4
Q3 Position: Position of Q3=3×(N+1)/4
Where N is the number of data points.

Find Q1 and Q3:


If the position is an integer: The quartile value is the data point at that position.
If the position is not an integer: Interpolate between the two closest data points.
Data Cleaning - IQR (Interquartile Range)
method
Calculate the IQR:
Find the first quartile (Q1) and the third quartile (Q3).
Compute the IQR as IQR=Q3−Q1

Determine the bounds:


Calculate the lower bound as Lower Bound=Q1−1.5×IQR
Calculate the upper bound as Upper Bound=Q3+1.5×IQR
Identify outliers:
Any data point outside the lower and upper bounds is considered an
outlier.
Remove the outlier from the data set.
Numerical
Identify or Remove Outliers: Given the dataset:
Scores
80
86
79
150
88
90
84
92
25
Identify the outlier using the IQR (Interquartile Range) method and remove it.
Data Integration

• Data integration:
• Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id ≡ B.cust-#
• Integrate metadata from different sources
• Entity identification problem:
• Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different sources are
different
• Possible reasons: different representations, different scales, e.g., metric vs.
British units
Example- Integration of Multiple Databases
Database 1: Integrating the databases to
ID Age form a single dataset
1 34
2 45 ID Name Age
3 27 1 John 34
4 40 2 Alice 45
5 19
Database 2: 3 Bob 27
ID Name 4 Rahul 40
1 John 5 Harry 19
2 Alice
3 Bob
4 Rahul
5 Harry
Handling Redundancy in Data Integration

Redundant data occur often when integration of multiple databases


Object identification: The same attribute or object may have different
names in different databases
Derivable data: One attribute may be a “derived” attribute in another table,
e.g., annual revenue
Redundant attributes may be able to be detected by correlation analysis
and covariance analysis
Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
Data Reduction Strategies

Data reduction:
Obtain a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical
results

Why data reduction?


A database/data warehouse may store terabytes of data. Complex data
analysis may take a very long time to run on the complete data set.
Data Reduction Strategies
Data reduction strategies
Dimensionality reduction
Wavelet transforms
Example: Compressing an image by converting it into wavelet coefficients and
keeping only the important ones.
Principal Components Analysis (PCA)
Example: Reducing a dataset of 100 features to just 10 principal components.
Feature subset selection, feature creation
Example: Using only the height and weight from a health dataset, ignoring other
less important features.

Numerosity reduction (Data Reduction)


Regression and Log-Linear Models
Example: Using a simple line to predict house prices based on size.
Histograms, clustering, sampling
Example: Creating a bar chart to show the frequency of different age groups in a
survey.
Data cube aggregation
Example: Summarizing sales data by week instead of daily.

Data compression
Data Transformation
A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of
the new values
Methods
Smoothing: Remove noise from data
Attribute/feature construction
New attributes constructed from the given ones
Aggregation: Summarization, data cube construction
Normalization: Scaled to fall within a smaller, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Discretization: Concept hierarchy climbing
Min-max normalization

Min-Max Normalization rescales feature values to a specified range, usually


[0, 1].
Sometimes it rescales feature values to [-1, 1].

It is used to ensure that all features contribute equally to the analysis by


standardizing their range.
Person Age
X − min( X ) A 23
Normalized Value =
max( X ) − min( X ) B 45
Where:
C 30
X is the original value. D 50
Min(X) is the minimum value of the feature. E 40
Max(X) is the maximum value of the feature.
Example
Person Age
Minimum Age (Min) = 23 A 23
Maximum Age (Max) = 50 B 45
Apply Min-Max Normalization: C 30
For each value: D 50
Normalized Age = (Age−23)/(50−23)
E 40

Person A: Normalized Age=(23−23)/(50−23)=0/27=0 Person Age Normalized Age


Person B: Normalized Age=(45−23)/(50−23)=22/27≈0.81
Person C: Normalized Age=(30−23)/(50−23)=7/27≈0.26 A 23 0.00
Person D: Normalized Age=50−23)/(50−23)=27/27=1 B 45 0.81
Person E: Normalized Age=(40−23)/(50−23)=17/27≈0.63 C 30 0.26
D 50 1.00
E 40 0.63
Numerical

Normalization: Given the dataset:


Value
100
400
200
500
Normalize the values to a range of [0, 1].
Data Discretization Methods
Typical methods: All the methods can be applied recursively Customer ID Age
Binning (Top-down split, unsupervised)
1 23
Binning is a data preprocessing technique that transforms 2 45
numerical variables into categorical ones by dividing the range
of the variable into bins. 3 31
4 52
Typically top-down, where the range of the variable is divided
into intervals (bins) in an unsupervised manner. 5 37

Customer Age Age


ID Group
1 23 20-29
2 45 40-49
3 31 30-39
4 52 50-59
5 37 30-39
Data Discretization Methods
Histogram analysis (Top-down split, unsupervised) Customer Age
ID
It involves creating histograms to understand the 1 23
distribution of data points within bins. 2 45
3 31
Approach: This method can be seen as a top-down
4 52
split where data is divided into bins based on the value
range in an unsupervised manner. 5 37

Customer Age Age


ID Group Age Frequency
1 23 20-29 20-29 1
2 45 40-49 30-39 2
3 31 30-39 40-49 1
4 52 50-59 50-59 1
5 37 30-39
Data Discretization Methods
Clustering analysis (unsupervised, top-down
Customer Age Purchase
split or bottom-up merge) ID Amount
1 23 200
Clustering is the task of dividing a set of
2 45 500
objects into groups (clusters) so that objects
3 31 150
in the same cluster are more similar to each 4 52 700
other than to those in other clusters. 5 37 300

Approach:
Clusters: Assuming 2 clusters based
Top-down split: Methods like divisive clustering
on age and purchase amount:
start with all data points in one cluster and split
Cluster 1: Customers 1, 3, 5
them recursively.
(younger, lower purchases)
Bottom-up merge: Methods like agglomerative
Cluster 2: Customers 2, 4 (older,
clustering start with each data point as its own
higher purchases)
cluster and merge them recursively.
Data Discretization Methods
Decision-tree analysis (supervised, top- Customer ID Age Purchase Repeat
Amount Purchase
down split) 1 23 200 No
Decision-tree analysis involves using a tree- 2 45 500 No
like model to make decisions based on the 3 31 150 No
values of input features. It is a supervised 4 52 700 Yes
learning method. 5 37 300 No
Approach: The method starts at the root and
splits the data recursively into subsets based
on feature values (top-down split).

Decision Tree:
Node 1 (Root): Age
Age < 40: Repeat Purchase = No

Age ≥ 40: Purchase Amount

Amount < 600: Repeat Purchase = No


Amount ≥ 600: Repeat Purchase = Yes
Data Discretization Methods
Correlation (e.g., χ2) analysis (unsupervised, bottom-up merge)

Correlation analysis examines the relationship between two or more


variables. The chi-squared (χ2) test is often used to test the independence
of two categorical variables.

Approach: In a bottom-up merge, variables that have a significant


correlation are grouped together, which can be seen in methods like
hierarchical clustering that use correlation measures to merge clusters.
Correlation (e.g., χ2) analysis
Step 1: Create a Contingency Table Customer Age Repeat
ID Purchase
A contingency table is used to display
1 23 No
the frequency distribution of the
2 45 Yes
variables. It shows how often certain 3 31 No
combinations of categories occur. 4 52 Yes
5 37 No
1. Identify the variables: In this case, 6 28 No
the variables are "Age Group" and 7 55 Yes
"Repeat Purchase". 8 49 Yes

2. Organize the data: Arrange the data Age Repeat Repeat Row
into a table format, showing the counts Group Purchase: Yes Purchase: No Total
20-29 0 2 2
of each combination of the variables.
30-39 0 2 2
40-49 2 0 2
50-59 2 0 2
Column 4 4 8
Total
Correlation (e.g., χ2) analysis
Step 2: Calculate Expected Frequencies

Expected frequencies are calculated to determine what the frequencies would be if


there was no association between the variables.

Formula for expected frequency: Expected frequency=(Row total×Column


total)/Grand total

Apply the formula: For each cell in the table, calculate the expected frequency.

Age Repeat Repeat Row Age Expected Expected


Group Purchase: Yes Purchase: No Total Group Yes No
20-29 0 2 2 20-29 1 1
30-39 0 2 2 30-39 1 1
40-49 2 0 2 40-49 1 1
50-59 2 0 2 50-59 1 1
Column 4 4 8
Total
Correlation (e.g., χ2) analysis
Step 3: Calculate the Chi-Squared Statistic

The chi-squared statistics measures how the observed frequencies deviate from the
2
expected frequencies
χ 2
=
(Oi − Ei )
Formula for chi-squared: Ei
Oi is the observed frequency and Ei is the expected frequency

Calculate the statistic: Compute the chi-squared value for each cell and sum them up.

Age Observed Expected (O-E)²/E Observed Expected (O-E)²/E


Group (Yes) (Yes) (Yes) (No) (No) (No)
20-29 0 1 1 2 1 1
30-39 0 1 1 2 1 1
40-49 2 1 1 0 1 1
50-59 2 1 1 0 1 1

Sum the chi-squared values for each cell: χ2 = 1+1+1+1+1+1+1+1 = 8


The Chi Square Distribution
Correlation (e.g., χ2) analysis
Step 4: Calculate Degrees of Freedom

Degrees of freedom are used to determine the critical value from the chi-squared
distribution.

Formula for degrees of freedom: Degrees of freedom=(No of rows−1)×(No of columns−1)


Calculate: Degrees of freedom=(4−1)×(2−1)=3

Step 5: Perform Chi-Squared Test

Compare the calculated chi-squared statistic to the critical value from the chi-squared
distribution table at the desired significance level (e.g., 0.05) with the calculated degrees
of freedom.

Find the critical value: For 3 degrees of freedom at the 0.05 significance level, the
critical value for χ2 is approximately 7.815.

Compare the values: Since the calculated χ2=8 is greater than 7.815, reject the null
hypothesis that the variables are independent.
Analyse the correlation (e.g., χ2) for the given data

Treatment Treatment
Outcome Control
A B

Recovered 30 50 20

Not
20 40 40
Recovered
Summary
• Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
• Data cleaning: e.g. missing/noisy values, outliers
• Data integration from multiple sources:
• Entity identification problem
• Remove redundancies
• Detect inconsistencies
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization

You might also like