Lecture 02
Lecture 02
Lecture 02
Data Preprocessing
An Overview
• Data Quality
• Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation and Data Discretization
• Summary
Data Quality: Why Preprocess the Data?
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
Data Cleaning
• Real Data: flawed in various way, e.g., instrument faulty, human or
computer error, transmission error
• Incomplete Data: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• Noisy Data: containing noise, errors, or outliers
• e.g., Age=“−10” (an error)
• Inconsistent Data: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional Data:(e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
Missing Data
• Unavailable Data (sometimes)
• E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
• Missing data may be due to
• equipment malfunction
Date Temperature Humidity Wind Speed
Weather Monitoring System:
(°C) (%) (km/h)
2023-07-01 25.0 60 15
2023-07-02 26.5 58 10
2023-07-03 N/A 55 20
2023-07-04 N/A 62 12
2023-07-05 27.0 59 18
Explanation: The "smoking status" field is missing for several patients because
a nurse misunderstood the form and left this field blank for those entries.
Missing Data
Missing data may be due to
• certain data may not be considered important at the time of entry
Customer Database:
Customer ID Name Email Occupation
1 Alice [email protected] N/A
2 Bob [email protected] N/A
3 Charlie [email protected] Engineer
4 Daisy [email protected] Teacher
Explanation: The company initially did not record customers' occupation, considering it
unimportant. Later, when analyzing customer demographics, this data was found missing for
initial entries.
• not register history or changes of the data
Product Inventory System: Date Product ID Product Name Price
2023-07-15 3001 Widget A $12
2023-08-01 3001 Widget A $15
2023-08-15 3002 Widget B $20
Explanation: The price of Widget A is updated regularly, but past prices are not stored. If
analysis requires historical pricing data, only the most recent price is available, and previous
prices are missing.
• In such cases, it may be necessary to infer missing data.
How to Handle Missing Data?
4 22 88 A
5 23 90 A
How to Handle Missing Data?
Fill in the missing value manually —------ but this process will be —----
tedious + infeasible
Numerical
Age Salary
25 5000
30 ?
35 8000
? 10000
27 5500
Noisy Data
• Noise: random error or variance in a measured variable
- A temperature sensor records 100°C in one reading while nearby sensors
show around 22°C. The 100°C reading is likely random noise.
• Incorrect attribute values may be due to
• faulty data collection instruments
- A humidity sensor reads 120% (which is impossible) due to a
malfunction.
• data entry problems
- An income field incorrectly shows -$5,000 due to a typing error.
• data transmission problems
- A transaction record shows missing data (e.g., Amount is N/A) because
of a transmission error.
• inconsistency in naming convention
- "john doe" vs. "John Doe" in user records, leading to inconsistent
naming.
• Other data problems which require data cleaning
• duplicate records
• incomplete data
• inconsistent data
How to Handle Noisy Data?
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median, smooth
by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with
possible outliers)
Numerical
Apply a moving average filter with a window size of 3 to smooth the data.
Data Cleaning - IQR (Interquartile Range) method
The IQR method is used to identify and remove outliers from a dataset.
It is based on the range within which the middle 50% of data values lie.
Outliers are values that fall outside this range.
Quartiles divide a dataset into four equal parts. Q1 (First Quartile) is the 25th percentile,
and Q3 (Third Quartile) is the 75th percentile. Here’s how to calculate them:
Steps to Calculate Q1 and Q3:
Sort the Data: Arrange your data in ascending order.
Determine the Position of Q1 and Q3:
Q1 Position: Position of Q1=(N+1)/4
Q3 Position: Position of Q3=3×(N+1)/4
Where N is the number of data points.
• Data integration:
• Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id ≡ B.cust-#
• Integrate metadata from different sources
• Entity identification problem:
• Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different sources are
different
• Possible reasons: different representations, different scales, e.g., metric vs.
British units
Example- Integration of Multiple Databases
Database 1: Integrating the databases to
ID Age form a single dataset
1 34
2 45 ID Name Age
3 27 1 John 34
4 40 2 Alice 45
5 19
Database 2: 3 Bob 27
ID Name 4 Rahul 40
1 John 5 Harry 19
2 Alice
3 Bob
4 Rahul
5 Harry
Handling Redundancy in Data Integration
Data reduction:
Obtain a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical
results
Data compression
Data Transformation
A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of
the new values
Methods
Smoothing: Remove noise from data
Attribute/feature construction
New attributes constructed from the given ones
Aggregation: Summarization, data cube construction
Normalization: Scaled to fall within a smaller, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Discretization: Concept hierarchy climbing
Min-max normalization
Approach:
Clusters: Assuming 2 clusters based
Top-down split: Methods like divisive clustering
on age and purchase amount:
start with all data points in one cluster and split
Cluster 1: Customers 1, 3, 5
them recursively.
(younger, lower purchases)
Bottom-up merge: Methods like agglomerative
Cluster 2: Customers 2, 4 (older,
clustering start with each data point as its own
higher purchases)
cluster and merge them recursively.
Data Discretization Methods
Decision-tree analysis (supervised, top- Customer ID Age Purchase Repeat
Amount Purchase
down split) 1 23 200 No
Decision-tree analysis involves using a tree- 2 45 500 No
like model to make decisions based on the 3 31 150 No
values of input features. It is a supervised 4 52 700 Yes
learning method. 5 37 300 No
Approach: The method starts at the root and
splits the data recursively into subsets based
on feature values (top-down split).
Decision Tree:
Node 1 (Root): Age
Age < 40: Repeat Purchase = No
2. Organize the data: Arrange the data Age Repeat Repeat Row
into a table format, showing the counts Group Purchase: Yes Purchase: No Total
20-29 0 2 2
of each combination of the variables.
30-39 0 2 2
40-49 2 0 2
50-59 2 0 2
Column 4 4 8
Total
Correlation (e.g., χ2) analysis
Step 2: Calculate Expected Frequencies
Apply the formula: For each cell in the table, calculate the expected frequency.
The chi-squared statistics measures how the observed frequencies deviate from the
2
expected frequencies
χ 2
=
(Oi − Ei )
Formula for chi-squared: Ei
Oi is the observed frequency and Ei is the expected frequency
Calculate the statistic: Compute the chi-squared value for each cell and sum them up.
Degrees of freedom are used to determine the critical value from the chi-squared
distribution.
Compare the calculated chi-squared statistic to the critical value from the chi-squared
distribution table at the desired significance level (e.g., 0.05) with the calculated degrees
of freedom.
Find the critical value: For 3 degrees of freedom at the 0.05 significance level, the
critical value for χ2 is approximately 7.815.
Compare the values: Since the calculated χ2=8 is greater than 7.815, reject the null
hypothesis that the variables are independent.
Analyse the correlation (e.g., χ2) for the given data
Treatment Treatment
Outcome Control
A B
Recovered 30 50 20
Not
20 40 40
Recovered
Summary
• Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
• Data cleaning: e.g. missing/noisy values, outliers
• Data integration from multiple sources:
• Entity identification problem
• Remove redundancies
• Detect inconsistencies
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization