0% found this document useful (0 votes)
13 views

Topic 05 - Data Preprocessing

Topic 05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Topic 05 - Data Preprocessing

Topic 05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

University of Science, VNU-HCM

Faculty of Information Technology

Môn Cơ Sở Trí Tuệ Nhân Tạo


Introduction to Data Science Course

Data Preprocessing

Le Ngoc Thanh
[email protected]
Department of Computer Science

Ho Chi Minh City


Contents
◎ Why need to preprocess data?
◎ Data cleaning
◎ Data integration
◎ Data reduction
◎ Data transformation

fit@hcmus
Data
◎ Attribute (Key) - Value
◎ Data types
○ numeric, categorical
○ static, dynamic (time)
◎ Other data types
○ Distributed data
○ Text data
○ Web data, metadata
○ Pictures, audio / video
○ ....

fit@hcmus
Data quality
◎ Missing, incomplete: missing attribute value, missing attributes
of interest, or only contains integrated data
○ Example : age, weight = “ ”
◎ Noise: contain errors or outliers
○ Example: salary =“-100 000”
◎ Conflict: there is inconsistency in the code or in the name
○ Example: age =42 , birth = 03/07/1997; US=USA?

fit@hcmus
Consequences of data quality
◎ The right decision must be based on accurate data
○ For example, duplication or lack of data can lead to inaccurate
statistics, or even misleading.
◎ Data warehouse needs consistent integration of quality
data

"Poor quality data -> not good exploitation"

fit@hcmus
Solutions? (1/2)

fit@hcmus
Solutions? (2/2)
◎ Data Cleaning
○ Fill in missing values, eliminate noise data, identify and eliminate discrepancies,
noise data, and resolve conflicting data
◎ Data Intergration
○ Synthesize, integrate DL from many databases, different files.
◎ Data Transformation
○ Aggregation.
◎ Data Reduction
○ Reduce the data size but ensure analytical results.

fit@hcmus
Contents
◎ Why need to prepare data?
◎ Data cleaning
◎ Data integration
◎ Data reduction
◎ Data transformation

fit@hcmus
Data cleaning
◎ Data cleaning is the most important task
◎ Data cleaning is the process:
○ Fill in the missing values
○ Identify and eliminate noise data
○ Resolve conflicting data

fit@hcmus
Fill the missing value(1/2)
◎ Delete missing items:
○ Commonly used when class labels are missing (in classification)
○ Ease, but not efficiency, especially when the ratio of missing values is
high.
◎ Fill in missing values manually: tasteless and not feasible
◎ Fill missing values automatically:
○ Replaced by a common constant. For example, "don't know". Can
become new class in data

fit@hcmus
Fill the missing value (2/2)
◎ Fill missing values automatically:
○ Replaced with the property's mean
○ Replaced with the property’s mean in a class
○ Replace with the most likely value: infer from a Bayesian formula,
decision tree or EM algorithm (Expectation Maximization)

fit@hcmus
Data cleaning
◎ Data cleaning is the most important task
◎ Data cleaning is the process:
○ Fill in the missing values
○ Identify and eliminate noise data
○ Resolve conflicting data

fit@hcmus
Noise reduction
◎ The basic methods of noise reduction:
○ Binning method:
◉ Sort and divide data into equal-width or equal-depth bins
◉ Noise reduction by mean, median, margin, ...
○ Clustering method:
◉ Detect and remove outliers
○ Regression method:
◉ Fit data into the regression function

fit@hcmus
Noise reduction– Binning (1/4)
◎ Binning method
○ Divide data into equal-width bins:
◉ Divide the range of values into N about the same size
◉ The width of each interval = (maximum value - minimum value) / N
○ Divide data into equal-depth bins:
◉ Divide the range of values into N ranges that each contain approximately the
same number of samples

fit@hcmus
Noise reduction– Binning (2/4)
◎ Example about equal-width:
The temperature value with N = 7:
64 65 68 69 70 71 72 72 75 75 80 81 83 85

Count

4
2 2 2 0 2 2

[64,67) [67,70) [70,73) [73,76) [76,79) [79,82) [82,85]


Left Bound <= value < Right Bound
s in to N in te rvals.
Divide the range o
f v a lu e
lu e - min imu m v a lue) / N.
a c h in te rv a l = (m aximum va
fit@hcmus The width of e
Noise reduction– Binning (3/4)
◎ But not good for skewed data

Count

[0 – 200,000) … …. [1,800,000 –
2,000,000]
Salary in the Company
fit@hcmus
Noise reduction– Binning(4/4)
◎ Example about equal-depth:
The temperature value with N = 4:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Count

4 4 4
2

[64 .. .. .. .. 69] [70 .. 72] [73 .. .. .. .. .. .. .. .. 81] [83 .. 85]

Depth = 4, except for the last bin


Divide the range of values into N ranges that
each contain approximately the same number of
fit@hcmus
samples
Noise reduction with split bins
◎ Sorted prices:
4, 8, 15, 21, 21, 24, 25, 28, 34
◎ Divide data into an equal-depth bins with N = 3
○ Bin 1: 4, 8, 15
○ Bin 2: 21, 21, 24
○ Bin 3: 25, 28, 34
→ What to do with the split bins?

fit@hcmus
Noise reduction with split bins
○ Bin 1: 4, 8, 15
○ Bin 2: 21, 21, 24
○ Bin 3: 25, 28, 34 Smoothing by median:
- Bin 1: 8, 8, 8
- Bin 2: 21, 21, 21
- Bin 3: 28, 28, 28

Smoothing by mean: Smoothing by margin:


- Bin 1: 9, 9, 9 - Bin 1: 4, 4, 15
- Bin 2: 22, 22, 22 - Bin 2: 21, 21, 24
- Bin 3: 29, 29, 29 - Bin 3: 25, 25, 34

fit@hcmus
Exercises
◎ Prices :
15, 17, 19, 25, 29, 31, 33, 41, 42, 45, 45, 47, 52, 52, 64
◎ Use the binning method with equal-width and equal-depth with
four bins:
○ Calculate the value of the bin according to the median smoothing.
○ Calculate the value of the bin according to the margin smoothing.
○ Calculate the value of the bin according to the mean smoothing.
○ Give some comments on results.

fit@hcmus
Noise reduction?
◎ The basic methods of noise reduction:
○ Binning method:
◉ Sort and divide data into equal-width or equal-depth bins
◉ Noise reduction by mean, median, margin, ...
○ Clustering method:
◉ Detect and remove outliers
○ Regression method:
◉ Fit data into the regression function

fit@hcmus
Noise reduction – clustering

fit@hcmus
Noise reduction – regression

Y1

Y1’ y=x+1

X1 x

fit@hcmus
Data cleaning
◎ Data cleaning is the most important task
◎ Data cleaning is the process:
○ Fill in the missing values
○ Identify and eliminate noise data
○ Resolve conflicting data

fit@hcmus
Resolve conflicts
◎ How to handle conflicting data?
◎ Give examples of each conflict resolution method.

fit@hcmus
Contents
◎ Why need to prepare data?
◎ Data cleaning
◎ Data integration
◎ Data reduction
◎ Data transformation

fit@hcmus
Data Integration
◎ Select and aggregate data from many different sources
into one database
◎ What problems occur when selecting and aggregating
data?

fit@hcmus
Data integration process (1/4)
◎ Process:
○ Select only required data for the data mining process.
○ Matches the data schema
○ Eliminate redundant and duplicate data
○ Detect and resolve data inconsistencies

fit@hcmus
Data integration process (2/4)
◎ Schema Matching
○ Entity recognition problem
◉ How do entities from multiple data sources become relevant
◉ US=USA; customer_id = cust_number
○ Metadata

fit@hcmus
Data integration process (3/4)
◎ Eliminate redundant and duplicated data
○ An attribute is redundant if it can be inferred from other properties
○ The same property can have multiple names in different databases
○ Some records in the data are repeated
○ Use correlation analysis
◉ r=0: X and Y are not correlated
◉ r>0: positive correlation. Xá«Yá
◉ r<0: negative correlation . Xâ« Y á

fit@hcmus
Data integration process (4/4)
◎ Resolve inconsistencies in data
○ For example, weight is measured in kilograms or pounds
○ Define standards and mapping based on metadata

fit@hcmus
Contents
◎ Why need to prepare data?
◎ Data cleaning
◎ Data integration
◎ Data reduction
◎ Data transformation

fit@hcmus
Data reduction
◎ The data may be too large for some data mining applications:
time consuming.
◎ Data reduction is the process of reducing data (size) so that
the same (or almost the same) analysis result is obtained.

fit@hcmus
Methods of data reduction
◎ Methods:
○ Aggregation
○ Dimensionality reduction
○ Data compression
○ Numerosity reduction
○ Discretization and Concept hierarchies

fit@hcmus
Data reduction – Aggregation (1/3)
◎ Aggregation
○ Combination of 2 or more attributes (object) into 1 attribute (object)
◉ Example: cities integrated into regions, regions and water, …
○ Aggregate low-level data into high-level data:
◉ Decrease data set size: reduce the number of attributes
◉ Increase the interestingness of the sample

fit@hcmus
Data reduction – Aggregation (2/3)

fit@hcmus
Data reduction – Aggregation (3/3)

fit@hcmus
Data reduction – Dimensionality reduction (1/6)
◎ Dimensionality reduction
○ Feature selection (subset of attributes)
◉ Choose m from n attributes
◉ Remove irrelevant, redundant attributes
○ How to define irrelevant attributes?
◉ Statistics
◉ Information gain

fit@hcmus
Data reduction – Dimensionality reduction (2/6)
◎ How to reduce the data dimension?
○ Brute Force
◉ There are 2d attribute subsets of d attributes
◉ Computational complexity is too high
○ Heuristic method
◉ Stepwise forward selection
◉ Stepwise backward elimitation
◉ Combine two methods
◉ Inductive decision tree

fit@hcmus
Data reduction – Dimensionality reduction (3/6)
◎ Heuristic - Stepwise forward
○ Step 1: choose the best single attribute
○ Step 2: Choose the best attribute from the rest,...
◎ Example with initial attribute set:
{A1,A2,A3,A4,A5,A6}
○ Result ={}
◉ S1: Result = {A1}
◉ S2: Result = {A1,A4}
◉ S3: Result = {A1,A4,A6}

fit@hcmus
Data reduction – Dimensionality reduction (4/6)
◎ Heuristic - Stepwise backward
○ Step 1: removes the worst single attribute
○ Step 2: continues to remove the worst of the remaining attributes, …
◎ Example with initial attribute set:
{A1,A2,A3,A4,A5,A6}
○ Result ={A1,A2,A3,A4,A5,A6}
◉ S1: Result = {A1,A3,A4,A5,A6}
◉ S2: Result = {A1,A4,A5,A6}
◉ S3: Result = {A1,A4, A6}

fit@hcmus
Data reduction – Dimensionality reduction (5/6)
◎ Heuristic – Combine Forward and Backward
○ Step 1: select the best single attribute and the worst single attribute type
○ Continue to choose the best attribute and the worst attribute type among the rest, …
◎ Example with initial attribute set: {A1,A2,A3,A4,A5,A6}
○ Result = {A1,A2,A3,A4,A5,A6}
◉ S1: Result = {A1,A3,A4,A5,A6}
◉ S2: Result = {A1,A4,A5,A6}
◉ S3: Result = {A1,A4, A6}

fit@hcmus
Data reduction – Dimensionality reduction (6/6)
◎ Heuristic – Inductive decision tree
○ Step 1: build decision tree
○ Step 2: removes any properties that are not present on the tree
◎ Example with initial attribute set:
{A1,A2,A3,A4,A5,A6}
Þ Result = {A1, A4, A6} A4 ?

A1? A6?

Class 2 Class 1 Class 2


Class 1
fit@hcmus
Data reduction – Compression
◎ Data Compression:
○ Encrypt or transform data
○ Lossless compression
◉ Data can be recovered
○ Lossy compression
◉ Data cannot be fully recovered
○ Using wavelet transforms, principal component analysis (PCA), ...

fit@hcmus
Data reduction – Numerosity reduction
◎ Numerosity reduction: selects a different representation of
the data ("less than")
◎ Some methods:
○ Parameter method:
◉ Use a mathematical model to store parameters
◉ Regression model and log-linear
○ Non-parametric method:
◉ Do not use a mathematical model but save the reduced representation
◉ Graphs, grouping, sampling

fit@hcmus
Data reduction – Numerosity reduction
◎ Linear regression:Y = a + b X
◎ Multi linear regression: Y = b0 + b1 X1 + b2 X2
◎ Log-linear model:
○ Probaility: p(a, b, c, d) = aab bac cad dbcd

fit@hcmus
Data reduction – Numerosity reduction
◎ Histogram
○ Common methods for data reduction
○ Divide the data into bins and the height of the column is the number
of objects in each bin. Store only the average of each bin.
○ The shape of the chart depends on the number of bins

fit@hcmus
Data reduction – Numerosity reduction
◎ Clustering
○ Divide data into groups and save group representations.
○ Very effective if the data is grouped but vice versa when the data is scattered
○ Lots of clustering algorithms.

fit@hcmus
Data reduction – Numerosity reduction
◎ Sampling
○ Use a much smaller random sample set instead of large data set.
○ Simple random sample without replacement (SRSWOR)
○ Simple random sample with replacement (SRSWR)
○ Group / hierarchical sampling method

fit@hcmus
Data reduction – Numerosity reduction

W O R
SRS le random
p t
(sim le withou
samp ment)
p la ce
re

SRSW
R

Raw Data
fit@hcmus
Data reduction – Numerosity reduction

Raw Data Cluster/Stratified Sample

fit@hcmus
Data reduction – Discretization and Concept hierarchies

◎ Discretization:
○ Converts the property value domain (contiguous) by dividing the
value domain into intervals.
○ Store labels of ranges instead of actual values
○ Suitable for continuous numeric data.
○ Methods: binning, chart analysis, grouping, discrete by entropy,
natural segmentation.

fit@hcmus
Data reduction – Discretization and Concept hierarchies

◎ Concept hierarchies:
○ Gather and replace a low-level concept with a higher-level concept.
○ Suitable for non-numeric data: create a hierarchy.

fit@hcmus
fit@hcmus
Data reduction – Discretization and Concept hierarchies

◎ Example:
○ Converts the logical value to 1.0
○ Converts a date value to a number
○ Converts columns with large numeric values into a set of values in a smaller range,
for example dividing them by a certain factor.
○ Group of values has the same semantics as: Activity before August Revolution is
group 1; from 01/08/45 - 31/06/54; group 2; from 01/07/54 - 30/4/75 is group 3, ...
○ Substitute the value of age into young, middle-aged, old

fit@hcmus
Contents
◎ Why need to prepare data?
◎ Data cleaning
◎ Data integration
◎ Data reduction
◎ Data transformation

fit@hcmus
Data transformation
◎ Data transformation: convert data into a form that is
suitable and convenient for algorithms
◎ Data transformation process :
○ Smoothing
○ Aggregation
○ Generalization
○ Normalization
○ Attribute construction

fit@hcmus
Data transformation process
◎ Smoothing: the process of removing noise from the data.
◎ Integration: summarizing or integrating data.
◎ Generalization: replacing low-level concepts with high-level
concepts.
◎ Normalization: attribute data should be returned to a small
range of values like 0 to 1.
◎ Attribute construction: new properties are created and added
to a given set of properties

fit@hcmus
Conclusion
◎ Data is often missing, noisy, inconsistent, and
multidimensional.
◎ Good data is the key to creating reliable and valid models.
◎ Data preparation includes the following processes:
○ Cleaning
○ Selection
○ Reduction
○ Transformation

fit@hcmus
Exercises
◎ Why is preparing data so urgent and time-consuming?
◎ How to solve the problem of missing values in database
records?
◎ Assuming the database has Age attribute with the values in
the records (ascending): 13, 15, 16, 16, 19, 20, 20, 21, 22, 22,
25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52,
70
○ Denoise data by mean of bins with the number of bins n = 4. Explain the
effectiveness of this technique with the above data.
○ Plot the equal-width histogram with the width = 10

fit@hcmus
Exercises
◎ Why do we need to select / integrate data? Please describe
the data selection process.
◎ Why need to data reduction? Can data reduction process lose
information? If yes, please state how to fix it.
◎ Learn about the data transformation processes. Give
examples for each direction.

fit@hcmus
fit@hcmus

You might also like