Slide 2 - Data Preprocessing
Slide 2 - Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data,
identify or remove outliers, and resolve
inconsistencies
• Data integration (if needed)
– Integration of multiple databases, data cubes,
or files
• Data transformation
– Normalization and aggregation
Major Tasks in Data Preprocessing
• Data reduction
– Obtains reduced representation in volume but
produces the same or similar analytical results
• Data discretization
– particular importance for numerical data;
– reduces the number of values of attributes
– Often transform quantitative data into
qualitative
Forms of data preprocessing
Data Cleaning
Data cleaning refers to methods for finding, removing, and replacing bad
or missing data.
So from the available data in the problem: 5, 10, 11, 13, 15, 35, 50 ,55,
72, 92, 204, 215 we have 7570 is the width.
— bin1: 5,10,11,13,15,35,50,55,72
— bin2: 92 //will hold only one value as we have only one value exist
between 75 and 145
— bin3: 204,215.
Simple Discretization Methods:
Binning
Y1
Y1 y=x+1
X1 x
Clustering
• Partition data set (it means the values of an attribute in
case of preprocessing) into clusters, and one can store
cluster representation only, i.e. replace all values of the
cluster by the one value representing this cluster.
• We also can use hierarchical clustering that can be
stored in multi-dimensional index tree structures
• There are many choices of clustering definitions and
clustering algorithms (Chapter 7)
Sampling
• Sampling allows the learning algorithm to run in
complexity that is potentially sub-linear to the size of the
data
W O R
SRS le random t
p
(sim le withou
s am p t)
emen
l ac
rep
SRSW
R
Raw Data
Discretization
• Three types of attributes:
– Nominal — values from an unordered set
– Ordinal — values from an ordered set
– Continuous — real numbers
• Discretization:
• divide the range of a continuous attribute into
intervals
– Some classification algorithms only accept
categorical (non- numerical) attributes.
– Reduce data (attributes values) size by discretization
– Prepare for further analysis
Discretization and Concept Hierachies for
numerical data
• Discretization
– reduce the number of values for a given continuous
attribute by dividing the range of the attribute (values
of the attribute) into intervals.
– Interval labels are then used to replace actual data
values.
• Concept hierarchies
– reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute
age) by higher level concepts (such as young,
middle-aged, or senior) transforming numerical
attributes into categorical
Segmentation by natural partitioning
3-4-5 rule can be used to segment numeric data (attribute
values) into relatively uniform, natural intervals.
• If an interval covers 3, 6, 7 or 9 distinct values at the
most significant digit, partition the range into 3 equi-width
intervals
• If it covers 2, 4, or 8 distinct values at the most
significant digit, partition the range into 4 intervals
• If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into 5 intervals
Concept Hierarchy generation for
Categorical data
• Concept hierarchy is:
• Specification of a partial ordering of attributes
explicitly at the schema level by users or experts
• Specification of a portion of a hierarchy by
explicit data grouping
• Specification of a set of attributes, but not of
their partial ordering
• Specification of only a partial set of attributes
Specification of a set of attributes