Lecture 10 - Data Transformation-M
Lecture 10 - Data Transformation-M
Data Mining
Lecture # 9
Data Preprocessing
Transformation and
Discretization
(Ch # 3)
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration
Data reduction
Data Transformation and
Discretization
Summary 2
Data Transformation
The data are transformed or consolidated into
forms appropriate for mining processing.
Different strategies includes
Smoothing – binning, regression, and clustering
Attribute construction – new attributes are constructed
from the given set of attributes.
Aggregation – Summary or aggregation operations are
applied e.g. construction of data cube
Normalization – data are scaled so as to fall within a
smaller rage e.g. -1.0 to 1.0 or 0.0 to 1.0
Discretization - Values of numeric attribute are
replaced by interval labels or conceptual labels.
(concept hierarchy for numeric attribute)
Concept hierarchy generation for nominal data –
nominal attribute values are generalized to higher-level
3
concepts e.g. street is generalized to block, city or
Data Normalization
A database can contain n numbers of continuous
type attributes.
Where a larger range continuous type attribute or
noise can shift the objects distance.
(Remember: All continuous type attributes similarity is
checked using a single Euclidean distance formula).
For example: ‘Income’ attribute can dominate the
distance as compared to ‘Weight’ and ‘Age’
attributes.
The objective of normalization is convert all integer
type attributes, so that there values fall within a
small specified range, such as 0 to 1.0.
Normalization is particularly useful for clustering
and distance measure algorithms such as k-
nearest-neighbor.
Data Normalization
min-max normalization
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
x 2.5 0.5 2.2 1.9 3.1 2.3 2.0 1.0 1.5 1.1
y 2.4 0.7 2.9 2.2 3.0 2.7 1.6 1.1 1.6 0.9