Mod1 DM Part2
Mod1 DM Part2
Mod1 DM Part2
e.g., occupation=“ ”
noisy: containing errors or outliers
e.g., Salary=“-10”
Regression: Y1
smooth by fitting
the data into
regression functions Y1’ y=x+1
X1 x
Data cleaning as a process
Discrepancy detection
Field overloading
Unique rules
Consecutive rules
Null rules
min-max normalization
Min-max normalization performs a linear transformation on the original
data.
Suppose that mina and maxa are the minimum and the maximum values
for attribute A. Min-max normalization maps a value v of A to v’ in the
range [new-mina, new-maxa] by computing:
v’= ( (v-mina) / (maxa – mina) ) * (new-maxa – newmina)+ new-mina
Data Transformation: Normalization
Z-score Normalization:
In z-score normalization, attribute A are normalized based on the
mean and standard deviation of A. a value v of A is normalized to v’
by computing:
v’ = ( ( v – A ) / µA )
in volume but yet produces the same (or almost the same) analytical
results
terabytes of data. Complex data analysis may take a very long time
22
Data reduction strategies
Data cube aggregation
Dimensionality reduction
Numerosity reduction
popular form of
sorted: 1, 1, 5, 5, 5, 5, 5,8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15,
15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20,20, 20, 20, 20, 20, 20, 21,
21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
Supervised discretization
Unsupervised discretization
Top-down discretization or splitting
Bottom-up discretization or merging
33 April 13, 2021
Data Discretization and Concept
Hierarchy Generation
A concept hierarchy for a given numerical attribute defines a
discretization of the attribute.