Data Preprocessing: Why Preprocess The Data?
Data Preprocessing: Why Preprocess The Data?
Data cleaning
Data reduction
Summary
Data cleaning
Data reduction
Summary
This method is not very effective, unless the tuple contains several
attributes with missing values.
Not effective when the percentage of missing values per attribute varies
considerably.
2.Fill in the missing value manually: time-consuming and
may not be feasible given a large data set with many missing values.
August 2, 2019 11
How to Handle Missing Data?
3.Use a global constant to fill in the missing value: Replace all
missing attribute values by the same constant, such as a label
like “Unknown” or infinity.
number of samples
Y1
Y1’ y=x+1
X1 x
i 1 (ai A)(bi B)
n n
( ai bi ) n A B
rA, B i 1
( n 1) A B (n 1) A B
Χ2 (chi-square) test
(Observed Expected) 2
2
Expected
The larger the Χ2 value, the more likely the variables are
related
The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
25
Correlation Analysis (Nominal
Data)
For categorical (discrete) data, a correlation relationship
between two attributes, A and B, can be discovered by a chi-
square test.
Suppose A has c distinct values, namely a1;a2; : : :ac.
B has r distinct values, namely b1;b2; : : :br.
The data tuples described by A and B can be shown as a
contingency table, with the c values of A making up the
columns and the r values of B making up the rows.
Let (Ai;Bj) denote the event that attribute A takes on value ai
and attribute B takes on value bj, that is, where (A = ai;B = bj).
Each and every possible (Ai;Bj) joint event has its own cell (or
slot) in the table.
N-number of tuples
The chi square statistic tests the hypothesis that A and B are
independent.
The test is based on a significance level, with (r-1)(c-1) degrees
of freedom
73,600 54,000
1.225
Ex. Let μ = 54,000, σ = 16,000. Then 16,000
Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
32
Data Reduction
33
Data Reduction
42
Data Reduction 4: Numerosity Reduction
43
Parametric Data Reduction: Regression and
Log-Linear Models
Linear regression: Y = w X + b
A randomvariable, y (called a response variable), can be modeled as
a linear function of another random variable.
Multiple regression: Y = b0 + b1 X1 + b2 X2
Multiple linear regression is an extension of (simple) linear
regression, which allows a response variable, y, to be modeled as a
linear function of two or more predictor variables.
44
Parametric Data Reduction: Regression
and Log-Linear Models
Log-linear models
Given a set of tuples in n dimensions (e.g., described by n
attributes), we can consider each tuple as a point in an n-
dimensional space.
Log-linear models can be used to estimate the probability
of each point in a multidimensional space for a set of
discretized attributes, based on a smaller subset of
dimensional combinations.
This allows a higher-dimensional data space to be
constructed from lower dimensional spaces.
Log-linear models are thus useful for dimensionality
reduction and data smoothing
48
Sampling
49
Sampling