Week 4 - 5 - Data Preprocessing
Week 4 - 5 - Data Preprocessing
Warehousing
1
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete:
• lacking attribute values
• lacking certain attributes of interest
• containing only aggregate data
– noisy:
• containing errors or outliers
– inconsistent:
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data
• 2. Data integration
• 3.Data
transformation
• 4. Data
reduction
• Use the attribute mean for all samples belonging to the same class
to fill in the missing value: smarter
• Use the most probable value to fill in the missing value: inference-
based such as Bayesian formula or decision tree
Data Warehouse and Data Mining 7
Noisy Data
outliers
Y1
Y1’ y=x+1
X1 x
73,600 54,000
1.225
16,000
– Ex. Let μ = 54,000, σ = 16,000. Then
• Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
Data Warehouse and Data Mining 21
Discretization
• Discretization
– reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.
– reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute
age) by higher level concepts (such as young,
middle-aged, or senior).
24
Binning
• Attribute values (for one attribute e.g., age):
– 0, 4, 12, 16, 16, 18, 24, 26, 28
• Equal-width binning – for bin width of e.g., 10:
– Bin 1: 0, 4 [-,10) bin
– Bin 2: 12, 16, 16, 18 [10,20) bin
– Bin 3: 24, 26, 28 [20,+) bin
• Equal-frequency binning – for bin density of e.g., 3:
– Bin 1: 0, 4, 12
– Bin 2: 16, 16, 18
– Bin 3: 24, 26, 28
• Data reduction
– Obtains a reduced representation of the data set that is
much smaller in volume but yet produces the same
(or almost the same) analytical results
• Heuristic methods
(due to exponential # of choices):
1. step-wise forward selection
2. step-wise backward elimination
3. combining forward selection and backward
elimination
4. decision-tree induction
os sy
l
Original Data
Approximated
C1 C2
C3
SW OR om
SR le rand
p t
(sim le withou
samp ment)
pl ace
re
SRSW
R
Raw Data
Data Warehouse and Data Mining 56
Data Cube Aggregation
• The lowest level of a data cube
– the aggregated data for an individual entity of interest
– e.g., a customer in a phone calling data warehouse.
i 1 (ai A)(bi B)
n n
(ai bi ) n AB
rA, B i 1
(n 1) A B (n 1) A B
Scatter plots
showing the
similarity from
–1 to 1.
Correlation coefficient:
where n is the number of tuples, and are the respective mean or
expected values of A and B, σA and σB are the respective standard
deviation of A and B.
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than
their expected values.
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value,
B is likely to be smaller than its expected value.
• Independence: CovA,B = 0 but the converse is not true:
– Some pairs of random variables may have a covariance of 0 but are not independent.
Only under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
• Suppose two stocks A and B have the following values in one week: (2, 5), (3,
8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will their
prices rise or fall together?
– E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4