Chapter2 Data Preprocssing
Chapter2 Data Preprocssing
◼ Completeness
◼ Consistency
◼ Timeliness
◼ Believability
◼ Value added
◼ Interpretability
◼ Accessibility
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data transformation
◼ Normalization and aggregation
◼ Data reduction
◼ Obtains reduced representation in volume but produces the same
or similar analytical results
◼ Data discretization
◼ Part of data reduction but with particular importance, especially
for numerical data
◼ Motivation
◼ To better understand the data: central tendency, variation
and spread
◼ Data dispersion characteristics
◼ median, mean, max, min, quantiles, outliers, variance, etc.
◼ Numerical dimensions correspond to sorted intervals
◼ Data dispersion
◼ Boxplot or quantile analysis on sorted intervals
=
1 n x
◼ Mean (algebraic measure) (sample vs. population): x = xi
n i =1 N
◼ Weighted arithmetic mean: n
w x i i
w i
◼ Median: A holistic measure i =1
1 …. 100K billion
◼ Histogram
◼ Boxplot
◼ Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are xi
◼ Quantile-quantile (q-q) plot: graphs the quantiles of one
univariant distribution against the corresponding quantiles
of another
◼ Scatter plot: each pair of values is a pair of coordinates
and plotted as points in the plane
◼ Loess (local regression) curve: add a smooth curve to a
scatter plot to provide better perception of the pattern of
dependence
March 6, 2024 Data Mining: Concepts and Techniques 22
Chapter 2: Data Preprocessing
◼ Importance
◼ “Data cleaning is one of the three biggest problems
in data warehousing”—Ralph Kimball
◼ “Data cleaning is the number one problem in data
warehousing”—DCI survey
◼ Data cleaning tasks
◼ Fill in missing values
◼ Identify outliers and smooth out noisy data
◼ Correct inconsistent data
◼ Resolve redundancy caused by data integration
◼ technology limitation
◼ Clustering
◼ detect and remove outliers
Y1
?
Humidity
Y1’ y=x+1
X1 x
Temperature
rA, B =
( A − A)( B − B) ( AB ) − n AB
=
(n − 1)AB (n − 1)AB
73,600 − 54,000
◼ Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
◼ Normalization by decimal scaling
v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
March 6, 2024 Data Mining: Concepts and Techniques 43
Chapter 2: Data Preprocessing
smaller in volume but yet produce the same (or almost the
same) analytical results
◼ Data reduction strategies
◼ Data cube aggregation:
◼ Data Compression
understand
◼ Heuristic methods (due to exponential # of choices):
◼ Step-wise forward selection
◼ Decision-tree induction
A4 ? Loan Approval
Example:
Salary, Credit
A1? A6?
Score, House,
Monthly payment,
Age
Class 1 Class 2 Class 1 Class 2
expansion
◼ Audio/video compression
refinement
◼ Sometimes small fragments of signal can be
Original Data
Approximated
◼ Non-parametric methods
◼ Do not assume models
◼ Linear regression: Y = w X + b
◼ Two regression coefficients, w and b, specify the line
and are to be estimated by using the data at hand
◼ Using the least squares criterion to the known values
of Y1, Y2, …, X1, X2, ….
◼ Multiple regression: Y = b0 + b1 X1 + b2 X2.
◼ Many nonlinear functions can be transformed into the
above
A R Example for Linear Regression
http://msenux.redwoods.edu/math/R/regression.php
100000
10000
20000
30000
40000
50000
60000
70000
80000
90000
March 6, 2024 Data Mining: Concepts and Techniques 58
Data Reduction Method (3): Clustering
◼ Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
◼ Can be very effective if data is clustered but not if data is
“smeared”
◼ There are many choices of clustering definitions and clustering
algorithms
◼ Cluster analysis will be studied in depth in Chapter 7
Raw Data
March 6, 2024 Data Mining: Concepts and Techniques 61
Sampling: Cluster or Stratified Sampling