Analisis Data 2
Analisis Data 2
Data cleaning
Data reduction
Summary
June 19, 2025 Data Mining: Concepts and Techniques 2
Why Data Preprocessing?
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
• e.g., occupation=“ ”
• noisy: containing errors or outliers
• e.g., Salary=“-10”
• inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records
Data cleaning
Data reduction
Summary
June 19, 2025 Data Mining: Concepts and Techniques 9
Mining Data Descriptive
Characteristics
Motivation
To better understand the data: central tendency, variation and
spread
Data dispersion characteristics
median, mean, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion
Boxplot or quantile analysis on sorted intervals
1 n x
Mean (algebraic measure) (sample vs. population): x xi
n i 1 N
n
Weighted arithmetic mean: w x i i
x i 1
Trimmed mean: chopping extreme values n
w
i 1
i
Histogram
Boxplot
Quantile plot:each value xi is paired with fi indicating that approximately 100 fi %
of data are xi
Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution
against the corresponding quantiles of another
Scatter plot: each pair of values is a pair of coordinates and plotted as points in
the plane
Loess (local regression) curve: add a smooth curve to a scatter plot to provide
better perception of the pattern of dependence
Data cleaning
Data reduction
Summary
June 19, 2025 Data Mining: Concepts and Techniques 22
Data Cleaning
• Importance
• “Data cleaning is one of the three biggest problems in
data warehousing”—Ralph Kimball
• “Data cleaning is the number one problem in data
warehousing”—DCI survey
Data cleaning
Data reduction
Summary
rA, B
(A A)( B B )
( AB) n AB
( n 1)AB ( n 1)AB
A B
where n is the number of tuples, and are the respective means of A
and B, σA and σB are the respective standard deviation of A and B, and
Σ(AB) is the sum of the AB cross-product.
73,600 54,000
1.225
• Ex. Let μ = 54,000, σ = 16,000. Then 16,000
• Normalization
v by decimal scaling
v' Where j is the smallest integer such that Max(|ν’|) < 1
10 j
June 19, 2025 Data Mining: Concepts and Techniques 36
Chapter 2: Data Preprocessing
Data cleaning
Data reduction
Summary