Concepts and Techniques: Data Mining
Concepts and Techniques: Data Mining
— Chapter 2 —
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber, All rights reserved
April 25, 2024 Data Mining: Concepts and Techniques 1
April 25, 2024 Data Mining: Concepts and Techniques 2
Chapter 2: Data Preprocessing
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Broad categories:
Intrinsic, contextual, representational, and accessibility
Motivation
To better understand the data: central tendency, variation
and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities of
precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
April 25, 2024 Data Mining: Concepts and Techniques 11
Measuring the Central Tendency
1 n x
Mean (algebraic measure) (sample vs. population): x xi
n i 1 N
n
Weighted arithmetic mean: w x i i
x i 1
Trimmed mean: chopping extreme values n
w
i 1
i
N
i 1
( xi
2
)
N
xi 2
i 1
2
warehousing”—DCI survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
April 25, 2024 Data Mining: Concepts and Techniques 27
Missing Data
technology limitation
incomplete data
inconsistent data
Clustering
detect and remove outliers
Y1
Y1’ y=x+1
X1 x
store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
rA, B
( A A)( B B ) ( AB ) n A B
( n 1)AB ( n 1)AB
Χ2 (chi-square) test
(Observed Expected ) 2
2
Expected
The larger the Χ2 value, the more likely the variables are
related
The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population
73,600 54,000
1.225
Ex. Let μ = 54,000, σ = 16,000. Then 16,000
Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
April 25, 2024 Data Mining: Concepts and Techniques 44
Suppose that the recorded values of A range from −986 to
917. The maximum absolute value of A is 986. To normalize
by decimal scaling, we therefore divide each value by 1,000
(i.e., j = 3) so that −986 normalizes to −0.986 and 917
normalizes to 0.917.
smaller in volume but yet produce the same (or almost the
same) analytical results
Data reduction strategies
Data cube aggregation:
Data Compression
understand
Heuristic methods (due to exponential # of choices):
Step-wise forward selection
Decision-tree induction
A4 ?
Y N
A1? A6?
Y N Y N
component vectors
The principal components are sorted in order of decreasing
“significance” or strength
Since the components are sorted, the size of the data can be
X2
Y1
Y2
X1
Linear regression: Y = w X + b
Two regression coefficients, w and b, specify the line
above
Log-linear models:
The multi-way table of joint probabilities is
Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
Can be very effective if data is clustered but not if data is “smeared”
Can have hierarchical clustering and be stored in multi-dimensional
index tree structures
There are many choices of clustering definitions and clustering
algorithms
Cluster analysis will be studied in depth in Chapter 7
(-$1,000 - $2,000)
Step 3:
(-$400 -$5,000)
Step 4: