Data Preprocessing for Clustering
Data Preprocessing for Clustering
Preprocessing
What is Data?
Attributes
Collection of data objects and their attributes
➢ Disadvantages:
➢ outliers may dominate presentation
➢ Skewed data is not handled well.
Binning (Equal-frequency)
➢ Equal-depth (frequency) partitioning:
➢ Disadvantage:
➢ Many occurrences of the same continuous value could
cause the values to be assigned into different bins
➢ Managing categorical attributes can be tricky.
Binning Example
Attribute values (for one attribute e.g., age):
• 0, 4, 12, 16, 16, 18, 24, 26, 28
➢ Example: Let min and max values for the attribute income are
$12,000 and $98,000, respectively.
➢ Map income to the range [0.0;1.0].
Data Normalization
➢ z-score normalization(or zero-mean normalization)
➢ An attribute A, values are normalized based on the mean and
standard deviation of A.
v − meanA
v' =
stand _ devA
➢ Example: Let mean= 54,000 and standard deviation=16,000 for the
attribute income
➢ With z-score normalization, a value of $73,600 for income is
transformed to
Continuous and Categorical
Attributes
How to apply association analysis formulation to non-
asymmetric binary variables?
Session Country Session Number of
Browser
Id Length Web Pages Gender Buy
Type
(sec) viewed
1 USA 982 8 Male IE No
2 China 811 10 Female Netscape No
3 USA 2125 45 Female Mozilla Yes
4 Germany 596 4 Male IE Yes
5 Australia 123 9 Male Mozilla No
… … … … … … …
10
Dissimilarity
Similarity and Dissimilarity
• Numerical measure of how alike two data
objects are.
Similarity • Is higher when objects are more alike.
• Often falls in the range [0,1]
➢ Euclidean Distance
n
dist = ( pk − qk )
2
k =1
Where n is the number of dimensions (attributes)
pk and qk are the kth attributes (components) or data
objects p and q.