Chap2 Data
Chap2 Data
nd
Introduction to Data Mining , 2 Edition
by
Tan, Steinbach, Kumar
01/27/2021 Introduction to Data Mining, 2nd Edition 1
Tan, Steinbach, Karpatne, Kumar
Outline
● Types of Data
● Data Quality
● Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in
a collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
● Continuous Attribute
– Has real numbers as attribute values – Examples:
temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as
floating-point variables.
– The data type you see – often numbers or strings – may not
capture all the properties or may suggest properties that are
not present
– Sparsity
◆ Only presence counts
– Resolution
◆ Patterns depend on the scale
– Size
● Spatio-
Temporal
Data
Average
Monthly
● Causes?
Missing Values
● Data cleaning
– Process of dealing with duplicate data issues
● Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
● Dissimilarity measure
– Numerical measure of how different two data objects
are
– Lower when objects are more alike
p1
p2
p3
p4
●
● Standardization is
necessary, if scales
differ.
Distance Matrix
● r = 2. Euclidean distance
Mahalanobis Distance
B A: (0.5, 0.5)
B: (0, 1)
A
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
● Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> =
|| d1 || =
Cosine Similarity
● If d1 and d2 are two document vectors, then
cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot product
of vectors, d1 and d2, and || d || is the length of vector d.
● Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
01/27/2021 Introduction to Data Mining, 2nd Edition 58
Tan, Steinbach, Karpatne, Kumar
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481 ||
cos(d1, d2 ) = 0.3150
yi = xi2
● mean(x) = 0, mean(y) = 4
● std(x) = 2.16, std(y) = 3.74
● corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 )
=0
of a piece of data
● The more certain an outcome, the less information
that it contains and vice-versa
● Aggregation
● Sampling
● Discretization and Binarization
● Attribute Transformation
● Dimensionality Reduction
● Feature subset selection
● Feature creation
Sample Size
● Discretization
is the process of converting a
continuous attribute into an ordinal attribute
– A potentially infinite number of values are mapped into
a small number of categories
– Discretization is used in both unsupervised and
supervised settings
● Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise
● Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
01/27/2021 Introduction to Data Mining, 2nd Edition 97
Tan, Steinbach, Karpatne, Kumar
– Others: supervised and non-linear techniques
Dimensionality Reduction: PCA
Frequency