Lecture Notes For Chapter 2 Introduction To Data Mining, 2 Edition
Lecture Notes For Chapter 2 Introduction To Data Mining, 2 Edition
Types of Data
Data Quality
Data Preprocessing
Objects
variable, field, characteristic, 4 Yes Married 120K No
dimension, or feature 5 No Divorced 95K Yes
A collection of attributes 6 No Married 60K No
describe an object 7 Yes Divorced 220K No
– Object is also known as record, 8 No Single 85K Yes
point, case, sample, entity, or
9 No Married 75K No
instance
10 No Single 90K Yes
01/27/2020 Introduction to Data Mining, 2nd Edition
10
3
Tan, Steinbach, Karpatne, Kumar
A More Complete View of Data
B
7 2
C
This scale This scale
8 3
preserves preserves
only the the ordering
ordering D and additvity
property of properties of
length. 10 4 length.
15 5
female} test
This
01/27/2020 categorization oftoattributes
Introduction is due
Data Mining, 2nd to S. S. Stevens
Edition 10
Tan, Steinbach, Karpatne, Kumar
Attribute Transformation Comments
Type
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
make any difference?
Categorical
Qualitative
This
01/27/2020 categorization oftoattributes
Introduction is due
Data Mining, 2nd to S. S. Stevens
Edition 11
Tan, Steinbach, Karpatne, Kumar
Discrete and Continuous Attributes
Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
01/27/2020 Introduction to Data Mining, 2nd Edition 12
Tan, Steinbach, Karpatne, Kumar
Asymmetric Attributes
Only presence (a non-zero attribute value) is regarded as important
Words present in documents
Items present in customer transactions
Biased Scale
– Interval or Ratio
– The data type you see – often numbers or strings – may not capture all the
properties or may suggest properties that are not present
– Sparsity
Only presence counts
– Resolution
Patterns depend on the scale
– Size
Type of analysis may depend on size of data
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
01/27/2020 Introduction to Data Mining, 2nd Edition 24
Tan, Steinbach, Karpatne, Kumar
Graph Data
Examples: Generic graph, a molecule, and webpages
2
5 1
2
5
Sequences of transactions
Items/Events
An element of
the sequence
01/27/2020 Introduction to Data Mining, 2nd Edition 26
Tan, Steinbach, Karpatne, Kumar
Ordered Data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Causes?
01/27/2020 Introduction to Data Mining, 2nd Edition 32
Tan, Steinbach, Karpatne, Kumar
Missing Values
Examples:
– Same person with multiple email addresses
Data cleaning
– Process of dealing with duplicate data issues
Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
Dissimilarity measure
– Numerical measure of how different two data objects
are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
Proximity refers to a similarity or dissimilarity
01/27/2020 Introduction to Data Mining, 2nd Edition 36
Tan, Steinbach, Karpatne, Kumar
Similarity/Dissimilarity for Simple Attributes
Euclidean Distance
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
01/27/2020 Introduction to Data Mining, 2nd Edition 39
Tan, Steinbach, Karpatne, Kumar
Minkowski Distance
r = 2. Euclidean distance
Do not confuse r with n, i.e., all these distances are defined for all
numbers of dimensions.
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
01/27/2020 Introduction to Data Mining, 2nd Edition 42
Tan, Steinbach, Karpatne, Kumar
Mahalanobis Distance
𝐦𝐚𝐡𝐚𝐥𝐚𝐧𝐨𝐛𝐢𝐬 ( 𝐱 , 𝐲 )=( 𝐱 − 𝐲 )𝑇 Ʃ − 1( 𝐱 − 𝐲 )
Covariance
Matrix:
0.3 0.2
C
0. 2 0. 3
B A: (0.5, 0.5)
B: (0, 1)
A
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
x= 1000000000
y= 0000001001
Scatter plots
showing the
similarity from
–1 to 1.
yi = xi2
mean(x) = 0, mean(y) = 4
std(x) = 2.16, std(y) = 3.74
Domain of application
– Similarity measures tend to be specific to the type of attribute
and data
– Record data, images, graphs, sequences, 3D-protein
structure, etc. tend to have different measures
However, one can talk about various properties that you
would like a proximity measure to have
– Symmetry is a common one
– Tolerance to noise and outliers is another
– Ability to find more types of patterns?
– Many others possible
The measure must be applicable to the data and
produce results that agree with domain knowledge
01/27/2020 Introduction to Data Mining, 2nd Edition 54
Tan, Steinbach, Karpatne, Kumar
Information Based Measures
For
– a variable (event), X,
– with n possible values (outcomes), x1, x2 …, xn
– each outcome having probability, p1, p2 …, pn
– the entropy of X , H(X), is given by
Suppose we have
– a number of observations (m) of some attribute, X,
e.g., the hair color of students in the class,
– where there are n different possible values
– And the number of observation in the ith category is mi
– Then, for this sample
Information
one variable provides about another
Formally, , where
Where pij is the probability that the ith value of X and the jth value of Y
occur together
Mutual information of Student Status and Grade = 0.9928 + 1.4406 - 2.2710 = 0.1624
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
Purpose
– Data reduction
Reduce the number of attributes or objects
– Change of scale
Cities aggregated into regions, states, countries, etc.
Days aggregated into weeks, months, or years
– More “stable” data
Aggregated data tends to have less variability
When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise
Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques
x1
01/27/2020 Introduction to Data Mining, 2nd Edition 80
Tan, Steinbach, Karpatne, Kumar
Dimensionality Reduction: PCA
Frequency
30
Counts 20
10
0
0 2 4 6 8
Petal Length
Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.
Net Primary
Production (NPP)
is a measure of
plant growth used
by ecosystem
scientists.