3_Introduction to Data (3)
3_Introduction to Data (3)
Types of Data
Data Quality
Objects
3 No Single 70K No
– Attribute is also known as
variable, field, characteristic, 4 Yes Married 120K No
dimension, or feature 5 No Divorced 95K Yes
A collection of attributes 6 No Married 60K No
describe an object 7 Yes Divorced 220K No
– Object is also known as 8 No Single 85K Yes
record, point, case, sample,
entity, or instance 9 No Married 75K No
10
10 No Single 90K Yes
Attribute Values
Interval
Measured on a scale of equal-sized units
Values have order
– E.g., temperature in C˚or F˚, calendar dates
No true zero-point
Ratio
Inherent zero-point
We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as high as
5 K˚).
– e.g. length, counts, monetary quantities
Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
01/27/2021 Introduction to Data Mining, 2nd Edition 8
Tan, Steinbach, Karpatne, Kumar
Basic Statistical Descriptions of Data
Motivation
– To better understand the data: central tendency,
variation and spread
Data dispersion characteristics
– median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
– Data dispersion: analyzed with multiple granularities
of precision
– Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
– Folding measures into numerical dimensions
– Boxplot orInqtroudaucntitoinletoaDantaaMlyinsinigs,2ondnEdthiteontransformed cube
01/27/2021 9 9
Tan, Steinbach, Karpatne, Kumar
Measuring the Central Tendency
n
1
Mean (algebraic measure) (sample vs. population): x
x xi
Note: n is sample size and N is population size. n i 1
N
w
n
– Weighted arithmetic mean:
– Trimmed mean: chopping extreme values x i1
Median:
– Middle value if odd number of values, or average of
the middle two values otherwise
(x x
1 1
s
2
(xi x)
2
2
)
2 2
2
n 1 i1 n 1 i1 n i1 N i1
i
N i1
i
01/27/2021 IntroductionDtaotaDMaintainMg:inCionngc,e2pntsdaEnddition 21
Tan, Steinbach, KaTrepcahtnnieq,uKesumar 21
Quantile-Quantile (Q-Q) Plot
– Sparsity
Only presence counts
– Resolution
Patterns depend on the scale
– Size
Type of analysis may depend on size of data
game
score
timeout
season
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
01/27/2021 Introduction to Data Mining, 2nd Edition 31
Tan, Steinbach, Karpatne, Kumar
Graph Data
2
5 1
2
5
Sequences of transactions
Items/Events
An element of
the sequence
01/27/2021 Introduction to Data Mining, 2nd Edition 33
Tan, Steinbach, Karpatne, Kumar
Ordered Data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
Dissimilarity measure
– Numerical measure of how different two data objects
are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
Proximity refers to a similarity or dissimilarity
01/27/2021 Introduction to Data Mining, 2nd Edition 36
Tan, Steinbach, Karpatne, Kumar
Similarity/Dissimilarity for Simple Attributes
Euclidean Distance
3
point x y
2 p1
p1 0 2
p3
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
01/27/2021 Introduction to Data Mining, 2nd Edition 39
Tan, Steinbach, Karpatne, Kumar
Minkowski Distance
r = 2. Euclidean distance
L1 p1 p2 p3 p
p1 0 4 4
p2 4 0
p3 4
p4
point x y
p1 0 2 L2 p1 p2 p3 p
p2 2 0 p1 0 2.828 3
p3 3 1 p2 2.828
p4 5 1 p3 3
p4
L p1 p2 p3 p
p1 0 2
p2 2
p3
p
Distance Matrix
01/27/2021 Introduction to Data Mining, 2nd Edition 42
Tan, Steinbach, Karpatne, Kumar
Common Properties of a Distance
Jaccard Coefficients
counts only presences and it is frequently for asymmetric binary
attributes.
J = number of 11 matches / number of non-zero attributes
= (f11) / (f01 + f10 + f11)
x= 1000000000
y= 0000001001
Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150
yi = x 2
i
mean(x) = 0, mean(y) = 4
std(x) = 2.16, std(y) = 3.74