X Chapter 02 Data
X Chapter 02 Data
1
Getting to Know Your Data
2
MODULE 5
3
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix,
timeout
season
coach
game
score
team
ball
lost
pla
crosstabs
wi
n
y
Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
Transaction data
Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
Social or information networks
Molecular Structures
Ordered TID Items
Video data: sequence of images
1 Bread, Coke, Milk
Temporal data: time-series
Sequential Data: transaction sequences 2 Beer, Bread
Genetic sequence data 3 Beer, Coke, Diaper, Milk
Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
Spatial data: maps 5 Coke, Diaper, Milk
Image data:
Video data:
4
Data Objects
5
Attributes
Attribute (or dimensions (data warehousing), features
(machine learning), variables(statisticians)): a data field,
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
6
Attribute Types
Nominal: categories, states, or “names of things”
Hair_color = {auburn, black, blond, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes
No meaning full order and are not quantitative
Mathematical operation on values of nominal attributes are not meaning ful
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV
positive)
7
Attribute Types
Ordinal
Values have a meaningful order (ranking) but magnitude between
successive values is not known.
Size = {small, medium, large}, grades, army rankings
Central tendency can be represented by – mode and median
But ‘mean’ cannot be defined
8
Numeric Attribute Types
Numeric
Quantitative - it is measurable quantity represented in
integer or real-values
Can be interval –scaled or ratio scaled
Interval
Continous measurements of a rougly linear scale
Measured on a scale of equal-sized units
Allow us to compare and quantify the difference
between values
Values have order
E.g., temperature in C˚or F˚, calendar dates
No true zero-point
Neither 0 degree Celcius/Fahrenheit indicates no
temperature
Ratios are not valid 9
Numeric Attribute Types
Ratio
Inherent zero-point
Values are ordered, being multiple (ratio), Can compute
difference, mean, mode, median,
We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as
high as 5 K˚, since 0 K˚ = -273.15˚C ).
e.g., temperature in Kelvin, length, counts,
monetary quantities
10
Discrete vs. Continuous Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
collection of documents
Sometimes, represented as integer variables
attributes
Continuous Attribute
Has real numbers as attribute values
floating-point variables
11
Getting to Know Your Data
Data Visualization
Summary
12
Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects are
Value is higher when objects are more alike
Often falls in the range [0,1]
Dissimilarity (e.g., distance)
Numerical measure of how different two data objects
are
Lower when objects are more alike
Minimum dissimilarity is often 0
Proximity refers to a similarity or dissimilarity
13
Data Matrix and Dissimilarity Matrix
Data structures
1. Data matrix x11 ... x1f ... x1p
Stores the n data ... ... ... ... ...
objects with p x ... xif ... xip
i1
attributes in the form ... ... ... ... ...
of a relational table x ... xnf ... xnp
n1
It is n by p matrix
14
Data Matrix and Dissimilarity Matrix
Data structures 0
2. Dissimilarity matrix d(2,1) 0
This structure shows d(3,1) d ( 3,2) 0
collection of proximities
: : :
that are available for all d ( n,1) d ( n,2) ... ... 0
pair of n objects.
d(i, j) – difference
same
15
Proximity Measure for Nominal Attributes
16
Dissimilarity- Nominal data
Obj. ID test-1 test-2
101 A X
102 B X
103 A Y
104 C Z
Calculate
0
d(2,1) 0
0 1 / 2 0
d(3,1) d ( 3,2) 0
1 / 2 1 0
: : :
d ( n,1) d ( n,2) ... ... 0 1 1 1
17
Proximity Measure for Binary Attributes
Object j
A contingency table for binary data
Object i
where q – # of attributes == 1 for both objects
t - # attributes == 0 for both objects
s - # attributes == 0 object i and == 1 object j
r - # attributes == 1 object i and == 0 object j
18
Proximity Measure for Binary Attributes
19
Proximity Measure for Nominal Attributes
20
Dissimilarity- Nominal data
Obj. ID test-1 test-2
101 A X
102 B X
103 A Y
104 C Z
Calculate
0
d(2,1) 0
0 1 / 2 0
d(3,1) d ( 3,2) 0
1 / 2 1 0
: : :
d ( n,1) d ( n,2) ... ... 0 1 1 1
21
Proximity Measure for Binary Attributes
Object j
A contingency table for binary data
Object i
where q – # of attributes == 1 for both objects
t - # attributes == 0 for both objects
s - # attributes == 0 object i and == 1 object j
r - # attributes == 1 object i and == 0 object j
22
Proximity Measure for Binary Attributes
23
Dissimilarity between Binary Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
Gender is a symmetric attribute
The remaining attributes are asymmetric binary
Let the values Y and P be 1, and the value N 0
01
d ( jack , mary ) 0.33
2 01
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2
24
Dissimilarity on Numeric Data: Minkowski
Distance
Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
Properties
d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
25
Special Cases of Minkowski Distance
h = 1: Manhattan (city block, L1 norm) distance
E.g., the Hamming distance: the number of bits that are different
between two binary vectors
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
26
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
2 x1
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
27
Ordinal Variables
28
Attributes of Mixed Type
A database may contain all attribute types
Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
One may use a weighted formula to combine their effects
pf 1 ij( f ) dij( f )
d (i, j)
pf 1 ij( f )
δij =0, if either 1) xif or xjf is missing 2) xif =xjf and attribute is asymmetric
binary, otherwise δij =1
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is numeric: use the normalized distance
f is ordinal
z
rif 1
Compute ranks rif and if M f 1
Treat zif as interval-scaled
29
Cosine Similarity
A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.
30
Example: Cosine Similarity
cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d|: the length of vector d
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
31
Basic Statistical Descriptions of Data
Motivation
To better understand the data: central tendency,
variation and spread
Central tendency
Location of middle or center of dispersion
Dispersion
How data are spread about
Range, quartiles, interquartile range, five-number summary.
32
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population): 1 n
x xi x
Note: n is sample size and N is population size. n i 1 N
n
Weighted arithmetic mean:
w x
i i
Trimmed mean: chopping extreme values x i 1
n
Median: w
i 1
i
Middle value if odd number of values, or average of the
middle two values otherwise
Estimated by interpolation (for grouped data):
n / 2 ( freq ) l Median
median L1 ( ) width interval
freq median
L1 –lower boundary of median interval
n – the number of values in entire data set
∑freq – sum of freq of all intervals lower than median interval
Freqmedian – frequency of median interval
Width- width of median interval
33
Measuring the Central Tendency
Mode
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula:
mean mode 3 (mean median)
34
Symmetric vs. Skewed
Data
Median, mean and mode of symmetric
n n
1 1
x 2
2
2
( xi 2
) i
N i 1 N i 1
36
Measuring the Dispersion of Data
Quartiles, outliers and boxplots
Quartiles: sorted, splits into equal size, consecutive set
Q1 (25th percentile), Q3 (75th percentile)
37
Boxplot Analysis
38
Visualization of Data Dispersion: 3-D Boxplots
40
Graphic Displays of Basic Statistical Descriptions
43
Quantile Plot
Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
Plots quantile information
For a data x data sorted in increasing order, f indicates
i i
that approximately 100 fi% of the data are below or
equal to the value xi
45
Scatter plot
Provides a first look at bivariate data to see clusters of
points, outliers, etc
Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
46
Positively and Negatively Correlated Data
47
Uncorrelated Data
48
Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects are
Value is higher when objects are more alike
Often falls in the range [0,1]
Dissimilarity (e.g., distance)
Numerical measure of how different two data objects
are
Lower when objects are more alike
Minimum dissimilarity is often 0
Proximity refers to a similarity or dissimilarity
49
Data Matrix and Dissimilarity Matrix
Data structures
1. Data matrix x11 ... x1f ... x1p
Stores the n data ... ... ... ... ...
objects with p x ... xif ... xip
i1
attributes in the form ... ... ... ... ...
of a relational table x ... xnf ... xnp
n1
It is n by p matrix
50
Data Matrix and Dissimilarity Matrix
Data structures 0
2. Dissimilarity matrix d(2,1) 0
This structure shows d(3,1) d ( 3,2) 0
collection of proximities
: : :
that are available for all d ( n,1) d ( n,2) ... ... 0
pair of n objects.
d(i, j) – difference
same
51
Proximity Measure for Nominal Attributes
52
Dissimilarity- Nominal data
Obj. ID test-1 test-2
101 A X
102 B X
103 A Y
104 C Z
Calculate
0
d(2,1) 0
0 1 / 2 0
d(3,1) d ( 3,2) 0
1 / 2 1 0
: : :
d ( n,1) d ( n,2) ... ... 0 1 1 1
53
Proximity Measure for Binary Attributes
Object j
A contingency table for binary data
Object i
where q – # of attributes == 1 for both objects
t - # attributes == 0 for both objects
s - # attributes == 0 object i and == 1 object j
r - # attributes == 1 object i and == 0 object j
54
Proximity Measure for Binary Attributes
55
Proximity Measure for Nominal Attributes
56
Dissimilarity- Nominal data
Obj. ID test-1 test-2
101 A X
102 B X
103 A Y
104 C Z
Calculate
0
d(2,1) 0
0 1 / 2 0
d(3,1) d ( 3,2) 0
1 / 2 1 0
: : :
d ( n,1) d ( n,2) ... ... 0 1 1 1
57
Proximity Measure for Binary Attributes
Object j
A contingency table for binary data
Object i
where q – # of attributes == 1 for both objects
t - # attributes == 0 for both objects
s - # attributes == 0 object i and == 1 object j
r - # attributes == 1 object i and == 0 object j
58
Proximity Measure for Binary Attributes
59
Dissimilarity between Binary Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
Gender is a symmetric attribute
The remaining attributes are asymmetric binary
Let the values Y and P be 1, and the value N 0
01
d ( jack , mary ) 0.33
2 01
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2
60
Dissimilarity on Numeric Data: Minkowski
Distance
Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
Properties
d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
61
Special Cases of Minkowski Distance
h = 1: Manhattan (city block, L1 norm) distance
E.g., the Hamming distance: the number of bits that are different
between two binary vectors
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
62
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
2 x1
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
63
Ordinal Variables
64
Attributes of Mixed Type
A database may contain all attribute types
Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
One may use a weighted formula to combine their effects
pf 1 ij( f ) dij( f )
d (i, j)
pf 1 ij( f )
δij =0, if either 1) xif or xjf is missing 2) xif =xjf and attribute is asymmetric
binary, otherwise δij =1
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is numeric: use the normalized distance
f is ordinal
z
rif 1
Compute ranks rif and if M f 1
Treat zif as interval-scaled
65
Cosine Similarity
A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.
66
Example: Cosine Similarity
cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d|: the length of vector d
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
67