Lect 3
Lect 3
— Chapter 2 —
1
Chapter 2: Getting to Know Your
Data
Data Visualization
Summary
2
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix,
timeout
season
coach
game
score
team
ball
lost
pla
wi
crosstabs
n
y
Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
Transaction data
Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
Social or information networks
Molecular Structures
Ordered TID Items
Video data: sequence of images
1 Bread, Coke, Milk
Temporal data: time-series
Sequential Data: transaction 2 Beer, Bread
sequences 3 Beer, Coke, Diaper, Milk
Genetic sequence data 4 Beer, Bread, Diaper, Milk
Spatial, image and multimedia:
5 Coke, Diaper, Milk
Spatial data: maps
Image data:
Video data:
3
Important Characteristics of
Structured Data
Dimensionality
Curse of dimensionality
Sparsity
Only presence counts
Resolution
Patterns depend on the scale
Distribution
Centrality and dispersion
4
Data Objects
Types:
Nominal
Binary
Ordinal
Numeric: quantitative
Interval-scaled
Ratio-scaled
7
Binary Attribute Types
Binary Variables: attribute with only 2 states (0 and
1)
However, it can be symmetric/asymmetric
red.
How is dissimilarity computed?
Matching approach d(i,j)=(p-m)/p
M is the number of similar attributes between I
and j
P is the number of total attributes between I and
j
Ordinal Attribute Types
Ordinal
Values have a meaningful order (ranking)
army rankings
10
Numeric Attribute Types
Interval-Scaled Variables
Continuous measurements of a roughly
linear scale
Weight, height, latitude, temperature
13
Discrete vs. Continuous
Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
E.g., zip codes, profession, or the set of words in
a collection of documents
Sometimes, represented as integer variables
discrete attributes
Continuous Attribute
Has real numbers as attribute values
E.g., temperature, height, or weight
Practically, real values can only be measured and
floating-point variables
14
Chapter 2: Getting to Know Your
Data
Data Visualization
Summary
15
Basic Statistical Descriptions of
Data
Motivation
To better understand the data: central tendency,
variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities
of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed
cube
16
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population): 1 n
x xi x
Note: n is sample size and N is population size. n i 1 N
n
Weighted arithmetic mean:
w x i i
Trimmed mean: chopping extreme values x i 1
n
Median: w
i 1
i
Middle value if odd number of values, or average
of the middle two values otherwise
Estimated by interpolation (for grouped data):
n / 2 ( freq )l
median L1 ( ) width
Mode freq median
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula:
mean mode 3 (mean median)
17
Symmetric vs.
Skewed Data
Median, mean and mode of symmetric
symmetric, positively and
negatively skewed data
positively negatively
skewed skewed
25
Positively and Negatively Correlated
Data
26
Uncorrelated Data
27
Chapter 2: Getting to Know Your
Data
Data Visualization
Summary
28
Data Visualization
Why data visualization?
Gain insight into an information space by mapping data onto
graphical primitives
Provide qualitative overview of large data sets
Search for patterns, trends, structure, irregularities, relationships
among data
Help find interesting regions and suitable parameters for further
quantitative analysis
Provide a visual proof of computer representations derived
Categorization of visualization methods:
Pixel-oriented visualization techniques
Geometric projection visualization techniques
Icon-based visualization techniques
Hierarchical visualization techniques
Visualizing complex data and relations
29
Pixel-Oriented Visualization
Techniques
For a data set of m dimensions, create m windows on the
screen, one for each dimension
The m dimension values of a record are mapped to m pixels
at the corresponding positions in the windows
The colors of the pixels reflect the corresponding values
Data Visualization
Summary
31
Measure of Similarity and Dissimilarity
In clustering, outlier analysis and nearest-
neighbor
We need to assess how alike and unalike
objects are in comparison to one another.
Example: A store may want to search for clusters of customer
objects, resulting in groups of customers with similar
characteristics (e.g. similar income, area of residence, and
age). Such information can then be used for marketing.
A cluster is a collection of data objects such that the objects
within a cluster are similar to one another and dissimilar to
the objects in other clusters.
Knowledge of similarities can be used in nearest-neighbor
classification scheme.
Knowledge of disimilarities can be used in outlier analysis.
32
Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects i and j
Value is higher when objects are more alike
0 if the objects are unalike
1 if the objects are alike (complete similarity)
Often falls in the range [0,1]
Dissimilarity (e.g., distance , opposite to similarity)
Numerical measure of how different two data objects i and j
Value Lower when objects are more alike
0 if the objects are same
>0 if the objects are dissimilar
Minimum dissimilarity is often 0
Upper limit varies. The higher the dissimilarity value , the
more dissimilar the two objects are.
Proximity refers to measure similarity or dissimilarity
33
Data Matrix
Data matrix(object-by- x11 ... x
1f
... x
1p
attribute structure)
... ... ... ... ...
n data objects with p attributes. x ... x ... x
Each row corresponds to an i1 if ip
object. ... ... ... ... ...
Two-mode matrix (rows for object xn1 ... x ... x
nf np
and column for attr.)
34
Dissimilarity Matrix
Dissimilarity matrix
(object-by-object stru.) 0
d(2,1) 0
n data points, but registers
only the distance d(i,j) i.e. d(3,1) d (3,2) 0
the difference between
objects i and j. : : :
0 if objects are highly
d (n,1) d (n,2) ... ... 0
similar or near each other.
d(i,j) =d(j,i)
Larger the more they differ.
A triangular matrix
One-mode matrix (contains
one kind of entity
35
Similarity and Dissimilarity
Matrix
Measures of similarity can often be
expressed as a function of measures of
dissimilarity. For nominal data
sim(i,j)= 1- d(i,j)
Many clustering and nearest-neighbor
algorithms operate on a dissimilarity
matrix.
Data in the form of a data matrix can be
transformed into a dissimilarity matrix
before applying such algorithms.
36
Proximity Measure for Nominal
Attributes
Can take 2 or more states, e.g., red, yellow,
blue, green (generalization of a binary attribute)
Method 1: Simple matching
m: # of matches, p: total # of attributes
d (i, j) p p m
Method 2: Use a large number of binary
attributes
creating a new binary attribute for each of the
M nominal states
37
Proximity Measure for Binary
Attributes
Object j
A contingency table for binary data
Object i
38
Dissimilarity between Binary
Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M 1 0 1 0 0 0
Mary F 1 0 1 0 1 0
Jim M 1 1 0 0 0 0
Gender is a symmetric attribute
The remaining attributes are asymmetric binary
Let the values Y and P be 1, and the value N 0
0 1 Likely to have a similar disease
d ( jack , mary ) 0.33
2 0 1
11
d ( jack , jim ) 0.67
111
1 2 Unlikely to have a similar disease
d ( jim , mary ) 0.75
11 2
39
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
x2 x4
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity Matrix
(with Euclidean Distance)
x3
0 4 x1 x2 x3 x4
2
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
40
Distance on Numeric Data
Manhattan distance
Minkowski distance
41
Euclidean distance
Let i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp)
are two p-dimensional data objects (i.e. p
numeric attributes)
The Euclidean distance between two
objects i and j is defined as
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j 2 ip jp
42
Minkowski Distance
Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and h is the order (the distance so
defined is also called L-h norm)
Euclidean and Manhattan distance satisfy the following
properties:
Properties
d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
d(i, i) =0; distance to object itself (identity of indiscernibles)
d(i, j) = d(j, i) (Symmetry)
d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
43
Special Cases of Minkowski Distance
h = 1: Manhattan (city block, L1 norm) distance
E.g., the Hamming distance: the number of bits that are
different between two binary vectors
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
44
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan
x1 1 2 (L1)L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
2 x1
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
45
Ordinal Variables
46
Attributes of Mixed Type
A database may contain all attribute types
Nominal, symmetric binary, asymmetric binary,
numeric, ordinal
One may use a weighted formula to combine their
effects
pf 1 ij( f ) dij( f )
d (i, j) p
f 1 ij( f )
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is numeric: use the normalized distance
f is ordinal
Compute ranks rif and zif rif 1
Treat zif as interval-scaled Mf 1
47
Cosine Similarity
A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords)
or phrase in the document.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
49
Chapter 2: Getting to Know Your
Data
Data Visualization
Summary
50
Summary
Data attribute types: nominal, binary, ordinal, interval-scaled,
ratio-scaled
Many types of data sets, e.g., numerical, text, graph, Web,
image.
Gain insight into the data by:
Basic statistical data description: central tendency,
dispersion, graphical displays
Data visualization: map data onto graphical primitives
Measure data similarity
Above steps are the beginning of data preprocessing.
Many methods have been developed but still an active area of
research.
51