Concepts and Techniques: - Chapter 2
Concepts and Techniques: - Chapter 2
Techniques
Chapter 2
Data Visualization
Summary
Record
Relational records
Transaction data
Molecular Structures
Ordered
Image data:
Video data:
TID
Items
1
2
3
4
5
Important Characteristics of
Structured Data
Dimensionality
Sparsity
Resolution
Curse of dimensionality
Distribution
Data Objects
Examples:
Attributes
Attribute Types
e.g., gender
No true zero-point
Ratio
Inherent zero-point
Discrete Attribute
Has only a finite or countably infinite set of values
Data Visualization
Summary
10
Motivation
To better understand the data: central tendency,
variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities
of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
11
1 n
x xi
n i 1
Median:
w x
i 1
n
Mode
median L1 (
n / 2 ( freq )l
freq
) width
Empirical formula:
Symmetric vs.
Skewed Data
positively skewed
symmetric
negatively
skewed
13
Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
1 n
1 n 2 1 n
2
s
( xi x )
[ xi ( xi ) 2 ]
n 1 i 1
n 1 i 1
n i 1
2
1
N
2
1
(
x
i
N
i 1
2
xi 2
2
i 1
Boxplot Analysis
Boxplot
16
17
Histogram Analysis
Quantile Plot
21
22
Scatter plot
23
24
Uncorrelated Data
25
Data Visualization
Summary
26
Data Visualization
Pixel-Oriented Visualization
Techniques
(a) Income
(b) Credit
Limit
(c) transaction
volume
(d) age
28
29
Methods
Direct visualization
Landscapes
Prosection views
Hyperslice
Parallel coordinates
30
31
Scatterplot Matrices
Landscapes
news articles
visualized as
a landscape
Parallel Coordinates
A t t r. 1
A t tr . 2
A t t r. 3
A t t r. k
34
35
Icon-Based Visualization
Techniques
Chernoff Faces
Stick Figures
General techniques
Chernoff Faces
The figure shows faces produced using 10 characteristics-head eccentricity, eye size, eye spacing, eye eccentricity,
pupil size, eyebrow slant, nose size, mouth shape, mouth size,
and mouth opening): Each assigned one of 10 possible
values, generated using Mathematica (S. Dickson)
Stick Figure
A census data
figure showing
age, income,
gender,
education, etc.
A 5-piece
stick figure (1
body and 4
limbs w.
different
angle/length)
Two attributes mapped to axes, remaining attributes mapped to angle or length of limbs. Look at
38
Methods
Dimensional Stacking
Worlds-within-Worlds
Tree-Map
Cone Trees
InfoCube
39
Dimensional Stacking
a ttr ib u t e4
at tr ib u te2
a ttr ib u te3
a ttri b u te 1
Dimensional Stacking
Used by permission of M. Ward, Worcester Polytechnic Institute
Visualization of oil mining data with longitude and latitude mapped to the
outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes
41
Worlds-within-Worlds
Nvision: Dynamic
interaction through
data glove and stereo
displays, including
rotation, scaling
(inner) and translation
(inner/outer)
Auto Visual: Static
interaction by means
of queries
42
Tree-Map
Ack.: http://www.cs.umd.edu/hcil/treemap-
43
44
InfoCube
45
Ack.: http://nadeausoftware.com/articles/visualizatio
46
The importance
of tag is
represented by
font size/color
Besides text data,
there are also
methods to
visualize
relationships, such
as visualizing social
networks
Data Visualization
Summary
48
Similarity
Numerical measure of how alike two data objects
are
Value is higher when objects are more alike
Often falls in the range [0,1]
Dissimilarity (e.g., distance)
Numerical measure of how different two data
objects are
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Proximity refers to a similarity or dissimilarity
49
Data matrix
n data points with
p dimensions
Two modes
x11
...
x
i1
...
x
n1
Dissimilarity matrix
n data points, but
registers only the
distance
A triangular matrix
Single mode
...
x1f
...
x1p
...
...
...
...
xif
...
...
xip
...
...
... xnf
...
...
...
xnp
d(2,1)
0
d(3,1) d ( 3,2) 0
:
:
:
... 0
50
m
d (i, j) p
p of binary
Method 2: Use a large number
attributes
52
Example
Name
Jack
Mary
Jim
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
53
Z-score: z
s f n (| x1 f m f | | x2 f m f | ... | xnf m f |)
m f 1n (x1 f x2 f ... xnf )
.
xif m f
zif
sf
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
Dissimilarity Matrix
(with Euclidean Distance)
55
where i = (xi1, xi2, , xip) and j = (xj1, xj2, , xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
Properties
57
Manhattan
(L1)
Euclidean (L2)
Supremum
58
Ordinal Variables
pf 1 ij( f ) dij( f )
d (i, j)
pf 1 ij( f )
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is numeric: use the normalized distance
f is ordinal
Compute ranks r and
if
r
if 1
zif
Treat z as interval-scaled
if
M f 1
60
Cosine Similarity
61
62
Data Visualization
Summary
63
Summary
References
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John
Wiley, 2003