Concepts and Techniques: - Chapter 2
Concepts and Techniques: - Chapter 2
Techniques
Chapter 2
Data Visualization
Summary
Record
Relational records
Transaction data
Molecular Structures
Ordered
Image data:
Video data:
TID
Items
1
2
3
4
5
Important Characteristics of
Structured Data
Dimensionality
Sparsity
Resolution
Curse of dimensionality
Distribution
Data Objects
Examples:
Attributes
Attribute Types
e.g., gender
No true zero-point
Ratio
Inherent zero-point
Discrete Attribute
Has only a finite or countably infinite set of values
Data Visualization
Summary
10
Motivation
To better understand the data: central tendency,
variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities
of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
11
1 n
x xi
n i 1
Median:
w x
i 1
n
Mode
median L1 (
n / 2 ( freq ) l
freq
) width
Empirical formula:
Media
n
interv
al
Symmetric vs.
Skewed Data
positively skewed
symmetric
negatively
skewed
13
Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
1 n
1 n 2 1 n
2
2
s
(
x
x
)
[
x
(
x
)
i
i n
i ]
n 1 i 1
n 1 i 1
i 1
2
1 n
1 n 2
2
( xi ) xi 2
N i 1
N i 1
Standard deviation s (or ) is the square root of variance s2 (or 2)
2
14
Boxplot Analysis
Boxplot
Histogram Analysis
Quantile Plot
19
20
Scatter plot
21
22
Uncorrelated Data
23
Data Visualization
Summary
24
Data Visualization
Pixel-Oriented Visualization
Techniques
(a) Income
(b) Credit
Limit
(c) transaction
volume
(d) age
26
Methods
Direct visualization
Landscapes
Prosection views
Hyperslice
Parallel coordinates
27
28
Scatterplot Matrices
Landscapes
news articles
visualized as
a landscape
Parallel Coordinates
A tt r. 1
A t tr . 2
A ttr. 3
A t t r. k
31
32
Icon-Based Visualization
Techniques
Chernoff Faces
Stick Figures
General techniques
Chernoff Faces
The figure shows faces produced using 10 characteristics-head eccentricity, eye size, eye spacing, eye eccentricity,
pupil size, eyebrow slant, nose size, mouth shape, mouth size,
and mouth opening): Each assigned one of 10 possible
values, generated using Mathematica (S. Dickson)
34
Stick Figure
A census
data figure
showing age,
income,
gender,
education,
etc.
A 5-piece
stick figure (1
body and 4
limbs w.
different
angle/length)
Data Mining: Concepts and
Techniques
35
Data Visualization
Summary
36
Similarity
Data matrix
n data points with
p dimensions
Two modes
x11
...
x
i1
...
x
n1
Dissimilarity matrix
n data points, but
registers only the
distance
A triangular matrix
Single mode
...
x1f
...
x1p
...
...
...
...
xif
...
...
xip
...
...
... xnf
...
...
...
xnp
d(2,1)
0
d(3,1) d ( 3,2) 0
:
:
:
... 0
38
m
d (i, j) p
p of binary
Method 2: Use a large number
attributes
Object i
40
Example
Name
Jack
Mary
Jim
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
41
Z-score:
s f n (| x1 f m f | | x2 f m f | ... | xnf m f |)
m f 1n (x1 f x2 f ... xnf )
x m
.
zif
if
sf
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
Dissimilarity Matrix
(with Euclidean Distance)
43
where i = (xi1, xi2, , xip) and j = (xj1, xj2, , xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
Properties
45
Manhattan
(L1)
Euclidean (L2)
Supremum
46
Ordinal Variables
pf 1 ij( f ) dij( f )
d (i, j)
pf 1 ij( f )
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is numeric: use the normalized distance
f is ordinal
Compute ranks r and
if
r
if 1
zif
Treat z as interval-scaled
if
M f 1
48
Cosine Similarity
49
50
KL Divergence: Comparing
Two Probability Distributions
Discrete form:
The KL divergence measures the expected number of extra bits
required to code samples from p(x) (true distribution) when
using a code based on q(x), which represents a theory, model,
description, or approximation of p(x)
Its continuous form:
The KL divergence: not a distance measure, not a metric:
asymmetric, not satisfy triangular inequality
51
How to
Compute the KL
Divergence?
Base on the formula, D (P,Q) 0 and D
KL
KL
52
Data Visualization
Summary
53
Summary