Lectur 4 Basic Statistical Descriptions of Data
Lectur 4 Basic Statistical Descriptions of Data
1
Chapter 2: Getting to Know Your Data
Data Visualization
Summary
2
Basic Statistical Descriptions of Data
Motivation
To better understand the data: central tendency, variation
and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities of
precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
3
Frequency and Mode
Frequency and Mode
The frequency of an attribute value is the percentage of
attribute value
The notions of frequency and mode are typically used
4
Measures of Location( Central Tendency): Mean
and Median
The mean is the most common measure of the
location of a set of points.
•However, the mean is very sensitive to outliers.
n / 2 ( freq ) l Median
Mode median L1 ( ) width interval
freq median
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula:
mean mode 3 (mean median)
6
Example
7
Symmetric vs. Skewed Data
Median, mean and mode of symmetric, symmetric
positively and negatively skewed data
9
Measuring the Dispersion of Data
Quartiles, outliers and boxplots
Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 – Q1
Five number summary: min, Q1, median, Q3, max
Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and
plot outliers individually
Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation (sample: s, population: σ)
Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n 1 n
1 n
s 2
i
n 1 i 1
( x x ) 2
[ i n
n 1 i 1
x (
i 1
xi ]
) 2 2
N
( xi 2
)
N
xi 2
2
i 1 i 1
10
Percentiles
11
Percentile
You are the fourth tallest person in a group of 20
12
Chapter 2: Getting to Know Your Data
Data Visualization
Summary
13
Post-processing
Visualization
The human eye is a powerful analytical tool
patterns
Visualization is the way to present the data so that
14
Boxplot Analysis
15
Visualization of Data Dispersion: 3-D Boxplots
17
Graphic Displays of Basic Statistical Descriptions
20
Quantile Plot
Displays all of the data (allowing the user to assess both the
overall behavior and unusual occurrences)
Plots quantile information
For a data x data sorted in increasing order, f indicates that
i i
approximately 100 fi% of the data are below or equal to the
value xi
22
Scatter plot
Provides a first look at bivariate data to see clusters of points,
outliers, etc
Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
23
Positively and Negatively Correlated Data
24
Uncorrelated Data
25
Chapter 2: Getting to Know Your Data
Data Visualization
Summary
26
Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects are
Value is higher when objects are more alike
Often falls in the range [0,1]
Dissimilarity (e.g., distance)
Numerical measure of how different two data objects are
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Proximity refers to a similarity or dissimilarity
27
Data Matrix and Dissimilarity Matrix
Data matrix
n data points with p x11 ... x1f ... x1p
dimensions
... ... ... ... ...
Two modes x ... xif ... xip
i1
... ... ... ... ...
x ... xnf ... xnp
n1
Proximity/ Dissimilarity
matrix 0
n data points, but d(2,1) 0
registers only the d(3,1) d ( 3,2) 0
distance
: : :
A triangular matrix
d ( n,1) d ( n,2) ... ... 0
Single mode
28
Proximity Measure for Nominal Attributes
29
Proximity Measure for Binary Attributes
Object j
A contingency table for binary data
Object i
30
Dissimilarity between Binary Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
d ( jack , mary )
d ( jack , jim)
d ( jim, mary )
31
Dissimilarity between Binary Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
Z-score:
x
z
X: raw score to be standardized, μ: mean of the population, σ: standard
deviation
the distance between the raw score and the population mean in units
of the standard deviation
negative when the raw score is below the mean, “+” when above
An alternative way: Calculate the mean absolute deviation
s f 1n (| x1 f m f | | x2 f m f | ... | xnf m f |)
where
m f 1n (x1 f x2 f ... xnf )
x m
.
if f
standardized measure (z-score):
zif sf
Using mean absolute deviation is more robust than using standard
deviation
33
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
x2 x4
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity Matrix
(with Euclidean Distance)
x3
0 4 x1 x2 x3 x4
2
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
34
Distance on Numeric Data: Minkowski Distance
Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional
data objects, and h is the order (the distance so defined is also
called L-h norm)
Properties
d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
35
Special Cases of Minkowski Distance
h = 1: Manhattan (city block, L1 norm) distance
E.g., the Hamming distance: the number of bits that are different
between two binary vectors
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
36
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
2 x1
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
37
Ordinal Variables
38
Attributes of Mixed Type
A database may contain all attribute types
Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
One may use a weighted formula to combine their effects
pf 1 ij( f ) dij( f )
d (i, j)
pf 1 ij( f )
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is numeric: use the normalized distance
f is ordinal
Compute ranks rif and r 1
zif
if
Treat zif as interval-scaled M 1 f
39
Cosine Similarity
A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.
40
Example: Cosine Similarity
cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d|: the length of vector d
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94
41
Chapter 2: Getting to Know Your Data
Data Visualization
Summary
42
Summary
Data attribute types: nominal, binary, ordinal, interval-scaled,
ratio-scaled
Many types of data sets, e.g., numerical, text, graph, Web,
image.
Gain insight into the data by:
Basic statistical data description: central tendency, dispersion,
graphical displays
Data visualization: map data onto graphical primitives