0% found this document useful (0 votes)
65 views26 pages

Applied Multivariate Statistics - Review

This document provides an overview and examples of various multivariate statistical analysis techniques including covariance, correlation, Mahalanobis distance, multivariate normal distributions, outlier detection, missing data imputation, multidimensional scaling, dissimilarities, principal component analysis, and linear discriminant analysis. Examples and functions for implementing these techniques in R are also discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views26 pages

Applied Multivariate Statistics - Review

This document provides an overview and examples of various multivariate statistical analysis techniques including covariance, correlation, Mahalanobis distance, multivariate normal distributions, outlier detection, missing data imputation, multidimensional scaling, dissimilarities, principal component analysis, and linear discriminant analysis. Examples and functions for implementing these techniques in R are also discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Revision: Chapter 1-6

Applied Multivariate Statistics – Spring 2012


Overview

 Cov, Cor, Mahalanobis, MV normal distribution


 Visualization: Stars plot, mosaic plot with shading
 Outlier: chisq.plot
 Missing values: md.pattern, mice
 MDS: Metric / non-metric
 Dissimilarities: daisy
 PCA
 LDA
Two variables: Covariance and Correlation

 Covariance: Cov(X; Y ) = E[(X ¡ E[X])(Y ¡ E[Y ])] 2 [¡1; 1]

 Correlation: Corr(X; Y ) = Cov(X;Y )


¾X ¾Y 2 [¡1; 1]

Pn
 Sample covariance: Cov(x;
d y) = 1
n¡1 i=1 (xi ¡ x)(yi ¡ y)

c
 Sample correlation: d y) = Cov(x;y)
rxy = Cor(x; ¾
^x ¾
^y

 Correlation is invariant to changes in units,


covariance is not
(e.g. kilo/gram, meter/kilometer, etc.)

2
Scatterplot: Correlation is scale invariant

3
Intuition and pitfalls for correlation
Correlation = LINEAR relation

4
Covariance matrix / correlation matrix:
Table of pairwise values
 True covariance matrix: §ij = Cov(Xi; Xj )
 True correlation matrix: Cij = Cor(Xi; Xj )

 Sample covariance matrix: Sij = Cov(x


d i ; xj )
Diagonal: Variances
 Sample correlation matrix: Rij = Cor(x
d i; xj )
Diagonal: 1

 R: Functions “cov”, “cor” in package “stats”

5
Sq. Mahalanobis Distance MD2(x)
Multivariate Normal Distribution:
=
Most common model choice
Sq. distance from mean in
standard deviations
IN DIRECTION OF X
1
¡ 1 T ¡1
¢
f(x; ¹; §) = p exp ¡ 2 ¢ (x ¡ ¹) § (x ¡ ¹)
2¼j§j

6
µ ¶
Mahalanobis distance: Example 0
¹= ;
0
µ ¶
25 0
§=
0 1

(0,10)

MD = 10

7
µ ¶
Mahalanobis distance: Example 0
¹= ;
0
µ ¶
25 0
§=
0 1

(10, 7)

MD = 7.3

8
Glyphplots:
Stars

• Which cities are special?


• Which cities are like
New Orleans?
• Seattle and Miami are quite
far apart; how do they
compare?

• R: Function “stars” in package


“stats”

9
Mosaic plot with shading
Suprisingly small
R: Function “mosaic” in package “vcd”
observed cell
count

p-value of
independence
Suprisingly large
test: Highly
observed cell
significant
count
10
Outliers: Theory of Mahalanobis Distance

Assume data is multivariate normally distributed


(d dimensions)

Squared Mahalanobis distance of samples follows a Chi-Square distribution


with d degrees of freedom
Expected value: d
(“By definition”: Sum of d standard normal random variables has
Chi-Square distribution with d degrees of freedom.)

11
Outliers: Check for multivariate outlier

 Are there samples with estimated Mahalanobis distance


that don’t fit at all to a Chi-Square distribution?
 Check with a QQ-Plot
 Technical details:
- Chi-Square distribution is still reasonably good for
estimated Mahalanobis distance
- use robust estimates for ¹; §

 R: Function «chisq.plot» in package «mvoutlier»

12
Outliers: chisq.plot
Outlier easily detected !

13
Missing values: Problem of Single Imputation

 Too optimistic: Imputation model (e.g. in Y = a + bX) is


just estimated, but not the true model
 Thus, imputed values have some uncertainty
 Single Imputation ignores this uncertainty
 Coverage probability of confidence intervals is wrong

 Solution: Multiple Imputation


Incorporates both
- residual error
- model uncertainty (excluding model mis-specification)

 R: Package «mice» for Multiple Imputation using chained


equations
14
Multiple Imputation: MICE

Aggregate
results

Do standard analysis
Impute several times for each imputed data set;
get estimate and std.error

15
Idea of MDS

 Represent high-dimensional point cloud in few (usually 2)


dimensions keeping distances between points similar
 Classical/Metric MDS: Use a clever projection
- guaranteed to find optimal solution only for euclidean
distance
- fast
R: Function “cmdscale” in base distribution
 Non-metric MDS:
- Squeeze data on table = minimize STRESS
- only conserve ranks = allow monotonic transformations
before reducing dimensions
- slow(er)
R: Function “isoMDS” in package “MASS”

16
Distance: To scale or not to scale…

 If variables are not scaled


- variable with largest range has most weight
- distance depends on scale
 Scaling gives every variable equal weight
 Similar alternative is re-weighing:
p
d(i; j) = w1(xi1 ¡ xj1)2 + w2(xi2 ¡ xj2)2 + ::: + wp(xip ¡ xjp)2
 Scale if,
- variables measure different units (kg, meter, sec,…)
- you explicitly want to have equal weight for each variable
 Don’t scale if units are the same for all variables
 Most often: Better to scale.

17
Dissimilarity for mixed data: Gower’s Dissim.

 Idea: Use distance measure between 0 and 1 for each


variable: d(f
ij
)

Pp
 Aggregate: d(i; j) = p i=1 d(f)
1
ij

 Binary (a/s), nominal: Use methods discussed before


- asymmetric: one group is much larger than the other
(f) jx ¡x j
 Interval-scaled: dij = ifRf jf
xif: Value for object i in variable f
Rf: Range of variable f for all objects
 Ordinal: Use normalized ranks; then like interval-scaled
based on range

 R: Function “daisy” in package “cluster”


18
PCA: Goals

 Goal 1: Dimension reduction to a few dimensions while


explaining most of the variance
(use first few PC’s)
 Goal 2: Find one-dimensional index that separates objects
best
(use first PC)

19
PCA (Version 1): Orthogonal directions

• PC 1 is direction of largest variance


• PC 2 is
- perpendicular to PC 1 PC 1
- again largest variance
• PC 3 is PC 3
- perpendicular to PC 1, PC 2
- again largest variance PC 2
• etc.

20
How many PC’s: Blood Example

Rule 1: 5 PC’s

Rule 3: Ellbow after PC 1 (?)

Rule 2: 3 PC’s

21
Biplot: Show info on samples AND variables

Approximately true:
• Data points: Projection on first two PCs
Distance in Biplot ~ True Distance
• Projection of sample onto arrow gives
original (scaled) value of that variable
• Arrowlength: Variance of variable
• Angle between Arrows: Correlation

Approximation is often crude;


good for quick overview

22
Supervised Learning: LDA

P (C)P (XjC)
P (CjX) = P (X) » P (C)P (XjC)

Prior / prevalence:
Find some estimate Assume:
Fraction of samples
XjC » N(¹c; §)
in that class

Bayes rule:
Choose class where P(C|X) is maximal
(rule is “optimal” if all types of error are equally costly)

Special case: Two classes (0/1)


- choose c=1 if P(C=1|X) > 0.5 or
- choose c=1 if posterior odds P(C=1|X)/P(C=0|X) > 1

In Practice: Estimate 𝑃 𝐶 , 𝜇𝐶 , Σ
23
LDA Orthogonal directions of best separation

1. Principal Component
Linear decision boundary
1. Linear Discriminant
=
1. Canonical Variable

Balance prior and mahalanobis distance

1
Classify to which class? – Consider:
• Prior
0
• Mahalanobis distance to class center
24
LDA: Quality of classification

 Use training data also as test data: Overfitting


Too optimistic for error on new data
 Separate test data

Test

Training

 Cross validation (CV; e.g. “leave-one-out cross validation):


Every row is the test case once, the rest in the training data

25

You might also like