0% found this document useful (0 votes)
23 views31 pages

STAT456 Study Guide

STAT456StudyGuide

Uploaded by

nilsdmikkelsen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views31 pages

STAT456 Study Guide

STAT456StudyGuide

Uploaded by

nilsdmikkelsen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

STAT 456 Study Guide

Nils DM

01/12/2021

1 Terminology
1. Multivariate Statistics: A subfield of statistics encompassing the simultaneous observation and
analysis of more than one response variable.
2. Multivariate Analysis: Based on the principles of multivariate statistics, typically, multivariate
analysis is used to address the situations where multiple measurements are made on each experimental
unit and the relations among these measurements and their structures are important.
3. Parameters:
Let y1 , . . . , yp be p variables where:

µi = E(yi ), i = 1, 2, . . . , p
σi2 = V (yi ), i = 1, 2, . . . , p

We then define the random vector as:


y1
y = ...
yp

Where the mean and variance are defined by the mean vector µ and the covariance matrix Σ:

µ1
µ2
µ = E(y) = .
..
µp
σ12 σ12 ... σ1p
σ21 σ22
Σ = cov(y) = . ..
.. .
σp1 ... σp2

Where σij = cov(yi , yj )

1
4. Estimation of µ and Σ:
n
b = y = n1
P
µ yi
i=1
E(b
µ) = µ =⇒ Unbiased estimator
b = S = The sample covariance matrix where:
Σ
n
1
Sjj = Sj2 = n−1 (yij − yj )2 : The sample variance of variable yj
P
i=1
n
1
(yij − yj )(yik − yk ): The sample covariance of j th and k th variables.
P
Sjk = n−1
i=1
S is an unbiased estimator of Σ and is semi-definite.
Sjk
Let R denote the sample correlation matrix where the diagonal entries are 1 and where rjk = √
Sjj ·Skk

5. Measures of overall Variability:


The determinant of the sample covariance matrix S is called the Generalized Sample Variance
The trace of the sample covariance matrix S is called the Total Sample Variance

1 Code

data_1 <- read.table("../Examples_Code/T3_4_CALCIUM.DAT")


data_1 <- data_1[-1]
colMeans(data_1) # Calculate mu-hat (sample mean)

## V2 V3 V4
## 28.100 7.180 3.089

cov(data_1) # Sample covariance matrix (S)

## V2 V3 V4
## V2 140.544444 49.680000 1.9412222
## V3 49.680000 72.248444 3.6760889
## V4 1.941222 3.676089 0.2501211

cor(data_1) # Sample correlation matrix (R)

## V2 V3 V4
## V2 1.0000000 0.4930154 0.327411
## V3 0.4930154 1.0000000 0.864762
## V4 0.3274110 0.8647620 1.000000

det(cov(data_1)) # Generalized Sample Variance |S|

## [1] 459.9555

sum(diag(cov(data_1))) # Total Sample Variance tr(S)

## [1] 213.043

2
2 Multivariate Normal Distribution
1. Mahalanobis Distance:
A measure of the distance between a point P at a distribution D. It is a multi-dimensional general-
ization of the idea of measuring how many standard deviations away P is from the mean of D.

∆2 = (y − µ)Σ−1 (y − µ)

2. Identity:
If A is a constant q × p matrix of rank q, where q ≤ p (meaning that there are more observations than
variables), then:
Ay ∼ Nq (Aµ, AΣA′ )

3. Standardized Multivariate Normal Variables:


A multivariate normal random variable z is standardized if it has the distribution

z ∼ Np (0, Ip )

The preferred way to standardize a multivariate normal random variable is as:

z = (Σ1/2 )−1 (y − µ) ∼ Np (0, Ip )

If z ∼ Np (0, Ip ), then z ′ z = z12 + . . . zp2 ∼ χ2p


If y ∼ Np (µ, Σ), then (y − µ)′ Σ−1 (y − µ) ∼ χ2p (i.e. ∆2 ∼ χ2p )
4. Independence:
For a covariance matrix Σ, for any value σij where i ̸= j, if σij = 0, then the corresponding variables
yi and yj are independent.
5. Ay Calculations:
Given a distribution y ∼ Np (µ, Σ), we can find the distribution of z = C1 y1 + C2 y2 + . . . + Cp yp as
follows: (Assume for the example that p = 3)

(a) create a vector a = (C1 , C2 , C3 )


(b) calculate the mean as a · µ
(c) calculate the variance as aΣa′

To find the joint distribution of z1 and z2 by creating a matrix:

C1z1 C1z2 C1z3


C2z1 C2z2 C2z3

And applying the same steps as above.


To find the joint distribution of y1 , y3 and 12 (y1 + y2 ) by setting the matrix a as:

1 0 0
0 0 1
1 1
2 2 0

3
6. Maximum Likelihood Estimation:
Given a random sample y1 , y2 , . . . , yn from the distribution Np (µ, Σ), we have the following MLE’s:
n
1
P
(a) µ
b=y= n yi
i=1
n
1
(yi − y)(yi − y)′ = n−1
P
(b) Σ n S
b=
n
i=1
n
1
P
(c) E(b
y) = n E(yi ) = µ (Unbiased)
i=1
1
(d) Cov(b
y) = nΣ (Biased)

Therefore, yb ∼ Np (µ, Σ/n)

7. Wishart Distribution:
p
 p
 p(p+1)
In our covariance matrix S, there are p variances and 2 covariances for a total of p + 2 = 2
distinct entries S.
p
 p(p+1)
If we let W = (n − 1)S, them the joint distribution of these p + 2 = 2 distinct variables in W
is the Wishart distribution denoted:
Wp (n − 1, Σ)
Where n − 1 denotes the degrees of freedom.
The Wishart distribution is the multivariate analog of the chi-square distribution:

n
X (yi − y)2 (n − 1)S 2
= ∼ χ2(n−1)
i=1
σ2 σ2

4
2 Code

#### Checks for normality


library(mvtnorm)

par(mfrow = c(1,2), pty = "s")


f1 <- rmvnorm(n = 5000, mean = c(0,0), sigma = matrix(c(1, 2, 2, 4), ncol= 2))
d1 <- data.frame(f1)

# Q-Q plots for each variable


qqnorm(d1$X1)
qqline(d1$X1, col = 2)
qqnorm(d1$X2)
qqline(d1$X2, col = 2)

Normal Q−Q Plot Normal Q−Q Plot


4

5
Sample Quantiles

Sample Quantiles
2
0

0
−2

−5
−4

−4 −2 0 2 4 −4 −2 0 2 4

Theoretical Quantiles Theoretical Quantiles

#### scatter plots (2D)


#### library(plotly)

# plot_ly(d1, x= ~X1, y=~X2)

#### density (3D)


# dens <- kde2d(d1$X1,d1$X2)
# plot_ly(x = dens$x, y = dens$y,z = dens$z) %>% add_surface()

5
3 Hypothesis Testing
A general procedure for hypothesis testing includes the following steps:

1. Define the population parameter(s)*

2. State the null hypothesis H0 and the alternative/research hypothesis H1 *


3. State the test statistic and its distribution under H0 *
4. Based on the random sample, compute the observed test statistic

5. Compute the p-value


6. Make conclusion

*These should be stated before a random sample is collected

3.1 Tests on One or Two Mean Vectors


1. Tests on µ with Σ known:

(a) Parameters:
µ: population mean vector (unknown)
E(µ) = E(y) = E(y1 , . . . , yp )′
Σ: Population covariance matrix
Σ = cov(y) (known)
(b) Hypotheses:
H0 : µ = µ0 (µ0 is a given vector)
H1 : µ ̸= µ0 (do not use > or <)
(c) Random Sample:
y1 , y2 , . . . , yp ∼ Np (µ, Σ)
n
We compute: y = n1
P
yi
i=1
(d) Test Statistic:
The null hypothesis H0 test statistic is calculated and distributed as:

z 2 = n(y − µ0 )Σ−1 (y − µ0 ) ∼ χ2p

2
Then compute zobs
(e) P-value:
2
The p-value represents the probability that, assuming the null hypothesis is true (zobs ≤ χ2p ).
Denoted:

2
p-value = p(zobs ≤ χ2p )

(f) Conclusion:
We define:
2
i. The acceptance region: zobs ≤ χ2α,p (We do not reject H0 )
ii. The rejection region: zobs ≥ χ2α,p (We reject H0 )
2

6
2. Tests on µ with Σ unknown:
(a) Hypotheses:
H0 : µ = µ0 (µ0 is a given vector)
H1 : µ ̸= µ0 (do not use > or <)
(b) Random Sample:
y1 , y2 , . . . , yp ∼ Np (µ, Σ)
n
We compute: y = n1
P
yi
i=1
n
1
(yi − y)(yi − y)′
P
And S = n−1
i=1
(c) Test Statistic:
The null hypothesis H0 test statistic is calculated and distributed as:

T 2 = n(y − µ0 )′ S −1 (y − µ0 )

Where T 2 ∼ Tp,n−1
2
denotes the Hotelling’s T 2 distribution where:
i. p = the dimension of S
ii. n − 1 = the degrees of freedom
(d) Conclusion:
2 2
We reject H0 if Tobs > Tα,p,n−1
(e) Conversion to F-statistic:
The statistic T 2 can be converted to an F-statistic as follows:

v−p+1 2
Tp,v = Fp,v−p+1
vp
3. Comparing two mean vectors:
Assumptions:

(a) The two samples are independent


(b) Σ1 = Σ2 = Σ (Σ is unknown)

(a) Hypotheses:
H0 : µ1 = µ2
H1 : µ1 ̸= µ2
(b) Random Sample:
Sample one: y 11 , y 12 , . . . , y 1,n ∼ N (µ1 , Σ21 )
1

Sample two: y 21 , y 22 , . . . , y 2,n ∼ N (µ2 , Σ22 )


2

Compute the following:


i. The sample mean vectors y 1 , y 2
ii. The sample covariance matrices S1 , S2
2 (n1 −1)S12 +(n2 −1)S22
iii. Spl = n1 +n2 −2
Note that E(Spl = Σ)
(c) Test Statistic:
The test statistic is computed and the null hypothesis is distributed by:
n1 n2 −1
T2 = (y1 − y2 )′ Spl 2
(y1 − y2 ) ∼ Tp,n 1 +n2 −2
n1 + n2

7
3.1.1 Two mean vector tests when samples are not independent

Our hypotheses are:


H0 : µ1 = µ2
H1 : µ1 ̸= µ2
And our samples
Sample one: y 11 , y 12 , . . . , y 1,n ∼ N (µ1 , Σ21 )
1

Sample two: y 21 , y 22 , . . . , y 2,n ∼ N (µ2 , Σ22 )


2

Are not independent, then we proceed by defining the following:

1. di = y1i − y2i
2. µd = µ1 − µ2
n
1
P
3. d = n di (the sample mean vector)
i=1

4. Sd , the sample covariance matrix computed from d1 , . . . , dn


5. H0 : µd = 0

6. H1 : µd ̸= 0

7. The test statistic T 2 = nd Sd−1 d
8. T 2 ∼ Tp,n−1
2
under H0

Assumptions:

1. No assumptions are needed for Σ1 and Σ2


2. d1 , . . . , dn ∼ Np (µd , Σd )

8
3.1 Code

EDA Visualization

library(Hotelling)
library(tidyverse)
library(data.table)
data_3 <- read.table("../Examples_Code/T5_1_PSYCH.DAT")
colnames(data_3) <- c("Group","y1","y2","y3","y4")

melted_data <- data_3 %>% melt(id = "Group")


m_1 <- melted_data[melted_data$Group == 1,]
m_2 <- melted_data[melted_data$Group == 2,]

ggplot() +
geom_boxplot(data = m_1, aes(x = variable, y = value, colour = variable),
notch = TRUE) +
geom_boxplot(data = m_2, aes(x = variable, y = value, colour = variable, fill = variable),
notch = TRUE, alpha = 0.1) +
labs(title = "Overlaying box plot", caption = "Shaded box plots = group 2")

Overlaying box plot

30

variable

20 y1
value

y2
y3
y4

10

y1 y2 y3 y4
variable
Shaded box plots = group 2

9
EDA Summary Statistics

x <- data_3[data_3[,1] == 1, 2:5]


y <- data_3[data_3[,1] == 2, 2:5]
colMeans(x)

## y1 y2 y3 y4
## 15.96875 15.90625 27.18750 22.75000

colMeans(y)

## y1 y2 y3 y4
## 12.34375 13.90625 16.65625 21.93750

round(cov(x),2)

## y1 y2 y3 y4
## y1 5.19 4.55 6.52 5.25
## y2 4.55 13.18 6.76 6.27
## y3 6.52 6.76 28.67 14.47
## y4 5.25 6.27 14.47 16.65

round(cov(y),2)

## y1 y2 y3 y4
## y1 9.14 7.55 4.86 4.15
## y2 7.55 18.60 10.22 5.45
## y3 4.86 10.22 30.04 13.49
## y4 4.15 5.45 13.49 28.00

Tests for two main vectors

print(hotelling.test(x, y))

## Test stat: 97.601


## Numerator df: 4
## Denominator df: 59
## P-value: 1.464e-11

10
Test Statistic for one mean vector

data_3 <- data_3[,-1]


n <- dim(data_3)[1]
p <- dim(data_3)[2]
v <- n - 1

colMeans(data_3)

## y1 y2 y3 y4
## 14.15625 14.90625 21.92188 22.34375

mu0 <- c(14, 15, 22, 22) # Not a significant difference


T2 <- n * t(colMeans(data_3) - mu0) %*% solve(cov(data_3)) %*% (colMeans(data_3) - mu0)
T2

## [,1]
## [1,] 1.036205

mu0 <- c(12, 17, 23, 22) # Significant difference


T2_2 <- n * t(colMeans(data_3) - mu0) %*% solve(cov(data_3)) %*% (colMeans(data_3) - mu0)
T2_2

## [,1]
## [1,] 135.0627

F_obs <- (v - p + 1) * T2/(v * p)


F_obs

## [,1]
## [1,] 0.2467156

F_obs_2 <- (v - p + 1) * T2_2/(v * p)


F_obs_2

## [,1]
## [1,] 32.15778

pvalue <- 1 - pf(F_obs, p, v - p + 1)


pvalue

## [,1]
## [1,] 0.9105544

pvalue_2 <- 1 - pf(F_obs_2, p, v - p + 1)


pvalue_2

## [,1]
## [1,] 2.553513e-14

11
3.2 Tests on Covariance Matrices
1. Testing a specified pattern for Σ
Assumptions:
i. Independent observations
ii. Normality
iii. µ is unknown
iv. Σ0 is given
(a) Hypotheses:
H0 : Σ = Σ 0
H1 : Σ ̸= Σ0
(b) Random Sample:
y1 , y2 , . . . , yn ∼ Np (µ, Σ)
(c) Test Statistic:
Let:
i. S = the sample covariance matrix
ii. v = n − 1 = degrees of freedom of S
iii. p = the number of variables
Then the test statistic is:
When v = n − 1 is large:
u = v[ln|Σ0 | − ln|S| + tr(SΣ−1
0 ) − p]

When v = n − 1 is moderate, u′ gives a better approximation:


" !#
′ 1 2
u =u· 1− 2p + 1 −
6v − 1 p+1

Note that u can be written as:


p
X
v[ (λi − lnλi ) − p]
i=1

Where λi , λ2 , . . . , λp for the eigenvalues of SΣ−1


0
(d) Conclusion

u/u′ ∼ χ21 p(p+1)


2

Under H0
Notice that if S = Σ0 , then u = 0

Special Case:
If we wish to test of the variables yi , . . . , yp are independent and have unit variance, in other words:

y ∼ Np (µ, Ip )

Then we set Σ0 = I
H0 : Σ = I
H1 : Σ ̸= I

12
2. Testing Sphericity
(a) Hypotheses:
H0 : Σ = σ 2 I
H1 : Σ ̸= σ 2 I
In this case, σ 2 is unknown, therefore the test statistic needs to be derived differently.
(b) Test Statistic:
Let:
p
pp
Q
p λi
p |S| i=1
u= = p
(tr(S))p P
( λ i )p
i=1
Where λi , λ2 , . . . , λp for the eigenvalues of S
Then we get the following test statistic:
−2 = ln(LR) = −nln(u)
Which is approximately distributed χ2df for large n where df :

p(p + 1)
−1
2
3. Testing for Equality of Covariance Matrices
(a) Hypotheses:
H0 : Σ1 = Σ2 = . . . = Σk
H1 : At least two matrices are different
(b) Random Sample:
There are k independent samples:
Sample 1: n1 observations of y from Np (µ1 , Σ1 )
..
.
Sample k: nk observations of y from Np (µk , Σk )
Assumption: In order for S1 , . . . , Sk to be nonsingular, ni > p + 1, i = 1, . . . , k
(c) Test statistic: From the k samples, we calculate the test statistic as follows:
i. Calculate the sample covariance matrices S1 , . . . , Sk for each sample
ii. Calculate Spl
(n1 − 1)S1 + (n2 − 1)S2 + . . . + (nk − 1)Sk
Spl = k
P
ni − k
i=1
iii. Calculate M : n1 −1 n2 −1 nk −1
|S1 | 2 · |S2 | 2 · . . . · |Sk | 2
M= n
ni −1
P
2
|Spl |i=1
iv. Calculate c1 " #" #
k
X 1 1 2p2 + 3p − 1
c1 = − k
ni − 1 P 6(p + 1)(k − 1)
i=1 (ni − 1)
i=1
v. Calculate u
u = −2(1 − c1 )ln(M )
Under H0 , u is approximately distributed:
χ2[1 (k−1)p(p+1)]
2

13
3.2 Code

Test for equality of two covariance matrices

data_3_2 <- read.table("../Examples_Code/T5_1_PSYCH.DAT")


colnames(data_3_2) <- c("Group","y1","y2","y3","y4")

X <- data_3_2[data_3_2[,1] == 1, 2:5]


Y <- data_3_2[data_3_2[,1] == 2, 2:5]

Sx <- cov(X)
Sy <- cov(Y)
v1 <- dim(X)[1] - 1
v2 <- dim(Y)[1] - 1

Spl <- (v1 * Sx + v2 * Sy)/(v1 + v2)

#test statistic
u <- -v1 * log(det(Sx)) - v2 * log(det(Sy)) + (v1 + v2) * log(det(Spl)) # 14.561

#chi-square approximation
p <- 4
c1 <- (1/v1 + 1/v2 - 1 / (v1 + v2)) * (2 * pˆ2 + 3 * p - 1)/(6 * (p + 1) * (2 - 1)) #0.069
u2 <- (1 - c1) * u # 13.550
df1 <- 0.5 * (2 - 1) * p * (p + 1) # 10

pvalue <- 1 - pchisq(u2, df1) # 0.195

14
4 MANOVA
1. MANOVA: A procedure for comparing multivariate sample means. It is used when there are two or
more dependent variables, and is often followed by significance tests involving individual dependent
variable separately.
2. Variable Names:
(a) n = the number of observations
(b) p = the number of variables
(c) k = the number of samples
3. Assumptions: For a data set of k samples, were every observation is denoted yij where i ∈ k and
j ∈ n, each y is a (p×1) vector containing p variables (attributes for each observation), then we assume:

yij ∼ Np (µi , Σ)

The k samples are independent.


When p = 1, it becomes ANOVA

Sample 1 Sample 2 ... Sample k


y11 y21 ... yk1
y12 y22 ... yk2
.. .. .. ..
. . . .
y1n y2n ... ykn

4. Notation:
We can write the MANOVA model as:

yij = µi + ϵij , i = 1, . . . , k, j = 1, . . . , n

5. Hypotheses:
H0 : µ1 = µ2 = . . . = µk
H1 : At least two µ’s are different
6. Test Statistic:
Begin by calculating the following:

(a) The sample (column) means:

yi0 , i = 1, . . . , k
(b) The overall mean:

k n
1 XX
y00 = yij
nk i=1 j=1

(c) The "between" matrix H:


k
X
H=n (yi0 − y00 )(yi0 − y00 )′
i=1

15
(d) The "within" matrix E:
k X
X n
(yij − yi0 )(yij − yi0 )′
i=1 j=1

E is also called the error matrix.


The expectation of E = E(E) = k(n − 1)Σ
s
λ1
The trace of tr((E + H)−1 H) =
P
1+λ1
i=1

There are three main types of test statistics for MANOVA:

(a) Wilks

|E|
∧=
|E + H
We reject H0 if ∧ ≤ ∧α,p,vH ,vE If H is "small" relative to E, then we probably don’t have evidence
to reject H0 .
Note that the Wilks test statistic is invariant to scalars (our test statistic remains the same when
the data is multiplied uniformly by constantvalues)
(b) Roy

λ1
1 + λ1
Where λ1 = the largest eigenvalue of E −1 H
We reject H0 if θ ≥ θα,s,m,n
(c) Hotelling
Let λi , . . . , λs = be the eigenvalues of E −1 H, then:
s
X λ1
V (s) =
i=1
1 + λ1

(s)
We reject H0 if V (s) ≥ Vα

16
4 Code

data_4 <- read.table(file="../Examples_Code/T6_2_ROOT.DAT")


colnames(data_4) <- c("rootstock","y1","y2","y3","y4")

rootstock <- data_4[,1] # Isolate the root stock observations


rootstock <- as.factor(rootstock)
summary(rootstock)

## 1 2 3 4 5 6
## 8 8 8 8 8 8

y1 <- data_4[,2]
y2 <- data_4[,3]
y3 <- data_4[,4]
y4 <- data_4[,5]

root.manova <- manova(cbind(y1, y2, y3, y4) ~ rootstock)


summary(root.manova, test = "Wilks")

## Df Wilks approx F num Df den Df Pr(>F)


## rootstock 5 0.15401 4.9369 20 130.3 7.714e-09 ***
## Residuals 42
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

summary(root.manova, test = "Roy")

## Df Roy approx F num Df den Df Pr(>F)


## rootstock 5 1.8757 15.756 5 42 1.002e-08 ***
## Residuals 42
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

summary(root.manova, test = "Hotelling")

## Df Hotelling-Lawley approx F num Df den Df Pr(>F)


## rootstock 5 2.9214 5.4776 20 150 2.568e-10 ***
## Residuals 42
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

17
5 Discriminant Analysis
1. Two Groups:
Suppose we have two groups:
Group 1: y11 , y12 , . . . , y1n1 ∼ Np (µ1 , Σ)
Group 2: y21 , y22 , . . . , y2n2 ∼ Np (µ2 , Σ)
And we reject H0 : µ1 = µ2
Then we may want to do more analysis the two groups.
Here, we try to find a linear combination such that the distance between the two transformed groups
is maximized. This transformation is denoted:

z = a′ y

After transformation, each group becomes:


Group 1: z1i = a′ y1i = a1 y1i1 + a2 y1i2 + . . . + ap y1ip , for i = 1, . . . , n1
Group 2: z2i = a′ y2i = a1 y2i1 + a2 y2i2 + . . . + ap y2ip , for i = 1, . . . , n2
We define the sample mean for each zi as z1
Since we assume that the two groups have the same covariance matrix Σ, then the transformed groups
have the same variance too.
The pulled variance estimate is:

Sz2 = a′ SpL a

Where:

(n1 − 1)S1 + (n2 − 1)S2


SpL =
n1 + n2 − 2

The (standardized) distance between the two transform groups is defined as:

z1 − z2
Sz

We can compute a such that the distance d2 (a) is maximized using:


−1
a = SpL (y1 − y2 )

If we reject H0 : µ1 = µ2 , then we will reject H0 : µz1 = µz2


For testing H0 : µz1 = µz2 versus H1 : µz1 ̸= µz2 , Weaves the test statistic:

1 1
(y1 − y2 )′ [( + )SpL ](y1 − y2 ) = T 2
n1 n2

18
2. Several Groups:
Suppose there are k groups where k > 2, then we need more than one discriminant function to describe
group separation.
Let µ1 , µ2 , . . . , µk be the k group means.
H0 : µ1 = µ2 = . . . = µk
H1 : µi ̸= µj for some i and j.
Use the matrices E and H to calculate E −1 H with the corresponding eigenvalues λ1 , λ2 , . . . , λs (notice
that the eigenvalues decrease in value).
The corresponding eigenvectors are a1 , a2 , . . . , ap , then the s discriminant functions are:

z1 = a1 ′ y, z2 = a2 ′ y, . . . , zs = as ′ y

The relative importance of each discriminant function zi can be assessed by considering its eigenvalue
as a proportion of the total:
λi
s
P
λj
j=1

Note the following:

(a) E −1 H is a p × p matrix
(b) E is non-singular ⇒ n1 + n2 + . . . nk > p
(c) H can be singular. If k < p, H is singular
(d) λ1 , . . . , λs > 0, s ≤ k − 1, s ≤ p

19
5 Code

library(MASS)
data_5 <- read.table(file="../Examples_Code/T6_2_ROOT.DAT")
colnames(data_5) <- c("rootstock","y1","y2","y3","y4")

rootstock <- data_5[,1]


rootstock <- as.factor(rootstock)

y1 <- data_5[,2]
y2 <- data_5[,3]
y3 <- data_5[,4]
y4 <- data_5[,5]

fit1 <- lda(rootstock ~ y1 + y2 + y3 + y4)

a1 <- fit1$scaling[,1]# Extract eigenvectors


a2 <- fit1$scaling[,2]
z1 <- a1[1] * y1 + a1[2] * y2 + a1[3] * y3 + a1[4] * y4 # Calculate LDA's
z2 <- a2[1] * y1 + a2[2] * y2 + a2[3] * y3 + a2[4] * y4
plot(z1, z2, col=rootstock)
15
14
13
z2

12
11
10

13 14 15 16 17 18

z1

20
6 Classification Analysis
1. Classification into Two Groups:
Assumptions:
(a) The two groups (populations) have the same covariance matrix
(b) Normality is not required for Fisher’s procedure

Group 1: y1 , S1
Group 2: y2 , S2
Fisher’s procedure:
The procedure is based on the discriminant function:

z = a′ y = (y1 − y2 )′ SpL

y

(a) Use the discriminant function to project the p dimensional data into one-dimensional data.
(b) Calculate z1 , z2
−1
z1 = (y1 − y2 )′ SpL y1
−1
z2 = (y1 − y2 )′ SpL y2

(c) If z is closer to zi , classify this new observation to group i


If we assume that z1 > z2 , and if z > 21 (z1 + z2 ), then assign the observation to Group 1

This is a linear classification rule since it is only a linear function of y. If the two populations are
normally distributed, then this rule is optimal to minimize the misclassification rate.
2. Prior Probabilities:
If prior probabilities p1 , p2 Are known for the two populations, then the optimal classification rule is:

Assign y two G1 (Group 1) if:

p1 f (y|G1 ) > p2 f (y|G2 )

3. Misclassification Cost:
If the cost of mis-classifying observation is different for the two groups, we can add this information to
the optimal rule. Let:

(a) C12 be the cost of misclassifying an observation from group 1 into group 2
(b) C21 be the cost of misclassifying an observation from group 2 into group 1

Then at the term ln( pp12 C


C21
12
)
4. Quadratic Classification Rule:
If the two population covariance matrices are not the same, then we use the distance measures to
do the classification:

Di2 (y) = (y − yi )Si−1 (y − yi )

Assign y to the group i if Di2 (y) is minimized.

21
5. Classification into Several Groups:
Use the quadratic classification function where each group has sample mean and covariance y1 , S1 in
the new observation y is being considered.
Use the distance measurement Di2 (y) to assign y of the group for which Di2 (y) is the smallest.
If there are prior probabilities p1 , . . . , pk then define:

1 1
Qi (y) = ln(pi ) − ln|Si | − (y − yi )Si−1 (y − yi )
2 2
Assign y the group for which Qi (y) is largest.

This requires a normality assumption

If the covariance matrices are equal for the multiple groups, the classification functions can be simplified
to get a linear classification functions.

6. Misclassification Rates:
In order to visualize and calculate the misclassification rate, a confusion matrix (similar to a covariance
matrix where off-axis values represent misclassifications.
e.g. n12 = the number of group 1 observations misclassified as group 2 observations
We can calculate the error rate as follows:
P
nij
i̸=j
k P
P k
nij
i=1 j=1

Can we can partition the sample to get improved estimates of error rates using cross-validation.

22
6 Code

data_6 <- read.table("../Examples_Code/T5_1_PSYCH.DAT")


colnames(data_6) <- c("Group","y1","y2","y3","y4")
data_6 <- data.frame(data_6)

#Linear classification
LinearModel <- lda(Group~., data_6, prior=c(1, 1)/2)
predict(LinearModel, data_6)$class

## [1] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2
## [39] 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2
## Levels: 1 2

# Classification table
table(data_6$Group, predict(LinearModel,data_6)$class)

##
## 1 2
## 1 28 4
## 2 4 28

misclassification_rate <- (4 + 4)/(32 + 32)


misclassification_rate * 100

## [1] 12.5

# Quadratic Classificatio
Group <- data_6[,1]
data_6_2 <- data_6[,-1]
Quadratic_model <- qda(data_6_2, Group)
predict(Quadratic_model, data_6_2)$class

## [1] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2
## [39] 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2
## Levels: 1 2

table(data_6$Group, predict(Quadratic_model, data_6_2)$class)

##
## 1 2
## 1 28 4
## 2 2 30

misclassification_rate_2 <- (2 + 4)/(32 + 32)


misclassification_rate_2 * 100

## [1] 9.375

23
# Hold out/Leave one out cross validation function (Written for A4)
holdout_function <- function(data, model){

group_data <- data[,1]


data_length <- length(group_data)
prediction_results <- rep(0,data_length)
predictions <- rep(0,data_length)

if(model == "lda"){
for(i in 1:data_length){

holdout_data <- data[-i,]


ho_model <- lda(cooking_method~.,
holdout_data,
prior = c(1, 1, 1)/3)

predicted_class <- predict(ho_model, newdata = data[i,])$class

# Used for error table


predictions[i] <- predicted_class

# Used for error rate


if(predicted_class == data[i, 1]){
prediction_results[i] = 1
}

}
print(predictions)
table(fish_data$cooking_method, predictions)
}

else{
for(i in 1:data_length){

holdout_data <- data[-i,]


ho_model <- qda(cooking_method~.,
holdout_data,
prior = c(1, 1, 1)/3)

predicted_class <- predict(ho_model, newdata = data[i,])$class

# Used for error table


predictions[i] <- predicted_class

# Used for error rate


if(predicted_class == data[i, 1]){
prediction_results[i] = 1
}
}
}
print(predictions)
return(table(fish_data$cooking_method, predictions))
}

24
7 Principal Component Analysis
Imagine we have a data matrix Yn×p with n observations and p variables. The idea of principal component
analysis (PCA) is to project Yn×p → Zn×k where k < p. This is a dimension reduction technique where the
variables zi are linear combinations of y1 , . . . , yp .

1. Theory:
Assumption: y1 , . . . , yn is a random sample from one population.
Let:
n
1X
y= yi
n i=1
1
S= (yi − y)′
n−1
S is a p × p matrix and may be singular.
Let λ1 ≥ λ2 ≥, . . . , λp ≥ 0 be the eigenvalues of S.
Let a1 , . . . , ap be the corresponding eigenvectors of S.
Then we can calculate the principal components as:

z1 = a1 ′ y, . . . , zp = ap ′ y

Also noted as:

a1 ′ y
a2 ′ y
z = ..
.
ap ′ y

The sample covariance matrix of z1 , . . . , zn is given by:

λ1 ... 0
Sz = ASA = D = ...
′ ..
.
..
.
0 ... λp
Notice that λ1 ≥ λ2 ≥, . . . , λp ≥ 0,
.
var(z1 ) = λ1 var(z2 ) = λ2 ..var(zp ) = λp

Because the eigenvalues are variances of the principal components, the proportion of variance
explained by the first k components is:
λ1 + λ2 + . . . + λk λ1 + λ2 + . . . + λk
=
λ1 + λ2 + . . . + λp tr(S)

Sometimes, we want to scale variables so that all variables have variance 1, then we compute the
principal components. This is equivalent to using the correlation matrix R, for principal component
analysis. This also implies that the PCA is not scale invariant.
2. Computation:
There p principal components.
In practice, a decision must be made on how many principal components should be retained in order
to effectively summarize the data. Some common methods for making this decision include:

25
(a) Retain Sufficient Components: Retain sufficient components to account for a specific per-
centage of the data, usually around 80%. Usually a few principal components will explain 80% or
more variation in the data.
(b) Greater Than Average: Retain the components whose eigenvalues are greater than the average
of the eigenvalues:
Pp

j=1
λi > , i = 1, 2, . . . , k
p
(c) Scree Graph: Use a scree graph, a plot of λi vs i and a look for a "natural break" between the
"large" eigenvalues in the "small" eigenvalues.

3. High Dimensional Data: For high dimensional data (p > n), the sample covariance matrix Sp×p is
singular and has rank rank(S) ≤ n − 1. Thus, S will have it most (n − 1) non-zero eigenvalues. Which
means that the data can be projected to at most (n − 1) dimensional space.

Yn×p → Zn×(n−1)

4. Outliers:
Outliers have a huge influence on PCA, we can use boxplots for each variable or robust methods to
help identify and remove outliers from our data.

26
7 Code

library(tidyquant)
data_7 <- read.table("../Examples_Code/T3_5_DIABETES.DAT")
colnames(data_7)=c("obs","y1","y2","x1","x2","x3")
boxplot_data <- data_7[,-1] %>% gather() %>% mutate(key = as.factor(key))
boxplot_data %>%
ggplot(aes(x = key, y = value)) +
geom_boxplot(fill = "Skyblue3") +
labs(
title = "Variability of each variable",
x = "Variable",
y = "Values"
) + theme_tq()

Variability of each variable


400

300
Values

200

100

x1 x2 x3 y1 y2
Variable

27
# PCA using S
result.pca <- prcomp(data_7[,-1], scale = FALSE)
eigen_S <- cov(data_7[,-1]) %>% eigen()
eigen_S$values

## [1] 3.466182e+03 1.264471e+03 8.952684e+02 6.933524e+01 1.142288e-02

eigen_S$vectors

## [,1] [,2] [,3] [,4] [,5]


## [1,] -0.0004000338 0.0007797631 -0.001790748 -0.002853710 0.9999939407
## [2,] 0.0080430117 -0.0166018253 -0.028590488 -0.999416799 -0.0028870982
## [3,] -0.1547283109 -0.6382201919 -0.753510273 0.030914777 -0.0008253663
## [4,] -0.7429697521 -0.4279397984 0.514468587 -0.013590620 0.0009189842
## [5,] -0.6511453350 0.6397392336 -0.408318158 -0.004182095 -0.0015024637

# PCA using R
result.pca_2 <- prcomp(data_7[,-1], scale = TRUE)
eigen_R <- cor(data_7[,-1]) %>% eigen()
eigen_R$values

## [1] 1.7172403 1.2338489 0.9602027 0.7866324 0.3020758

eigen_R$vectors

## [,1] [,2] [,3] [,4] [,5]


## [1,] -0.41553044 0.5295048 0.4181274 0.3993240 0.4611605
## [2,] -0.07352227 0.6840790 -0.1558701 -0.7012017 -0.1032041
## [3,] -0.36395661 0.1974475 -0.7620396 0.4356454 -0.2409541
## [4,] -0.54219580 -0.4258696 -0.2490388 -0.3857172 0.5602327
## [5,] -0.62887855 -0.1769468 0.3976800 -0.1014490 -0.6362078

28
8 Cluster Analysis
Cluster analysis differs fundamentally from classification analysis in the sense that we are trying to group
“similar” observations into the same group. Typically, we have n observations that we are trying to assign
to k clusters where k is unknown.

1. Hierarchical Clustering Methods:


The first thing we consider our measures of distance:
(a) Distance between two vectors:
v
u p
uX
1
d(x, y) = ((x − y)′ (x − y)) 2 = t( (xj − yj )2 )
j=1

(b) Euclidean Distance:

d(x, y) = ((x − y)′ S −1 (x − y))1/2

(c) Manhattan Distance:


p
X
|xj − yj |
j=1

For n observations: y1 , . . . , yn , we can compute the distance d(yi , yj ) between any two observations yi
and yj as dij = d(yi , yj ) where D = (dij ) is an n × n symmetric matrix where the diagonal elements
are all zero.

(a) Single Linkage: The distance between two clusters A and B is defined as the minimum distance
between a point in A and a point in B.
At each step of the single linkage method, the distance is found for every pair of clusters, and
we merge the two clusters with the smallest distance. The number of clusters is reduced by one.
After two clusters are merged, the procedure is repeated for the next step. The distance between
all pairs of clusters are calculated again, and the pair with minimum distance is merged into a
single cluster.
(b) Complete Linkage: This procedure is similar to the single linkage procedure, but the distance
between the two clusters A and B is defined as the maximum distance between a point in A
and a point in B.
(c) Average Linkage: If we let nA and nB be the number of points in A and B respectively, then
we define average linkage distance as:
nA XnB
1 X
D(A, B) = d(yi − yj )
nA nB i=1 j=1

29
2. Nonhierarchical Methods:
(a) Partitioning Approach: In the partitioning approach, the observations are separated into g
clusters without using a hierarchical approach based on a matrix of distances or similarity measures
between all pairs of points.
(b) K-means Method:
An optimization method which allows the observations to be moved from one cluster to another.
This type of reallocation is not available in the hierarchical methods.
The algorithm executes as follows:
i. First, we select g clusters as "seeds".
ii. After the seeds are chosen, each remaining point in the data set is assigned to the cluster
with the nearest seed based on Euclidean distance.
iii. As soon as a cluster has more than one member the cluster seed is replaced by the centroid
iv. After all items are assigned to clusters, Each item is examined to see if it is closer to the
centroid of another cluster than to the centroid of its own cluster. If so, the item is moved to
the new cluster, and the two cluster centroids are updated. This process is continued until
no further improvement is possible.
(c) Choosing Seeds:
i. Select at random g observations that are at least a distance r apart
ii. Select the first g observations there at least a distance of r apart
iii. Select the g observations that are mutually furthest apart
iv. Use the g centroids from the g-cluster Solution from the average linkage (hierarchical) clus-
tering method.

30
8 Code

data_8 <- read.table(file="../Examples_Code/T15_1_CITYCRIME.dat")


colnames(data_8) <- c("City", "y1", "y2", "y3", "y4", "y5", "y6", "y7")
rownames(data_8)=data_8[,1]
par(mfrow = c(1,2), pty = "s")
#Hierarchical Clustering
Y <- data_8[,-1]
d1 <- dist(Y, method = "euclidean")
Cluster1 <- hclust(d1, method = "average")
plot(Cluster1, cex = 0.5, hang = -0.1, xlab = "average linkage")
abline(h = 600, col = "blue", lty = 2)
#Use discrimant analysis to view the two clusters
group <- c(1,1,1,2,2,2,1,2,2,2,2,1,2,2,1,2) #From the dendrogram
y1 <- Y[,1]
y2 <- Y[,2]
y3 <- Y[,3]
y4 <- Y[,4]
y5 <- Y[,5]
y6 <- Y[,6]
y7 <- Y[,7]
fit1 <- lda(group ~ y1 + y2 + y3 + y4 + y5 + y6 + y7)
a1 <- fit1$scaling[,1]
z <- a1[1] * y1 + a1[2]*y2 + a1[3] * y3 + a1[4] * y4 + a1[5] * y5 + a1[6] * y6 + a1[7] * y7
plot(group, z, col=group)

Cluster Dendrogram
16
700

14
400

12
Height

z
100

10
Boston
Chicago
NO
Hartford
Atlanta
Tucson
LA
NY
Denver
Honolulu
Portland
Detroit
Washington
Dallas
Houston
KC

1.0 1.2 1.4 1.6 1.8 2.0

average linkage group


hclust (*, "average")

31

You might also like