STAT456 Study Guide
STAT456 Study Guide
Nils DM
01/12/2021
1 Terminology
1. Multivariate Statistics: A subfield of statistics encompassing the simultaneous observation and
analysis of more than one response variable.
2. Multivariate Analysis: Based on the principles of multivariate statistics, typically, multivariate
analysis is used to address the situations where multiple measurements are made on each experimental
unit and the relations among these measurements and their structures are important.
3. Parameters:
Let y1 , . . . , yp be p variables where:
µi = E(yi ), i = 1, 2, . . . , p
σi2 = V (yi ), i = 1, 2, . . . , p
Where the mean and variance are defined by the mean vector µ and the covariance matrix Σ:
µ1
µ2
µ = E(y) = .
..
µp
σ12 σ12 ... σ1p
σ21 σ22
Σ = cov(y) = . ..
.. .
σp1 ... σp2
1
4. Estimation of µ and Σ:
n
b = y = n1
P
µ yi
i=1
E(b
µ) = µ =⇒ Unbiased estimator
b = S = The sample covariance matrix where:
Σ
n
1
Sjj = Sj2 = n−1 (yij − yj )2 : The sample variance of variable yj
P
i=1
n
1
(yij − yj )(yik − yk ): The sample covariance of j th and k th variables.
P
Sjk = n−1
i=1
S is an unbiased estimator of Σ and is semi-definite.
Sjk
Let R denote the sample correlation matrix where the diagonal entries are 1 and where rjk = √
Sjj ·Skk
1 Code
## V2 V3 V4
## 28.100 7.180 3.089
## V2 V3 V4
## V2 140.544444 49.680000 1.9412222
## V3 49.680000 72.248444 3.6760889
## V4 1.941222 3.676089 0.2501211
## V2 V3 V4
## V2 1.0000000 0.4930154 0.327411
## V3 0.4930154 1.0000000 0.864762
## V4 0.3274110 0.8647620 1.000000
## [1] 459.9555
## [1] 213.043
2
2 Multivariate Normal Distribution
1. Mahalanobis Distance:
A measure of the distance between a point P at a distribution D. It is a multi-dimensional general-
ization of the idea of measuring how many standard deviations away P is from the mean of D.
∆2 = (y − µ)Σ−1 (y − µ)
2. Identity:
If A is a constant q × p matrix of rank q, where q ≤ p (meaning that there are more observations than
variables), then:
Ay ∼ Nq (Aµ, AΣA′ )
z ∼ Np (0, Ip )
1 0 0
0 0 1
1 1
2 2 0
3
6. Maximum Likelihood Estimation:
Given a random sample y1 , y2 , . . . , yn from the distribution Np (µ, Σ), we have the following MLE’s:
n
1
P
(a) µ
b=y= n yi
i=1
n
1
(yi − y)(yi − y)′ = n−1
P
(b) Σ n S
b=
n
i=1
n
1
P
(c) E(b
y) = n E(yi ) = µ (Unbiased)
i=1
1
(d) Cov(b
y) = nΣ (Biased)
7. Wishart Distribution:
p
p
p(p+1)
In our covariance matrix S, there are p variances and 2 covariances for a total of p + 2 = 2
distinct entries S.
p
p(p+1)
If we let W = (n − 1)S, them the joint distribution of these p + 2 = 2 distinct variables in W
is the Wishart distribution denoted:
Wp (n − 1, Σ)
Where n − 1 denotes the degrees of freedom.
The Wishart distribution is the multivariate analog of the chi-square distribution:
n
X (yi − y)2 (n − 1)S 2
= ∼ χ2(n−1)
i=1
σ2 σ2
4
2 Code
5
Sample Quantiles
Sample Quantiles
2
0
0
−2
−5
−4
−4 −2 0 2 4 −4 −2 0 2 4
5
3 Hypothesis Testing
A general procedure for hypothesis testing includes the following steps:
(a) Parameters:
µ: population mean vector (unknown)
E(µ) = E(y) = E(y1 , . . . , yp )′
Σ: Population covariance matrix
Σ = cov(y) (known)
(b) Hypotheses:
H0 : µ = µ0 (µ0 is a given vector)
H1 : µ ̸= µ0 (do not use > or <)
(c) Random Sample:
y1 , y2 , . . . , yp ∼ Np (µ, Σ)
n
We compute: y = n1
P
yi
i=1
(d) Test Statistic:
The null hypothesis H0 test statistic is calculated and distributed as:
2
Then compute zobs
(e) P-value:
2
The p-value represents the probability that, assuming the null hypothesis is true (zobs ≤ χ2p ).
Denoted:
2
p-value = p(zobs ≤ χ2p )
(f) Conclusion:
We define:
2
i. The acceptance region: zobs ≤ χ2α,p (We do not reject H0 )
ii. The rejection region: zobs ≥ χ2α,p (We reject H0 )
2
6
2. Tests on µ with Σ unknown:
(a) Hypotheses:
H0 : µ = µ0 (µ0 is a given vector)
H1 : µ ̸= µ0 (do not use > or <)
(b) Random Sample:
y1 , y2 , . . . , yp ∼ Np (µ, Σ)
n
We compute: y = n1
P
yi
i=1
n
1
(yi − y)(yi − y)′
P
And S = n−1
i=1
(c) Test Statistic:
The null hypothesis H0 test statistic is calculated and distributed as:
T 2 = n(y − µ0 )′ S −1 (y − µ0 )
Where T 2 ∼ Tp,n−1
2
denotes the Hotelling’s T 2 distribution where:
i. p = the dimension of S
ii. n − 1 = the degrees of freedom
(d) Conclusion:
2 2
We reject H0 if Tobs > Tα,p,n−1
(e) Conversion to F-statistic:
The statistic T 2 can be converted to an F-statistic as follows:
v−p+1 2
Tp,v = Fp,v−p+1
vp
3. Comparing two mean vectors:
Assumptions:
(a) Hypotheses:
H0 : µ1 = µ2
H1 : µ1 ̸= µ2
(b) Random Sample:
Sample one: y 11 , y 12 , . . . , y 1,n ∼ N (µ1 , Σ21 )
1
7
3.1.1 Two mean vector tests when samples are not independent
1. di = y1i − y2i
2. µd = µ1 − µ2
n
1
P
3. d = n di (the sample mean vector)
i=1
6. H1 : µd ̸= 0
′
7. The test statistic T 2 = nd Sd−1 d
8. T 2 ∼ Tp,n−1
2
under H0
Assumptions:
8
3.1 Code
EDA Visualization
library(Hotelling)
library(tidyverse)
library(data.table)
data_3 <- read.table("../Examples_Code/T5_1_PSYCH.DAT")
colnames(data_3) <- c("Group","y1","y2","y3","y4")
ggplot() +
geom_boxplot(data = m_1, aes(x = variable, y = value, colour = variable),
notch = TRUE) +
geom_boxplot(data = m_2, aes(x = variable, y = value, colour = variable, fill = variable),
notch = TRUE, alpha = 0.1) +
labs(title = "Overlaying box plot", caption = "Shaded box plots = group 2")
30
variable
20 y1
value
y2
y3
y4
10
y1 y2 y3 y4
variable
Shaded box plots = group 2
9
EDA Summary Statistics
## y1 y2 y3 y4
## 15.96875 15.90625 27.18750 22.75000
colMeans(y)
## y1 y2 y3 y4
## 12.34375 13.90625 16.65625 21.93750
round(cov(x),2)
## y1 y2 y3 y4
## y1 5.19 4.55 6.52 5.25
## y2 4.55 13.18 6.76 6.27
## y3 6.52 6.76 28.67 14.47
## y4 5.25 6.27 14.47 16.65
round(cov(y),2)
## y1 y2 y3 y4
## y1 9.14 7.55 4.86 4.15
## y2 7.55 18.60 10.22 5.45
## y3 4.86 10.22 30.04 13.49
## y4 4.15 5.45 13.49 28.00
print(hotelling.test(x, y))
10
Test Statistic for one mean vector
colMeans(data_3)
## y1 y2 y3 y4
## 14.15625 14.90625 21.92188 22.34375
## [,1]
## [1,] 1.036205
## [,1]
## [1,] 135.0627
## [,1]
## [1,] 0.2467156
## [,1]
## [1,] 32.15778
## [,1]
## [1,] 0.9105544
## [,1]
## [1,] 2.553513e-14
11
3.2 Tests on Covariance Matrices
1. Testing a specified pattern for Σ
Assumptions:
i. Independent observations
ii. Normality
iii. µ is unknown
iv. Σ0 is given
(a) Hypotheses:
H0 : Σ = Σ 0
H1 : Σ ̸= Σ0
(b) Random Sample:
y1 , y2 , . . . , yn ∼ Np (µ, Σ)
(c) Test Statistic:
Let:
i. S = the sample covariance matrix
ii. v = n − 1 = degrees of freedom of S
iii. p = the number of variables
Then the test statistic is:
When v = n − 1 is large:
u = v[ln|Σ0 | − ln|S| + tr(SΣ−1
0 ) − p]
Under H0
Notice that if S = Σ0 , then u = 0
Special Case:
If we wish to test of the variables yi , . . . , yp are independent and have unit variance, in other words:
y ∼ Np (µ, Ip )
Then we set Σ0 = I
H0 : Σ = I
H1 : Σ ̸= I
12
2. Testing Sphericity
(a) Hypotheses:
H0 : Σ = σ 2 I
H1 : Σ ̸= σ 2 I
In this case, σ 2 is unknown, therefore the test statistic needs to be derived differently.
(b) Test Statistic:
Let:
p
pp
Q
p λi
p |S| i=1
u= = p
(tr(S))p P
( λ i )p
i=1
Where λi , λ2 , . . . , λp for the eigenvalues of S
Then we get the following test statistic:
−2 = ln(LR) = −nln(u)
Which is approximately distributed χ2df for large n where df :
p(p + 1)
−1
2
3. Testing for Equality of Covariance Matrices
(a) Hypotheses:
H0 : Σ1 = Σ2 = . . . = Σk
H1 : At least two matrices are different
(b) Random Sample:
There are k independent samples:
Sample 1: n1 observations of y from Np (µ1 , Σ1 )
..
.
Sample k: nk observations of y from Np (µk , Σk )
Assumption: In order for S1 , . . . , Sk to be nonsingular, ni > p + 1, i = 1, . . . , k
(c) Test statistic: From the k samples, we calculate the test statistic as follows:
i. Calculate the sample covariance matrices S1 , . . . , Sk for each sample
ii. Calculate Spl
(n1 − 1)S1 + (n2 − 1)S2 + . . . + (nk − 1)Sk
Spl = k
P
ni − k
i=1
iii. Calculate M : n1 −1 n2 −1 nk −1
|S1 | 2 · |S2 | 2 · . . . · |Sk | 2
M= n
ni −1
P
2
|Spl |i=1
iv. Calculate c1 " #" #
k
X 1 1 2p2 + 3p − 1
c1 = − k
ni − 1 P 6(p + 1)(k − 1)
i=1 (ni − 1)
i=1
v. Calculate u
u = −2(1 − c1 )ln(M )
Under H0 , u is approximately distributed:
χ2[1 (k−1)p(p+1)]
2
13
3.2 Code
Sx <- cov(X)
Sy <- cov(Y)
v1 <- dim(X)[1] - 1
v2 <- dim(Y)[1] - 1
#test statistic
u <- -v1 * log(det(Sx)) - v2 * log(det(Sy)) + (v1 + v2) * log(det(Spl)) # 14.561
#chi-square approximation
p <- 4
c1 <- (1/v1 + 1/v2 - 1 / (v1 + v2)) * (2 * pˆ2 + 3 * p - 1)/(6 * (p + 1) * (2 - 1)) #0.069
u2 <- (1 - c1) * u # 13.550
df1 <- 0.5 * (2 - 1) * p * (p + 1) # 10
14
4 MANOVA
1. MANOVA: A procedure for comparing multivariate sample means. It is used when there are two or
more dependent variables, and is often followed by significance tests involving individual dependent
variable separately.
2. Variable Names:
(a) n = the number of observations
(b) p = the number of variables
(c) k = the number of samples
3. Assumptions: For a data set of k samples, were every observation is denoted yij where i ∈ k and
j ∈ n, each y is a (p×1) vector containing p variables (attributes for each observation), then we assume:
yij ∼ Np (µi , Σ)
4. Notation:
We can write the MANOVA model as:
yij = µi + ϵij , i = 1, . . . , k, j = 1, . . . , n
5. Hypotheses:
H0 : µ1 = µ2 = . . . = µk
H1 : At least two µ’s are different
6. Test Statistic:
Begin by calculating the following:
yi0 , i = 1, . . . , k
(b) The overall mean:
k n
1 XX
y00 = yij
nk i=1 j=1
15
(d) The "within" matrix E:
k X
X n
(yij − yi0 )(yij − yi0 )′
i=1 j=1
(a) Wilks
|E|
∧=
|E + H
We reject H0 if ∧ ≤ ∧α,p,vH ,vE If H is "small" relative to E, then we probably don’t have evidence
to reject H0 .
Note that the Wilks test statistic is invariant to scalars (our test statistic remains the same when
the data is multiplied uniformly by constantvalues)
(b) Roy
λ1
1 + λ1
Where λ1 = the largest eigenvalue of E −1 H
We reject H0 if θ ≥ θα,s,m,n
(c) Hotelling
Let λi , . . . , λs = be the eigenvalues of E −1 H, then:
s
X λ1
V (s) =
i=1
1 + λ1
(s)
We reject H0 if V (s) ≥ Vα
16
4 Code
## 1 2 3 4 5 6
## 8 8 8 8 8 8
y1 <- data_4[,2]
y2 <- data_4[,3]
y3 <- data_4[,4]
y4 <- data_4[,5]
17
5 Discriminant Analysis
1. Two Groups:
Suppose we have two groups:
Group 1: y11 , y12 , . . . , y1n1 ∼ Np (µ1 , Σ)
Group 2: y21 , y22 , . . . , y2n2 ∼ Np (µ2 , Σ)
And we reject H0 : µ1 = µ2
Then we may want to do more analysis the two groups.
Here, we try to find a linear combination such that the distance between the two transformed groups
is maximized. This transformation is denoted:
z = a′ y
Sz2 = a′ SpL a
Where:
The (standardized) distance between the two transform groups is defined as:
z1 − z2
Sz
1 1
(y1 − y2 )′ [( + )SpL ](y1 − y2 ) = T 2
n1 n2
18
2. Several Groups:
Suppose there are k groups where k > 2, then we need more than one discriminant function to describe
group separation.
Let µ1 , µ2 , . . . , µk be the k group means.
H0 : µ1 = µ2 = . . . = µk
H1 : µi ̸= µj for some i and j.
Use the matrices E and H to calculate E −1 H with the corresponding eigenvalues λ1 , λ2 , . . . , λs (notice
that the eigenvalues decrease in value).
The corresponding eigenvectors are a1 , a2 , . . . , ap , then the s discriminant functions are:
z1 = a1 ′ y, z2 = a2 ′ y, . . . , zs = as ′ y
The relative importance of each discriminant function zi can be assessed by considering its eigenvalue
as a proportion of the total:
λi
s
P
λj
j=1
(a) E −1 H is a p × p matrix
(b) E is non-singular ⇒ n1 + n2 + . . . nk > p
(c) H can be singular. If k < p, H is singular
(d) λ1 , . . . , λs > 0, s ≤ k − 1, s ≤ p
19
5 Code
library(MASS)
data_5 <- read.table(file="../Examples_Code/T6_2_ROOT.DAT")
colnames(data_5) <- c("rootstock","y1","y2","y3","y4")
y1 <- data_5[,2]
y2 <- data_5[,3]
y3 <- data_5[,4]
y4 <- data_5[,5]
12
11
10
13 14 15 16 17 18
z1
20
6 Classification Analysis
1. Classification into Two Groups:
Assumptions:
(a) The two groups (populations) have the same covariance matrix
(b) Normality is not required for Fisher’s procedure
Group 1: y1 , S1
Group 2: y2 , S2
Fisher’s procedure:
The procedure is based on the discriminant function:
z = a′ y = (y1 − y2 )′ SpL
′
y
(a) Use the discriminant function to project the p dimensional data into one-dimensional data.
(b) Calculate z1 , z2
−1
z1 = (y1 − y2 )′ SpL y1
−1
z2 = (y1 − y2 )′ SpL y2
This is a linear classification rule since it is only a linear function of y. If the two populations are
normally distributed, then this rule is optimal to minimize the misclassification rate.
2. Prior Probabilities:
If prior probabilities p1 , p2 Are known for the two populations, then the optimal classification rule is:
3. Misclassification Cost:
If the cost of mis-classifying observation is different for the two groups, we can add this information to
the optimal rule. Let:
(a) C12 be the cost of misclassifying an observation from group 1 into group 2
(b) C21 be the cost of misclassifying an observation from group 2 into group 1
21
5. Classification into Several Groups:
Use the quadratic classification function where each group has sample mean and covariance y1 , S1 in
the new observation y is being considered.
Use the distance measurement Di2 (y) to assign y of the group for which Di2 (y) is the smallest.
If there are prior probabilities p1 , . . . , pk then define:
1 1
Qi (y) = ln(pi ) − ln|Si | − (y − yi )Si−1 (y − yi )
2 2
Assign y the group for which Qi (y) is largest.
If the covariance matrices are equal for the multiple groups, the classification functions can be simplified
to get a linear classification functions.
6. Misclassification Rates:
In order to visualize and calculate the misclassification rate, a confusion matrix (similar to a covariance
matrix where off-axis values represent misclassifications.
e.g. n12 = the number of group 1 observations misclassified as group 2 observations
We can calculate the error rate as follows:
P
nij
i̸=j
k P
P k
nij
i=1 j=1
Can we can partition the sample to get improved estimates of error rates using cross-validation.
22
6 Code
#Linear classification
LinearModel <- lda(Group~., data_6, prior=c(1, 1)/2)
predict(LinearModel, data_6)$class
## [1] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2
## [39] 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2
## Levels: 1 2
# Classification table
table(data_6$Group, predict(LinearModel,data_6)$class)
##
## 1 2
## 1 28 4
## 2 4 28
## [1] 12.5
# Quadratic Classificatio
Group <- data_6[,1]
data_6_2 <- data_6[,-1]
Quadratic_model <- qda(data_6_2, Group)
predict(Quadratic_model, data_6_2)$class
## [1] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2
## [39] 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2
## Levels: 1 2
##
## 1 2
## 1 28 4
## 2 2 30
## [1] 9.375
23
# Hold out/Leave one out cross validation function (Written for A4)
holdout_function <- function(data, model){
if(model == "lda"){
for(i in 1:data_length){
}
print(predictions)
table(fish_data$cooking_method, predictions)
}
else{
for(i in 1:data_length){
24
7 Principal Component Analysis
Imagine we have a data matrix Yn×p with n observations and p variables. The idea of principal component
analysis (PCA) is to project Yn×p → Zn×k where k < p. This is a dimension reduction technique where the
variables zi are linear combinations of y1 , . . . , yp .
1. Theory:
Assumption: y1 , . . . , yn is a random sample from one population.
Let:
n
1X
y= yi
n i=1
1
S= (yi − y)′
n−1
S is a p × p matrix and may be singular.
Let λ1 ≥ λ2 ≥, . . . , λp ≥ 0 be the eigenvalues of S.
Let a1 , . . . , ap be the corresponding eigenvectors of S.
Then we can calculate the principal components as:
z1 = a1 ′ y, . . . , zp = ap ′ y
a1 ′ y
a2 ′ y
z = ..
.
ap ′ y
λ1 ... 0
Sz = ASA = D = ...
′ ..
.
..
.
0 ... λp
Notice that λ1 ≥ λ2 ≥, . . . , λp ≥ 0,
.
var(z1 ) = λ1 var(z2 ) = λ2 ..var(zp ) = λp
Because the eigenvalues are variances of the principal components, the proportion of variance
explained by the first k components is:
λ1 + λ2 + . . . + λk λ1 + λ2 + . . . + λk
=
λ1 + λ2 + . . . + λp tr(S)
Sometimes, we want to scale variables so that all variables have variance 1, then we compute the
principal components. This is equivalent to using the correlation matrix R, for principal component
analysis. This also implies that the PCA is not scale invariant.
2. Computation:
There p principal components.
In practice, a decision must be made on how many principal components should be retained in order
to effectively summarize the data. Some common methods for making this decision include:
25
(a) Retain Sufficient Components: Retain sufficient components to account for a specific per-
centage of the data, usually around 80%. Usually a few principal components will explain 80% or
more variation in the data.
(b) Greater Than Average: Retain the components whose eigenvalues are greater than the average
of the eigenvalues:
Pp
j=1
λi > , i = 1, 2, . . . , k
p
(c) Scree Graph: Use a scree graph, a plot of λi vs i and a look for a "natural break" between the
"large" eigenvalues in the "small" eigenvalues.
3. High Dimensional Data: For high dimensional data (p > n), the sample covariance matrix Sp×p is
singular and has rank rank(S) ≤ n − 1. Thus, S will have it most (n − 1) non-zero eigenvalues. Which
means that the data can be projected to at most (n − 1) dimensional space.
Yn×p → Zn×(n−1)
4. Outliers:
Outliers have a huge influence on PCA, we can use boxplots for each variable or robust methods to
help identify and remove outliers from our data.
26
7 Code
library(tidyquant)
data_7 <- read.table("../Examples_Code/T3_5_DIABETES.DAT")
colnames(data_7)=c("obs","y1","y2","x1","x2","x3")
boxplot_data <- data_7[,-1] %>% gather() %>% mutate(key = as.factor(key))
boxplot_data %>%
ggplot(aes(x = key, y = value)) +
geom_boxplot(fill = "Skyblue3") +
labs(
title = "Variability of each variable",
x = "Variable",
y = "Values"
) + theme_tq()
300
Values
200
100
x1 x2 x3 y1 y2
Variable
27
# PCA using S
result.pca <- prcomp(data_7[,-1], scale = FALSE)
eigen_S <- cov(data_7[,-1]) %>% eigen()
eigen_S$values
eigen_S$vectors
# PCA using R
result.pca_2 <- prcomp(data_7[,-1], scale = TRUE)
eigen_R <- cor(data_7[,-1]) %>% eigen()
eigen_R$values
eigen_R$vectors
28
8 Cluster Analysis
Cluster analysis differs fundamentally from classification analysis in the sense that we are trying to group
“similar” observations into the same group. Typically, we have n observations that we are trying to assign
to k clusters where k is unknown.
For n observations: y1 , . . . , yn , we can compute the distance d(yi , yj ) between any two observations yi
and yj as dij = d(yi , yj ) where D = (dij ) is an n × n symmetric matrix where the diagonal elements
are all zero.
(a) Single Linkage: The distance between two clusters A and B is defined as the minimum distance
between a point in A and a point in B.
At each step of the single linkage method, the distance is found for every pair of clusters, and
we merge the two clusters with the smallest distance. The number of clusters is reduced by one.
After two clusters are merged, the procedure is repeated for the next step. The distance between
all pairs of clusters are calculated again, and the pair with minimum distance is merged into a
single cluster.
(b) Complete Linkage: This procedure is similar to the single linkage procedure, but the distance
between the two clusters A and B is defined as the maximum distance between a point in A
and a point in B.
(c) Average Linkage: If we let nA and nB be the number of points in A and B respectively, then
we define average linkage distance as:
nA XnB
1 X
D(A, B) = d(yi − yj )
nA nB i=1 j=1
29
2. Nonhierarchical Methods:
(a) Partitioning Approach: In the partitioning approach, the observations are separated into g
clusters without using a hierarchical approach based on a matrix of distances or similarity measures
between all pairs of points.
(b) K-means Method:
An optimization method which allows the observations to be moved from one cluster to another.
This type of reallocation is not available in the hierarchical methods.
The algorithm executes as follows:
i. First, we select g clusters as "seeds".
ii. After the seeds are chosen, each remaining point in the data set is assigned to the cluster
with the nearest seed based on Euclidean distance.
iii. As soon as a cluster has more than one member the cluster seed is replaced by the centroid
iv. After all items are assigned to clusters, Each item is examined to see if it is closer to the
centroid of another cluster than to the centroid of its own cluster. If so, the item is moved to
the new cluster, and the two cluster centroids are updated. This process is continued until
no further improvement is possible.
(c) Choosing Seeds:
i. Select at random g observations that are at least a distance r apart
ii. Select the first g observations there at least a distance of r apart
iii. Select the g observations that are mutually furthest apart
iv. Use the g centroids from the g-cluster Solution from the average linkage (hierarchical) clus-
tering method.
30
8 Code
Cluster Dendrogram
16
700
14
400
12
Height
z
100
10
Boston
Chicago
NO
Hartford
Atlanta
Tucson
LA
NY
Denver
Honolulu
Portland
Detroit
Washington
Dallas
Houston
KC
31