Stat 5311 - Multivariate Statistics and Nonparametric Statistics
Stat 5311 - Multivariate Statistics and Nonparametric Statistics
Nonparametric Statistics
1 / 27
Outline
• Multidimensional Scaling
• Chapter 4 in EH and Chapter 12 in JW
• Notation and Terminologies: JW uses p, DZ uses m, and EH
uses q.
2 / 27
Multidimensional Scaling
3 / 27
Classical (Metric) MDS
• Deals with Euclidean distances, D, derived from a raw n × q
data matrix, X with zero mean (subtract mean).
• Find a low-dimensional representation of the data which mimics
the original distances.
• Let X be known (for now) and
B = XXT .
Pq
• This suggests bij = l=1 xil xjl .
• The squared Euclidean distance between ith row and jth row is
B = V ΛV T
X = V Λ1/2
1/2 1/2 1/2
where Λ1/2 = diag λ1 , λ2 , . . . , λq .
• Classical MDS problem: given the proximities, find X.
• GOAL: Find adequate k-dimensional coordinates X̂n×k , k ≤ q.
5 / 27
Classical (Metric) MDS
Pk
|λi |r
Pk = Pni=1 r
, r = 1, 2.
j=1 |λj |
6 / 27
Example: Euclidean distances
X=c(3, 4, 4, 6, 1,
5, 1, 1 ,7, 3,
6, 2, 0, 2, 6,
1, 1, 1, 0, 3,
4, 7, 3 ,6 ,2,
2, 2, 5, 1, 0,
0, 4, 1, 1, 1,
0, 6, 4, 3, 5,
7, 6, 5, 1, 4,
2, 1, 4, 3, 1)
X=matrix(X,nr=10, byrow=TRUE)
7 / 27
Example: Euclidean distances
(D <- dist(X))
## 1 2 3 4 5 6 7 8
## 2 5.196152
## 3 8.366600 6.082763
## 4 7.874008 8.062258 6.324555
## 5 3.464102 6.557439 8.366600 9.273618
## 6 5.656854 8.426150 8.831761 5.291503 7.874008
## 7 6.557439 8.602325 8.185353 3.872983 7.416198 5.000000
## 8 6.164414 8.888194 8.366600 6.928203 6.000000 7.071068 5.744563
## 9 7.416198 9.055385 6.855655 8.888194 6.557439 7.549834 8.831761 7.416198
## 10 4.358899 6.164414 7.681146 4.795832 7.141428 2.645751 5.099020 6.708204
## 9
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10 8.000000
8 / 27
Example: Euclidean distances
9 / 27
Example: Euclidean distances
## 1 2 3 4 5 6 7 8
## 2 5.196152
## 3 8.366600 6.082763
## 4 7.874008 8.062258 6.324555
## 5 3.464102 6.557439 8.366600 9.273618
## 6 5.656854 8.426150 8.831761 5.291503 7.874008
## 7 6.557439 8.602325 8.185353 3.872983 7.416198 5.000000
## 8 6.164414 8.888194 8.366600 6.928203 6.000000 7.071068 5.744563
## 9 7.416198 9.055385 6.855655 8.888194 6.557439 7.549834 8.831761 7.416198
## 10 4.358899 6.164414 7.681146 4.795832 7.141428 2.645751 5.099020 6.708204
## 9
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10 8.000000
10 / 27
Example: Euclidean distances
## [1] 1.065814e-14
#comparing first 5 PCAs with 5-dimensional scaling solution
max(abs(prcomp(X)$x) - abs(cmdscale(D, k = 5)))
## [1] 2.663494e-14
cmdscale(D, k = 3, eig = TRUE)$GOF
11 / 27
Example: Non-euclidean (airline) distances of US cities.
library(psychTools)
data(cities)
cities
## ATL BOS ORD DCA DEN LAX MIA JFK SEA SFO MSY
## ATL 0 934 585 542 1209 1942 605 751 2181 2139 424
## BOS 934 0 853 392 1769 2601 1252 183 2492 2700 1356
## ORD 585 853 0 598 918 1748 1187 720 1736 1857 830
## DCA 542 392 598 0 1493 2305 922 209 2328 2442 964
## DEN 1209 1769 918 1493 0 836 1723 1636 1023 951 1079
## LAX 1942 2601 1748 2305 836 0 2345 2461 957 341 1679
## MIA 605 1252 1187 922 1723 2345 0 1092 2733 2594 669
## JFK 751 183 720 209 1636 2461 1092 0 2412 2577 1173
## SEA 2181 2492 1736 2328 1023 957 2733 2412 0 681 2101
## SFO 2139 2700 1857 2442 951 341 2594 2577 681 0 1925
## MSY 424 1356 830 964 1079 1679 669 1173 2101 1925 0
12 / 27
Example
13 / 27
Example
SEA
BOS
500
JFK
ORD
DCA
Coordinate 2
DEN
SFO
ATL
LAX
−500
MSY
MIA
Coordinate 1
14 / 27
Example
15 / 27
Example
par(mfrow=c(2,1))
plot (1:length(lam),cumsum((lam)))
plot (1:length(lam),cumsum(abs(lam^2)))
12500000
cumsum((lam))
11000000
2 4 6 8 10
1:length(lam)
1.24e+14
cumsum(abs(lam^2))
1.21e+14
2 4 6 8 10
1:length(lam)
16 / 27
Non-metric MDS
2
d̂ij − dij
P
i<j
Stress(k) = min{ }1/2 ,
dij2
P
i<j
17 / 27
House of Representatives Example
Romesburg (1984) shows the number of times 15 congressmen from NJ voted differently on 19 environmental bills.
## Hunt(R) Sandman(R) Howard(D) Thompson(D) Freylinghuysen(R)
## Hunt(R) 0 8 15 15 10
## Sandman(R) 8 0 17 12 13
## Howard(D) 15 17 0 9 16
## Thompson(D) 15 12 9 0 14
## Freylinghuysen(R) 10 13 16 14 0
## Forsythe(R) 9 13 12 12 8
## Widnall(R) 7 12 15 13 9
## Roe(D) 15 16 5 10 13
## Heltoski(D) 16 17 5 8 14
## Rodino(D) 14 15 6 8 12
## Minish(D) 15 16 5 8 12
## Rinaldo(R) 16 17 4 6 12
## Maraziti(R) 7 13 11 15 10
## Daniels(D) 11 12 10 10 11
## Patten(D) 13 16 7 7 11
## Forsythe(R) Widnall(R) Roe(D) Heltoski(D) Rodino(D) Minish(D)
## Hunt(R) 9 7 15 16 14 15
## Sandman(R) 13 12 16 17 15 16
## Howard(D) 12 15 5 5 6 5
## Thompson(D) 12 13 10 8 8 8
## Freylinghuysen(R) 8 9 13 14 12 12
## Forsythe(R) 0 7 12 11 10 9
## Widnall(R) 7 0 17 16 15 14
## Roe(D) 12 17 0 4 5 5
## Heltoski(D) 11 16 4 0 3 2
## Rodino(D) 10 15 5 3 0 1
## Minish(D) 9 14 5 2 1 0
## Rinaldo(R) 10 15 3 1 2 1
## Maraziti(R) 6 10 12 13 11 12
## Daniels(D) 6 11 7 7 4 5
## Patten(D) 10 13 6 5 6 5
## Rinaldo(R) Maraziti(R) Daniels(D) Patten(D)
## Hunt(R) 16 7 11 13 18 / 27
Voting Example
library("MASS")
data("voting", package = "HSAUR2")
#Kruskal's non-metric MDS, default: k=2
voting_mds <- isoMDS(voting)
19 / 27
Voting Example
Kruskal(1964): Poor=20%, Fair=10%, Good=5%, Excellent =
2.5%
Stress vs. k
20
15
stress2
10
5
0
1 2 3 4 5 6 7
Index
20 / 27
Voting Example
8
6 Sandman(R)
Thompson(D)
4
Patten(D)
Coordinate 2
Hunt(R) Roe(D)
Rinaldo(R) Heltoski(D)
0
Minish(D)
Daniels(D) Rodino(D)
Widnall(R)
Howard(D)
−2
Forsythe(R)
−4
Freylinghuysen(R)
−6
Maraziti(R)
−10 −5 0 5
Coordinate 1
Voting behaviour is essentially along party lines, although there is more variation
among Republicans. The voting behaviour of one Republican (Rinaldo) seems to
be closer to his Democratic colleagues. 21 / 27
Voting Example (Shepard Plot)
22 / 27
Voting Example (Shepard Plot)
* **
* *
*
15
* ** * *
** * *
* * *
**
* * * *
** *
** **
* ** *
* **
10
**
Distance
* **
** *
*
* *
** *
*
* *
* *
** *
* **
5
* * * *
* *
* * *
** * *
** *
* * ** * *
* *
** * *
* *
5 10 15
Dissimilarity
If proximities are similarities, the points should form a loose line from top left to bottom right. If proximities are
dissimilarities, then the data should form a line from bottom left to top right.
Good fit (zero stress) means that the dashes (fit) and the asterisks(dij ) should coincide.
According to EH, the plot shows some discrepancies between the original dissimilarities and the multidimensional
23 / 27
scaling solution.
Voting Example (vegan package)
library(vegan)
set.seed(7634)
voting_mds2=metaMDS(voting)
10
5
0
5 10 15
Observed Dissimilarity
25 / 27
Summary
26 / 27
HW/Lab for this week
27 / 27