0% found this document useful (0 votes)
9 views27 pages

Stat 5311 - Multivariate Statistics and Nonparametric Statistics

The document outlines a course on Multivariate Statistics and Nonparametric Statistics, focusing on Multidimensional Scaling (MDS) as a method for spatial representation of data based on proximities. It discusses Classical (Metric) MDS, including the derivation of low-dimensional representations from high-dimensional data and the evaluation of representation adequacy. Examples of Euclidean and non-Euclidean distances are provided, demonstrating the application of MDS in real-world scenarios.

Uploaded by

joaobbdd2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views27 pages

Stat 5311 - Multivariate Statistics and Nonparametric Statistics

The document outlines a course on Multivariate Statistics and Nonparametric Statistics, focusing on Multidimensional Scaling (MDS) as a method for spatial representation of data based on proximities. It discusses Classical (Metric) MDS, including the derivation of low-dimensional representations from high-dimensional data and the evaluation of representation adequacy. Examples of Euclidean and non-Euclidean distances are provided, demonstrating the application of MDS in real-world scenarios.

Uploaded by

joaobbdd2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Stat 5311 - Multivariate Statistics and

Nonparametric Statistics

Dexter Cahoy, University of Houston-Downtown

1 / 27
Outline

• Multidimensional Scaling
• Chapter 4 in EH and Chapter 12 in JW
• Notation and Terminologies: JW uses p, DZ uses m, and EH
uses q.

2 / 27
Multidimensional Scaling

• Multidimensional scaling (MDS) is a method based on


proximities between objects, subjects, or stimuli used to
produce a spatial representation of these items.
• Proximities express the similarity or dissimilarity between data
objects. These arise in judgements made by human raters
about how alike pairs of objects, stimuli, etc., of interest are.
• It is a dimension reduction technique which aims to find a set
of points in low dimension (typically 2 dimensions) that reflect
the configuration of the high-dimensional data.

3 / 27
Classical (Metric) MDS
• Deals with Euclidean distances, D, derived from a raw n × q
data matrix, X with zero mean (subtract mean).
• Find a low-dimensional representation of the data which mimics
the original distances.
• Let X be known (for now) and

B = XXT .

Pq
• This suggests bij = l=1 xil xjl .
• The squared Euclidean distance between ith row and jth row is

dij2 = bii + bjj − 2bij .

• With certain conditions (see p108 of EH), we can derive


elements of B or X in terms of the Euclidean distances dij2 .
4 / 27
Classical (Metric) MDS

• By spectral decomposition and if X is full rank then

B = V ΛV T

where V contains the first q eigenvectors and Λ contains the q


non-zero eigenvalues λ1 , λ2 , . . . , λq .
• The required coordinate values are thus

X = V Λ1/2
 
1/2 1/2 1/2
where Λ1/2 = diag λ1 , λ2 , . . . , λq .
• Classical MDS problem: given the proximities, find X.
• GOAL: Find adequate k-dimensional coordinates X̂n×k , k ≤ q.

5 / 27
Classical (Metric) MDS

• The adequacy of the k-dimensional representation can be


judged by the size of the criterion
Pk
λi
Pk = Pni=1 .
j=1 λj

• Pk > 0.8 is desirable (according to EH).


• If proximity matrix is non-Euclidean then one can use

Pk
|λi |r
Pk = Pni=1 r
, r = 1, 2.
j=1 |λj |

6 / 27
Example: Euclidean distances

X=c(3, 4, 4, 6, 1,
5, 1, 1 ,7, 3,
6, 2, 0, 2, 6,
1, 1, 1, 0, 3,
4, 7, 3 ,6 ,2,
2, 2, 5, 1, 0,
0, 4, 1, 1, 1,
0, 6, 4, 3, 5,
7, 6, 5, 1, 4,
2, 1, 4, 3, 1)

X=matrix(X,nr=10, byrow=TRUE)

7 / 27
Example: Euclidean distances

(D <- dist(X))

## 1 2 3 4 5 6 7 8
## 2 5.196152
## 3 8.366600 6.082763
## 4 7.874008 8.062258 6.324555
## 5 3.464102 6.557439 8.366600 9.273618
## 6 5.656854 8.426150 8.831761 5.291503 7.874008
## 7 6.557439 8.602325 8.185353 3.872983 7.416198 5.000000
## 8 6.164414 8.888194 8.366600 6.928203 6.000000 7.071068 5.744563
## 9 7.416198 9.055385 6.855655 8.888194 6.557439 7.549834 8.831761 7.416198
## 10 4.358899 6.164414 7.681146 4.795832 7.141428 2.645751 5.099020 6.708204
## 9
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10 8.000000

8 / 27
Example: Euclidean distances

hatX=cmdscale(D, k = 9, eig = TRUE)$points # \hat{X} solution


hatX

## [,1] [,2] [,3] [,4] [,5] [,6]


## [1,] -1.6038325 2.38060903 -2.2301092 -0.3656856 0.11536476 8.421598e-08
## [2,] -2.8246377 -2.30937202 -3.9523782 0.3419185 0.33169405 -2.312861e-08
## [3,] -1.6908272 -5.13970089 1.2880306 0.6503227 -0.05133897 1.074339e-08
## [4,] 3.9527719 -2.43233961 0.3833746 0.6863995 -0.03460933 3.254229e-08
## [5,] -3.5984894 2.75538195 -0.2551393 1.0783741 -1.26125237 -2.345830e-08
## [6,] 2.9520356 1.35475175 -0.1899027 -2.8211220 0.12385813 -3.319300e-08
## [7,] 3.4689928 0.76411068 0.3016531 1.6369166 -1.94209512 6.148806e-09
## [8,] 0.3545235 2.31408566 2.2161772 2.9240116 2.00450379 -6.654403e-09
## [9,] -2.9362323 -0.01279597 4.3117385 -2.5122743 -0.18911558 1.773836e-08
## [10,] 1.9256952 0.32526941 -1.8734445 -1.6188611 0.90299062 3.093294e-09
## [,7]
## [1,] 0.000000e+00
## [2,] 1.565503e-08
## [3,] 1.532955e-08
## [4,] 2.353591e-08
## [5,] 3.588394e-08
## [6,] 6.457705e-09
## [7,] 1.530651e-08
## [8,] 1.652824e-08
## [9,] 2.040312e-08
## [10,] 4.576136e-08

9 / 27
Example: Euclidean distances

hatX=cmdscale(D, k = 9, eig = TRUE)$points # \hat{X} solution


dist(hatX)

## 1 2 3 4 5 6 7 8
## 2 5.196152
## 3 8.366600 6.082763
## 4 7.874008 8.062258 6.324555
## 5 3.464102 6.557439 8.366600 9.273618
## 6 5.656854 8.426150 8.831761 5.291503 7.874008
## 7 6.557439 8.602325 8.185353 3.872983 7.416198 5.000000
## 8 6.164414 8.888194 8.366600 6.928203 6.000000 7.071068 5.744563
## 9 7.416198 9.055385 6.855655 8.888194 6.557439 7.549834 8.831761 7.416198
## 10 4.358899 6.164414 7.681146 4.795832 7.141428 2.645751 5.099020 6.708204
## 9
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10 8.000000

10 / 27
Example: Euclidean distances

cmdscale(D, k = 9, eig = TRUE)$eig

## [1] 7.518716e+01 5.880560e+01 4.960516e+01 3.042789e+01 1.037419e+01


## [6] 1.086006e-14 5.381235e-15 -5.316641e-15 -8.539862e-15 -
1.050854e-14
#five-dimensional solution achieves complete
#recovery of the observed distance matrix.
max(abs(dist(X) - dist(cmdscale(D, k = 5))))

## [1] 1.065814e-14
#comparing first 5 PCAs with 5-dimensional scaling solution
max(abs(prcomp(X)$x) - abs(cmdscale(D, k = 5)))

## [1] 2.663494e-14
cmdscale(D, k = 3, eig = TRUE)$GOF

## [1] 0.8181726 0.8181726

11 / 27
Example: Non-euclidean (airline) distances of US cities.

library(psychTools)
data(cities)
cities

## ATL BOS ORD DCA DEN LAX MIA JFK SEA SFO MSY
## ATL 0 934 585 542 1209 1942 605 751 2181 2139 424
## BOS 934 0 853 392 1769 2601 1252 183 2492 2700 1356
## ORD 585 853 0 598 918 1748 1187 720 1736 1857 830
## DCA 542 392 598 0 1493 2305 922 209 2328 2442 964
## DEN 1209 1769 918 1493 0 836 1723 1636 1023 951 1079
## LAX 1942 2601 1748 2305 836 0 2345 2461 957 341 1679
## MIA 605 1252 1187 922 1723 2345 0 1092 2733 2594 669
## JFK 751 183 720 209 1636 2461 1092 0 2412 2577 1173
## SEA 2181 2492 1736 2328 1023 957 2733 2412 0 681 2101
## SFO 2139 2700 1857 2442 951 341 2594 2577 681 0 1925
## MSY 424 1356 830 964 1079 1679 669 1173 2101 1925 0

12 / 27
Example

city.loc2 <- cmdscale(cities, k=2,eig=TRUE)


#cmdscale(cities, k=2,eig=TRUE)£points
x <- -city.loc2$points[,1]
y <- -city.loc2$points[,2] # reflect so North is at the top
plot(x, y, xlab = "Coordinate 1", ylab = "Coordinate 2", xlim = range(x
text(x, y, labels = colnames(cities), cex = 0.7)

13 / 27
Example

SEA
BOS
500

JFK

ORD
DCA
Coordinate 2

DEN

SFO

ATL

LAX
−500

MSY

MIA

−2000 −1500 −1000 −500 0 500 1000

Coordinate 1

14 / 27
Example

lam <- city.loc2$eig


lam

## [1] 1.097898e+07 1.972910e+06 1.335364e+04 1.579915e+03 6.352201e+02


## [6] 5.328605e+01 -1.236913e-10 -1.982622e+02 -1.054745e+03 -4.225182e+03
## [11] -4.352426e+04
cumsum(abs(lam)) / sum(abs(lam)) # non-Euclidean MDS

## [1] 0.8434654 0.9950352 0.9960611 0.9961825 0.9962313 0.9962354 0.9962354


## [8] 0.9962506 0.9963316 0.9966562 1.0000000
cumsum(lam^2) / sum(lam^2) # non-Euclidean MDS

## [1] 0.9687021 0.9999832 0.9999846 0.9999846 0.9999846 0.9999846 0.9999846


## [8] 0.9999846 0.9999846 0.9999848 1.0000000

15 / 27
Example
par(mfrow=c(2,1))
plot (1:length(lam),cumsum((lam)))
plot (1:length(lam),cumsum(abs(lam^2)))
12500000
cumsum((lam))

11000000

2 4 6 8 10

1:length(lam)
1.24e+14
cumsum(abs(lam^2))

1.21e+14

2 4 6 8 10

1:length(lam)

16 / 27
Non-metric MDS

• The idea is to find the coordinates in the spatial representation


X̂n×k , k ≤ q of the observed dissimilarities, that minimizes the
Stress function

 2
d̂ij − dij
P
i<j
Stress(k) = min{ }1/2 ,
dij2
P
i<j

where d̂ij is a function of the proximity data, and dij is the


Euclidean distance.

17 / 27
House of Representatives Example
Romesburg (1984) shows the number of times 15 congressmen from NJ voted differently on 19 environmental bills.
## Hunt(R) Sandman(R) Howard(D) Thompson(D) Freylinghuysen(R)
## Hunt(R) 0 8 15 15 10
## Sandman(R) 8 0 17 12 13
## Howard(D) 15 17 0 9 16
## Thompson(D) 15 12 9 0 14
## Freylinghuysen(R) 10 13 16 14 0
## Forsythe(R) 9 13 12 12 8
## Widnall(R) 7 12 15 13 9
## Roe(D) 15 16 5 10 13
## Heltoski(D) 16 17 5 8 14
## Rodino(D) 14 15 6 8 12
## Minish(D) 15 16 5 8 12
## Rinaldo(R) 16 17 4 6 12
## Maraziti(R) 7 13 11 15 10
## Daniels(D) 11 12 10 10 11
## Patten(D) 13 16 7 7 11
## Forsythe(R) Widnall(R) Roe(D) Heltoski(D) Rodino(D) Minish(D)
## Hunt(R) 9 7 15 16 14 15
## Sandman(R) 13 12 16 17 15 16
## Howard(D) 12 15 5 5 6 5
## Thompson(D) 12 13 10 8 8 8
## Freylinghuysen(R) 8 9 13 14 12 12
## Forsythe(R) 0 7 12 11 10 9
## Widnall(R) 7 0 17 16 15 14
## Roe(D) 12 17 0 4 5 5
## Heltoski(D) 11 16 4 0 3 2
## Rodino(D) 10 15 5 3 0 1
## Minish(D) 9 14 5 2 1 0
## Rinaldo(R) 10 15 3 1 2 1
## Maraziti(R) 6 10 12 13 11 12
## Daniels(D) 6 11 7 7 4 5
## Patten(D) 10 13 6 5 6 5
## Rinaldo(R) Maraziti(R) Daniels(D) Patten(D)
## Hunt(R) 16 7 11 13 18 / 27
Voting Example

library("MASS")
data("voting", package = "HSAUR2")
#Kruskal's non-metric MDS, default: k=2
voting_mds <- isoMDS(voting)

## initial value 15.268246


## iter 5 value 10.264075
## final value 9.879047
## converged
#gives final stress achieved (in percent).

19 / 27
Voting Example
Kruskal(1964): Poor=20%, Fair=10%, Good=5%, Excellent =
2.5%
Stress vs. k
20
15
stress2

10
5
0

1 2 3 4 5 6 7

Index

20 / 27
Voting Example
8
6 Sandman(R)

Thompson(D)
4

Patten(D)
Coordinate 2

Hunt(R) Roe(D)

Rinaldo(R) Heltoski(D)
0

Minish(D)
Daniels(D) Rodino(D)
Widnall(R)
Howard(D)
−2

Forsythe(R)
−4

Freylinghuysen(R)
−6

Maraziti(R)

−10 −5 0 5

Coordinate 1

Voting behaviour is essentially along party lines, although there is more variation
among Republicans. The voting behaviour of one Republican (Rinaldo) seems to
be closer to his Democratic colleagues. 21 / 27
Voting Example (Shepard Plot)

voting_sh <- Shepard(voting[lower.tri(voting)], voting_mds$points)

#x-axis= orig proximity (similarity or dissimilarity) x(i,j)


#y-axis = d(i,j) =asterisk
plot(voting_sh, pch = "*", cex=1, col="red", xlab = "Dissimilarity",
ylab = "Distance", xlim = range(voting_sh$x), ylim = range(voting_sh$x))

#x-axis= orig proximity(similarity or dissimilarity)


#y-axis= the fit( \hat{d(i,j)}) =dash
lines(voting_sh$x, voting_sh$yf, type = "S", lwd=2,col="blue")

22 / 27
Voting Example (Shepard Plot)

* **
* *
*
15

* ** * *
** * *
* * *
**
* * * *
** *
** **
* ** *
* **
10

**
Distance

* **
** *
*
* *
** *
*
* *
* *
** *
* **
5

* * * *
* *
* * *
** * *
** *
* * ** * *
* *
** * *
* *

5 10 15

Dissimilarity

If proximities are similarities, the points should form a loose line from top left to bottom right. If proximities are
dissimilarities, then the data should form a line from bottom left to top right.

Good fit (zero stress) means that the dashes (fit) and the asterisks(dij ) should coincide.

According to EH, the plot shows some discrepancies between the original dissimilarities and the multidimensional
23 / 27
scaling solution.
Voting Example (vegan package)
library(vegan)
set.seed(7634)
voting_mds2=metaMDS(voting)

## Run 0 stress 0.07328141


## Run 1 stress 0.09555359
## Run 2 stress 0.0768525
## Run 3 stress 0.06938356
## ... New best solution
## ... Procrustes: rmse 0.04121633 max resid 0.1105989
## Run 4 stress 0.0909891
## Run 5 stress 0.09777049
## Run 6 stress 0.07060562
## Run 7 stress 0.06628582
## ... New best solution
## ... Procrustes: rmse 0.04867576 max resid 0.1390413
## Run 8 stress 0.07060575
## Run 9 stress 0.09777044
## Run 10 stress 0.0768525
## Run 11 stress 0.0768525
## Run 12 stress 0.09777068
## Run 13 stress 0.07328141
## Run 14 stress 0.06628577
## ... New best solution
## ... Procrustes: rmse 0.0003259382 max resid 0.00105246
## ... Similar to previous best
## Run 15 stress 0.07328141
## Run 16 stress 0.06796735
## Run 17 stress 0.0768525
## Run 18 stress 0.07328141
## Run 19 stress 0.09777049
## Run 20 stress 0.0768525
## *** Solution reached
#voting_mds2
24 / 27
Voting Example (vegan package)
The closer the fit statistics R 2 to 1 the better the fit.
stressplot(voting_mds2)

Non−metric fit, R2 = 0.996


Linear fit, R2 = 0.983
15
Ordination Distance

10
5
0

5 10 15

Observed Dissimilarity
25 / 27
Summary

• MDS finds a lower-dimension representation of the data that


mimics the distances or minimizes the “stress”” function.
• MDS starts with a proximity matrix D (n × n) consisting of
dissimilarities or distances.
• Classical MDS deals with distances (Euclidean and
non-Euclidean).
• Non-metric MDS deals with non-distances.
• Use scree or Shepard plots to assess fit.

26 / 27
HW/Lab for this week

1 Explore the eurodist data from the library(datasets).


How many coordinates do we need? What does the Shepard
and scree plot say? Plot the coordinates using k = 2.

2 Explore the WWIIleaders data


(http:// stat.ethz.ch/T eaching/ Datasets/W BL/ ) using
k = 2 and non-metric MDS. What does the Shepard and scree
plot say about k = 2?

27 / 27

You might also like