0% found this document useful (0 votes)
7 views38 pages

Unit 5 NLP

NLP

Uploaded by

samith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views38 pages

Unit 5 NLP

NLP

Uploaded by

samith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

8

Multidimensional Scaling

In mathematics you don’t understand things. You just get used to them (John von
Neumann, 1903–1957; in Gary Zukav (1979), The Dancing Wu Li Masters).

8.1 Introduction
Suppose that we have n objects and that for each pair of objects a numeric quantity or a rank-
ing describes the relationship between objects. The objects could be geographic locations,
with a distance describing the relationship between locations. Other examples are different
types of food or drink, with judges comparing items pairwise and providing a score for each
pair. Multidimensional Scaling combines such pairwise information into a whole picture of
the data and leads to a visual representation of the relationships.
Visual and geometric aspects have been essential parts of Multidimensional Scaling. For
geographic locations, they lead to a map (see Figure 8.1). From comparisons and rankings
of foods, drinks, perfumes or laptops, one typically reconstructs low-dimensional represen-
tations of the data and displays these representations graphically in order to gain insight
into the relationships between the different objects of interest. In addition to these graphical
representations, in a ranking of wines, for example, we might want to know which features
result in wines that will sell well; the type of grape, the alcohol content and the region might
be of interest. Based on information about pairs of objects, the aims of Multidimensional
Scaling are
1. to construct vectors which represent the objects such that
2. the relationship between pairs of the original objects is preserved as much as possible in
the new pairs of vectors. In particular, if two objects are close, then their corresponding
new vectors should also be close.
The origins of Multidimensional Scaling go back to Young and Householder (1938) and
Richardson (1938) and their interest in psychology and the behavioural sciences. The method
received little attention until the seminal paper of Torgerson (1952), which was followed by a
book (Torgerson, 1958). In the early 1960s, Multidimensional Scaling became a fast-growing
research area with a series of major contributions from Shepard and Kruskal. These include
Shepard (1962a, 1962b), Kruskal (1964a, 1964b, 1969, 1972), and Kruskal and Wish (1978).
Gower (1966) was the first to formalise a concrete framework for this exciting discipline,
and since then, other approaches have been developed. Apart from the classical book by
Kruskal and Wish (1978), the books by Cox and Cox (2001) and Borg and Groenen (2005)
deal with many different issues and topics in Multidimensional Scaling.
248
8.2 Classical Scaling 249

Many of the original definitions and concepts have undergone revisions or refinements
over the decades. The notation has stabilised, and a consistent framework has emerged,
which I use in preference to the historical definitions. The distinction between the three
different approaches, referred to as classical, metric and non-metric scaling, is useful, and I
will therefore describe each approach separately.
Multidimensional Scaling is a dimension-reduction method – in the same way that Prin-
cipal Component Analysis and Factor Analysis are: We assume that the original objects
consist of d variables, we attempt to represent the objects by a smaller number of meaning-
ful variables, and we ask the question: How many dimensions do we require? If we want
to represent the new configuration as a map or shape, then two or three dimensions are
required, but for more general configurations, the answer is not so clear. In this chapter we
focus on the following:

• We explore the main ideas of Multidimensional Scaling in their own right.


• We relate Multidimensional Scaling to other dimension-reduction methods and in parti-
cular to Principal Component Analysis.

Section 8.2 sets the framework and introduces classical scaling. We find out about princi-
pal coordinates and the loss criteria ‘stress’ and ‘strain’. Section 8.3 looks at metric scaling,
a generalisation of classical scaling, which admits a range of proximity measures and addi-
tional stress measures. Section 8.4 deals with non-metric scaling, where rank order replaces
quantitative measures of distance. This section includes an extension of the strain crite-
rion to the non-metric environment. Section 8.5 considers data and their configurations
from different perspectives: I highlight advantages of the duality between X and XT for
high-dimensional data and show how to construct a configuration for multiple data sets.
We look at results from Procrustes Analysis which explain the relationship between multi-
ple configurations. Section 8.6 starts with a relative of Multidimensional Scaling for count
or quantitative data, Correspondence Analysis. It looks at Multidimensional Scaling for
data that are known to belong to different classes, and we conclude with developments of
embeddings that integrate local information such as cluster centres and landmarks. Problems
pertaining to the material of this chapter are listed at the end of Part II.

8.2 Classical Scaling


Multidimensional Scaling is driven by data. It is possible to formulate Multidimensional
Scaling for the population and pairs of random vectors, but I will focus on n random vectors
and data.
The ideas of scaling have existed since the 1930s, but Gower (1966) first formalised them
and derived concrete solutions in a transparent manner. Gower’s framework – known as
Classical Scaling – is based entirely on distance measures.
We begin with a general framework: the observed objects, dissimilarities and criteria for
measuring closeness. The dissimilarities of the non-classical approaches are more general
than the Euclidean distances which Gower used in classical scaling. Measures of proxim-
ity, including dissimilarities and distances, are defined in Section 5.3.2. Unless otherwise
specified, E is the Euclidean distance or norm in this chapter.
250 Multidimensional Scaling

Table 8.1 Ten Cities from Example 8.1


1 Tokyo 6 Jakarta
2 Sydney 7 Hong Kong
3 Singapore 8 Hiroshima
4 Seoul 9 Darwin
5 Kuala Lumpur 10 Auckland

 
Definition 8.1 Let X = X1 · · · Xn be data. Let

O = {O1 , . . ., On }

be a set of objects such that the object Oi is derived from or related to Xi . Let be a
dissimilarity for pairs of objects from O . We call

ik = (Oi , Ok )

the dissimilarity of Oi and Ok and {O , } the observed data or the (pairwise) observa-
tions. 

The dissimilarities between objects are the quantities we observe. The objects are binary
or categorical variables, random vectors or just names, as in Example 8.1. The underlying
d-dimensional random vectors Xi are generally not observable or not available.

Example 8.1 We consider the ten cities listed in Table 8.1. The cities, given by their names,
are the objects. The dissimilarities ik of the objects are given in (8.1) as distances in kilo-
metres between pairs of cities. In the matrix D = { ik } of (8.1), the entry 2,5 = 6, 623 refers
to the distance between the second city, Sydney, and the fifth city, Kuala Lumpur. Because
ik = ki , I show the ik for i ≤ k in (8.1) only:

⎛ ⎞
0 7825 5328 1158 5329 5795 2893 682 5442 8849
⎜ 0 6306 8338 6623 5509 7381 7847 3153 2157⎟
⎜ ⎟
⎜ 0 4681 317 892 2588 4732 3351 8418⎟
⎜ ⎟
⎜ 0 4616 5299 2100 604 5583 9631⎟
⎜ ⎟
⎜ 0 1183 2516 4711 3661 8736⎟
D=⎜

⎟ (8.1)
⎜ 0 3266 5258 2729 7649⎟

⎜ 0 2236 4274 9148⎟
⎜ ⎟
⎜ 0 5219 9062⎟
⎜ ⎟
⎝ 0 5142⎠
0

The aim is to construct a map of these cities from the distance information alone. We
construct this map in Example 8.2.

Definition 8.2 Let {O , } be the observed data, with O = {O1 , . . ., On } a set of n objects.
Fix p > 0. An embedding of objects from O into R p is a one-to-one map f: O → R p . Let
 be a distance, and put ik = [f(Oi ), f(Ok )]. For functions g and h and positive weights
8.2 Classical Scaling 251

w = wik , with i , k ≤ n, the (raw) stress, regarded as a function of and f, is


$ %1/2
n
S tress( , f) = ∑ wik [g ( ik ) − h (ik )]2 . (8.2)
i<k

If f∗ minimises S tress over embeddings into R p , then


 
W = W1 · · · Wn with Wi = f∗ (Oi )

is a p-dimensional (stress) configuration for {O , }, and the (Wi , Wk ) are the (pairwise)
configuration distances. 

The stress extends the idea of the squared error, and the (stress) configuration, the min-
imiser of the stress, corresponds to the least-squares solution. In Section 5.4 we looked at
feature maps and embeddings. A glance back at Definition 5.6 tells us that R p corresponds
to the space F , f and f∗ are feature maps and the configuration f∗ (Oi ) with i ≤ n represents
the feature data.
A configuration depends on the dissimilarities , the functions g and h and the weights
wik . In classical scaling, we have
= E the Euclidean distance, g, h identity functions and wik = 1.

Indeed, = E characterises classical scaling. We take g to be the identity function, as


done in the original classical scaling. This simplification is not crucial; see Meulman (1992,
1993), who allows more general functions g in an otherwise classical scaling framework.

8.2.1 Classical Scaling and Principal Coordinates


Let X be data of size d × n. Let {O , , f} be the observed data together with an embedding f
into R p for some p ≤ d. Put

ik =  E (Xi , Xk ) and ik =  E [f(Oi ), f(Ok )] for i , k ≤ n; (8.3)

then the stress (8.2) leads to the notions


raw classical stress:
$ %1/2
n
S tressclas ( , f) = ∑[ ik − ik ] 2
and
i<k

classical stress:
$ %1/2
S tressclas ( , f) ∑ni<k [ ik − ik ]
2
S tress∗clas ( , f) =   = , (8.4)
∑ni<k ik2 1/2 ∑ni<k ik
2

The classical stress is a standardised version of the raw stress. This standardisation is not
part of the generic definition (8.2) but is commonly used in classical scaling. If there is no
ambiguity, I will drop the subscript clas.
For the classical setting, Gower (1966) proposed a simple construction of a p-dimensional
configuration W.
252 Multidimensional Scaling
 
Theorem 8.3 Let X = X1 · · · Xn be centred data, and let {O , } be the observed data with
ik = (Xi − Xk ) (Xi − Xk ). Put Q n = X X, and let r be the rank of Q n .
2 T T

1. The n × n matrix Q n has entries


( !
1 2 n 1 n
qik = − 2
ik − ∑ 2
ik + 2 ∑ 2
ik .
2 n k=1 n i,k=1

2. If Q n = V D 2 V T is the spectral decomposition of Q n , then

W = DV T (8.5)

defines a configuration in r variables which minimises S tressclas . The configuration (8.5)


is unique up to multiplication on the left by an orthogonal r × r matrix.
Gower (1966) coined the phrases principal coordinates for the configuration vectors Wi
and Principal Coordinate Analysis for his method of constructing the configuration. The
term Principal Coordinate Analysis has become synonymous with classical scaling.
In Theorem 8.3, the dissimilarities ik are assumed to be Euclidean distances of pairs
of random vectors Xi and Xk . In practice, the ik are observed quantities, and we have no
way of checking whether they are pairwise Euclidean distances. This lack of information
does not detract from the usefulness of the theorem, which tells us how to exploit the ik
in the construction of the principal coordinates. Further, if we know X, then the principal
coordinates are the principal component data.

Proof From the definition of the ik2 it follows that 2 = XT X + XT X − 2XT X = 2 . The
ik i i k k i k ki
last equality follows by symmetry. The Xi are centred, so taking sums over indices i , k ≤ n
leads to
1 n 1 n T 1 n 2 n T
n∑ n∑ n2 ∑ ∑ Xi Xi .
2
ik = XTi Xi + Xk Xk and 2
ik = (8.6)
k k i,k n i=1

Combine (8.6) with qik = XTi Xk , and then it follows that


1 
qik = − 2
ik − XTi Xi − XTk Xk
2
( !
1 1 n 1 n 1 n 1 n
= − 2
ik − ∑ 2
ik + ∑ XTk Xk − ∑ 2
ik + ∑ XTi Xi
2 n k=1 n k=1 n i=1 n i=1
( !
1 2 n 1 n 2
= −
2
2
ik − ∑
n k=1 ik +
2
∑ ik .
n i,k=1

To show part 2, we observe that the spectral decomposition Q n = V D 2 V T consists of


a diagonal matrix D 2 with n − r zero eigenvalues because Q n has rank r . It follows that
Q n = Vr Dr2 VrT, where Vr consists of the first r eigenvectors of Q n , and Dr2 is the diagonal
r × r matrix. The matrix W of (8.5) is a configuration in r variables because W = f(O ), and
f minimises the classical stress S tressclas by Result 5.7 of Section 5.5. The uniqueness of
W – up to pre-multiplication by an orthogonal matrix – also follows from Result 5.7.
8.2 Classical Scaling 253

In Theorem 8.3, the rank r of Q n is the dimension of the configuration. In practice, a
smaller dimension p may suffice or even be preferable. The following algorithm tells us
how to find p-dimensional coordinates.

Algorithm 8.1 Principal Coordinate Configurations in p Dimensions


 
Let {O , } with = E be the observed data derived from X = X1 · · · Xn .
Step 1. Construct the matrix Q n from as in part 1 of Theorem 8.3.
Step 2. Determine the rank r of Q n and its spectral decomposition Q n = V D 2 V T .
Step 3. For p ≤ r , put
W p = D p V pT , (8.7)
where D p is the p × p diagonal matrix consisting of the first p diagonal elements of D, and
V p is the n × p matrix consisting of the first p eigenvectors of Q n .

If the X were known and centred, then the p-dimensional principal coordinates W p would
equal the principal component data W( p) of (2.6) in Section 2.3. To see this, we use the
relationship between the singular value decomposition of X and the spectral decomposi-
tions of the Q d - and Q n -matrix of X. Starting with the principal components U T X as in
Section 5.5, for p ≤ d, we obtain
U pT X = U pT U DV T = I p×d DV T = D p V pT = W p .
The calculation shows that we can construct the principal components of X, but we cannot –
even approximately – reconstruct X unless we know the orthogonal matrix U .
We construct principal coordinates for the ten cities.

Example 8.2 We continue with the ten cities and construct the two-dimensional configu-
ration (8.7) as in Algorithm 8.1 from the distances (8.1). The five non-zero eigenvalues of
Q n are
λ1 = 10, 082 × 104 λ2 = 3, 611 × 104 λ3 = 48 × 104
λ4 = 6 × 104 λ5 = 0. 02 × 104.
The eigenvalues decrease rapidly, and S tressclas = 419. 8 for r = 5. Compared with the
size of the eigenvalues, this stress is small and shows good agreement between the original
distances and the distances calculated from the configuration.
Figure 8.1 shows a configuration for p = 2. The left subplot shows the first coordinates
of W2 on the x-axis and the second on the y-axis. The cities Tokyo, Hiroshima and Seoul
form the group of points on the top left, and the cities Kuala Lumpur, Singapore and Jakarta
form another group on the bottom left.
For a more natural view of the map, we consider the subplot on the right, which is obtained
by a 90-degree rotation. Although the geographic north is not completely aligned with the
vertical direction, the map on the right gives a good impression of the location of the cities.
For p = 2, the raw classical stress is 368.12, and the classical stress is 0.01. This stress shows
that the reconstruction is excellent, and we have achieved an almost perfect map from the
configuration W2 .
254 Multidimensional Scaling

2500 4
3000
7 8 1

0 5
3

−2500 0 6

−3000 0 3000 6000


9

−3000

−6000
10

−2500 0 2500
Figure 8.1 Map of locations of ten cities from Example 8.2. Principal coordinates (left) and
rotated principal coordinates (right).

8.2.2 Classical Scaling with S train


The S tress compares dissimilarities of pairs (Oi , Ok ) with distances of the pairs
[f(Oi ), f(Ok )] and does not require knowledge of the (Xi , Xk ). In a contrasting approach,
Torgerson (1952) starts with the observations X1 , . . . , Xn and defines a loss criterion called
strain. The notion of strain was revitalised and extended in the 1990s in a number of papers,
including Meulman (1992, 1993) and Trosset (1998). The following definition of strain is
informed by these developments rather than the original work of Torgerson.
 
Definition 8.4 Let X = X1 · · · Xn be centred data, and let r be the rank of X. For κ ≤ r , let
A be a κ × d matrix, and write AX for the κ-dimensional transformed data. Fix p ≤ r . Let f
be an embedding of X into R p . The (classical) strain between AX and f(X) is
 2
S train [ AX, f(X)] = ( AX)T AX − [f(X)]T f(X)Frob , (8.8)
where . Frob is the Frobenius norm of Definition 5.2 in Section 5.3.
If f∗ minimises the S train over all embeddings into R p , then W = f∗ (X) is the
p-dimensional (strain) configuration for the transformed data AX. 

The strain criterion starts with the transformed data AX and searches for the best
p-dimensional embedding. Because AX and f(X) have different dimensions, they are not
directly comparable. However, their respective Q n -matrices ( AX)T AX and [f(X)]T f(X) (see
(5.17) in Section 5.5), are both of size n × n and can be compared. To arrive at a configu-
ration, we fix the transformation A and then find the embedding f which minimises the loss
function.
I do not exclude the case AX = X in the definition of strain. In this case, one wants to find
the best p-dimensional approximation to the data, where ‘best’ refers to the loss defined by
the S train. Let X be centred, and let U DV T be the singular value decomposition of X. For
8.2 Classical Scaling 255

W = D p V pT , as in (8.7), it follows that


r
S train(X, W) = ∑ d 4j , (8.9)
j= p+1

where the d j values are the singular values of X. Further, if W2 = E D p V pT and E is an


orthogonal p × p matrix, then S train(X, W2 ) = S train(X, W). I defer the proof of this result
to the Problems at the end of Part II.
If the rank of X is d and the variables of X have different ranges, then we could take A to
−1/2
be the scaling matrix Sdiag of Section 2.6.1.

Proposition 8.5 Let X be centred data with singular value decomposition X = U DV T . Let
−1/2
r be the rank of the sample covariance matrix S of X. Put A = Sdiag , where Sdiag is the
diagonal matrix (2.18) of Section 2.6.1. For p < r , put
 
−1/2
W = f(X) = Sdiag D V pT .
p

Then the strain between AX and W is

S train( AX, W) = (r − p)2 .


Proof The result follows from the fact that the trace of the covariance matrix of the scaled
data equals the rank of X. See Theorem 2.17 and Corollary 2.19 of Section 2.6.1.
Similar to the S tress function, S train is a function of two variables, and each loss func-
tion finds the embedding or feature map which minimises the respective loss. Apart from
the fact that the strain is defined as a squared error, there are other differences between the
two criteria. The stress compares real numbers: for pairs of objects, we calculate the differ-
ence between their dissimilarities and the distance of their feature vectors. This requires the
choice of a dissimilarity. The strain compares n × n matrices: the transformed data matrix
and the feature data, both at the level of the Q n -matrices of (5.17) in Section 5.5. The strain
requires a transformation A and does not use dissimilarities between objects. However, if
we think of the AX as the observed objects and use the Euclidean norm as the dissimilarity,
then the two loss criteria are essentially the same.
The next example explores the connection between observed data {O , } and AX and
examines the configurations which result for different {O , }.

Example 8.3 The athletes data were collected at the Australian Institute of Sport by
Richard Telford and Ross Cunningham (see Cook and Weisberg 1999). For the 102 male
and 100 female athletes, the twelve variables are listed in Table 8.2, including the variable
number and an abbreviation for each variable.
We take X to be the 11 × 202 data consisting of all variables except variable 9, sex. A
principal component analysis of the scaled data (not shown here) reveals that the first two
principal components separate the male and female data almost completely. A similar split
does not occur for the raw data, and I therefore work with the scaled data, which I refer to
as X in this example.
The first principal component eigenvector of the scaled data has highest absolute weight
for LBM (variable 7), second highest for Hg (variable 5), and third highest for Hc (variable
256 Multidimensional Scaling

Table 8.2 Athletes Data from Example 8.3


1 Bfat Body fat 7 LBM Lean body mass
2 BMI Body mass index 8 RCC Red cell count
3 Ferr Plasma ferritin 9 Sex 0 for male
concentration 1 for female
4 Hc Haematocrit 10 SSF Sum of skin folds
5 Hg Haemoglobin 11 WCC White cell count
6 Ht Height 12 Wt Weight

Table 8.3 Combinations of {O, }, κ and p from Example 8.3


O Variables κ p Fig 8.2, position
1 O1 5 and 7 2 2 Top left
2 O2 10 and 11 2 2 Bottom left
3 O3 4, and 5 and 7 3 2 Top middle
4 O3 4, and 5 and 7 3 3 Bottom middle
5 O4 Except 9 11 2 Top right
6 O4 Except 9 11 3 Bottom right

4); SSF (variable 10) and WCC (variable 11) have the two lowest PC1 weights in absolute
value. We want to examine the relationship between PC weights and the effectiveness of the
corresponding configurations in splitting the male and female athletes. For this purpose, I
choose four sets of objects O1 , . . . , O4 which are constructed from subsets of variables of
the scaled data, as shown in Table 8.3. For example, the set O1 is made up of the two ‘best’
PC1 variables, so O1 = {Oi = (X i,5 , X i,7 ) and i ≤ n}.
We construct matrices A such that A X = O for  = 1, . . ., 4 and X the scaled data.
Thus, A projects the scaled data X onto the scaled variables that define the set of objects
O . To obtain the set O1 of Table 8.3, we take A1 = (ai j ) to be the 2 × 11 matrix with entries
a1,5 = a2,7 = 1, and all other entries equal zero. In the table, κ is the dimension of AX, as in
Definition 8.4.
As standard in classical scaling, I choose the Euclidean distances as the dissimilarities,
and for the four {O , } combinations, I calculate the classical stress (8.4) and the strain (8.8)
and find the configurations for p = 2 or 3, as shown in the Table 8.3.
The stress and strain configurations look very similar, so I only show the strain con-
figurations. Figure 8.2 shows the 2D and 3D configurations with the placement of the
configurations as listed in Table 8.3. The first coordinate of each configuration is shown
on the x-axis, the second on the y-axis and the third – when applicable – on the
z-axis. The red points show the female observations, and the blue points show the male
observations.
A comparison of the three configurations in the top row of the figure shows that the
male and female points become more separate as we progress from O1 on the left to O4 on
right, with hardly any overlap in the last case. Similarly, the right-most configuration in the
bottom row is better than the middle one at separating males and females. The bottom-left
plot shows the configuration obtained from O2 . These variables, which have the lowest PC1
weights, do not separate the male and female data.
8.3 Metric Scaling 257

2
2 2

0
0 0
−2

−2
−2 0 2 −2 0 2 −2 0 2

2
0.5
2
0
0 −0.5 0

2 −2 2
−2 0
0 2 −2
0
−2 −2 −2 0 2
−2 0 2

Figure 8.2 Two- and three-dimensional configurations for different {O, } from Table 8.3 and
Example 8.3; female data in red, male in blue.

The classical stress and strain are small in the configurations shown in the left and middle
columns. For the configurations in the right column, the strain decreases from 9 to 8, and
the normalised classical stress decreases from 0.0655 to 0.0272 when going from the 2D to
the 3D configuration.
The calculations show the dependence of the configuration on the objects. A ‘careless’
choice of objects, as in O2 , leads to configurations which may hide the structure of the
data.

For the athletes data, the four sets of objects O1 , . . ., O4 result in different configurations.
A judicious choice of O can reveal the structure of the data, whereas the hidden structure
may remain obscure for other choices of O . In the Problems at the end of Part II, we consider
other choices O in order to gain a better understanding of the change in configurations with
the number of variables. Typically, the objects are given, and we have no control how they
are obtained or how well the represent the structure in the data. As a consequence, we may
not be able to find structure that is present in the data.
In Example 8.3, I have fixed A, but one could fix the row size κ of A only and then
optimise the strain for both A and embeddings f. Meulman (1992) considered this latter
case and optimised the strain in a two-step process: find A and update, find f and the
configuration W and update and then repeat the two steps. The sparse principal compo-
nent criterion of Zou, Hastie, and Tibshirani (2006), which I describe in Definition 13.10
of Section 13.4.2, relies on the same two-step updating of their norms, which are closely
related to the Frobenius norm in the strain criterion.

8.3 Metric Scaling


In classical scaling, the dissimilarities are Euclidean distances, and principal coordinate con-
figurations provide lower-dimensional representations of the data. These configurations are
258 Multidimensional Scaling

Table 8.4 Common Forms of Stress


Name g h Squared stress
n
Classical stress g(t) = t h(t) = t ∑( ik − ik )2
i<k
n
Linear stress g(t) = a + bt h(t) = t ∑ [(a + b ik ) − ik ]
2
i<k
 −1
n n
Metric stress g(t) = t h(t) = t ∑ 2
ik ∑( ik − ik )2
i<k i<k
 −1
n n  2
Sstress g(t) = t 2 h(t) = t 2 ∑ 4
ik ∑ 2
ik − 2ik
i<k i<k
 −1
n n
( − ik )2
∑ ∑
ik
Sammon stress g(t) = t h(t) = t ik
i<k i<k ik

linear in the data. Metric and non-metric scaling include non-linear solutions, and this fact
distinguishes Multidimensional Scaling from Principal Component Analysis. The avail-
ability of non-linear optimisation routines, in particular, has led to renewed interest in
Multidimensional Scaling.
The distinction between metric and non-metric scaling is not always made, and this may
not matter to the practitioner who wants to obtain a configuration for his or her data. I
follow Cox and Cox (2001), who use the term Metric Scaling for quantitative dissimilarities
and Non-metric Scaling for rank-order dissimilarities. We begin with metric scaling and
consider non-metric scaling in Section 8.4.

8.3.1 Metric Dissimilarities and Metric Stresses


We begin with the observed data {O , } consisting of n objects Oi and dissimilarities ik
between pairs of objects. For embeddings f from O into R p for some p ≥ 1 and distances
ik = [f(Oi ), f(Ok )], we measure the disparity between the ik and the ik with the
stress (8.2).
In classical scaling, essentially a single stress function, the raw classical stress or its stan-
dardised version (8.4) are used, whereas metric scaling incorporates a number of stresses
which differ in their functions g, h and in their normalising factors. Table 8.4 lists stress cri-
teria that are common in metric scaling. For notational convenience, I give expressions for
the squared stress in Table 8.4; this avoids having to include square roots in each expression.
For completeness, I have included the classical stress in this list.
The classical stress and metric stress differ by the normalising constant ∑ ik 2 . The linear

stress represents an attempt at comparing non-Euclidean dissimilarities and configuration


distances while preserving linearity. Improved computing facilities have made this stress
less interesting in practice. The sstress is also called squared stress, but I will not use
the latter term because the sstress is not the square of the stress. The non-linear Sammon
stress of Sammon (1969) incorporates the dissimilarities in a non-standard form and, as a
consequence, can lead to interesting results.
8.3 Metric Scaling 259

Unlike the classical stress, the stresses in metric scaling admit general dissimilarities. We
explore different dissimilarities and stresses in examples. As we shall see, some data give
rise to very different configurations when we vary the dissimilarities or stresses, whereas for
other data different stresses hardly affect the configurations.

Example 8.4 We return to the illicit drug market data which have seventeen different
series measured over sixty-six months. We consider the months as the variables, so we have
a high-dimension low sample size (HDLSS) problem. Multidimensional Scaling is partic-
ularly suitable for HDLSS problems because it replaces the d × d covariance-like matrix
Q d of (5.17) in Section 5.5 with the much smaller n × n dual matrix Q n . I will return to
this duality in Section 8.5.1.
For the scaled data, I calculate 1D and 2D sstress configurations W1 and W2 using the
Euclidean distance for the dissimilarities. Figure 8.3 shows the resulting configurations: the
graph in the top left shows W1 and that in the top right the first dimension of W2 , both
against the series number on the x-axis. At first glance, these plots look very similar. I show
them both because a closer inspection reveals that they are not the same. Series 15 is negative
in the left plot but has become positive in the plot on the right. A change such as this would
not occur for the principal component solution of classical scaling. Here it is a consequence
of the non-linearity of the sstress. Further, the sstress of the 2D configuration is smaller than
that of the 1D configuration: it decreases from 0.1974 for W1 to 0.1245 for W2 .
Calculations with different dissimilarities and stresses reveal that the resulting configura-
tions are similar for these data. For this reason, I do not show the other configurations.
The two panels in the lower part of Figure 8.3 show scatterplots of the dissimilarities
on the x-axis versus the configuration distances on the y-axis, so a point for each pair of

5 5

0 0

−5 −5

0 10 0 10

15 15

10 10

5 5

0 0
5 10 15 5 10 15
Figure 8.3 1D configuration W1 and first coordinate of W2 in the top row. Corresponding plots
of configuration distances versus dissimilarities for Example 8.4 below.
260 Multidimensional Scaling

observations. The left panel refers to W1 , and the right panel refers to W2 . The black scat-
terplot on the right is tighter, and as a result, the sstress is smaller. In both plots there is
a larger spread for small distances because they contribute less to the overall loss and are
therefore not as important.
The plot in the top-right panel of Figure 8.3 is the same as the right panel of Figure 6.11 in
Section 6.5.2, apart from a sign change. These two plots divide the seventeen observations
into the same two groups, although different techniques are used in the two analyses.
The recurrence of the same split of the data by two different methods shows the robustness
of this split, a further indication that the two groups are really present in the data.

The next example shows the diversity of 2D configurations from different dissimilarities
and stresses.

Example 8.5 We continue with the athletes data, which consist of 100 female and 102 male
athletes. In Example 8.3 we compared the classical stress and the strain configurations, and
we observed that the 2D configurations are able to separate the male and female athletes.
In these calculations I used Euclidean distances as dissimilarities. Now we explore different
dissimilarities and stresses and examine whether the new combinations are able to separate
the male and female athletes.
As in the preceding example, I work with the scaled data and use all variables except
the variable sex. Figure 8.4 displays six different configurations. The top row is based on
the cosine distances (5.6) of Section 5.3.1 as the dissimilarities, and different stresses: the
(metric) stress in the left plot, the sstress in the middle plot and the Sammon stress in the
right plot. Females are shown in red, males in blue. In the bottom row I keep the stress fixed

0.8
0.8

0 0 0

−0.8
−0.8 −0.8
−0.8 0 0.8 −0.8 0 0.8 −0.8 0 0.8

0.8
4 2.5

0 0
0

−2.5
−4
−0.8
−4 0 4 −0.8 0 0.8 −2.5 0 2.5
Figure 8.4 Configurations for Example 8.5, females in red, males in blue or black. (Top row):
Cosine distances; from left to right: stress, sstress and Sammon stress. (Bottom row): Stress;
from left to right: Euclidean, correlation and ∞ distance.
8.3 Metric Scaling 261

and vary the dissimilarities: the Euclidean distance in the left plot, the correlation distance
(5.7) in the middle plot and the ∞ distance (5.5), both from Section 5.3.1, on the right.
The female athletes are shown in red and the male athletes in black. The figure shows that
the change in dissimilarity affects the pattern of the configurations more than a change in
the type of stress. The cosine distances separate the males and females more clearly than the
norm-based distances. The sstress (middle top) results in the tightest configuration but takes
six times longer to calculate than the stress. The Sammon stress takes about 2.5 times as
long as the stress. For these relatively small data sets, the computation time may not matter,
but it may become important for large data sets.
There is no ‘right’ or ‘wrong’ answer; the different dissimilarities and stresses result in
configurations which expose different information inherent in the data. It is a good idea to
calculate more than one configuration and to look for interpretations these configurations
allow.

When we vary the dissimilarities or the loss, we each time solve a different problem.
There is no single dissimilarity or stress that produce the ‘best’ result, and I therefore rec-
ommend the use of different dissimilarities and stresses. The cosine dissimilarity often leads
to insightful interpretations of the data. We noticed a similar phenomenon in Cluster Anal-
ysis (see Figure 6.2 in Example 6.1 of Section 6.2). As the dimension of the data increases,
the angle between two vectors provides a measure of closeness, and for HDLSS problems
in particular, the angle between vectors has become a standard tool for assessing the conver-
gence of observations to a given vector (see Johnstone and Lu 2009 and Jung and Marron
2009).

8.3.2 Metric S train


So far we have focused on different types of stress for measuring the loss between dis-
similarities and configuration distances. Interestingly, the stress and sstress criteria were
proposed initially for non-metric scaling and were later adapted to metric scaling. In con-
trast, the strain criterion (8.8) is mostly associated with classical and metric scaling. Indeed,
the strain is a natural measure of discrepancy for metric scaling and relies on results from
distance geometry. I present the version given in Trosset (1998), but it is worth noting that
the original ideas go back to Schoenberg (1935) and Young and Householder (1938).
Theorem 8.7 requires additional notation, which we establish first.
Definition 8.6 Let A = (aik ) and B = (bik ) be m × m matrices. The Hadamard product
or Schur product of A and B is the matrix A ◦ B whose elements are defined by the
elementwise product

( A ◦ B)ik = (aik bik ). (8.10)

Let 1m×1 be the column vector of 1s. The m × m centring matrix  is defined by
⎛ ⎞
m −1 −1 ··· −1
 
1 T
1⎜⎜ −1 m − 1 ··· −1 ⎟⎟
 = m×m = Im×m − 1m×1 (1m×1 ) = ⎜ . . . .. ⎟ . (8.11)
m m ⎝ .. .. .. . ⎠
−1 −1 ··· m − 1
262 Multidimensional Scaling

The centring transformation τ maps A to a matrix τ ( A) defined by


A −→ τ ( A) =  A.


Using the newly defined terms, the matrix Q n of part 1 in Theorem 8.3 becomes
1 1
Q n = − τ (P ◦ P) = − (P ◦ P) where P = ( ik ). (8.12)
2 2
 
Theorem 8.7 [Trosset (1998)] Let X = X1 · · · Xn be data of rank r . Let be dissimilarities
defined on X, and let P = ( ik ) be the n × n matrix of dissimilarities.
1. There is a configuration W whose distance matrix  equals P if and only if τ (P ◦ P) is
a symmetric positive semidefinite matrix of rank at most r .
2. If the r × n configuration W satisfies τ (P ◦ P) = WT W, then  = P, where  is the
distance matrix of W.
The theorem provides conditions for the existence of a configuration with the required
properties. If these conditions do not hold, then we can still use the ideas of the theorem and
find configurations which minimise the difference between τ (P ◦ P) and τ ( ◦ ).
 
Definition 8.8 Let X = X1 · · · Xn be data with a dissimilarity matrix P. Let r be the rank
of X. Let  be an n × n matrix whose elements can be realised as pairwise distances of
points in Rr . The metric strain S trainmet , regarded as a function of P and , is
S trainmet (P, ) = τ (P ◦ P) − τ ( ◦ ) 2
Frob . (8.13)


Unlike the classical strain (8.8), which is a function of the transformed data AX and the
feature data f(X), it is more natural to base the metric strain on the dissimilarity matrix P.
The matrix  corresponds to the matrix of distances of f(X) ⊂ Rr . With this identifica-
tion, we want to find an embedding f∗ with W = f∗ (X) and a distance matrix ∗ which
minimises (8.13).
I have defined the metric strain as a function of . The definition shows that S trainmet
depends on τ ( ◦ ). Putting B = τ ( ◦ ), it is convenient to write S trainmet as a function
of B, namely,
S trainmet (P, B) = τ (P ◦ P) − B 2Frob . (8.14)
 
Theorem 8.9 [Trosset (1998)] Let X = X1 · · · Xn be data of rank r . Let be dissimilarities
defined on X, P = ( ik ), and let P ◦ P be the Hadamard product. Assume that τ (P ◦ P) has
spectral decomposition τ (P ◦ P) = V ∗ D ∗ 2 V ∗ T , with eigenvectors v∗i and i ≤ n. Write D(r)

∗ ∗
for the n × n diagonal matrix whose first r diagonal elements d1 , . . ., dr agree with those of
D ∗ and whose remaining n − r diagonal elements are zero. Then
B ∗ = V ∗ D(r)
∗ 2 ∗T
V
is the minimiser of (8.14), and
⎡ ∗ ∗⎤
d1 v1
⎢ .. ⎥
W=⎣ . ⎦
dr∗vr∗
8.4 Non-Metric Scaling 263

is an r × n configuration whose matrix of pairwise distances minimises the


strain (8.13).

Theorem 8.9 combines theorem 2 and corollary 1 of Trosset (1998). For a proof of the
theorem, see Trosset (1997, 1998), and chapter 14 of Mardia, Kent, and Bibby (1992).
The configurations obtained in Theorems 8.3 and 8.9 look similar. There are, however,
differences. Theorem 8.3 refers to the classical set-up; the matrix Q = V D 2 V T is con-
structed from Euclidean distances and minimises the classical stress (8.4). The matrix B ∗ of
Theorem 8.9 admits more general dissimilarities and minimises (8.14). The resulting config-
urations therefore will differ because they solve different problems. Meulman (1992) noted
that strain optimal configurations underestimate configuration distances, whereas this is not
the case for stress optimal configurations. In practice, these differences may not be apparent
visually and can be negligible, as we have seen in Example 8.5.

8.4 Non-Metric Scaling


8.4.1 Non-Metric Stress and the Shepard Diagram
Shepard (1962a, 1962b) first formulated the ideas of non-metric scaling in an attempt to
capture processes that cannot be described by distances or dissimilarities. The starting point
is the observed data {O , } – as in metric scaling – but the dissimilarities are replaced by
rankings, also called rank orders. The pairwise rankings of objects are the available obser-
vations. We might like to think of these rank orders as preferences in wine, breakfast foods,
perfumes or other merchandise. Because the dissimilarities are replaced by rankings, we use
the same notation , but refer to them as rankings or ranked dissimilarities, and write
them in increasing order:

1 = min ik ≤ · · · max ik = N with N = n(n − 1)/2. (8.15)


i,k; i=k i,k; i=k

Shepard’s aim was to construct distances with the rank order for each pair of observations
informed by that of the ik . To achieve this goal, he placed the N points  at the vertices of a
regular simplex in R N −1 . He calculated Euclidean distances ik between all vertices, ranked
the distances and compared the ranking of the dissimilarities with that of the distances ik .
Points that are in the wrong rank order are moved in or out, and a new ranking of the
distances is determined. The process is iterated until no further improvements in ranking are
achieved. Cosmetics such as a rotation of the coordinates are applied so that the points agree
with the principal axes in R N −1 . For a p-dimensional configuration, Shepard proposed to
take the first p principal coordinates as the desired configuration.
Shepard achieved the monotonicity of the ranked dissimilarities and the corresponding
distances essentially by trial and error. It turns out that the monotonicity is the key to making
non-metric scaling work. A few years later, Kruskal (1964a, 1964b) placed these intuitive
ideas on a more rigorous basis by defining a measure of loss, the non-metric stress.

Definition 8.10 Let {O , } be the observed data, with O = {O1 , . . . , On } and ranked dissim-
ilarities ik . For p > 0, let f be an embedding from O into R p . Let  be a distance, and put
ik = (f(Oi ), f(Ok )). Disparities are real-valued functions, defined for pairs of objects Oi
and Ok and denoted by dik = d(O  i , Ok ), which satisfy the following:
264 Multidimensional Scaling

• There is a monotonic function f such that

dik = f (ik )

for every pair (i , k) with i , k ≤ n.


• For pairs of coefficients (i , k) and (i  , k  ),

dik ≤ di  k  (8.16)

whenever ik < i k .

The non-metric stress S tressnonmet , regarded as a function of d and f, is


$ %1/2
 ∑ni<k (ik − dik )2
S tressnonmet (d, f) = , (8.17)
∑ni<k 2ik

and the non-metric sstress SS tressnonmet of d and f is


$ %1/2
∑ n (2 − d 2 )2
 f) =
SS tressnonmet (d, i<k ik ik
. (8.18)
∑ni<k 4ik
 
The matrix W = f(O1 ) · · · f(On ) which minimises S tressnonmet (or SS tressnonmet ) is
called the p-dimensional configuration for the non-metric stress (or non-metric sstress,
respectively). 

I will omit the subscript nonmet if there is no ambiguity. For classical and metric scaling,
the stress is a function of the dissimilarities, and for given dissimilarities, we want to find
the embedding and the distances  which minimise the stress. In non-metric scaling, the
stress is defined from the disparities and uses the dissimilarities only via the monotonicity
relationship (8.16). A consequence of this definition is that we are searching for distances
and disparities which jointly minimise the stress or sstress.
The specific form of the monotone functions f , which was of paramount importance in
the early days, is no longer the main object of interest because there are good algorithms
which minimise the non-metric stress and sstress and calculate the p-dimensional config-
urations, the distances and the disparities. The following examples show how these ideas
work and perform in practice.

Example 8.6 The cereal data consist of seventy-seven brands of breakfast cereals, for
which eleven quantities are recorded. I use the first ten variables, which are listed in Table
8.5, and exclude the binary variable type.
The top-left panel of Figure 8.5 shows a parallel coordinate plot of the data with the
variable numbers on the x-axis. I take the Euclidean distances of pairs of observations as the
ranked dissimilarities and calculate the 2D configuration of these data with the non-metric
sstress (8.18). The red dots in the top-right panel of the figure show this configuration, with
the first component on the x-axis and the second on the y-axis. Superimposed in blue is the
2D PC data, with the first PC on the x-axis. The two plots look similar but are not identical.
The bottom row of Figure 8.5 depicts Shepard diagrams, plots which show the configu-
ration distances or disparities on the y-axis against the ranked dissimilarities on the x-axis.
8.4 Non-Metric Scaling 265

Table 8.5 Cereal Data from Example 8.6


1 # of calories 6 Carbohydrates (g)
2 Protein (g) 7 Sugars (g)
3 Fat (g) 8 Display shelf (1,2 etc)
4 Sodium (mg) 9 Potassium (mg)
5 Fibre (g) 10 Vitamins (% enriched)

300 200

200 100

100 0

−100
0
1 5 9 −100 0 100

400 400

200 200

0 0
0 200 400 0 200 400
Figure 8.5 Cereal data (top left) from Example 8.6. (Top right): Non-metric configuration in
red and first two PCs in blue. (Bottom): Configuration distances (blue) and disparities (black)
versus ranked dissimilarities on the x-axis.

Shepard diagrams give an insight into the degree of monotonicity between distances or dis-
parities and rankings. The bottom-left panel displays the configuration distances against the
ranked dissimilarities, and the bottom-right panel displays the disparities against the ranked
dissimilarities. There is a wider spread in the blue plot on the left than in the black plot, indi-
cating that the smaller distances are not estimated as well as the larger distances. Overall
both plots show clear monotonic behaviour.
Following this initial analysis of all data, I separately analyse two subgroups of
the seventy-seven brands: the twenty-three brands of Kellogg’s cereals and the twenty-
one brands from General Mills. An analysis of the Kellogg’s sample is also given in
Cox and Cox (2001). The purpose of the separate analyses is to determine differences
between the Kellogg’s and General Mills brands of cereals which may not otherwise be
apparent.
The stress and sstress configurations are very similar for both subsets, so I only show the
sstress-related results in Figures 8.6 and 8.7. The top panels of Figure 8.6 refer to Kellogg’s,
the bottom panels to General Mills. Parallel coordinate plots of the data – displayed in the
left panels – show that the Kellogg’s cereals contain more sodium, variable 4, and potas-
sium, variable 9, than General Mills’ cereals. The plots on the right in Figure 8.6 display
Shepard diagrams: the blue scatter plot shows the configuration distances, and the red line
the disparities, both against the ranked dissimilarities, which are shown on the x-axis. We
note that the range of distances and dissimilarities is larger for the Kellogg’s sample than for
the General Mills’ sample, but both show good monotone behaviour.
266 Multidimensional Scaling

300 300

200 200

100 100

0
1 5 9 100 200 300

200
200

100
100

0
1 5 9 100 200
Figure 8.6 Kellogg’s (blue) and General Mills’ (red) brands of cereal data from Example 8.6
and their Shepard diagrams on the right.

1
0 1
100
9 5 1 1
3 0 1
5 3 2 1
0
3 4 1
14 1
−100 1

2 3
−200 −100 0 100

0
50 2 2
2 10 0
0
1 1 0
0 3
3 2
2 0 0
1 0
−50 4
2
−100 −50 0 50
Figure 8.7 Configurations of Kellogg’s (blue) and General Mills’ (red) samples from
Example 8.6 with fibre content given by the numbers next to the dots.

Figure 8.7 shows the 2D sstress configurations for Kellogg’s in the top panel in blue and
General Mills in the lower panel in red. The numbers next to each red or blue dot show
the fibre content of the particular brand: 0 means no fibre, and the higher the number, the
higher is the fibre content of the brand. The Kellogg’s cereals have higher fibre content than
the General Mills’ cereals, but there is a ‘low-fibre cluster’ in both sets of configurations.
The clustering behaviour of these low-fibre cereals indicates that they have other properties
in common; in particular, they have low potassium content. These properties of the cereals
brands are clearly of interest to buyers and are accessible in these configurations.
8.4 Non-Metric Scaling 267

The example shows that Multidimensional Scaling can discover information that is not
exhibited in a principal component analysis, in this case the cluster of brands with low
fibre content. Variable 5, fibre, has a much smaller range than variables 4 and 9, sodium and
potassium. The latter two contribute most to the first two principal components, unlike fibre,
which has much smaller weights in the first and second PCs and is therefore not noticeable
in the first two PCs.
Next we look at an example where ranking of the dissimilarities and subsequent non-
metric scaling lead to poor results.

Example 8.7 Instead of using the Euclidean distances for the ten cities of Example 8.1, I
now use rank orders obtained from the distances (8.1). Thus, 1 = (3,5) , the rank obtained
from the smallest distance between the third city, Singapore, and the fifth city, Kuala
Lumpur, in Table 8.1 and (8.1). Similarly, for N = 45, the largest rank 45 = (4,10) is
obtained for the fourth and tenth cities, Seoul and Auckland.
The top-left panel of Figure 8.8 shows the 2D configuration calculated with the non-
metric stress (8.17). The right panel shows the rotated configuration with a 90-degree
rotation as in Figure 8.1, and the numbers next to the dots refer to the cities in Table 8.1.
The bottom-left panel shows a Shepard diagram: the blue points show the configuration dis-
tances, and the black points show disparities, plotted against the ranked dissimilarities. The
disparities are monotonic with the dissimilarities; the configuration distances less so.
A comparison with Figure 8.1 shows that the map of the cities does not bear any rela-
tionship to the real geographic layout of these ten cities, and rank order based non-metric
scaling produces less satisfactory results for these data than classical scaling.

This example shows the superiority of the classical dissimilarities over the rank orders
and confirms that we get better results when we use all available information. If the rank
orders are all the available information, then configurations using non-metric scaling may
be the best we can do.

10 3
20
0

−10 8
1
−20 0 20 6
5
0
40 2
7
4
20

9 10
0 −20
0 10 20 30 40 50 −10 0 10
Figure 8.8 Non-metric configuration from rank orders for Example 8.7 and Shepard diagram.
268 Multidimensional Scaling

8.4.2 Non-Metric S train


In metric scaling we considered two types of loss: stress and strain. Historically, Kruskal’s
stress and sstress were the non-metric loss criteria. Meulman (1992) popularised the notion
of strain in metric scaling, and Trosset (1998) proposed a non-metric version of strain which
I outline now.
 
Let X = X1 · · · Xn be data with dissimilarities 0 . Put P 0 = ( ik
0 ), and let

0
1 ≤ ··· ≤ 0
N with N = n(n − 1)/2
be the ordered ranks as in (8.15). Instead of constructing disparities from the configuration
distances, we start with P 0 and consider dissimilarity matrices P for X whose entries ik
satisfy for pairs i , k and i  , k  :

ik ≤ i k whenever 0
ik ≤ 0
i k . (8.19)
Put
M(P 0 ) = {P: P is a dissimilarity matrix for X whose entries satisfy (8.19)}.

Then M(P 0 ) furnishes a rich supply of dissimilarity matrices compatible with P 0 .


 
Definition 8.11 Let X = X1 · · · Xn be data of rank r with an n × n dissimilarity matrix
 2
P 0 = ( 0 ). Put r 2 =  P 0  . For p ≤ r , let n ( p) be the set of n × n matrices whose
ik 0 Frob
elements can be realised as pairwise distances of points in R p . For P ∈ M(P 0 ) such that
P 2Frob ≥ r02 and for  ∈ n ( p), the non-metric strain S trainnonmet is

S trainnonmet (P, ) = τ (P ◦ P) − τ ( ◦ ) 2
Frob . (8.20)


Putting B = τ ( ◦ ) and using Theorem 8.7, the non-metric strain between symmetric
positive n × n matrices B of rank at most r and P ∈ M(P 0 ) which satisfies P 2Frob ≥ r02
becomes
S trainnonmet (P, B) = τ (P ◦ P) − B 2
Frob . (8.21)
The minimiser (∗ , P ∗ ) of (8.20) is the desired distance and dissimilarity matrix, respec-
tively, for X. The definition of the non-metric strain S trainnonmet is very similar to that of
the metric strain S trainmet of (8.13), apart from the extra condition P 2Frob ≥ r02 which is
required to avoid degenerate solutions. Trosset (1998) discussed the non-linear optimisation
problem (8.21) in more detail and showed how to obtain a minimiser.
The distance matrices which minimise the non-metric strain are different from the min-
imisers of the metric strain, the stress and sstress. In applications, the different solutions may
be similar, and often it is a question of personal preference on the user’s side which criterion
is fitted. However, if computing facilities permit, it is advisable to experiment with more
than one loss criterion as they may lead to different insights into a problem.

8.5 Data and Their Configurations


Multidimensional Scaling started as an exploratory technique that focused on reconstructing
data from partial information and on representing data visually. Since its beginnings in the
8.5 Data and Their Configurations 269

late 1930s, attempts have been made to formulate models suitable for statistical inferences.
Ramsay (1982) and the discussion of his paper by some fourteen experts gave an understand-
ing of the issues involved. A number of these discussants share Silverman’s reservations, of
feeling a ‘little uneasy about the use of Multidimensional Scaling as a model-based infer-
ential technique, rather than just an exploratory or presentational method’ (Ramsay 1982,
p. 307). These reservations need not detract from the merits of the method, and indeed,
many of the newer non-linear approaches in Multidimensional Scaling successfully extend
or complement linear dimension-reduction methods such as Principal Component Analysis.
In this section we consider two specific topics: scaling for high-dimensional data and
relationships between different configurations of the same data.

8.5.1 HDLSS Data and the X and XT Duality


In classical scaling, Theorem 8.3 tells us how to obtain the Q n matrix from the dissimilari-
ties without requiring X. Traditionally, n > d, and if X are available, it is clearly preferable
to work directly with X.
For HDLSS data X, we exploit the duality between Q n = XT X and Q d = XXT , see
Section 5.5 for details. The Q d matrix is closely related to the sample covariance matrix of
X and thus can be used instead of S to derive the principal component data. We now explore
the relationship between Q n and the principal components of X.
 
Proposition 8.12 Let X = X1 · · · Xn be d-dimensional centred data, with d > n and rank r .
Put Q n = XT X, and write

X = U DV T and Q n = V D 2 V T

for the singular value decomposition of X and the spectral decomposition of Q n ,
respectively. For k ≤ r , put

W(k) = UkT X and V(k) = Dk VkT . (8.22)

Then W(k) = V(k) .

This proposition tells us that the principal component data W(k) can be obtained from the
left eigenvectors of X, the columns of U , or equivalently from the right eigenvectors of X,
the columns of V . Classically, for n > d, one calculates the left eigenvectors, which coincide
with the eigenvectors of the sample covariance matrix S of X. However, if d > n, finding
the eigenvectors of the smaller n × n matrix Q n is computationally much faster.

Proof Fix k ≤ r . Using the singular value decomposition of X, we have

W(k) = UkTU DV T ,

where U is of size d × r , D is the diagonal r × r matrix of singular values and V is of size


n × r . Because U is r -orthogonal, UkT U = Ik×r , and from this equality it follows that

W(k) = Ik×r DV T = Dk VrT = Dk VkT = V(k) .

The equality Dk VrT = Dk VkT holds because the last r − k columns of Vr are zero.
270 Multidimensional Scaling

The illicit drug market data of Example 8.4 fit into the HDLSS framework, but in the
calculations of the stress and sstress configurations, I did not mention the duality between X
and XT . Our next example makes explicit use of Proposition 8.12.

Example 8.8 We continue with the breast tumour gene expression data of Example 2.15
of Section 2.6.2, which consist of 4,751 genes, the variables, and seventy-eight patients.
These data clearly fit into the HDLSS framework. The rank of the data is at most seventy-
eight, and it therefore makes sense to calculate the principal component data from the
eigenvectors of the matrix Q n .
In a standard principal component analysis, the eigenvectors of S contain the weights
for each variable. I calculate the eigenvectors of Q n , and we therefore obtain weights
for the individual observations. The analysis shows that observation 54 has much larger
absolute weights for the first three eigenvectors of Q n than any of the other observations.
Observation 54 is shown in blue in the bottom-right corner of the top-left plot of Figure 8.9.
The top row shows the 2D and 3D PC data, with V(k) as in (8.22). The 2D scatterplot in the
top left, in particular, marks observation 54 as an outlier.
For each observation, the data contain a survival time in months and a binary out-
come which is 1 for patients surviving five years and 0 otherwise. There are forty-four
patients who survived five years – shown in black in the figure – and thirty-four who did
not – shown in blue. As we can see, in a principal component analysis the blue and black
points are not separated; instead, outliers are exposed. These contribute strongly to the total
variance.
The scatterplots in the bottom row of Figure 8.9 result from a calculation of V(k) in (8.22)
after observation 54 has been removed. A comparison between the 2D scatterplots shows
that after removal of observation 54, the PC directions are aligned with the x- and y-axis.
This is not the case in the top-left plot. Because the sample size is small, I am not advocating

0 0.1

0
−0.2

−0.1
−0.4
−0.4 0.2 −0.2
−0.2 0 0.2 0 −0.2 0

0.1
0.05

0 0
−0.05
0.1 0.1
0
−0.1 0 −0.1
−0.1 0 0.1 −0.1

Figure 8.9 PC configurations in two and three dimensions from Example 8.8. (Top): all
seventy-eight samples. (Bottom): Without the blue observation 54. (Black dots): Survived more
than five years. (Blue dots): survived less than five years.
8.5 Data and Their Configurations 271

leaving out observation 54 but suggest instead carrying out any subsequent analysis with and
without this particular observation and then comparing the results.
As we have seen, the eigenvectors of Q n provide useful information about the observa-
tions. A real benefit of using Q n – instead of Q d or S – is the computational efficiency.
Calculation of the 78 ×78 matrix Q n takes about 3 per cent of the time it takes to cal-
culate Q d . The computational advantage increases further when we try to calculate the
eigenvalues and eigenvectors of Q n and Q d , respectively.

The duality of X and XT goes back at least as far as Gower (1966) and classical scaling.
For HDLSS data, it results in computational efficiencies. In the theoretical development
of Principal Component Analysis – see Theorem 2.25 of Section 2.7.2 – Jung and Marron
(2009) exploited this duality to prove the convergence of the sample eigenvalues to the
population counterparts. In addition, the duality of the Principal Component Analysis for X
and XT can be regarded as a forerunner or a special case of Kernel Principal Component
Analysis, which we consider in Section 12.2.2.

8.5.2 Procrustes Rotations


In the earlier parts of this chapter we constructed p-dimensional configurations based on
different dissimilarities and loss criteria. As the dissimilarities and loss criteria vary, so
do the resulting configurations. In this section we quantify the difference between two
configurations.
Let X be d-dimensional data. If W is a p-dimensional configuration for X with p ≤ d
and E is an orthogonal p × p matrix, then EW is also a configuration for X which has the
same distance matrix  as W. Theorem 8.3 states this fact for classical scaling with stress,
but it holds more generally. As a consequence, configurations are unique up to orthogonal
transformations. Following Gower (1971), we consider the difference between two config-
urations W and EV and find the orthogonal matrix E which minimises this difference. As
it turns out, E has a simple form.
Theorem 8.13 Let W and V be matrices of size p × n whose rows are centred. Let E be an
orthogonal matrix of size p × p, and put

χ (E) = W − EV 2
Frob .

Put T = VWT , and write T = U DV T for its singular value decomposition. If E ∗ = V U T ,


then

E ∗ = argmin χ (E) and χ (E ∗ ) = W Frob +


2
V Frob − 2 tr (D).
2

The orthogonal matrix E ∗ is the Procrustes rotation of V relative to W.


Proof For any orthogonal E,

tr (EVVT E T ) = tr (VVT ) = V 2
Frob ,

and hence we have the identity


T
χ (E) = W Frob +
2
V Frob − 2 tr(EVW ).
2
272 Multidimensional Scaling

Minimising χ (E) is therefore equivalent to maximising tr (EVWT ). We determine the max-


imiser by introducing a symmetric p × p matrix L of Lagrange multipliers and then find the
maximiser of
 
1 T
χ̃ (E) = tr E T − L E E − I p× p with T = VWT .
2
Differentiating χ̃ with respect to E and setting the derivative equal to the zero matrix lead to

T = L E T.
Using the singular value decomposition of T and the symmetry of L, we have

L 2 = (T E)(T E)T = T T T = U DV T V DU T = U D 2 U T .
By the uniqueness of the square root, we obtain L = U DU T . From T = L E T it follows that
U DV T = U DU T E T ,

and hence V = EU .
The desired expression for the maximiser follows. Further, tr (E ∗ VWT ) = tr (D) by the
trace property, and hence the expression for χ (E ∗ ) holds.
Theorem 8.13 relates matrices of the same size, and applying it to two configuration
matrices of X tells us how different these configurations are. For configurations with a dif-
ferent number of dimensions p1 and p2 , we ‘pad’ the smaller configuration matrix with
zeroes and then apply Theorem 8.13 to the two matrices of the same size. The padding can
also be applied when we want to compare data with a lower-dimensional configuration. I
summarise the results in the following corollary.
Corollary 8.14 Let X be a centred d × n data of rank r , and write X = U DV T for the sin-
gular value decomposition. For p ≤ r , let W p = D p V pT be the p × n configuration obtained
in step 3 of Algorithm 8.1. Let U p,0 be the d × d matrix whose first p columns coincide with
U p and whose remaining d − p columns are zero vectors. The following hold.
1. The matrix W p,0 = U p,0
T
X agrees with W p in the first p rows, and its remaining d − p
rows are zeroes.
2. The Procrustes rotation E ∗ of W p,0 relative to X is E ∗ = U p,0 , and
E ∗ W p,0 = U p,0U p,0
T
X = U p U pT X.

Proof From Proposition 8.12, W p = U pT X, and thus W p,0 = U p,0


T
X is a centred d ×n matrix
satisfying part 1. For T = W p,0 X = U p,0 XX , we obtain the decomposition
T T T

T
T = U p,0 U D2U T.
 T
The Procrustes rotation E ∗ of Theorem 8.13 is E ∗ = U U p,0
T
U = U p,0 .

This corollary asserts that E ∗ = U p,0 is the minimiser of χ in Theorem 8.13. Part 2 of
the theorem thus tells us that U p U pT X is the best approximation to X with respect χ . A
comparison with Corollary 2.14 in Section 2.5.2 shows that the two corollaries arrive at the
same conclusion but by different routes.
8.5 Data and Their Configurations 273

The Procrustes rotation of Theorem 8.13 is a special case of the more general transforma-
tions

V −→ P(V) = cEV + b,

where c is a scale factor, E is an orthogonal matrix and b is a vector. Extending χ of The-


orem 8.13, to transformations of this form, we may want to compare two configurations W
and V and then find the optimal transformation parameters of V with respect to W. These
parameters minimise

χ (E, c, b) = W − P(V) 2
Frob .

Problems of this type and their solutions are the topic of Procrustes Analysis, which
derives its name from Procrustes, the ‘stretcher’ of Greek mythology who viciously
stretched or scaled each passer-by to the size of his iron bed. An introduction to Procrustes
Analysis is given in Cox and Cox (2001). Procrustes Analysis is not restricted to configu-
rations; it is an important method for matching or registering matrices, images or general
shapes in high-dimensional space in shape analysis. For details, see Dryden and Mardia
(1998).

8.5.3 Individual Differences Scaling


Multidimensional Scaling traditionally deals with one data matrix X and the observed data
{O , } and constructs configurations for X from {O , }. If we have repetitions of an exper-
iment, or if we have a number of ‘judges’, each of whom creates a dissimilarity matrix for
the given data, then extensions of this traditional set-up are required, such as
1. treating multiple sets X and {O ,  } simultaneously, and
2. considering multiple sets {O ,  } for a single X, where  = 1, . . ., M.
Analyses for these extensions of Multidimensional Scaling are sometimes also called Three-
Way Scaling.
For the two scenarios – repetitions of X and {O , } and multiple ‘judges’ {O ,  } for a
single X – one wants to construct configurations. The aims and methods of solution differ
depending on whether one wants to construct a single configuration from the M sets {O ,  }
or M separate configurations. I will not describe solutions but mention some approaches that
address these problems.
Tucker and Messick (1963) worked directly with the dissimilarities and defined an aug-
mented matrix of dissimilarities based on the dissimilarities ik, , where  ≤ M refers to the
judges. The augmented matrix PT M of dissimilarities is
⎡ ⎤
12,1 ... 12,M
⎢ .. .. .. ⎥
PTM = ⎣ . . . ⎦ (8.23)
(n−1)n,1 ... (n−1)n,M

so PTM has n(n − 1)/2 rows and M columns. The th column contains the dissimilarities
of judge , and the kth row contains the dissimilarities for a specific pair of observations
across all judges. Using a stress loss, individual configurations or an average configuration
are constructed.
274 Multidimensional Scaling

Carroll and Chang (1970) used the augmented matrix (8.23) and assigned to judge  a
weight vector ω ∈ Rd such that the entries ω j of ω corresponded to the d variables of
X. Using the weights, the configuration distances for the observations Xi and Xk and judge
 are
 1/2
d
ik, = ∑ ω j (X i j − X k j )2 .
j=1

The configuration distances give rise to configurations W p, as in Algorithm 8.1. The opti-
mal configuration minimises a stress loss based on the average over the M judges. Many
algorithms have been developed which build on the work of Carroll and Chang (1970); see,
for example, Davies and Coxon (1982).
Meulman (1992) extended the strain criterion of Definition 8.4 to data sets X1 , . . ., X M .
The desired configuration W∗ minimises the average strain over AX , where A is a κ × d
matrix and

1 M
S train( AX1 , . . ., AX M , W) =
m ∑ S train( AX j , W).
j=1

Of special interest is the case M = 2, which has connections to Canonical Correlation Anal-
ysis. In a subsequent paper, Meulman (1996) examined this connection between the two
methods. This research area is known as as Homogeneity Analysis.

8.6 Scaling for Grouped and Count Data


In this last section we look at a number of methods that have evolved from Multidimensional
Scaling in response to the needs of different types of data, such as categorical and count data
or data that are characterised locally, for example, through their class membership.

8.6.1 Correspondence Analysis


Multidimensional Scaling was proposed for continuous d-dimensional data. The mathemat-
ical equations and developments governing Multidimensional Scaling, however, also apply
to count or categorical data. We now look at data that are available in the form of contingency
tables, and we consider a scaling-like development for such data, known as Correspon-
dence Analysis. I give a brief introduction to Correspondence Analysis in the language and
notation we have developed in this chapter. There are many books on Correspondence Anal-
ysis, and Greenacre (1984) and the more recent Greenacre (2007) are some that provide a
generic view and detail on this topic.

Definition 8.15 Let X be matrix of size r × c whose entries X k are the counts for the
cell corresponding to column  and row k. We present data of this form in a table, called a
contingency table, which is shown in Table 8.6. The centre part of the table contains the
counts X k in the r × c cells. The entries of the first column are row indices, written r j for
the j th row, and the entries of the first row are column indices, written ci for the i th column.
Further the last row and column contain the column and row totals, respectively. 
8.6 Scaling for Grouped and Count Data 275

Table 8.6 Contingency Table of Size r × c


c1 c2 ··· cc Row totals
r1 X 11 X 21 ... X c1 ∑k X k1
r2 X 12 X 22 ... X c2 ∑k X k2
.. .. .. .. .. ..
. . . . . .
rr X 1r X 2r ... X cr ∑k X kr
Column totals ∑ X 1 ∑ X 2 ... ∑ X c N = ∑k X k

The rows of a contingency table are observations in c variables, and the columns are
observations in r variables. It is common practice to give the row and column totals in a
contingency table and to let N be the sum total of all cell counts.
There is an abundance of examples which can be represented in contingency tables,
including political preferences in different locations, treatment and outcomes and ranked
responses in marketing or psychology. Typically, one examines and tests – by means of
suitably chosen χ 2 statistics – whether rows are independent or have the same proportions.
Voter preference in different states could be examined in this way. The role of rows and
columns can be interchanged, so one might want to test for independence of columns instead
of rows.
In a correspondence analysis of contingency table data, the aims include the construction
of low-dimensional configurations but are not restricted to these. Although Correspondence
Analysis (CA) makes use of ideas from Multidimensional Scaling (MDS), there are differ-
ences between the two methods that go beyond those relating to the type of data, as the
following summary shows:
(MDS) For d × n data X in Multidimensional Scaling, the columns Xi of X are the obser-
vations or random vectors, and the row vector X• j is the j th variable or dimension
across all n observations.
(CA.a) For r ×c count data X, we may regard the rows as the observations (in c variables)
or the columns as the observations (in r variables). As a consequence, a scaling-like
analysis of X or XT or both is carried out in Correspondence Analysis.
(CA.b) For the r × c matrix of count data X, we want to examine the sameness of rows
of X.
We first look at (CA.a), and then turn to (CA.b).
Let r and c be positive integers, and let X be given by the r × c matrix of Table 8.6.
Let r0 be the rank of X, and write X = U DV T for its singular value decomposition. We
treat the rows and columns of X as separate vector spaces. The r × r0 matrix U of left
eigenvectors of X forms part of a basis for the r variables of the columns. Similarly, the
c ×r0 matrix V of right eigenvectors of X forms part of a basis for the c variables of the rows
of X. For p ≤ r0 , we construct two separate configurations based on the ideas of classical
scaling:
D p U pT a configuration for the r -dimensional columns, and
(8.24)
D p V pT a configuration for the c-dimensional rows.
276 Multidimensional Scaling

Because there is no preference between the rows and the columns, one calculates both con-
figurations, and typically displays the first two or three dimensions of each configuration in
one figure.

Example 8.9 We consider the assessment marks of a class of twenty-three students in a


second-year statistics course I taught. The total assessment consisted of five assignments
and a final examination, and we therefore have r = 6 and c = 23. For each student, the mark
he or she achieved in each assessment task is an integer, and in this example I regard these
integer values as counts.
The raw data are shown in the two parallel coordinate plots in the top panels of
Figure 8.10: the twenty-three curves in the left plot show the students’ marks with the assess-
ment tasks 1,. . . , 6 on the x-axis. The six curves in the right plot show the marks of the six
assessment tasks, with the student number on the x-axis. In the left plot I highlight the per-
formance of three students (with indices 1, 5 and 16) in red and of another (with index 9)
in black. The three red curves are very similar, whereas the black curve is clearly different
from all other curves. In the plot on the right, the assignment marks are shown in blue and
the examination marks in black. The ‘black’ student from the left panel is the student with
x-value 9 in the right panel.
The bottom-left panel of Figure 8.10 shows the 2D configurations D2 U2T and D2 V2T of
(8.24) in the same figure and so differs from the configurations we are used to seeing in
Multidimensional Scaling. Here U2 is a 6 × 2 matrix, and V2 is a 23 × 2 matrix. The dots
represent D2 V2T , and the red and black dots show the same student indices as earlier. The
diamond symbols represent D2 U2T , with the black diamond on the far right the examination
mark. The configurations show which rows or columns are alike. The three red dots are
very close together, whereas the black dot is clearly isolated from the rest of the data. Two
diamonds are close to each other – with coordinates close to (20,20), so two assessment
tasks had similar marks. The rest of the diamonds are not clustered, which shows that the
distribution of assignment marks and examination marks may not be the same.

80 80

40 40

0 0
1 3 5 5 10 15 20

20 100

0
50

−20
0
−100 0 100 0 50 100
Figure 8.10 Parallel coordinate plots of Example 8.9 and corresponding 2D configurations in
the lower panels.
8.6 Scaling for Grouped and Count Data 277
5

2.5

−2.5

−5
−8 −4 0 4 8
Figure 8.11 2D configurations of Example 8.10.

The scatterplot in the bottom right compares dissimilarities with configuration distances
for the configuration D2 V2T ; here I use the Euclidean distance in both cases. The dissimi-
larities of the observations are shown on the x-axis against the configuration distances on
the y-axis, and the red line is the diagonal y = x. Note that most distances are small, and
more points are below the red line than above. The latter shows that the pairwise distances
of the columns are larger than those of the 2D configurations. Overall, the points are close
to the line.

The illicit drug market data of Example 8.4 also fit into the realm of Correspondence
Analysis: The entries of X are the seventeen series of count data observed over sixty-six
months.

Example 8.10 In the analysis of the illicit drug market data in Example 8.4, I construct 1D
and 2D configurations for the seventeen series and compare the first components of the two
configurations. In this example I combine the 2D configuration of Example 8.4 with the 2D
configuration (8.24) for the sixty-six months. The two configurations are shown in the same
plot in Figure 8.11: black dots refer to the seventeen series and blue asterisks to the sixty-six
months.
The figure shows a clear gap which splits the data into two parts. The months split into the
first forty-nine on the right-hand side and the remaining later months on the left-side of the
plot. This split agrees with the two parts in Figure 2.14 in Example 2.14 in Section 2.6.2. In
Figure 2.14, all PC1 weights before month 49 are positive, and all PC1 weights after month
49 are negative. Gilmour et al. (2006) interpreted this change in the sign of the weights as a
consequence of the heroin shortage in early 2001.
The configuration for the series emphasises the split into two parts further. The black dots
in the right part of the figure are those listed as cluster 1 in Table 6.9 in Example 6.10 in
Section 6.5.2 but also include series 15, robbery 2, whereas the black dots on the left are
essentially the series belonging to cluster 2.
The visual effect of the combined configuration plot is a clearer view of the partitions that
exist in these data than is apparent in separate configurations.
278 Multidimensional Scaling

These examples illustrate that cluster structure may become more obvious or pronounced
when we combine the configurations for rows and columns. This is not typically done in
Multidimensional Scaling but is standard in Correspondence Analysis.
We now turn to the second analysis (CA.b), and consider distances between observations
which are given as counts. Although the second analysis can be done for the columns or
rows, I will mostly focus on columns and, without loss of generality, regard each column as
an observation.
Definition 8.16 Let X be a matrix of size r × c whose entries are counts. The (column)
profile Xi of X consists of the r entries {X i1 , . . . , X ir } of Xi . Two profiles Xi and Xk are
equivalent if Xi = λXk for some constant λ.
For i ≤ c and Ci = ∑ X i , the column total of profile Xi , put X )i j = X i j /Ci for j ≤ r , and
) ) )
call Xi = [ X i1 · · · X ir ] the i th normalised profile.
T

The χ 2 distance or profile distance of two profiles Xk and X is the distance k of the
normalised profiles X ) k and X)  , defined by
 
r 1  2 1/2
k = ∑ c X ) j
)k j − X . (8.25)
)
j=1 ∑i=1 X i j

When the data are counts, it is often more natural to compare proportions. The definitions
imply that equivalent profiles result in normalised profiles that agree.
The name χ 2 distance is used because of the connection of the k 2 with the χ 2 statistic

in the analysis of contingency tables. In Correspondence Analysis we do not make distri-


butional assumptions about the data – such as the multinomial assumption inherent in the
analysis of contingency tables – thus we cannot take recourse to the inferential interpre-
tation of the χ 2 distribution. Instead, the graphical displays of Correspondence Analysis
provide a visual tool which shows which profiles are alike. The connection between the
profile distances and the configurations is summarised in the following proposition.
Proposition 8.17 Let X be a matrix of size r × c whose entries are counts and whose
columns are profiles. Let X ) be the matrix of normalised profiles. Assume that r ≤ c and
) )
that X has rank r . Write X = U )D ) Let k be
) V) T for the singular value decomposition of X.
the profile distance (8.25) of profiles Xk and X . Then
2 )D
= (ek − e )T U ) 2U
) T (ek − e ), (8.26)
k

where the ek are unit vectors with a 1 in the kth entry and zero otherwise.
A proof of this proposition is deferred to the Problems at the end of Part II. The proposi-
tion elucidates how Correspondence Analysis combines ideas of Multidimensional Scaling
and the analysis of contingency tables and shows the mathematical links between them.
To illustrate the second type of Correspondence Analysis (CA.b), I calculate profile
distances for the proteomics profiles and propose an interpretation of these distances.

Example 8.11 The curves of the ovarian cancer proteomics data are measurements taken
at points of a tissue sample. In Example 6.12 of Section 6.5.3 we explored clustering of
the binary curves using the Hamming and the cosine distances. Figure 6.12 and Table 6.10
8.6 Scaling for Grouped and Count Data 279

in Example 6.12 show that the data naturally fall into four different tissue types, but they
also alert us to the difference in the grouping by the two approaches. From the figures and
numbers in the table alone, it is not clear how different the methods really are and how big
the differences are between various tissue types.
To obtain a more quantitative assessment of the sameness and the differences between the
curves in the different cluster arrangements, I calculate the distances (8.25). Starting with
the binary data, as in Example 6.12, we first need to define the count data. Each binary curve
has a 0 or 1 at each of the 1,331 m/z values. It is natural to consider the total counts within
a cluster for each m/z value. Let C1 = ham1, . . ., C4 = ham4 be the four clusters obtained
with the Hamming distance, and let C5 = cos1, . . . , C8 = cos4 be the four clusters obtained
with the cosine distance. The clusters lead to c = 8 profiles, which I refer to as the counts
profiles or cluster profiles. The cluster profiles are given by

X(k) = ∑ Xι for k ≤ 8, (8.27)


Xι ∈Ck

and each cluster profile has r = 1, 331


 entries, the counts at the 1,331 m/z values.
 (1) 
For the counts data X · · · X
(1) (8) )
and their normalised profiles X · · · X) (8) I calculate
the pairwise profile distances k of (8.25). The k values are shown in Table 8.7.
Table 8.7 compares eight counts profiles; ham1 refers to the cancer clusters obtained
with the Hamming distance and shown in yellow in Figure 6.12, ham2 refers to the adi-
pose tissue which is shown in green in the figure, ham3 refers to peritoneal stroma shown
in blue and ham4 refers to the grey background cluster. The cosine-based clusters are listed
in the same order. The information presented in the table is shown by differently coloured
squares in Figure 8.12, with ham1,. . . , cos4 squares arranged from left to right and from
top to bottom in the same order as in the table. Thus, the second square from the left
in the top row of Figure 8.12 represents (1,2) , the distance between the cancer cluster
ham1 and the adipose cluster ham2. The darker the colour of the square, the smaller is the
distance k .
Table 8.7 and Figure 8.12 show that there is good agreement between the Hamming and
cosine counts profiles for each tissue type. The match agrees least for the cancer counts,
with a -value of 0.17. The three non-cancerous tissue types are closer to each other than to
the cancer tissue type, and the cancer tissue and background tissue differ most.
The relative lack of agreement between the two cancer-counts profiles ham1 and cos1
may be indicative of the fact that the groups are obtained from a cluster analysis rather
than a discriminant analysis and point to the need for good classification rules for these
data.

8.6.2 Analysis of Distance


Gower and Krzanowski (1999) combined ideas from the Analysis of Variance (ANOVA)
and Multidimensional Scaling and proposed a technique called Analysis of Distance which
applies to a larger class of data than the Analysis of Variance. A fundamental assumption
in their approach is that the data partition into κ distinct groups. I describe the main ideas
of Gower and Krzanowski, adjusted to our framework. As we shall see, the Analysis of
Distance has strong connections with Discriminant Analysis and Multidimensional Scaling.
280 Multidimensional Scaling

Table 8.7 Profile Distances k for the Hamming and Cosine Counts Data from Example 8.11
ham1 ham2 ham3 ham4 cos1 cos2 cos3 cos4
0 0.55 0.54 0.62 0.17 0.55 0.56 0.62 ham1
0 0.40 0.49 0.44 0.01 0.42 0.49 ham2
0 0.53 0.41 0.40 0.04 0.53 ham3
0 0.53 0.49 0.55 0.03 ham4
0 0.44 0.44 0.53 cos1
0 0.42 0.49 cos2
0 0.55 cos3
0 cos4

Figure 8.12 Profile distances k based on the cluster profiles of the ham1, . . . , cos4 clusters of
Table 8.7. The smaller the distance, the darker is the colour.

 
Let X = X1 · · · Xn be d-dimensional centred data. Let ik be classical Euclidean dissim-
2 /2. In the
ilarities for pairs of columns of X. Let K be the n × n matrix with elements − ik
notation of (8.10) to (8.12), K = −P ◦ P/2, and

Q n = XT X = K ,

where  is the centring matrix of (8.11).


Gower and Krzanowski collapsed the d × n data matrix into a d × κ matrix based on the
κ groups X partitions into and then applied Multidimensional Scaling to this much smaller
matrix.
 
Definition 8.18 Let X = X1 · · · Xn be d-dimensional centred data. Let ik be classical
Euclidean dissimilarities, and let K be the matrix with entries − ik
2 /2. Assume that the data

belong to κ distinct groups, with κ < d; the kth group has n k members from X, and ∑ n k = n,
8.6 Scaling for Grouped and Count Data 281

for k ≤ κ. Let G be the κ × n matrix with elements


$
1 if Xi belongs to the kth group
gik =
0 otherwise,
and let N be the κ × κ diagonal matrix with diagonal elements n k . The matrix of group
means XG and the matrix K G derived from the group dissimilarities are
XG = XG T N −1 and K G = N −1 G K G T N −1 . (8.28)


A little reflection shows that XG is of size d × κ and K G is of size κ × κ, and thus, the
transformation X → XG reduces the d × n raw data to the d × κ matrix of group means. Put
Q <G,κ> = XTG XG .
Using (8.12) and Theorem 8.3, it follows that
Q <G,κ> = K G . (8.29)

<G,κ> , then WG = DG VG is an r -
If VG DG 2 V T is the spectral decomposition of Q T
G
dimensional configuration for XG , where r is the rank of XG .
So far we have applied ideas from classical scaling to the group data. To establish a
connection with the Analysis of Variance, Gower and Krzanowski (1999) started with the
n × n matrix K of Definition 8.18 and considered submatrices K k of size n k × n k which
contain all entries of K relating to the kth group. Let ik be the Euclidean distance between
the i th and kth group means, and let K be the matrix of size κ × κ with entries ( − 2ik /2).
For
1
T = − (1n×1 )T K 1n×1
n
κ 1
W = − ∑ (1n k ×1 )T K k 1n k ×1
k=1 n k
1
B = − [n 1 . . . n κ ]K [n 1 . . . n κ ]T ,
n
Gower and Krzanowski showed the fundamental identity of the Analysis of Distance,
namely,
T = W + B, (8.30)
which reminds us of the decomposition of the total variance into between-class and within-
class variances.

Example 8.12 The ovarian cancer proteomics data from Example 8.11 are the type of data
suitable for an Analysis of Distance; the groups correspond to clusters or classes. A little
thought tells us that the matrix of group means (8.28) consists of the normalised profiles
calculated from the counts profiles.
Instead of the profile distances which I calculated in the preceding analysis of these data,
in an Analysis of Distance, one is interested in lower-dimensional profiles and the identity
282 Multidimensional Scaling

(8.30), which highlights the connection with Discriminant Analysis. We pursue these cal-
culations in the Problems at the end of Part II and compare the results with those obtained
in the correspondence analysis of these data in the preceding example. The two approaches
could be combined to form the basis of a classification strategy for such count data.

In the Analysis of Variance, F-tests allow us to draw inferences. Becaues the Analysis of
Distance is defined non-parametrically, F-tests are not appropriate; however, one can still
calculate T , W and B. The quantities B and W are closely related to similar quantities in
Discriminant Analysis: the between-class matrix B  and the within-class matrix W  , which
are defined in Corollary 4.9 in Section 4.3.2. In Discriminant Analysis we typically use
 −1 B.
the projection onto the first eigenvector of W  In contrast, in an Analysis of Distance,
we construct low-dimensional configurations of the matrix of group means, similar to the
low-dimensional configurations in Multidimensional Scaling.
The Analysis of Distance reduces the computational burden of Multidimensional Scaling
by replacing X and Q n with XG , the matrix of sample means of the groups, and the matrix
Q <G,κ> of (8.29). Both XG and Q <G,κ> are typically of much smaller size than X and
Q n . Based on Q <G,κ> , one calculates configurations for the grouped data. To extend the
configurations to X, one makes use of the group structure of the data and adds points as in
Gower (1968) which are consistent with the existing configuration distances of the group
means.

8.6.3 Low-Dimensional Embeddings


Classical scaling with Euclidean dissimilarities has become the starting point for a number
of new research directions. In Section 8.2 we considered objects and dissimilarities and
constructed configurations without knowledge of the underlying data X. The more recent
developments have a different focus: they start with d-dimensional data X, where both d
and n may be large, and the goal is that of embedding the data into a low-dimensional
space which contains the important structure of the data. The embeddings rely on Euclidean
dissimilarities of pairs of observations, and they are often non-linear and map into manifolds.
Local Multidimensional Scaling, Non-Linear Dimension Reduction, and Manifold Learning
are some of the names associated with these post-classical scaling methods.
To give a flavour of the research in this area, I focus on common themes and outline three
approaches:
• the Distributional Scaling of Quist and Yona (2004),
• the Landmark Multidimensional Scaling of de Silva and Tenenbaum (2004), and
• the Local Multidimensional Scaling of Chen and Buja (2009).
 
We begin with common notation for the three approaches. Let X = X1 · · · Xn be d-
dimensional data. For p < r , and r the rank of X, let f be an embedding of X or subsets
of X into R p . The dissimilarities on X and distances  on f(X) are defined for pairs of
observations Xi and Xk by

ik = Xi − Xk and ik = f(Xi ) − f(Xk ) .

Section 8.5.1 highlights advantages of the duality between X and XT for HDLSS data. For
arbitrary data with large sample sizes, the dissimilarity matrix becomes very large, and the
8.6 Scaling for Grouped and Count Data 283

Q n -matrix approach becomes less desirable for constructing low-dimensional configura-
tions. For data that partition into κ groups, Section 8.6.2 describes an effective reduction in
sample size by replacing the raw data with their group means. To be able to do this, it is nec-
essary that the data belong to κ groups or classes and that the group or class membership is
known and distance-based. If the data belong to different clusters, with an unknown number
of clusters, then the approach of Gower and Krzanowski (1999) does not work.
Group membership is a local property of the observations. We now consider other local
properties that can be exploited and integrated into scaling. Suppose that the data X have
some local structure L = {1 , . . ., κ } consistent with pairwise distances ik . The task of
finding a good global embedding f which preserves the local structure splits into two parts:
1. finding a good embedding for each of the local units k , and
2. extending the embedding to all observations so that the structure L is preserved.

Distributional Scaling. Quist and Yona (2004) used clusters as the local structures and
assumed that the cluster assignment is known or can be estimated. The first step of the
global embedding is similar to that of Gower and Krzanowski (1999) in Section 8.6.2. For
the second step, Quist and Yona proposed using a penalised stress loss. For pairs of clusters
Ck and Cm , they considered a function ρkm : R → R which reflects the distribution of the
distances i j between points of the two clusters:
∑Xi ∈Ck ∑X j ∈Cm wi j δ(x − ij)
ρkm (x) = ,
∑Xi ∈Ck ∑X j ∈Cm wi j
where the wi j are weights, and δ is the Kronecker delta function. Similarly, they defined a
)km for the embedded clusters f(Ck ) and f(Cm ). For a tuning parameter 0 ≤ α ≤ 1, the
function ρ
penalised stress S tress QY is regarded as a function of the distances and the embedding f:

S tress QY ( , f) = (1 − α)S tress( , f) + α ∑ Wkm D(ρkm , ρ)km ).


k≤m

The last sum is taken over pairs of clusters, the Wkm are weights for pairs of clusters Ck and
Cm , and D(ρkm , ρ
)km ) is a measure of the dissimilarity of ρkm and its embedded version ρ)km .
Examples and parameter choices are given in their paper. Instead of the stress S tress( , f),
they also used the Sammon stress (see Table 8.4).
The success of the method depends on how much is known about the cluster structure or
how well the cluster structure – the number of clusters and the cluster membership – can be
estimated. The final embedding f is obtained iteratively, and the authors suggested starting
with an embedding that results in low stress. Their method depends on a large number of
parameters. It is not easy to see how the choices of the tuning parameter α, the weights wi j
between points and the weights Wkm between clusters affect the performance of the method.
Although the number of parameters may seem overwhelming, a judicious choice can lead to
a low-dimensional configuration which contains the important structure of the data.

Landmark Multidimensional Scaling. de Silva and Tenenbaum (2004) used κ landmark


points as the local structure and calculated distances from each of the n observations to
each of the landmark points. As a consequence, the dissimilarity matrix P is reduced to a
dissimilarity matrix Pκ of size κ × n. Classical scaling is carried out for the landmark points.
284 Multidimensional Scaling

The landmark points are usually chosen randomly from the n observations, but other choices
are possible. If there are real landmarks in the data, such as cluster centres, they can be used.
This approach results in the p × κ configuration
W(κ) = D p V pT , (8.31)
which is based on the Q n matrix derived from the weights of Pκ as in Algorithm 8.1, and
p is the desired dimension of the configuration. The subscript κ indicates the number of
landmark points.
In a second step – the extension of the embedding to all observations –
de Silva and Tenenbaum (2004) defined a distance-based procedure which determines
where the remaining n − κ observations should be placed. For notational convenience, one
assumes that the first κ column vectors of Pκ are the distances between the landmark points.
Let Pκ ◦ Pκ be the Hadamard product, and let pk be the kth column vector of Pκ ◦ Pκ . The
remaining n − κ observations are represented by the column vectors pm of Pκ ◦ Pκ , where
κ < m ≤ n. Put
1 κ
= ∑ pk ,
κ k=1
W#(κ) = D −1 T
p Vp ,

and define the embedding f for non-landmark observations Xi and columns pi by


1
Xi −→ f(Xi ) = − W#(κ) (pi − ).
2
In a final step, de Silva and Tenenbaum applied a principal component analysis to the
configuration axes in order to align the axes with those of the data.
As in Quist and Yona (2004), the dissimilarities of the data are taken into account in the
second step of the approach of de Silva and Tenenbaum (2004). Because the landmark points
may be randomly chosen, de Silva and Tenenbaum suggested running the method multiple
times and discarding bad choices of landmarks. According to de Silva and Tenenbaum,
poorly chosen landmarks have low correlation with good landmark choices.

Local Multidimensional Scaling. Chen and Buja (2009) worked with the stress loss, which
they split into a ‘local’ part and a ‘non-local’ part, where ‘local’ refers to nearest neigh-
bours. A naive restriction of stress to pairs of observations which are close does not lead to
meaningful configurations (see Graef and Spence 1979). For this reason, care needs to be
taken when dealing with non-local pairs of observations. Chen and Buja (2009) considered
symmetrised k-nearest neighbourhood (kNN) graphs based on sets
Nsym = {(i , m): Xi ∈ N (Xm , k) & Xm ∈ N (Xi , k)},
where i and m are the indices of the observations Xi and Xm , and the sets N (Xm , k) are those
defined in (4.34) of Section 4.7.1. The key idea of Chen and Buja (2009) was to modify the
stress loss for pairs of observations which are not in Ssym and to consider, for a fixed c,
S tressC B ( , f) = ∑ (im − im )
2
−c ∑ im .
(i,m)∈Nsym (i,m)∈ Nsym

Chen and Buja (2009) call the first sum in S tressC B the ‘local stress’ and the second the
‘repulsion’. Whenenver (i , m) ∈ Nsym , the local stress forces the configuration distances to
8.6 Scaling for Grouped and Count Data 285

be small, and thus preserves the local structure of neighbourhoods. Chen and Buja (2009)
have developed a criterion, called local continuity, for choosing the parameter c in S tressC B .
With regards to the choice of k, their computations show that a number of values should be
tried in applications.
Chen and Buja (2009) reviewed a number of methods that have evolved from classical
scaling. These methods include the Isomap of Tenenbaum, de Silva, and Langford (2000),
the Local Linear Embeddings of Roweis and Saul (2000) and Kernel Principal Component
Analysis of Schölkopf, Smola, and Müller (1998). We consider Kernel Principal Compo-
nent Analysis in Section 12.2.2, a non-linear dimension-reduction method which extends
Principal Component Analysis and Multidimensional Scaling. For more general approaches
to non-linear dimension reduction, see Lee and Verleysen (2007) or chapter 16 in Izenman
(2008).
In this chapter we only briefly touched on how to choose the dimension of a configuration.
If visualisation is the aim, then two or three dimensions is the obvious choice. Multidimen-
sional Scaling is a dimension-reduction method, in particular, for high-dimensional data,
whose more recent extensions focus on handling and reducing the dimension and taking
into account local or localised structure. Finding dimension-selection criteria for non-linear
dimension reduction may prove to be a bigger challenge than that posed by Principal Com-
ponent Analysis, a challenge which will require mathematics we may not understand but
have to get used to.

You might also like