Interactive Data Visualization With Multidimensional Scaling
Interactive Data Visualization With Multidimensional Scaling
We discuss interactive techniques for multidimensional scaling (MDS) and a two sys-
tems, named “GGvis” and “XGvis”, that implement these techniques.
MDS is a method for visualizing proximity data, that is, data where objects are char-
acterized by dissimilarity values for all pairs of objects. MDS constructs maps (called
“configurations”) of these objects in IR k by interpreting the dissimilarities as distances.
As a data-mapping technique, MDS is fundamentally a visualization method. It is
hence plausible that MDS gains in power if it is embedded in a data visualization
environment. Consequently, the MDS systems presented here are conceived as exten-
sions of multivariate data visualization systems (“GGvis” and “X/GGobi” in this case).
The visual analysis of MDS output profits from dynamic projection tools for viewing
high-dimensional configurations, from brushing multiple linked views, from plot en-
hancements such as labels, glyphs, colors, lines, and from selective removal of groups
of objects. Powerful is also the ability to move points and groups of points interactively
and thereby create new starting configurations for MDS optimization.
In addition to the benefits of a data visualization environment, we enhance MDS by
providing interactive control over numerous options and parameters, a few of them
novel. They include choices of 1) metric versus nonmetric MDS, 2) classical versus dis-
tance MDS, 3) the configuration dimension, 4) power transformations for metric MDS,
5) distance transformations and 6) Minkowski metrics for distance MDS, 7) weights in
1
Andreas Buja is the Liem Sioe Liong / First Pacific Company Professor, The Wharton School, University
of Pennsylvania, Philadelphia, PA 19104-6302. (http://www-stat.wharton.upenn.edu/˜buja)
2
Deborah F. Swayne is Senior Technical Staff Member, AT&T Labs, 180 Park Ave., P.O. Box 971,
Florham Park, NJ 07932-0971. ([email protected], http://www.research.att.com/˜dfs)
3
Michael L. Littman is Associate Research Professor, Rutgers University, Department of Com-
puter Science, Hill Center Room 409, Piscataway, NJ 08854-8019. ([email protected],
http://www.cs.rutgers.edu/˜mlittman/)
4
Nathaniel Dean is Associate Professor, Computational and Applied Mathematics - MS 134, Rice Uni-
versity, 6100 Main Street, Houston, TX 77005. ([email protected], http://www.caam.rice.edu/˜nated)
5
Heike Hofmann is Assistant Professor, Dept of Statistics, Iowa State University, Ames, IA 50011.
([email protected], http://www1.math.uni-augsburg.de/˜hofmann)
1
the form of powers of dissimilarities and 8) as a function of group memberships, 9) var-
ious types of group-dependent MDS such as multidimensional unfolding and external
unfolding, 10) random subselection of dissimilarities, 11) perturbation of configura-
tions, and 12) a separate window for diagnostics, including the Shepard plot.
MDS was originally developed for the social sciences, but it is now also used for laying
out graphs. Graph layout is usually done in 2-D, but we allow layouts in arbitrary
dimensions. We show applications to the mapping of computer usage data, to the
dimension reduction of marketing segmentation data, to the layout of mathematical
graphs and social networks, and finally to the spatial reconstruction of molecules.
Key Words: Proximity Data, Dissimilarity Data, Multivariate Analysis, Dimension
Reduction, Multidimensional Unfolding, External Unfolding, Graph Layout, Social
Networks, Molecular Conformation.
www.research.att.com/areas/stat/xgobi/
www.ggobi.org
XGvis is currently more established, but GGvis is more recent, easier to run under MS
Windows, and programmable from other systems such as R. In what follows we refer to the
two systems as X/GGvis.
2
0 1 2 3 4
1 0 1 2 3 0 3 4
D =
2 1 0 1 2
D = 3 0 5
3 2 1 0 1
4 5 0
4 3 2 1 0
Figure 1: Simple Examples of Dissimilarity Matrices and Their Optimal Scaling Solutions.
This contrasts with multivariate data that consist of covariate information for individual ob-
jects. If the objects are labeled i = 1, ..., N , proximity data can be assumed to be dissimilarity
values Di,j . If the data are given as similarities, some monotone decreasing transformation
will convert them to dissimilarities. Dissimilarity data occur in many areas (see Section 1.3).
The goal of MDS is to map the objects i = 1, ..., N to points x1 , ..., xN ∈ IRk in such
a way that the given dissimilarities Di,j are well-approximated by the distances kxi − xj k.
Psychometricians often call these distances the “model” fitted to the data Di,j .
The dissimilarity matrices of Figure 1 are simple examples with easily recognized error-
free MDS solutions: the left matrix suggests mapping the five objects to an equispaced linear
arrangement; the right matrix suggests mapping the three objects to a right triangle. The
figure shows configurations actually found by MDS. The first configuration can be embedded
in k = 1 dimension, while the second needs k = 2 dimensions. The choice of embedding
dimension k is arbitrary in principle, but low in practice: k = 1, 2, 3 are the most frequently
used dimensions, for the simple reason that the points serve as easily visualized representors
of the objects.
In real data, there are typically many more objects, and the dissimilarities usually contain
error as well as bias with regard to the fitted distances.
The oldest version of MDS, called classical scaling, is due to Torgerson (1952). It is,
however, a later version due to Kruskal (1964a,b) that has become the leading MDS method.
It is defined in terms of minimization of a cost function called “Stress”, which is simply a
measure of lack of fit between dissimilarities Di,j and distances kxi − xj k. In the simplest
case, Stress is a residual sum of squares:
1/2
3
where the outer square root is just a convenience that gives greater spread to small values. For
a given dissimilarity matrix D = (Di,j ), MDS minimizes Stress over all point configurations
(x1 , ..., xN ), thought of as k × N -dimensional hypervectors of unknown parameters. The
minimization can be carried out by straightforward gradient descent applied to Stress D ,
viewed as a function on IRkN .
We note a technical detail: MDS is blind to asymmetries in the dissimilarity data because
(Di,j − kxi − xj k)2 + (Dj,i − kxj − xi k)2 = 2 · ((Di,j + Dj,i )/2 − kxi − xj k)2 + . . . ,
where . . . is an expression that does not depend on kxi − xj k. Without loss of generality
we assume from now on that the dissimilarities are symmetric: Di,j = Dj,i . If they are not,
they should be symmetrized by forming pairwise averages. The assumption of symmetry will
later be broken in one special case, when one of the two values is permitted to be missing
(Section 4.4).
4
1.3 Applications of MDS
Here is an incomplete list of application areas for MDS:
• MDS was invented for the analysis of proximity data which arise in the following areas:
– The social sciences: Proximity data take the form of similarity ratings for pairs
of stimuli such as tastes, colors, sounds, people, nations, ...
– Archaeology: Similarity of two digging sites can be quantified based on the fre-
quency of shared features in artifacts found in the sites.
– Classification problems: In classification with large numbers of classes, pairwise
misclassification rates produce confusion matrices that can be analyzed as similar-
ity data. An example would be confusion rates of phonemes in speech recognition.
• Another early use of MDS was for dimension reduction: Given high-dimensional data
y1 , ..., yN ∈ IRK (K large), compute a matrix of pairwise distances dist(yi , yj ) = Di,j ,
and use distance scaling to find lower-dimensional x1 , ..., xN ∈ IRk (k << K) whose
pairwise distances reflect the high-dimensional distances Di,j as well as possible. In
this application, distance scaling is a non-linear competitor of principal components.
Classical scaling, on the other hand, is identical to principal components when used
for dimension reduction. For a development of multivariate analysis from the point of
view of distance approximation, see Meulman (1992).
• In chemistry, MDS can be used for molecular conformation, that is, the problem of
reconstructing spatial structure of molecules. This situation differs from the above
areas in that 1) actual distance information is available from experiments or theory, and
2) the only meaningful embedding dimension is k = 3, physical space. Configurations
are here called “conformations.” Some references are Crippen and Havel (1978), Havel
(1991), Glunt et al. (1993), and Trosset (1998a). See also our nanotube example in
Section 7.
• A fourth use of MDS is for graph layout, an active area at the intersection of discrete
mathematics and network visualization, see Di Battista et al. (1994). An early example
before its time was Kruskal and Seery (1980). From graphs one can derive distances,
such as shortest-path metrics, which can be subjected to MDS for planar or spatial
layout. Note that shortest-path metrics are generally strongly non-Euclidean, hence
significant residual should be expected in this type of application.
In Section 7 we will show examples of data in all four categories. Our overall experience
with the various types of MDS is as follows: distance scaling is sometimes not very good at
dimension reduction, whereas classical inner-product scaling is sometimes not very good at
graph layout. The examples will exemplify these points.
5
S ...
S ...
H .... 5 .....
V ...-
U ..- 4 ....-
U ..-
I .. H ....
F ..-. 5 .....
I .. V ...-
B -... 6 -.... A .-
A .- D -.. X -..- F ..-. 4 ....-
L .-.. 3 ...-- D -..
N -. R .-.
K -.-
B -...
L .-..
R .-. 6 -....
C -.-. 7 --... W .-- X -..-
Y -.-- 3 ...--
E. K -.-
W .-- M --
Z --.. C -.-.
P .--. 2 ..--- 7 --...
T- P .--. Y -.--
Q --.-
Z --..
J .--- G --. 2 ..---
8 ---.. J .--- Q --.-
N -. G --.
O --- 8 ---..
O --- 1 .---- 1 .----
M --
9 ----.
0 ----- E. 9 ----.
T-
0 -----
H .... H ....
V ...-
5 ..... S ... U ..-
S ...U ..- V ...-
5 .....
4 ....- D -..
D -.. B -... 4 ....-
F ..-. F ..-. B -...
R .-. L .-..
L .-..
R .-.
6 -.... 6 -....
X -..- K -.-
K -.- X -..-
W .--
W .--
A .-I .. A .- I ..
3 ...-- N -.
N -. 3 ...--
E.
C -.-. C -.-.
M --T - TE
- .
M --
7 --... 7 --...
G --. Y -.-- P .--.
ZY--..
-.--
P .--. Z --.. G --.
O ---
O ---
J .---
2 ..---
2 J..---
.--- Q --.-
Q --.-
8 ---..
8 ---.. 1 .----
1 .----
0 -----
0
9 -----
----.
9 ----.
Figure 2: Rothkopf ’s Morse Code Data: four 2-D configurations. Top row: metric and
nonmetric distance scaling. Bottom row: metric and nonmetric classical scaling.
This yielded all non-negative values because the confusion matrix (si,j )i,j is diagonally domi-
nant (most identical code pairs are correctly judged even by inexperienced subjects). Unlike
other conversion methods, this one has the desirable property Di,i = 0. We also sym-
6
metrized the dissimilarities before the conversion as MDS responds only to the symmetric
part of proximity data, as noted earlier.
Applying all four scaling methods in k = 2 dimensions to the Morse code dissimilarities
produced the configurations shown in Figure 2. We decorated the plots with labels and lines
to aid interpretation. In particular, we connected groups of codes of the same length, except
for codes of length four which we broke up into three groups and a singleton. We observe
that, roughly, the length of the codes increases left to right, and the fraction of dots increases
bottom up. Both of these observations agree with the many published accounts (Shepard
1962, Kruskal and Wish 1978, p. 13, Borg and Groenen 1997, p. 59, for example). The
orientation of the configurations in IR2 is arbitrary due to rotation invariance of the distances
kxi − xj k and inner products hxi , xj i. We were therefore free to rotate the configurations in
order to achieve this particular interpretation of the horizontal and vertical axes.
The configurations produced by the four scaling methods show significant differences.
Nonmetric distance scaling (top right) produces probably the most satisfying configuration,
with the exception of the placement of the codes of length 1 (“E” and “T”). Metric scaling
(top left) suffers from circular bending but it places “E” and “T” in the most plausible
location. The classical scaling methods (bottom row) bend the configurations in different
ways. More problematically, they overlay the codes of length 4 and 5 and invert the placement
of the codes of length 1 and 2, both of which seem artifactual. In fact they aren’t: classical
scaling requires a third dimension to distinguish between these two pairs of groups. Distance
scaling is better at achieving compromises in lower dimensions, while classical scaling is more
rigid in this regard.
• Start up with dissimilarity data, multivariate data, or graph data. X/GGvis will
perform proximity analysis, dimension reduction, or graph layout, respectively. Provide
an initial configuration, or else a random configuration is generated automatically.
• Select one of the four scaling methods. The default is metric distance scaling.
• Choose a dimension. The default is 3.
7
Figure 3: The Major GGvis Windows. On the left is the master control panel, on the right is
the GGobi window for the configuration. Below is the GGobi window for an optional Shepard
plot.
• Initiate optimization (“Run MDS”) and watch the animation of the configuration and
the progression of the Stress or Strain value. When the shape of the configuration stops
changing, slow the optimization down by lowering the stepsize interactively. Finally,
stop the optimization (toggle “Run MDS”). X/GGvis does not have an automatic
convergence criterion, and optimization does not stop on its own.
• Examine the shape of the optimized configuration: If the chosen dimension k is two,
remain in “XY Plot”. If k is three, use 3-D “Rotation” or the “Grand/2-D Tour”. If
k is higher, use the “Grand/2-D Tour”.
• Interpret the configuration: Assuming informative object labels were provided with the
input data, search the configuration by labeling points (“Identify”). If covariates are
available in addition to the dissimilarities, interpretation can be further aided by linked
color brushing between covariate views and configuration views (“Brush”). Multiple
X/GGobi windows are automatically linked for brushing and labeling. As this is only
a tentative search for interpretable structure, use “transient” brushing.
8
• Enhance the configuration: After acquiring a degree of familiarity with the configura-
tion, use “persistent” brushing in order to permanently characterize subsets of interest.
Enhance the configuration further by persistently labeling interesting points. Finally,
enhance the overall perception of shape by connecting selected pairs of nearby points
with lines (“Line Editing”) and coloring the lines (“Brush”).
• Turn on optimization and leave it continually running. Observe the effects of
We described elsewhere (Swayne et al. 1998) X/GGobi operations such as three-D rotations
and grand tours, as well as (linked) brushing, labeling and line editing. Moving points is
also an X/GGobi operation, but in conjunction with MDS optimization it takes on a special
importance and is therefore described in Section 3. MDS parameters as well as weighting
and subsetting of dissimilarities affect the cost function and are therefore specific to MDS.
They are the subject of Section 4.
9
Figure 4: Snapshots from a MDS Animation. The figure shows nine stages of a Stress
minimization in three dimensions. It reconstructed a 5 × 5 square grid from a random
configuration. The grid was defined as a graph with 25 nodes and 40 edges. The distances
were computed as the lengths of the minimal paths in the graph (= city block-, Manhattan-,
or L1 -distances). These distances are not Euclidean, causing curvature in the configuration.
McFarlane and Young call this methodology “sensitivity analysis” because moving points
and observing the response of the optimization amounts to checking the stability of the
configuration.
ViSta-MDS implements a two-step mode of operation in which users alternate between
animated optimization and manipulation. X/GGvis permits the two-step mode also, but
in addition it implements a fluid mode of operation in which the user runs a never-ending
optimization loop, with no stopping criterion whatsoever. The user manipulates the config-
uration points while the optimization is in progress. The optimizer ignores the manipulated
point, but the other points “feel” the dislocation through the change in the cost function,
and they are therefore slowly dragged. They try to position themselves in a local minimum
configuration with regard to the manipulated point. As soon as the manipulated point is
let go, it snaps into a position that turns the configuration into a local minimum of the cost
function. The resulting feel for the user is that of pricking and tearing the fabric of the
10
E.
5 ..... T-
I .. 5 .....
I ..
M --
M --
ET. - 0 -----
0 -----
T -E .
I ..
5 .....
I .. 5 .....
E.
T-
M -- M --
0 -----
0 -----
Figure 5: Four Local Minima Found by Moving the Group {E,T } into Different Locations.
The Stress values are, left to right and top to bottom: 0.2101, 0.2187, 0.2189, 0.2207.
11
vex mixture of the present configuration and a random Gaussian configuration. The default
mixing parameter is 100% random, which means a completely new random configuration is
generated. A smaller fraction of 20% or so can be used for local stability checks: if optimiza-
tion always drives the perturbed configuration back to its previous state, it is stable under
20% perturbation. Further stability checks will be discussed below. For a discussion of the
problem of local minimum configurations, see Borg and Groenen (1997), Section 13.4.
As a final point, here is another use of manual dragging: rotation of configurations for
interpretability, essentially the factor rotation problem translated to MDS. After examining
a configuration and decorating it with labels, colors, glyphs and lines, one usually obtains
interpretations of certain directions in configuration space. This is when one develops a
desire to rotate the configuration to line these directions up with the horizontal and vertical
axes. One can achieve this by dragging points while optimization is running continuously.
When points are moved gently, the optimization will try to rotate the configuration so that
the moved point maintains a local minimum location relative to the other points. We used
this effect to rotate the four configurations of Figure 2 into their present orientations.
4.1 Stress
Although the simplest form of cost function for distance scaling is a residual sum of squares,
it is customary to report Stress values that are standardized and unit-free. This may take
the form
2 !1/2
i,j (Di,j − kxi − xj k)
P
StressD (x1 , ..., xN ) = 2
i,j Di,j
P
In this form of Stress one can explicitly optimize the size of the configuration:
P 2 1/2
i,j Di,j · kxi − xj k
min StressD (t · x1 , ..., t · xN ) =
1 − P 2
i,j Di,j · kxi − xj k2
P
t
i,j
This ratio inside the parens can be interpreted geometrically as the squared cosine between
the “vectors” {Di,j }i,j and {kxi − xj k}i,j , which is hence a number between zero and one.
The complete right hand side is therefore the sine between these two vectors. This is the
form of Stress reported and traced by X/GGvis.
The Stress we actually use is of considerably greater generality, however: it permits
1) power transformations of the dissimilarities in metric mode and isotonic transformations
12
in nonmetric mode, 2) Minkowski distances in configuration space, 3) powers of the distances,
4) weighting of the dissimilarities, and 5) missing and omitted dissimilarities. We give
the complete formula for Stress as implemented and defer the explanation of details to
Section 4.3:
1/2
STRESSD (x1 , ..., xN ) = 1 − cos2
P 2
(i,j)∈I wi,j · f (Di,j ) · kxi − xj kqm
cos2 = P P
(i,j)∈I wi,j · f (Di,j )2 (i,j)∈I wi,j · kxi − xj k2q
m
p
(
Di,j , for metric MDS
f (Di,j ) = p
s · Isotonic(Di,j ) + (1 − s) · Di,j , for nonmetric MDS
0 ≤ p ≤ 6, default: p = 1 (no transformation)
0 ≤ s ≤ 1, default: s = 1 (fully isotonic transformation)
Isotonic = monotone ↑ transformation estimated
with isotonic regression
ν=1,...,k
1 ≤ m ≤ 6, m = 2 : Euclidean (default)
m = 1 : City block
0 ≤ q ≤ 6, q = 1 : common Stress (default)
q = 2 : so-called SStress
The summation set I and the weights wi,j will be discussed in Section 4.5.
this assumption, one fits inner products hxi , xj i to a transformation of the dissimilarity data.
The derivation of this transformation is based on the following heuristic:
2
Di,j ≈ kxi − xj k2 = kxi k2 + kxj k2 − 2hxi , xj i
As the matrix (hxi , xj i)i,j has zero row and column means, one recognizes that the necessary
transformation is simply the removal of row and column means from the matrix
2
D̃i,j := −Dij /2 ,
13
a process which is commonly known as “double-centering”:
where D̃i• , D̃•j and D̃•• are row, column and grand means. The quantities Bi,j are now
plausibly called “inner-product data”:
Bi,j ≈ hxi , xj i
We call “Strain” any cost function that involves inner products of configuration points and
inner-product data. One form of Strain is a standardized residual sum of squares between
the quantities Bi,j and the inner products:
!1/2
Bi,j − hxi , xj i)2
i,j (
P
StrainD (x1 , ..., xN ) = 2
i,j hxi , xj i
P
Again, this is not the form of Strain we use. Instead, we pose ourselves the problem of finding
a form that can be made nonmetric. Nonmetric classical MDS seems a contradiction in terms
because classical scaling must be metric according to the conventional interpretations. Yet
we may ask whether it would not be possible to transform the dissimilarity data Di,j in such
a way that a better Strain could be obtained, just as the data are transformed in nonmetric
distance scaling to obtain a better Stress. This is indeed possible, and a solution has been
given by Trosset (1998b). We give here an alternative derivation with additional simplifi-
cations that permit us to fit the Strain minimization problem in the existing framework.
Classical nonmetric scaling is not a great practical advance, but it fills a conceptual gap. It
also gives the software a more satisfactory structure by permitting all possible pairings of
{metric, nonmetric} with {classical scaling, distance scaling}. The properties of nonmetric
classical scaling are not well-understood at this point. The implementation in X/GGvis will
hopefully remedy the situation.
We start with the following observations:
14
The first bullet is almost tautological in view of our introduction of classical scaling. The
second bullet may be new; it is certainly the critical one. To elaborate we need a little linear
algebra:
If A and C are N × N matrices, denote by hA, CiF = trace(AT C) = i,j Ai,j Ci,j the
P
Frobenius inner product, and by kAkF = hA, Ai1/2 the Frobenius norm. Furthermore, let
e = (1, ..., 1)T ∈ IRN and IN be the N × N identity matrix, so that P = IN − eeT /N 1/2 is the
centering projection. Then the equation Bi,j = D̃i,j − D̃i• − D̃•j + D̃•• can be re-expressed
as B = P D̃P . Finally, let X be the N × k configuration matrix whose rows contain the
coordinates of the configuration points, so that (hxi , xj i)i,j = XX T . The centering condition
i xi = 0 can be re-expressed as P X = X, and the residual sum of squares of Strain as
P
i,j
Using repeatedly a basic property of traces, trace(AT C) = trace(C T A), one derives hB, XX T iF
= hB, P XX T P iF and hence:
The former cost function has the property we were looking for: it lends itself to a non-
2
metric extension by replacing the original transformation −Di,j /2 with a general descending
transformation f (−Di,j ), where f is a monotone increasing (non-decreasing) function that
can be estimated with isotonic regression of hxi , xj i on −Di,j . In the metric case, an ex-
2p
tension to power transformations is natural: f (−Di,j ) = −Di,j . Thus we have solved the
problem of finding a natural nonmetric form of classical scaling.
Like Stress, the Strain actually implemented in X/GGvis is a standardized size-optimized
version (to avoid the collapse of f (−Di,j )), and it also has some generalizations such as
weighting and omitting of dissimilarities:
1/2
STRAIND (x1 , ..., xN ) = 1 − cos2
P 2
(i,j)∈I wi,j · f (−Di,j ) · hxi , xj i
cos2 = P P
(i,j)∈I wi,j · f (−Di,j )2 (i,j)∈I wi,j · hxi , xj i2
15
Di,j ∈ IR, ≥ 0, N × N matrix of dissimilarity data
2p
(
−Di,j , for metric MDS
f (−Di,j ) = 2p
s · Isotonic(−Di,j ) + (1 − s) · (−Di,j ), for nonmetric MDS
0 ≤ p ≤ 6, default: p = 1 (no transformation)
0 ≤ s ≤ 1, default: s = 1 (fully isotonic transformation)
Isotonic = monotone ↑ transformation estimated
with isotonic regression
X
hxi , xj i = xi,ν · xj,ν , configuration inner products
ν=1,...,k
The summation set I and the weights wi,j will be discussed in Section 4.5.
• The most fundamental “parameters” are the discrete choices of metric versus nonmetric
and distance versus classical scaling. The default is metric distance scaling.
• Next in importance is the choice of the dimension, k. The conventional choice is k = 2,
but with 3-D rotations available it is plausible to chose k = 3 as the default. For k ≥ 4
one can use the so-called “grand tour” or 00 2−Dtour 00 , a generalization of 3-D rotations
to higher dimensions.
• Both metric and nonmetric scaling are controlled by a parameter that affects the
transformation of the dissimilarities. The two parameters are very different in nature:
16
below 1 can help a configuration recover when it gets trapped in a degeneracy
(usually clumping of the points in a few locations, and near-zero Stress, see Borg
and Groenen (1997) Sections 13.2-3). (In X/GGvis the power exponent p cannot
be changed in nonmetric mode; it is inherited from the last visit to metric mode.)
• A parameter specific to distance scaling, both metric and nonmetric, is the distance
power q. We introduced this parameter to include so-called SStress, which is obtained
for p = q = 2. That is, SStress fits squared distances to squared dissimilarities. SStress
is used in the influential MDS software “ALSCAL” by Takane et al. In our limited
experience SStress is somewhat less robust than Stress, as the former is even more
strongly influenced by large dissimilarities than Stress.
• There is finally a choice of metric in configuration space for distance scaling. We
allow Minkowski (also called Lebesgue) metrics other than Euclidean by permitting
the Minkowski parameter m to be manipulated. The default is Euclidean, m = 2; the
city block or L1 metric is the limiting case m → 1.
4.4 Subsetting
The cost functions can trivially handle missing values in the dissimilarity matrix: Missing
pairs (i, j) are simply dropped from the summations in the cost function and its gradient.
Through the deliberate use of missing values one can implement certain extensions of MDS
such as multidimensional unfolding (see Borg and Groenen (1997), chapter 14, in particular
their Figure 14.1).
Missing values can be coded in the dissimilarities file as “N A”, or they can be introduced
through conditions that are under interactive control. Here is a symbolic description of the
summation set of Stress and Strain:
0 ≤ T0 ≤ T1 , thresholds, defaults: T0 = 0, T1 = ∞
Runif(i, j) = uniform random numbers ∈ [0, 1].
α = selection probability, default: α = 1.
... = conditions based on color/glyph groups.
• Thresholding: The lower and upper threshold parameters T0 and T1 for the conditions
T0 ≤ Di,j ≤ T1 can be interactively controlled (by grips under the histogram in the
X/GGvis control panel, on the left in Figure 3). Thresholding can be used to check the
influence of large and small dissimilarities by removing them. We implemented these
operations based on the received wisdom that the global shape of MDS configurations
is mostly determined by the large dissimilarities. This statement is based on a widely
cited study by Graef and Spence (1979) who ran simulations in which they removed,
17
respectively, the largest third and the smallest third of the dissimilarities. They found
devastating effects when removing the largest third, but relatively benign effects when
removing the smallest third. With interactive thresholding the degree to which this
behavior holds can be explored for every dataset individually.
• Random selection is implemented by thresholding uniform random numbers Runif(i, j).
The condition Runif(i, j) < α removes the dissimiliarity Di,j with probability 1 − α.
The selection probability α can be controlled interactively. Because the selection is
probabilistic, the number of selected dissimilarities is random. Repeatedly generating
new sets of random numbers while optimization is continuously running, one gets a
sense of how (un)stable the configuration is under random removal of dissimilarities.
In our experience classical scaling does not respond well to the removal of even a small
fraction of distances. Distance scaling is considerably more robust in this regard.
Another set of removal conditions for dissimilarities is based on color/glyph groups.
Such groups can be entered from files or they can be generated interactively with brushing
operations. We implemented the following ways of using groups in scaling:
• Subsetting objects: Remove some objects and scale the remaining objects. Removal
is achieved by “hiding” color/glyph groups.
• Within-groups scaling: Remove the dissimilarities that cross color/glyph groups.
This option can be useful for finding and comparing group-internal structure, which is
often obscured in global configurations.
Within-groups scaling has slightly different behavior in classical and distance scaling:
In classical scaling the groups are linked to each other by a common origin, but oth-
erwise they are scaled independently. In distance scaling the groups can be moved
independently of each other.
Note also that nonmetric scaling always introduces a certain dependence between
groups because the isotonic transformation is obtained for the pooled within-groups
dissimilarities, not for each group separately.
• Between-groups scaling: Remove the dissimilarities within the groups. Beetween-
groups scaling with two groups is called multidimensional unfolding (Borg and Groenen
1997, chapter 14). Two groups is the case that is most prone to degeneracies because
it removes the most dissimilarities. The more groups there are, the more dissimilarities
are retained and hence stability is gained.
• Anchored scaling: The objects are divided into two subsets, which we call the set of
anchors and the set of floaters. We scale the floaters by only using their dissimilarities
with regard to the anchors. Floaters are therefore scaled individually, and their posi-
tions do not affect each other. The anchors affect the floaters but not vice versa. The
configuration of the anchors is dealt with in one of two ways:
– Fixed anchors: The anchors have a priori coordinates that determine their con-
figuration. Such coordinates can be entered in an initial position file, or they
18
are obtained from previous configurations by manually moving the anchor points
(with mouse dragging).
– Scaled anchors: The anchors have dissimilarities also. Configurations for the
anchors can therefore be found by subjecting them to regular scaling. Internally
scaling the anchors and externally scaling the floaters with regard to the anchors
can be done in a single optimization (Section 6.3).
In our practice we usually start with scaled anchors. Subsequently we switch to fixed
anchors. Then, while the optimization is running, we drag the anchor points into new
locations in order to check the sensitivity and reasonableness of the configuration of
the floaters.
The anchor metaphor is ours. Anchored scaling is called “external unfolding” in the
literature (Borg and Groenen 1997, Section 15.1).
4.5 Weights
The cost functions are easily adapted to weights. We implemented weights that depend on
two parameters, each for a different task:
(
r w, if color/glyph of i, j is the same
wi,j = Di,j ·
(2 − w) , if color/glyph of i, j is different
The first factor in the weights can depend on a power r of the dissimilarities. If r > 0, large
dissimilarities are upweighted; if r < 0, large dissimilarities are downweighted. This is a
more gradual form of moving small and large dissimilarities in and out of the cost function
compared to lower and upper thresholding.
For metric distance scaling with r = −1, Sammon’s mapping (1969) is obtained, an
independent rediscovery of a variant of MDS.
The second factor in the weights depends on groups: The parameter w permits continuous
up- and downweighting of dissimilarities depending on whether they link objects in the same
or different groups. This is a gradual form of moving between conventional scaling, within-
groups scaling, and between-groups scaling. The latter are our ideas, while the weight-based
gradual version is due to Priebe and Trosset (personal communication).
Note that weighting is computationally more costly than subsetting. The latter saves
time because some dissimilarities do not need to be looked at, but weights are costly in
terms of memory as we store them to save power operations in each iteration.
19
p=1.0 p=6.0
300
4e+13
3.00e+13
200
2e+13
D^p
D^p
I ..,2 ..---
100
1e+13
0
100 200 300 0 1e+13 2e+13 3.00e+13 4e+13
d_config d_config
Figure 6: Two Diagnostic Plots for Configurations from Metric Distance Scaling of the Morse
Code Data. Left: raw data, power p = 1; right: transformed data, power p = 6. An outlier
is marked in the right hand plot: the pair of codes (I, 2) ∼ ( · · , · · − − − ) have a fitted
distance that is vastly larger than the target dissimilarity.
5 Diagnostics
A standard diagnostic in MDS is the Shepard plot, which is a scatterplot of the dissimilarities
against the fitted distances, usually overlayed with a trace of the isotonic transform. See for
example Borg and Groenen (1997, Sections 3.3 and 4.1) or Cox and Cox (1994, Section 3.2.4).
The plot provides a qualitative assessment of the goodness of fit, beyond the quantitative
assessment given by the Stress value.
The components for a Shepard plot are provided by X/GGvis on demand. The plot is
part of a separate X/GGobi window which goes beyond a mere Shepard plot. The window
contains seven variables:
• di,j , the fitted quantities, which are kxi − xj k for distance scaling and hxi , xj i for
classical scaling (in which case the axis is labeled “b ij” rather than “d ij”),
• f (Di,j ), the transformation of the dissimilarities, which is a power for metric scaling
and an isotonic transformation for nonmetric scaling,
• Di,j , the dissimilarities,
• ri,j = f (Di,j ) − di,j , the residuals,
• wi,j , the weights, which may be a power of the dissimilarities,
• i, the row index,
20
• j, the column index.
Selecting the variables di,j and Di,j yields the Shepard plot for distance scaling, and an obvi-
ous analog for classical scaling. Selecting Di,j and f (Di,j ) yields a plot of the transformation
of the dissimilarities. Selecting di,j and f (Di,j is possibly even more informative because the
strength of the visible linear correlation is a qualitative measure for the quality of the fit.
The right side of Figure 6 shows an example.
The residuals are provided for users who prefer residual plots over plots of fits. The
weights are occasionally useful for marking up- and down-weighted dissimilarities with col-
ors or glyphs. The row and column indices are useful for those X/GGvis modes in which
dissimilarities have been removed from the Stress or Strain, or if some dissimilarities are
missing. Plotting i versus j provides a graphical view of the missing and removal pattern of
the dissimilarity matrix as it is used in the current cost function.
If the objects are given labels in an input file, the labels of the diagnostics window are
pairs of object labels. In Figure 6, for example, an outlying point is labeled “I..,2..- - -”,
showing two Morse codes, “..” for “I” and “..- - -” for “2”, whose dissimilarity is not fitted
well.
21
window from the rest of X/GGvis is that the size of its data is in the order N 2 if the size of
the configuration is N . A Shepard plot can hence get so large even for moderate N that the
system would slow down to an intolerable degree if animation were attempted. It is, however,
useful to always show the current number of dissimilarities and update it continuously when
operations are performed that remove or add dissimilarities.
22
configuration, we are to maximize a ratio of the following form:
wi,j · g(Di,j ) · s(xi , xj )
P
N um (i,j)∈I
= 1/2 , (2)
Denom
P
(i,j)∈I wi,j · s(xi , xj )2
where s(xi , xj ) is a Minkowski distance (to the q’th power) for distance scaling, and the
inner product for classical scaling. Both, distances and inner products, are symmetric in
their arguments, hence s(xi , xj ) is symmetric in i and j. As Di,j is assumed symmetric in
the subscripts, so is wi,j .
The gradient of the ratio with regard to the configuration X = (xi )i=1..N is the collection
of partial gradients with regard to configuration points x i :
!
∂ N um ∂ N um
= .
∂X Denom ∂xi Denom i=1..N
Because we determine the size of gradient steps as a fraction of the size of the configuration,
we only need the partial gradients up to a constant factor:
∂ N um N um ∂
X
∝ wi,j g(Di,j ) − s(x i , x j ) s(x, xj ) , (3)
∂xi Denom Denom2 ∂x x=xi
j∈{j|(i,j)∈I}
In the derivation of this formula we used symmetry of Di,j , wi,j , s(xi , xj ), and also symmetry
of the set I, that is, (i, j) ∈ I ⇒ (j, i) ∈ I. The summation should really extend over the set
{j | (i, j) ∈ I} .
We note:
23
• The reduced-summation gradient (3) is exactly the plain gradient of (4) with regard
to the ξ i ’s at ξ i = xi ∀i.
• Maximizing (4) with regard to the ξ i ’s given fixed xj ’s amounts to anchored scaling
with fixed anchors {x1 , ..., xN } and floaters {ξ 1 , ..., ξ N }.
The first point assures that gradient steps based on (3) with reduced summation do find
local maxima of the “unimodal” criterion (2) when the summation set I is symmetric. The
second point hints that we should be able to incorporate anchored scaling by searching for
local maxima of the bimodal ratio (4) if we use suitable summation sets I that are not
symmetric.
We illustrate this point with an example. Consider a dissimilarity matrix whose first
column is missing: Di,1 = N A ∀i, or equivalently, I = {(i, j) | 1 ≤ i ≤ N, 2 ≤ j ≤ N }. The
missing first column means that x1 does not contribute to the partial gradient (3) of any
point whatsoever, but its own partial gradient has contributions from those points for which
D1,j is not missing. Intuitively, x1 does not exert force but it “feels” force from other points;
it does not “push”, but it is being “pushed”. In other words, x1 is a floater, and its anchors
are {xj | D1,j 6= N A}. This example shows that a suitable N A pattern in the dissimilarities
permits us to implement anchored scaling in addition to conventional scaling.
We discuss briefly the N A patterns of the group-based scaling methods of Section 4.4:
group1 group2
!
Dgrp1,grp1 NA group1
D =
NA Dgrp2,grp2 group2
Group 1 gets scaled internally, and so does group 2. The forces are confined within
the groups. The summation set I is symmetric.
group1 group2
!
NA Dgrp1,grp2 group1
D =
Dgrp2,grp1 NA group2
The two groups exert force on each other, but there is no force within the groups. The
summation set I is again symmetric.
anchors floaters!
Dancr,ancr N A anchors
D =
Df ltr,ancr N A floaters
24
Here the summation set I is no longer symmetric. The top left submatrix causes
conventional scaling of the anchors. The bottom left submatrix exerts the push of the
anchors on the floaters. The two blocks of N A’s on the right imply that the columns
for the floaters are absent, that is, the floaters do not push any points.
anchors floaters!
NA NA anchors
D =
Df ltr,ancr N A floaters
Again the summation set I is not symmetric. The two top blocks of N A’s imply that
the anchors are fixed: they are not being pushed by anyone. The only push is exerted
by the anchors on the floaters through the matrix on the bottom left.
It becomes clear that N A patterns and the corresponding summation sets I form a lan-
guage for expressing arbitrarily complex constellations of forces. This idea can be formalized
in terms of what we may call a “force graph”, defined as the directed graph with nodes
{1, ..., N } and edges in the summation set
I = {(i, j) | Di,j 6= N A} .
An edge (i, j) stands for “j pushes i”. Conventional MDS is represented by a complete
graph, where every point pushes every other point. For within-groups scaling the force
graph decomposes into disconnected complete subgraphs (cliques). Between-groups scaling
has a complete bi-directional, multi-partite graph, that is, the node set is decomposed into
two or more disjoint partitions, and the edge set is the set of edges from any one partition to
any other. Anchored MDS with fixed anchors has a uni-directional complete bipartite graph,
that is, the two partitions have asymmetric roles in that the edges go only from one of the
partitions to the other. In anchored MDS with scaled anchors, the latter form in addition
a clique. One can obviously conceive of more complex force graphs, such as multi-partite
graphs with layered anchoring, or graphs with selected force cycles, but this is not the place
to pursue the possibilities in detail.
25
distance scaling. It is therefore a common strategy to use classical scaling solutions as initial
configurations for distance scaling. For what it’s worth, on a laptop computer with model
year 2000 the dimension reduction example of Section 7 with N = 1926 required about 5
seconds per gradient step for metric distance scaling and under 4 seconds per step for metric
classical scaling.
We mentioned in Section 4.5 that weighted MDS is costly. When N is large, one should
abstain from the use of weights for space reasons. We do not allocate a weight array if the
weights are identical to 1.
The reach of MDS extends when a substantial number of terms is trimmed from the
Stress function. Such trimming is most promising in anchored MDS (Section 4.4), which
can be applied if an informative set of anchors can be found. The choice of anchors can be
crucial; in particular, a random choice of anchors often does not work. But we have had
success with the example of size N = 3648 mentioned earlier: satisfying configurations were
found with an anchor set of size 100, which reduced the time needed for a single gradient
step from 100 seconds to 6 seconds.
• Dimension reduction: From a multivariate dataset with eight demographic and tele-
phone usage variables for 1926 households we computed a Euclidean distance matrix
after standardizing the variables. MDS was used to reduce the dimension from eight to
two by creating a 2-D map. Figure 8 shows the result, side by side with a 2-D principal
component projection. While MDS squeezes 20% more variance into two dimensions
than PCA, its map shows rounding on the periphery that may be artifactual. This de-
fect, combined with vastly greater computational cost, convinced us that MDS cannot
be generally recommended for dimension reduction.
• Layout of telephone call graphs: The development of XGvis was originally mo-
tivated by the problem of laying out graphs in more than two dimensions and using
26
XGobi as a display device (Littman et al. 1992). In order to lay out a graph with MDS,
one computes from the graph a distance matrix that can be subjected to MDS, such as
the “shortest-path metric”. X/GGvis accepts graphs as input data, represented as a
list of pairs of object numbers, and it computes the shortest-path metric at start-up. It
also represents the graph visually with lines connecting configuration points. Figure 9
shows an example of a call graph with 110 nodes.
8 Conclusions
This article hoped to accomplish the following:
We finally refer the interested reader to a companion paper (Buja and Swayne 2002) for
MDS methodology that is supported by the functionality described in this article.
27
tar
xcal pr
4Dwm
xhost postprin
4Dwm sendmail
more
userenv mailx
make xdvi xdm
java xterm nawk
sed
eqn
grep
find spell
virtex
sed
gzip detex
xgvis
Figure 8: Marketing Segmentation Data. Left: MDS reduction to 2-D; right: largest two
principal components. The glyphs represent four market segments constructed with k-means
clustering using four means.
28
Figure 9: A Telephone Call Graph, Layed Out in 2-D. Left: classical scaling (Stress=0.34);
right: distance scaling (Stress=0.23). The nodes represent telephone numbers, the edges
represent the existence of a call between two telephone numbers in a given time period.
Figure 10: Nanotube Embedding. One of Asimov’s graphs for a nanotube is rendered with
MDS in 3-D (Stress=0.06). The nodes represent carbon atoms, the lines represent chemical
bonds. The right hand frame shows the cap of the tube only. Some of the pentagons are
clearly visible.
29
Acknowledgments
We thank Jon Kettenring and Daryl Pregibon, the two managers who made this work pos-
sible. We are indebted to Brian Ripley for his port of the XGvis/XGobi to Microsoft
WindowsTM . We also thank an associate editor and a reviewer for extensive comments
and help with scholarship.
References
[1] Asimov, D. (1998), “Geometry of Capped Nanocylinders,” AT&T Labs Technical
Memorandum, http://www.research.att.com/areas/stat/nano
[2] Borg, I., and Groenen, P. (1997), Modern Multidimensional Scaling: Theory and Ap-
plications, New York: Springer-Verlag.
[3] Buja, A., Cook, D., and Swayne, D. F. (1996), “Interactive high-dimensional data
visualization,” J. of Computational and Graphical Statistics, 5, pp. 78–99.
[4] Buja, A., and Swayne, D. F. (2002), “Visualization Methodology for Multidimensional
Scaling,” J. of Classification, 19, pp. 7-43.
[5] Cox, R. F., and Cox, M. A. A. (1994), Multidimensional Scaling, London: Chapman &
Hall.
[6] Crippen, G. M., and Havel, T. F. (1978), “Stable calculation of coordinates from
distance information,” Acta crystallographica, A34, pp. 282-284.
[7] Di Battista, G., Eades, P., Tamassia, R. and Tollis, I. (1994), “Algorithms For Drawing
Graphs: An Annotated Bibliography,” Computational Geometry, 4, pp. 235-282.
[8] Glunt, W., Hayden, T. L., and Raydan, M. (1993), “Molecular conformation from
distance matrices,” J. of Computational Chemistry, 14 1, pp. 114-120.
[9] Graef J., and Spence, I. (1979), “Using Distance Information in the Design of Large
Multidimensional Scaling Experiments,” Psychological Bulletin, 86, pp 60-66.
[10] Havel, T. F. (1991), “An evaluation of computational strategies for use in the deter-
mination of protein structure from distance constraints obtained by nuclear magnetic
resonance,” Progress in Biophysics and Molecular Biology, 56, pp. 43-78.
[11] Kearsley, A. J., Tapia, R. A., and Trosset, M. W. (1998), “The solution of the metric
STRESS and SSTRESS in multidimensional scaling by Newton’s method,” Computa-
tional Statistics, 13, pp. 369–396.
Microsoft Windows is a trademark of Microsoft, Inc.
30
[12] Kruskal, J. B., and Wish, M. (1978), Multidimensional Scaling, Beverly Hills and
London: Sage Publications.
[15] Kruskal, J. B., and Seery, J. B. (1980), “Designing Network Diagrams,” Proceedings of
the First General Conference on Social Graphics, July 1980, US Dept of the Census,
Washington, DC, pp. 22-50.
[16] Littman, M., Swayne, D. F., Dean, N., and Buja, A. (1992), “Visualizing the embedding
of objects in Euclidean space,” Computing Science and Statistics, 24, pp. 208-217.
[17] McFarlane, M., and Young, F. W. (1994), “Graphical Sensitivity Analysis for Multidi-
mensional Scaling,” J. of Computational and Graphical Statistics, 3, pp. 23-33.
[18] Meulman, J.J. (1992), “The integration of multidimensional scaling and multivariate
analysis with optimal scaling,” Psychometrika, 57, pp. 539-565.
[19] Rothkopf, E. Z. (1957), “A measure of stimulus similarity and errors in some paired-
associate learning tasks,” J. of Experimental Psychology, 53, pp. 94-101.
[20] Sammon, J. W., (1969), “A Non-Linear Mapping for Data Structure Analysis,” IEEE
Trans. on Computers, C-18(5).
[22] Swayne, D. F., Cook, D., and Buja, A. (1998), “XGobi: Interactive Data Visualization
in the X Window System,” J. of Computational and Graphical Statistics, 7, pp. 113-130.
[23] “Exploratory Visual Analysis of Graphs in GGobi,” Swayne, D.F., Buja, A., Temple-
Lang, D. (2003), proceedings of the Third Annual Workshop on Distributed Statistical
Computing (DSC 2003), Vienna.
[24] “GGobi: Evolving from XGobi into an Extensible Framework for Interactive Data Vi-
sualization,” Swayne, D.F., Temple-Lang, D., Buja, A., and Cook, D. (2002), Journal
of Computational Statistics and Data Analysis.
[25] Takane, Y., Young, F. W. and De Leeuw, J. (1977), “Nonmetric individual differences
multidimensional scaling: An alternating least-squares method with optimal scaling
features,” Psychometrika, 42, pp 7-67.
31
[26] Theus, M. and Schonlau, M., (1998), “Intrusion Detection Based on Structural Zeroes,”
Statistical Computing & Graphics Newsletter, 9, pp 12-17. Alexandria, VA: American
Statistical Association.
32