0% found this document useful (0 votes)
100 views5 pages

Nonlinear Dimensionality Reduction by Locally Linear Embedding

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views5 pages

Nonlinear Dimensionality Reduction by Locally Linear Embedding

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

REPORTS

35. R. N. Shepard, Psychon. Bull. Rev. 1, 2 (1994). 1– R2(D̂M , DY). DY is the matrix of Euclidean distanc- ing the coordinates of corresponding points {xi , yi} in
36. J. B. Tenenbaum, Adv. Neural Info. Proc. Syst. 10, 682 es in the low-dimensional embedding recovered by both spaces provided by Isomap together with stan-
(1998). each algorithm. D̂M is each algorithm’s best estimate dard supervised learning techniques (39).
37. T. Martinetz, K. Schulten, Neural Netw. 7, 507 (1994). of the intrinsic manifold distances: for Isomap, this is 44. Supported by the Mitsubishi Electric Research Labo-
38. V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduc- the graph distance matrix DG; for PCA and MDS, it is ratories, the Schlumberger Foundation, the NSF
tion to Parallel Computing: Design and Analysis of the Euclidean input-space distance matrix DX (except (DBS-9021648), and the DARPA Human ID program.
Algorithms (Benjamin/Cummings, Redwood City, CA, with the handwritten “2”s, where MDS uses the We thank Y. LeCun for making available the MNIST
1994), pp. 257–297. tangent distance). R is the standard linear correlation database and S. Roweis and L. Saul for sharing related
39. D. Beymer, T. Poggio, Science 272, 1905 (1996). coefficient, taken over all entries of D̂M and DY. unpublished work. For many helpful discussions, we
40. Available at www.research.att.com/⬃yann/ocr/mnist. 43. In each sequence shown, the three intermediate im- thank G. Carlsson, H. Farid, W. Freeman, T. Griffiths,
41. P. Y. Simard, Y. LeCun, J. Denker, Adv. Neural Info. ages are those closest to the points 1/4, 1/2, and 3/4 R. Lehrer, S. Mahajan, D. Reich, W. Richards, J. M.
Proc. Syst. 5, 50 (1993). of the way between the given endpoints. We can also Tenenbaum, Y. Weiss, and especially M. Bernstein.
42. In order to evaluate the fits of PCA, MDS, and Isomap synthesize an explicit mapping from input space X to
on comparable grounds, we use the residual variance the low-dimensional embedding Y, or vice versa, us- 10 August 2000; accepted 21 November 2000

Nonlinear Dimensionality along shortest paths confined to the manifold of


observed inputs. Here, we take a different ap-

Reduction by proach, called locally linear embedding (LLE),


that eliminates the need to estimate pairwise

Downloaded from http://science.sciencemag.org/ on December 12, 2016


distances between widely separated data points.
Locally Linear Embedding Unlike previous methods, LLE recovers global
nonlinear structure from locally linear fits.
Sam T. Roweis1 and Lawrence K. Saul2 The LLE algorithm, summarized in Fig.
2, is based on simple geometric intuitions.
Many areas of science depend on exploratory data analysis and visualization. Suppose the data consist of N real-valued
The need to analyze large amounts of multivariate data raises the fundamental vectors Xជi, each of dimensionality D, sam-
problem of dimensionality reduction: how to discover compact representations pled from some underlying manifold. Pro-
of high-dimensional data. Here, we introduce locally linear embedding (LLE), an vided there is sufficient data (such that the
unsupervised learning algorithm that computes low-dimensional, neighbor- manifold is well-sampled), we expect each
hood-preserving embeddings of high-dimensional inputs. Unlike clustering data point and its neighbors to lie on or
methods for local dimensionality reduction, LLE maps its inputs into a single close to a locally linear patch of the mani-
global coordinate system of lower dimensionality, and its optimizations do not fold. We characterize the local geometry of
involve local minima. By exploiting the local symmetries of linear reconstruc- these patches by linear coefficients that
tions, LLE is able to learn the global structure of nonlinear manifolds, such as reconstruct each data point from its neigh-
those generated by images of faces or documents of text. bors. Reconstruction errors are measured
by the cost function

冘冏 冏
How do we judge similarity? Our mental coordinates as observed modes of variability. 2
representations of the world are formed by Previous approaches to this problem, based on ε共W 兲 ⫽ Xជi⫺⌺jW ij Xជj (1)
processing large numbers of sensory in- multidimensional scaling (MDS) (2), have i

puts—including, for example, the pixel in- computed embeddings that attempt to preserve which adds up the squared distances between
tensities of images, the power spectra of pairwise distances [or generalized disparities all the data points and their reconstructions. The
sounds, and the joint angles of articulated (3)] between data points; these distances are weights Wij summarize the contribution of the
bodies. While complex stimuli of this form can measured along straight lines or, in more so- jth data point to the ith reconstruction. To com-
be represented by points in a high-dimensional phisticated usages of MDS such as Isomap (4), pute the weights Wij, we minimize the cost
vector space, they typically have a much more
compact description. Coherent structure in the
world leads to strong correlations between in-
puts (such as between neighboring pixels in
images), generating observations that lie on or
close to a smooth low-dimensional manifold.
To compare and classify such observations—in
effect, to reason about the world— depends
crucially on modeling the nonlinear geometry
of these low-dimensional manifolds.
Scientists interested in exploratory analysis
or visualization of multivariate data (1) face a
similar problem in dimensionality reduction.
The problem, as illustrated in Fig. 1, involves Fig. 1. The problem of nonlinear dimensionality reduction, as illustrated (10) for three-dimensional
mapping high-dimensional inputs into a low- data (B) sampled from a two-dimensional manifold (A). An unsupervised learning algorithm must
dimensional “description” space with as many discover the global internal coordinates of the manifold without signals that explicitly indicate how
the data should be embedded in two dimensions. The color coding illustrates the neighborhood-
preserving mapping discovered by LLE; black outlines in (B) and (C) show the neighborhood of a
1
Gatsby Computational Neuroscience Unit, Universi- single point. Unlike LLE, projections of the data by principal component analysis (PCA) (28) or
ty College London, 17 Queen Square, London WC1N classical MDS (2) map faraway data points to nearby points in the plane, failing to identify the
3AR, UK. 2AT&T Lab—Research, 180 Park Avenue, underlying structure of the manifold. Note that mixture models for local dimensionality reduction
Florham Park, NJ 07932, USA. (29), which cluster the data and perform PCA within each cluster, do not address the problem
E-mail: [email protected] (S.T.R.); lsaul@research. considered here: namely, how to map high-dimensional data into a single global coordinate system
att.com (L.K.S.) of lower dimensionality.

www.sciencemag.org SCIENCE VOL 290 22 DECEMBER 2000 2323


REPORTS
function subject to two constraints: first, that not belong to the set of neighbors of Xជi; Wij subject to these constraints (6) are found
each data point Xជi is reconstructed only from second, that the rows of the weight matrix by solving a least-squares problem (7).
its neighbors (5), enforcing Wij ⫽ 0 if Xជj does sum to one: ⌺jWij ⫽ 1. The optimal weights The constrained weights that minimize
these reconstruction errors obey an important
Fig. 2. Steps of locally lin- symmetry: for any particular data point, they
ear embedding: (1) Assign are invariant to rotations, rescalings, and
neighbors to each data translations of that data point and its neigh-
point Xជi (for example by bors. By symmetry, it follows that the recon-
using the K nearest neigh- struction weights characterize intrinsic geo-
bors). (2) Compute the
weights Wij that best lin- metric properties of each neighborhood, as
early reconstruct Xជi from opposed to properties that depend on a par-
its neighbors, solving the ticular frame of reference (8). Note that the
constrained least-squares invariance to translations is specifically en-
problem in Eq. 1. (3) Com- forced by the sum-to-one constraint on the
pute the low-dimensional rows of the weight matrix.
embedding vectors Yជi best
reconstructed by Wij, mini- Suppose the data lie on or near a smooth
mizing Eq. 2 by finding the nonlinear manifold of lower dimensionality d

Downloaded from http://science.sciencemag.org/ on December 12, 2016


smallest eigenmodes of ⬍⬍ D. To a good approximation then, there
the sparse symmetric ma- exists a linear mapping— consisting of a
trix in Eq. 3. Although the translation, rotation, and rescaling—that
weights Wij and vectors Yi maps the high-dimensional coordinates of
are computed by methods
in linear algebra, the con- each neighborhood to global internal coordi-
straint that points are only nates on the manifold. By design, the recon-
reconstructed from neigh- struction weights Wij reflect intrinsic geomet-
bors can result in highly ric properties of the data that are invariant to
nonlinear embeddings. exactly such transformations. We therefore
expect their characterization of local geome-
try in the original data space to be equally
valid for local patches on the manifold. In
particular, the same weights Wij that recon-
struct the ith data point in D dimensions
should also reconstruct its embedded mani-
fold coordinates in d dimensions.
LLE constructs a neighborhood-preserving
mapping based on the above idea. In the final
step of the algorithm, each high-dimensional
observation Xជi is mapped to a low-dimensional
vector Yជi representing global internal coordi-
nates on the manifold. This is done by choosing
d-dimensional coordinates Yជi to minimize the
embedding cost function

冘冏 冏
2

⌽共Y 兲 ⫽ Yជi ⫺ ⌺ jW ijYជj (2)


i

This cost function, like the previous one, is


based on locally linear reconstruction errors,
but here we fix the weights Wij while opti-
mizing the coordinates Yជi. The embedding
cost in Eq. 2 defines a quadratic form in the
vectors Yជi. Subject to constraints that make
the problem well-posed, it can be minimized
by solving a sparse N ⫻ N eigenvalue prob-
lem (9), whose bottom d nonzero eigenvec-
tors provide an ordered set of orthogonal
coordinates centered on the origin.
Implementation of the algorithm is
straightforward. In our experiments, data
points were reconstructed from their K near-
est neighbors, as measured by Euclidean dis-
tance or normalized dot products. For such
Fig. 3. Images of faces (11) mapped into the embedding space described by the first two
implementations of LLE, the algorithm has
coordinates of LLE. Representative faces are shown next to circled points in different parts of the only one free parameter: the number of
space. The bottom images correspond to points along the top-right path (linked by solid line), neighbors, K. Once neighbors are chosen, the
illustrating one particular mode of variability in pose and expression. optimal weights Wij and coordinates Yជi are

2324 22 DECEMBER 2000 VOL 290 SCIENCE www.sciencemag.org


REPORTS
computed by standard methods in linear al- chitectural specifications. Finally, whereas alyzed— can provide information about global
gebra. The algorithm involves a single pass other nonlinear methods rely on deterministic geometry. Many virtues of LLE are shared by
through the three steps in Fig. 2 and finds annealing schemes (17) to avoid local mini- Tenenbaum’s algorithm, Isomap, which has
global minima of the reconstruction and em- ma, the optimizations of LLE are especially been successfully applied to similar problems in
bedding costs in Eqs. 1 and 2. tractable. nonlinear dimensionality reduction. Isomap’s
In addition to the example in Fig. 1, for LLE scales well with the intrinsic mani- embeddings, however, are optimized to pre-
which the true manifold structure was known fold dimensionality, d, and does not require a serve geodesic distances between general pairs
(10), we also applied LLE to images of faces discretized gridding of the embedding space. of data points, which can only be estimated by
(11) and vectors of word-document counts As more dimensions are added to the embed- computing shortest paths through large sublat-
(12). Two-dimensional embeddings of faces ding space, the existing ones do not change, tices of data. LLE takes a different approach,
and words are shown in Figs. 3 and 4. Note so that LLE does not have to be rerun to analyzing local symmetries, linear coefficients,
how the coordinates of these embedding compute higher dimensional embeddings. and reconstruction errors instead of global con-
spaces are related to meaningful attributes, Unlike methods such as principal curves and straints, pairwise distances, and stress func-
such as the pose and expression of human surfaces (18) or additive component models tions. It thus avoids the need to solve large
faces and the semantic associations of words. (19), LLE is not limited in practice to mani- dynamic programming problems, and it also
Many popular learning algorithms for folds of extremely low dimensionality or tends to accumulate very sparse matrices,
nonlinear dimensionality reduction do not codimensionality. Also, the intrinsic value of whose structure can be exploited for savings in

Downloaded from http://science.sciencemag.org/ on December 12, 2016


share the favorable properties of LLE. Itera- d can itself be estimated by analyzing a re- time and space.
tive hill-climbing methods for autoencoder ciprocal cost function, in which reconstruc- LLE is likely to be even more useful in
neural networks (13, 14), self-organizing tion weights derived from the embedding combination with other methods in data anal-
maps (15), and latent variable models (16) do vectors Yជi are applied to the data points Xជi. ysis and statistical learning. For example, a
not have the same guarantees of global opti- LLE illustrates a general principle of mani- parametric mapping between the observation
mality or convergence; they also tend to in- fold learning, elucidated by Martinetz and and embedding spaces could be learned by
volve many more free parameters, such as Schulten (20) and Tenenbaum (4), that over- supervised neural networks (21) whose target
learning rates, convergence criteria, and ar- lapping local neighborhoods—collectively an- values are generated by LLE. LLE can also
be generalized to harder settings, such as the
Fig. 4. Arranging words in a case of disjoint data manifolds (22), and spe-
continuous semantic space. cialized to simpler ones, such as the case of
Each word was initially repre- time-ordered observations (23).
sented by a high-dimensional
Perhaps the greatest potential lies in ap-
vector that counted the
number of times it appeared plying LLE to diverse problems beyond those
in different encyclopedia ar- considered here. Given the broad appeal of
ticles. LLE was applied to traditional methods, such as PCA and MDS,
these word-document count the algorithm should find widespread use in
vectors (12), resulting in an many areas of science.
embedding location for each
word. Shown are words from
two different bounded re- References and Notes
gions (A) and (B) of the em- 1. M. L. Littman, D. F. Swayne, N. Dean, A. Buja, in
bedding space discovered by Computing Science and Statistics: Proceedings of the
LLE. Each panel shows a two- 24th Symposium on the Interface, H. J. N. Newton, Ed.
dimensional projection onto (Interface Foundation of North America, Fairfax Sta-
tion, VA, 1992), pp. 208 –217.
the third and fourth coordi-
2. T. Cox, M. Cox, Multidimensional Scaling (Chapman &
nates of LLE; in these two Hall, London, 1994).
dimensions, the regions (A) 3. Y. Takane, F. W. Young, Psychometrika 42, 7 (1977).
and (B) are highly over- 4. J. Tenenbaum, in Advances in Neural Information
lapped. The inset in (A) Processing 10, M. Jordan, M. Kearns, S. Solla, Eds. (MIT
shows a three-dimensional Press, Cambridge, MA, 1998), pp. 682– 688.
projection onto the third, 5. The set of neighbors for each data point can be
fourth, and fifth coordinates, assigned in a variety of ways: by choosing the K
revealing an extra dimension nearest neighbors in Euclidean distance, by consider-
ing all data points within a ball of fixed radius, or by
along which regions (A) and
using prior knowledge. Note that for fixed number of
(B) are more separated. neighbors, the maximum number of embedding di-
Words that lie in the inter- mensions LLE can be expected to recover is strictly
section of both regions are less than the number of neighbors.
capitalized. Note how LLE co- 6. For certain applications, one might also constrain the
locates words with similar weights to be positive, thus requiring the reconstruc-
contexts in this continuous tion of each data point to lie within the convex hull
semantic space. of its neighbors.
7. Fits: The constrained weights that best reconstruct
each data point from its neighbors can be computed
in closed form. Consider a particular data point xជ with
neighbors ␩ជj and sum-to-one reconstruction weights
wj. The reconstruction error xជ – ⌺j⫽1 K wj␩ជj2 is
minimized in three steps. First, evaluate inner prod-
ucts between neighbors to compute the neighbor-
hood correlation matrix, Cjk ⫽ ␩ជj 䡠 ␩ជk, and its matrix
inverse, C ⫺1. Second, compute the Lagrange multi-
plier, ␭ ⫽ ␣/␤, that enforces the sum-to-one con-
⫺1(xជ 䡠 ␩ ) and ␤ ⫽
straint, where ␣ ⫽ 1 ⫺ ⌺jkC jk ជk
⌺jkCjk
⫺1. Third, compute the reconstruction weights:

wj ⫽ ⌺kC⫺1 ជ ជk ⫹ ␭). If the correlation matrix C is


jk (x 䡠 ␩

www.sciencemag.org SCIENCE VOL 290 22 DECEMBER 2000 2325


REPORTS
nearly singular, it can be conditioned (before inver- 10. Manifold: Data points in Fig. 1B (N ⫽ 2000) were amining powers of its adjacency matrix. Different
sion) by adding a small multiple of the identity sampled from the manifold (D ⫽ 3) shown in Fig. 1A. connected components of the data are essentially
matrix. This amounts to penalizing large weights that Nearest neighbors (K ⫽ 20) were determined by decoupled in the eigenvector problem for LLE. Thus,
exploit correlations beyond some level of precision in Euclidean distance. This particular manifold was in- they are best interpreted as lying on distinct mani-
the data sampling process. troduced by Tenenbaum (4), who showed that its folds, and are best analyzed separately by LLE.
8. Indeed, LLE does not require the original data to global structure could be learned by the Isomap 23. If neighbors correspond to nearby observations in
be described in a single coordinate system, only algorithm. time, then the reconstruction weights can be com-
that each data point be located in relation to its 11. Faces: Multiple photographs (N ⫽ 2000) of the same puted online (as the data itself is being collected) and
neighbors. face were digitized as 20 ⫻ 28 grayscale images. Each the embedding can be found by diagonalizing a
9. The embedding vectors Yជi are found by minimizing the image was treated by LLE as a data vector with D ⫽ sparse banded matrix.
cost function ⌽(Y ) ⫽ ⌺iYជi – ⌺jWijYជj2 over Yជi with fixed 560 elements corresponding to raw pixel intensities. 24. R. A. Horn, C. R. Johnson, Matrix Analysis (Cambridge
weights Wij. This optimization is performed subject to Nearest neighbors (K ⫽ 12) were determined by Univ. Press, Cambridge, 1990).
constraints that make the problem well posed. It is clear Euclidean distance in pixel space. 25. Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, H. van der
that the coordinates Yជi can be translated by a constant 12. Words: Word-document counts were tabulated for Vorst, Eds., Templates for the Solution of Algebraic
displacement without affecting the cost, ⌽(Y ). We re- N ⫽ 5000 words from D ⫽ 31,000 articles in Grolier’s Eigenvalue Problems: A Practical Guide (Society for
move this degree of freedom by requiring the coordi- Encyclopedia (26). Nearest neighbors (K ⫽ 20) were Industrial and Applied Mathematics, Philadelphia, PA,
nates to be centered on the origin: ⌺iYជi ⫽ 0. ជ Also, to determined by dot products between count vectors 2000).
avoid degenerate solutions, we constrain the embed- normalized to unit length. 26. D. D. Lee, H. S. Seung, Nature 401, 788 (1999).
ding vectors to have unit covariance, with outer prod- 13. D. DeMers, G. W. Cottrell, in Advances in Neural 27. R. Tarjan, Data Structures and Network Algorithms,
1
ucts that satisfy ⌺i Yជi R Yជi ⫽ I, where I is the d ⫻ Information Processing Systems 5, D. Hanson, J. CBMS 44 (Society for Industrial and Applied Mathe-
N
d identity matrix. Now the cost defines a quadratic Cowan, L. Giles, Eds. (Kaufmann, San Mateo, CA, matics, Philadelphia, PA, 1983).
form, ⌽(Y ) ⫽ ⌺ij Mij(Yជi䡠Yជj), involving inner products of 1993), pp. 580 –587. 28. I. T. Jolliffe, Principal Component Analysis (Springer-

Downloaded from http://science.sciencemag.org/ on December 12, 2016


the embedding vectors and the symmetric N ⫻ N 14. M. Kramer, AIChE J. 37, 233 (1991). Verlag, New York, 1989).
matrix 15. T. Kohonen, Self-Organization and Associative Mem- 29. N. Kambhatla, T. K. Leen, Neural Comput. 9, 1493
M ij ⫽ ␦ ij ⫺ W ij ⫺ W ji ⫹ ⌺ W kiW kj (3) ory (Springer-Verlag, Berlin, 1988). (1997).
k 16. C. Bishop, M. Svensen, C. Williams, Neural Comput. 30. We thank G. Hinton and M. Revow for sharing their
where ␦ij is 1 if i ⫽ j and 0 otherwise. The optimal 10, 215 (1998). unpublished work (at the University of Toronto) on
embedding, up to a global rotation of the embedding 17. H. Klock, J. Buhmann, Pattern Recognition 33, 651 segmentation and pose estimation that motivated us
space, is found by computing the bottom d ⫹ 1 (1999). to “think globally, fit locally”; J. Tenenbaum (Stanford
eigenvectors of this matrix (24). The bottom eigen- 18. T. J. Hastie, W. Stuetzle, J. Am. Stat. Assoc. 84, 502 University) for many stimulating discussions about
vector of this matrix, which we discard, is the unit (1989). his work (4) and for sharing his code for the Isomap
vector with all equal components; it represents a free 19. D. J. Donnell, A. Buja, W. Stuetzle, Ann. Stat. 22, 1635 algorithm; D. D. Lee (Bell Labs) and B. Frey (University
translation mode of eigenvalue zero. (Discarding it (1994). of Waterloo) for making available word and face data
enforces the constraint that the embeddings have 20. T. Martinetz, K. Schulten, Neural Networks 7, 507 from previous work (26); and C. Brody, A. Buja, P.
zero mean.) The remaining d eigenvectors form the d (1994). Dayan, Z. Ghahramani, G. Hinton, T. Jaakkola, D. Lee,
embedding coordinates found by LLE. Note that the 21. D. Beymer, T. Poggio, Science 272, 1905 (1996). F. Pereira, and M. Sahani for helpful comments. S.T.R.
matrix M can be stored and manipulated as the 22. Although in all the examples considered here, the acknowledges the support of the Gatsby Charitable
sparse matrix (I ⫺ W )T(I ⫺ W ), giving substantial data had a single connected component, it is possible Foundation, the U.S. National Science Foundation,
computational savings for large values of N. More- to formulate LLE for data that lies on several disjoint and the National Sciences and Engineering Research
over, its bottom d ⫹ 1 eigenvectors (those corre- manifolds, possibly of different underlying dimen- Council of Canada.
sponding to its smallest d ⫹ 1 eigenvalues) can be sionality. Suppose we form a graph by connecting
found efficiently without performing a full matrix each data point to its neighbors. The number of
diagonalization (25). connected components (27) can be detected by ex- 7 August 2000; accepted 17 November 2000

2326 22 DECEMBER 2000 VOL 290 SCIENCE www.sciencemag.org


Nonlinear Dimensionality Reduction by Locally Linear Embedding
Sam T. Roweis and Lawrence K. Saul (December 22, 2000)
Science 290 (5500), 2323-2326. [doi: 10.1126/science.290.5500.2323]

Editor's Summary

Downloaded from http://science.sciencemag.org/ on December 12, 2016


This copy is for your personal, non-commercial use only.

Article Tools Visit the online version of this article to access the personalization and article
tools:
http://science.sciencemag.org/content/290/5500/2323

Permissions Obtain information about reproducing this article:


http://www.sciencemag.org/about/permissions.dtl

Science (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in
December, by the American Association for the Advancement of Science, 1200 New York Avenue NW,
Washington, DC 20005. Copyright 2016 by the American Association for the Advancement of Science;
all rights reserved. The title Science is a registered trademark of AAAS.

You might also like