An Introduction To Locally Linear Embedding

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

An Introduction to Locally Linear Embedding

Lawrence K. Saul
AT&T Labs – Research
180 Park Ave, Florham Park, NJ 07932 USA
[email protected]

Sam T. Roweis
Gatsby Computational Neuroscience Unit, UCL
17 Queen Square, London WC1N 3AR, UK
[email protected]

Abstract
Many problems in information processing involve some form of dimension-
ality reduction. Here we describe locally linear embedding (LLE), an unsu-
pervised learning algorithm that computes low dimensional, neighborhood
preserving embeddings of high dimensional data. LLE attempts to discover
nonlinear structure in high dimensional data by exploiting the local symme-
tries of linear reconstructions. Notably, LLE maps its inputs into a single
global coordinate system of lower dimensionality, and its optimizations—
though capable of generating highly nonlinear embeddings—do not involve
local minima. We illustrate the method on images of lips used in audiovisual
speech synthesis.

1 Introduction

Many problems in statistical pattern recognition begin with the preprocessing of


multidimensional signals, such as images of faces or spectrograms of speech.
Often, the goal of preprocessing is some form of dimensionality reduction: to com-
press the signals in size and to discover compact representations of their variability.
Two popular forms of dimensionality reduction are the methods of principal com-
ponent analysis (PCA) [1] and multidimensional scaling (MDS) [2]. Both PCA and
MDS are eigenvector methods designed to model linear variabilities in high dimen-
sional data. In PCA, one computes the linear projections of greatest variance from

1
the top eigenvectors of the data covariance matrix. In classical (or metric) MDS,
one computes the low dimensional embedding that best preserves pairwise dis-
tances between data points. If these distances correspond to Euclidean distances,
the results of metric MDS are equivalent to PCA. Both methods are simple to
implement, and their optimizations do not involve local minima. These virtues ac-
count for the widespread use of PCA and MDS, despite their inherent limitations
as linear methods.
Recently, we introduced an eigenvector method—called locally linear embedding
(LLE)—for the problem of nonlinear dimensionality reduction[4]. This problem
is illustrated by the nonlinear manifold in Figure 1. In this example, the dimen-
sionality reduction by LLE succeeds in identifying the underlying structure of the
manifold, while projections of the data by PCA or metric MDS map faraway data
points to nearby points in the plane. Like PCA and MDS, our algorithm is sim-
ple to implement, and its optimizations do not involve local minima. At the same
time, however, it is capable of generating highly nonlinear embeddings. Note that
mixture models for local dimensionality reduction[5, 6], which cluster the data and
perform PCA within each cluster, do not address the problem considered here—
namely, how to map high dimensional data into a single global coordinate system
of lower dimensionality.
In this paper, we review the LLE algorithm in its most basic form and illustrate a
potential application to audiovisual speech synthesis[3].
(A) (B) (C)

3 3
2
2 2
1
1 1
0

0 0
-1

-1 5 -1 5
-1 0 -1 0 -2
1 0 1 0 -2 -1 0 1 2

Figure 1: The problem of nonlinear dimensionality reduction, as illustrated for


three dimensional data (B) sampled from a two dimensional manifold (A). An un-
supervised learning algorithm must discover the global internal coordinates of the
manifold without signals that explicitly indicate how the data should be embed-
ded in two dimensions. The shading in (C) illustrates the neighborhood-preserving
mapping discovered by LLE.

2
2 Algorithm

Suppose the data consist of


  
The LLE algorithm, summarized in Fig. 2, is based on simple geometric intuitions.
real-valued vectors , each of dimensionality , 
sampled from some smooth underlying manifold. Provided there is sufficient data
(such that the manifold is well-sampled), we expect each data point and its neigh-
bors to lie on or close to a locally linear patch of the manifold.
We can characterize the local geometry of these patches by linear coefficients that
reconstruct each data point from its neighbors. In the simplest formulation of LLE,
one identifies  nearest neighbors per data point, as measured by Euclidean dis-
tance. (Alternatively, one can identify neighbors by choosing all points within a
ball of fixed radius, or by using more sophisticated rules based on local metrics.)
Reconstruction errors are then measured by the cost function:

 
      
     
   (1)



which adds up the squared distances between all the data points and their recon-

structions. The weights


summarize the contribution of the th data point to
the th reconstruction. To compute the weights

, we minimize the cost func-

   
tion subject to two constraints: first, that each data point is reconstructed only
 

 
from its neighbors, enforcing if does not belong to this set; second,
that the rows of the weight matrix sum to one:


. The reason for the
sum-to-one constraint will become clear shortly. The optimal weights subject
to these constraints are found by solving a least squares problem, as discussed in
Appendix A.
Note that the constrained weights that minimize these reconstruction errors obey an
important symmetry: for any particular data point, they are invariant to rotations,
rescalings, and translations of that data point and its neighbors. The invariance to
rotations and rescalings follows immediately from the form of eq. (1); the invari-
ance to translations is enforced by the sum-to-one constraint on the rows of the
weight matrix. A consequence of this symmetry is that the reconstruction weights
characterize intrinsic geometric properties of each neighborhood, as opposed to
properties that depend on a particular frame of reference.

!#" 
Suppose the data lie on or near a smooth nonlinear manifold of dimensionality
. To a good approximation, then, there exists a linear mapping—consisting
of a translation, rotation, and rescaling—that maps the high dimensional coordi-

design, the reconstruction weights




nates of each neighborhood to global internal coordinates on the manifold. By
reflect intrinsic geometric properties of the

3
data that are invariant to exactly such transformations. We therefore expect their

local patches on the manifold. In particular, the same weights




characterization of local geometry in the original data space to be equally valid for
that reconstruct
 ! 
the th data point in dimensions should also reconstruct its embedded manifold
coordinates in dimensions.
(Informally, imagine taking a pair of scissors, cutting out locally linear patches
of the underlying manifold, and placing them in the low dimensional embedding
space. Assume further that this operation is done in a way that preserves the angles
formed by each data point to its nearest neighbors. In this case, the transplantation
of each patch involves no more than a translation, rotation, and rescaling of its
data, exactly the operations to which the weights are invariant. Thus, when the
patch arrives at its low dimensional destination, we expect the same weights to
reconstruct each data point from its neighbors.)

  
LLE constructs a neighborhood preserving mapping based on the above idea. In the
$ 
final step of the algorithm, each high dimensional observation is mapped to a
! $ 
low dimensional vector representing global internal coordinates on the manifold.
This is done by choosing -dimensional coordinates to minimize the embedding
% )(
 $ &  $  '  
  $   
cost function:
   (2)


  whilebased
This cost function—like the previous one—is
$   . The
on locally linear reconstruction
errors, but here we fix the weights
$ 
optimizing the coordinates
embedding cost in Eq. (2) defines a quadratic form in the vectors . Subject to
+*, eigenvector problem, whose bottom ! non-zero eigenvectors provide
constraints that make the problem well-posed, it can be minimized by solving a
sparse
an ordered set of orthogonal coordinates centered on the origin. Details of this
eigenvector problem are discussed in Appendix B.
Note that while the reconstruction weights for each data point are computed from

bedding coordinates are computed by an


-*.
its local neighborhood—independent of the weights for other data points—the em-
eigensolver, a global operation that
couples all data points in connected components of the graph defined by the weight
matrix. The different dimensions in the embedding space can be computed succes-
sively; this is done simply by computing the bottom eigenvectors from eq. (2) one
at a time. But the computation is always coupled across data points. This is how
the algorithm leverages overlapping local information to discover global structure.
Implementation of the algorithm is fairly straightforward, as the algorithm has only

one free parameter: the number of neighbors per data point, . Once neighbors

4
LLE ALGORITHM

1. Compute the neighbors of each data point,


 .
2. Compute the weights


that best reconstruct each data point from

its neighbors, minimizing the cost in eq. (1) by constrained linear fits.
/$  
3. Compute the vectors best reconstructed by the weights


, minimizing
the quadratic form in eq. (2) by its bottom nonzero eigenvectors.

   0$  
Figure 2: Summary of the LLE algorithm, mapping high dimensional data points,
, to low dimensional embedding vectors, .

are chosen, the optimal weights



 $
and coordinates are computed by standard
methods in linear algebra. The algorithm involves a single pass through the three
steps in Fig. 2 and finds global minima of the reconstruction and embedding costs

neighbors outnumber the input dimensionality


 213
in Eqs. (1) and (2). As discussed in Appendix A, in the unusual case where the
, the least squares problem
for finding the weights does not have a unique solution, and a regularization term—
for example, one that penalizes the squared magnitudes of the weights—must be
added to the reconstruction cost.


The algorithm, as described in Fig. 2, takes as input the high dimensional vec-
tors, . In many settings, however, the user may not have access to data of this
form, but only to measurements of dissimilarity or pairwise distance between dif-
ferent data points. A simple variation of LLE, described in Appendix C, can be
applied to input of this form. In this way, matrices of pairwise distances can be
analyzed by LLE just as easily as MDS[2]; in fact only a small fraction of all pos-
sible pairwise distances (representing distances between neighboring points and
their respective neighbors) are required for running LLE.

3 Examples

The embeddings discovered by LLE are easiest to visualize for intrinsically two
dimensional manifolds. In Fig. 1, for example, the input to LLE consisted
5467
 8:9
data points sampled off the S-shaped manifold. The resulting embedding shows
how the algorithm, using neighbors per data point, successfully unraveled
the underlying two dimensional structure.

5
Fig. 3 shows another two dimensional manifold, this one living in a much higher
dimensional space. Here, we generated examples—shown in the middle panel of
the figure—by translating the image of a single face across a larger background
of random noise. The noise was uncorrelated from one example to the next. The
only consistent structure in the resulting images thus described a two-dimensional
<;74= 97> * 96
manifold parameterized by the face’s center of mass. The input to LLE consisted

?; * ?
of grayscale images, with each image containing a face su-
perimposed on a background of noise. Note that while simple to visu-

 @67A;
alize, the manifold of translated faces is highly nonlinear in the high dimensional

 CB
( ) vector space of pixel coordinates. The bottom portion of Fig. 3 shows
the first two components discovered by LLE, with neighbors per data point.
By contrast, the top portion shows the first two components discovered by PCA. It
is clear that the manifold structure in this example is much better modeled by LLE.
Finally, in addition to these examples, for which the true manifold structure was
D> ? 7> >
known, we also applied LLE to images of lips used in the animation of talking
EA> *
>FB
heads[3]. Our database contained
 59AG69=:4
color (RGB) images of lips at
resolution. Dimensionality reduction of these images ( ) is useful for

 H:4
faster and more efficient animation. The top and bottom panels of Fig. 4 show the
first two components discovered, respectively, by PCA and LLE (with ).
If the lip images described a nearly linear manifold, these two methods would
yield similar results; thus, the significant differences in these embeddings reveal
the presence of nonlinear structure. Note that while the linear projection by PCA
has a somewhat uniform distribution about its mean, the locally linear embedding
has a distinctly spiny structure, with the tips of the spines corresponding to extremal
configurations of the lips.

4 Discussion

It is worth noting that many popular learning algorithms for nonlinear dimension-
ality reduction do not share the favorable properties of LLE. Iterative hill-climbing
methods for autoencoder neural networks[7, 8], self-organizing maps[9], and latent
variable models[10] do not have the same guarantees of global optimality or con-
vergence; they also tend to involve many more free parameters, such as learning
rates, convergence criteria, and architectural specifications.


I 
The different steps of LLE have the following complexities. In Step 1, computing
nearest neighbors scales (in the worst case) as , or linearly in the input

dimensionality, , and quadratically in the number of data points, . For many

6
Figure 3: The results of PCA (top) and LLE (bottom), applied to images of a single
face translated across a two-dimensional background of noise. Note how LLE
maps the images with corner faces to the corners of its two dimensional embedding,
while PCA fails to preserve the neighborhood structure of nearby images.

7
Figure 4: Images of lips mapped into the embedding space described by the first
two coordinates of PCA (top) and LLE (bottom). Representative lips are shown
next to circled points in different parts of each space. The differences between the
two embeddings indicate the presence of nonlinear structure in the data.

8
data distributions, however – and especially for data distributed on a thin subman-

I  8J LK7M
ifold of the observation space – constructions such as K-D trees can be used to

I  ON

compute the neighbors in time[13]. In Step 2, computing the recon-

 * 
struction weights scales as ; this is the number of operations required

I !
to solve a set of linear equations for each data point. In Step 3, computing
!
the bottom eigenvectors scales as , linearly in the number of embedding
dimensions, , and quadratically in the number of data points, . Methods for
sparse eigenproblems[14], however, can be used to reduce the complexity to sub-
quadratic in . Note that as more dimensions are added to the embedding space,
the existing ones do not change, so that LLE does not have to be rerun to compute
higher dimensional embeddings. The storage requirements of LLE are limited by
the weight matrix which is size N by K.
LLE illustrates a general principle of manifold learning, elucidated by Tenenbaum
et al[11], that overlapping local neighborhoods—collectively analyzed—can pro-
vide information about global geometry. Many virtues of LLE are shared by the
Isomap algorithm[11], which has been successfully applied to similar problems
in nonlinear dimensionality reduction. Isomap is an extension of MDS in which
embeddings are optimized to preserve “geodesic” distances between pairs of data
points; these distances are estimated by computing shortest paths through large
sublattices of data. A virtue of LLE is that it avoids the need to solve large dy-
namic programming problems. LLE also tends to accumulate very sparse matrices,
whose structure can be exploited for savings in time and space.
LLE is likely to be even more useful in combination with other methods in data
analysis and statistical learning. An interesting and important question is how to

the results of LLE. One possible approach is to use
P   $  R Q
learn a parametric mapping between the observation and embedding spaces, given
pairs as labeled ex-
amples for statistical models of supervised learning. The ability to learn such map-
pings should make LLE broadly useful in many areas of information processing.

A Constrained Least Squares Problem

The constrained weights that best reconstruct each data point from its neighbors
S 
neighbors T 
and reconstruction weights U

can be computed in closed form. Consider a particular data point with nearest
that sum to one. We can write the
reconstruction error as:
  
V   S     U  T     W   U   S   T      Y X U  U X7Z)YX (3)

9
where in the first identity, we have exploited the fact that the weights sum to one,
and in the second identity, we have introduced the local covariance matrix,
(
  S   T   [\ S   T  X
Z)YX +
(4)

 
 U ]
This error can be minimized in closed form, using a Lagrange multiplier to enforce
the constraint that . In terms of the inverse local covariance matrix, the
optimal weights are given by:
   X Z#a^`X _ (
U Z ^`_
ceb d bfd
(5)

The solution, as written in eq. (5), appears to require an explicit inversion of the
 Z)aX X
U 
local covariance matrix. In practice, a more efficient way to minimize the error is
simply to solve the linear system of equations, 
, and then to rescale
the weights so that they sum to one (which yields the same result). By construction,
the local covariance matrix in eq. (4) is symmetric and semipositive definite. If
the covariance matrix is singular or nearly singular—as arises, for example, when
there are more neighbors than input dimensions ( 21g
), or when the data points
are not in general position—it can be conditioned (before solving the system) by
adding a small multiple of the identity matrix,
Z)aX,hiZ)aXkjml)n  YX 
porq (6)

n  Z
where is small compared to the trace of . This amounts to penalizing large
weights that exploit correlations beyond some level of precision in the data sam-
pling process.

B Eigenvector Problem
0$  
fixed weights

  sutwv
The embedding vectors
:
are found by minimizing the cost function, eq. (2), for

% )(
 $ y   $  /  2F
  $   
x    (7)

%
Note that the cost defines a quadratic form,
  $  [ $  
 $   { 
 z

10
*| z
    
  
  j 
X 
X} 
involving inner products of the embedding vectors and the matrix :

z  X
q (8)

  
where
q is 1 if and 0 otherwise.

% $ 
This optimization is performed subject to constraints that make the problem well

ment without affecting the cost,


$
posed. It is clear that the coordinates can be translated by a constant displace-
. We remove this degree of freedom by
requiring the coordinates to be centered on the origin:
(
 $    
 (9)

Also, to avoid degenerate solutions, we constrain the embedding vectors to have


unit covariance, with outer products that satisfy

 $   $#  ~ €
 (10)

 ! * !
where is the
$
identity matrix. Note that there is no loss in generality in
constraining the covariance of to be diagonal and of order unity, since the cost
function in eq. (2) is invariant to rotations and homogeneous rescalings. The further
constraint that the covariance is equal to the identity matrix expresses an assump-
tion that reconstruction errors for different coordinates in the embedding space
should be measured on the same scale.

by computing the bottom


! j
The optimal embedding—up to a global rotation of the embedding space—is found
eigenvectors of the matrix, ; this is a version of z
the Rayleitz-Ritz theorem[12]. The bottom eigenvector of this matrix, which we
discard, is the unit vector with all equal components; it represents a free translation
mode of eigenvalue zero. Discarding this eigenvector enforces the constraint that
the embeddings have zero mean, since the components of other eigenvectors must
!
!
sum to zero, by virtue of orthogonality. The remaining eigenvectors form the

! j
embedding coordinates found by LLE.
Note that the bottom
!j
eigenvectors of the matrix (that is, those correspond- z
ing to its smallest eigenvalues) can be found without performing a full matrix
diagonalization[14]. Moreover, the matrix z
can be stored and manipulated as the
sparse symmetric matrix

z + 
8 ƒ‚  
(11)

11
giving substantial computational savings for large values of . In particular, left
multiplication by z (the subroutine required by most sparse eigensolvers) can be

performed as
z-„ „ …
  
„ 
~  „ 
„ (12)
requiring just one multiplication by

and one multiplication by

~ , both of
which are extremely sparse. Thus, the matrix z never needs to be explicitly cre-

ated or stored; it is sufficient to store and multiply the matrix .

C LLE from Pairwise Distances

LLE can be applied to user input in the form of pairwise distances. In this case,
nearest neighbors are identified by the smallest non-zero elements of each row in
Z)aX
the distance matrix. To derive the reconstruction weights for each data point, we
need to compute the local covariance matrix between its nearest neighbors,
as defined by eq. (4) in appendix A. This can be done by exploiting the usual
relation between pairwise distances and dot products that forms the basis of metric
MDS[2]. Thus, for a particular data point, we set:
Z)aX    j X  a X  
9     † (13)
aX
where  aX aX distance between the  th and ‡ th neighbors,
‰ˆ  ‹Š`‰ˆ Š and  †    . In terms of this local covariance matrix, the
denotes the squared

reconstruction weights for each data point are given by eq. (5). The rest of the
algorithm proceeds as usual.
Note that this variant of LLE requires significantly less user input than the com-
plete matrix of pairwise distances. Instead, for each data point, the user needs
only to specify its nearest neighbors and the submatrix of pairwise distances be-
tween those neighbors. Is it possible to recover manifold structure from even less
user input—say, just the pairwise distances between each data point and its near-
est neighbors? A simple counterexample shows that this is not possible. Consider

S
the square lattice of three dimensional data points whose integer coordinates sum to

S
zero. Imagine that points with even -coordinates are colored black, and that points
with odd -coordinates are colored red. The “two point” embedding that maps all
black points to the origin and all red points to one unit away preserves the distance
between each point and its four nearest neighbors. Nevertheless, this embedding
completely fails to preserve the underlying structure of the original manifold.

12
Acknowledgements

The authors thank E. Cosatto, H.P. Graf, and Y. LeCun (AT&T Labs) and B. Frey
(U. Toronto) for providing data for these experiments. S. Roweis acknowledges
the support of the Gatsby Charitable Foundation, the National Science Foundation,
and the National Sciences and Engineering Research Council of Canada.

References
[1] I.T. Jolliffe, Principal Component Analysis (Springer-Verlag, New York, 1989).
[2] T. Cox and M. Cox. Multidimensional Scaling (Chapman & Hall, London, 1994).
[3] E. Cosatto and H.P. Graf. Sample-Based Synthesis of Photo-Realistic Talking-Heads.
Proceedings of Computer Animation, 103–110. IEEE Computer Society (1998).
[4] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear
embedding. Science 290, 2323-2326 (2000).
[5] K. Fukunaga and D. R. Olsen, An algorithm for finding intrinsic dimensionality of
data. IEEE Transactions on Computers 20(2), 176-193 (1971).
[6] N. Kambhatla and T. K. Leen. Dimension reduction by local principal component
analysis. Neural Computation 9, 1493–1516 (1997).
[7] D. DeMers and G.W. Cottrell. Nonlinear dimensionality reduction. In Advances in
Neural Information Processing Systems 5, D. Hanson, J. Cowan, L. Giles, Eds. (Mor-
gan Kaufmann, San Mateo, CA, 1993), pp. 580–587.
[8] M. Kramer. Nonlinear principal component analysis using autoassociative neural net-
works. AIChE Journal 37, 233–243 (1991).
[9] T. Kohonen. Self-organization and Associative Memory (Springer-Verlag, Berlin,
1988).
[10] C. Bishop, M. Svensen, and C. Williams. GTM: The generative topographic mapping.
Neural Computation 10, 215–234 (1998).
[11] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for
nonlinear dimensionality reduction. Science 290, 2319-2323 (2000).
[12] R. A. Horn and C. R. Johnson. Matrix Analysis (Cambridge University Press, Cam-
bridge, 1990).
[13] J. H. Friedman, J. L. Bentley and R. A. Finkel. An algorithm for finding best matches
in logarithmic expected time. ACM Transactions on Mathematical Software, 3(3),
290-226 (1977).
[14] Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst. Templates for the
Solution of Algebraic Eigenvalue Problems: A Practical Guide (Society for Industrial
and Applied Mathematics, Philadelphia, 2000).

13

You might also like