Charting a manifold
Matthew Brand
We construct a nonlinear mapping from a high-dimensional sample space to a low-dimensional
vector space, effectively recovering a Cartesian coordinate system for the manifold from which
the data is sampled. The mapping preserves local geometric relations in the manifold and is
pseudo-invertible. We show how to estimate the intrinsic dimensionality of the manifold from
samples, decompose the sample data into locally linear low-dimensional patches, merge these
patches into a single lowdimensional coordinate system, and compute forward and reverse map-
pings between the sample and coordinate spaces. The objective functions are convex and their
solutions are given in closed form.
Presented at NIPS-15, December 2002. Appears in Proceedings, Neural Information Processing Systems, volume 15. MIT Press, March 2003.
volume 15. MIT Press, March 2003.
1 2 3
2 10 10 10
#points (log scale)
Figure 1: Point growth processes. L EFT: At the locally linear scale, the number of points
in an r-ball grows as rd ; at noise and curvature scales it grows faster. R IGHT: Using the
point-count growth process to find the intrinsic dimensionality of a 2D manifold nonlinearly
embedded in 3-space (see figure 2). Lines of slope 1/3 , 1/2 , and 1 are fitted to sections of the
log r/ log nr curve. For neighborhoods of radius r ≈ 1 with roughly n ≈ 10 points, the slope
peaks at 1/2 indicating a dimensionality of d = 2. Below that, the data appears 3 D because
it is dominated by noise (except for n ≤ D points); above, the data appears >2 D because of
manifold curvature. As the r-ball expands to cover the entire data-set the dimensionality
appears to drop to 1 as the process begins to track the 1D edges of the 2D sheet.
the manifold. For low-dimensional manifolds such as sheets, the boundary submanifolds
(edges and corners) are very small relative to the full manifold, so the boundary effect is
typically limited to a small rise in c(r) as r approaches the scale of the entire data set. In
practice, our code simply expands an r-ball at every point and looks for the first peak in
c(r), averaged over many nearby r-balls. One can estimate d and r globally or per-point.
Each gaussian component defines a local neighborhood centered around µ j with axes de-
fined by the eigenvectors of Σ j . The amount of data variance along each axis is indicated
by the eigenvalues of Σ j ; if the data manifold is locally linear in the vicinity of the µ j , all
but the d dominant eigenvalues will be near-zero, implying that the associated eigenvec-
tors constitute the optimal variance-preserving local coordinate system. To some degree
likelihood maximization will naturally realize this property: It requires that the GMM com-
ponents shrink in volume to fit the data as tightly as possible, which is best achieved by
positioning the components so that they “pancake” onto locally flat collections of data-
points. However, this state of affairs is easily violated by degenerate (zero-variance) GMM
components or components fitted to overly small enough locales where the data density off
the manifold is comparable to density on the manifold (e.g., at the noise scale). Conse-
quently a prior is needed.
Criterion (2) implies that neighboring partitions should have dominant axes that span sim-
ilar subspaces, since disagreement (large subspace angles) would lead to inconsistent pro-
jections of a point and therefore uncertainty about its location in a low-dimensional co-
ordinate space. The principal insight is that criterion (2) is exactly the cost of coding the
location of a point in one neighborhood when it is generated by another neighborhood—the
cross-entropy between the gaussian models defining the two neighborhoods:
N (y; µ1 ,Σ1 )
D(N1 kN2 ) = dy N (y; µ1 ,Σ1 ) log
N (y; µ2 ,Σ2 )
> −1
1 Σ2 | + trace(Σ2 Σ1 ) + (µ2 −µ1 ) Σ2 (µ2 −µ1 ) − D)/2. (2)
= (log |Σ−1 −1
Roughly speaking, the terms in (2) measure differences in size, orientation, and position,
respectively, of two coordinate frames located at the means µ1 , µ2 with axes specified by
the eigenvectors of Σ1 , Σ2 . All three terms decline to zero as the overlap between the two
frames is maximized. To maximize consistency between adjacent neighborhoods, we form
the prior p(µ, Σ) = exp[− ∑i6= j mi (µ j )D(Ni kN j )], where mi (µ j ) is a measure of co-locality.
Unlike global coordination [8], we are not asking that the dominant axes in neighboring
charts are aligned—only that they span nearly the same subspace. This is a much easier
objective to satisfy, and it contains a useful special case where the posterior p(µ, Σ|Y) ∝
∑i p(yi |µ, Σ)p(µ, Σ) is unimodal and can be maximized in closed form: Let us associate a
gaussian neighborhood with each data-point, setting µi = yi ; take all neighborhoods to be
a priori equally probable, setting pi = 1/N; and let the co-locality measure be determined
from some local kernel. For example, in this paper we use mi (µ j ) ∝ N (µ j ; µi , σ2 ), with
the scale parameter σ specifying the expected size of a neighborhood on the manifold in
sample space. A reasonable choice is σ = r/2, so that 2erf(2) > 99.5% of the density of
mi (µ j ) is contained in the area around yi where the manifold is expected to be locally linear.
With uniform pi and µi , mi (µ j ) and fixed, the MAP estimates of the GMM covariances are
∑ mi (µ j ) (y j − µi )(y j − µi )> + (µ j − µi )(µ j − µi )> + Σ j ∑ mi (µ j ) (3).
Σi =
j j
Note that each covariance Σi is dependent on all other Σ j . The MAP estimators for all
covariances can be arranged into a set of fully constrained linear equations and solved ex-
actly for their mutually optimal values. This key step brings nonlocal information about
the manifold’s shape into the local description of each neighborhood, ensuring that ad-
joining neighborhoods have similar covariances and small angles between their respective
subspaces. Even if a local subset of data points are dense in a direction perpendicular to
the manifold, the prior encourages the local chart to orient parallel to the manifold as part
of a globally optimal solution, protecting against a pathology noted in [8]. Equation (3) is
easily adapted to give a reduced number of charts and/or charts centered on local centroids.
where pk|y (y) ∝ pk N (y; µk , Σk ), ∑k pk|y (y) = 1 is the probability that chart k generates
point y. As pointed out in [8], if a point has nonzero probabilities in two charts, then there
should be affine transforms of those two charts that map the point to the same place in a
global coordinate space. We set this up as a weighted least-squares problem:
. uki u ji
G = [G1 , · · · , GK ] = arg min ∑ pk|y (yi )p j|y (yi )
Gk − G j
. (5)
Gk ,G j i
1 1
solutions, then equation (8) is solved (up to rotation in coordinate space) by setting G> to
the eigenvectors associated with the smallest eigenvalues of QQ> . The eigenvectors can be
computed efficiently without explicitly forming QQ> ; other numerical efficiencies obtain
by zeroing any vanishingly small probabilities in each Pk , yielding a sparse eigenproblem.
A more interesting strategy is to numerically condition the problem by calculating the
trailing eigenvectors of QQ> + 1. It can be shown that this maximizes the posterior
p(G|Q) ∝ p(Q|G)p(G) ∝ e−kGQkF e−kG1k , where the prior p(G) favors a mapping G
whose unit-norm rows are also zero-mean. This maximizes variance in each row of G
and thereby spreads the projected points broadly and evenly over coordinate space.
The solutions for MAP charts (equation (5)) and connection (equation (8)) can be applied
to any well-fitted mixture of gaussians/factors1 /PCAs density model; thus large eigen-
problems can be avoided by connecting just a small number of charts that cover the data.
1 We thank reviewers for calling our attention to Teh & Roweis ([11]—in this volume), which
shows how to connect a set of given local dimensionality reducers in a generalized eigenvalue prob-
lem that is related to equation (8).
LLE, n=5 LLE, n=6 LLE, n=7
original data embedding, XY view XYZ view
data (linked)
XZ view
random subset
LLE, n=8 LLE, n=9 LLE, n=10
of local charts
XY view
charting reconstruction
(projection onto coordinate space) (back−projected coordinate grid)
charting best Isomap best LLE
Figure 2: The twisted curl problem. L EFT: Comparison of charting, I SO M AP, & LLE.
400 points are randomly sampled from the manifold with noise. Charting is the only
method that recovers the original space without catastrophes (folding), albeit with some
shear. R IGHT: The manifold is regularly sampled (with noise) to illustrate the forward
and backward projections. Samples are shown linked into lines to help visualize the man-
ifold structure. Coordinate axes of a random selection of charts are shown as bold lines.
Connecting subsets of charts such as this will also give good mappings. The upper right
quadrant shows various LLE results. At bottom we show the charting solution and the
reconstructed (back-projected) manifold, which smooths out the noise.
Once the connection is solved, equation (4) gives the forward projection of any point y
down into coordinate space. There are several numerically distinct candidates for the back-
projection: posterior mean, mode, or exact inverse. In general, there may not be a unique
posterior mode and the exact inverse is not solvable in closed form (this is also true of [8]).
Note that chart-wise projection defines a complementary density in coordinate space
0 [Id , 0]Λk [Id , 0]> 0
px|k (x) = N (x; Gk , Gk G>
k ). (9)
1 0 0
Let p(y|x, k), used to map x into subspace k on the surface of the manifold, be a Dirac delta
function whose mean is a linear function of x. Then the posterior mean back-projection is
obtained by integrating out uncertainty over which chart generates x:
+ !
I 0
y|x ∑ pk|x (x) µk + Wk Gk 0
x − Gk
, (10)
where (·)+ denotes pseudo-inverse. In general, a back-projecting map should not recon-
struct the original points. Instead, equation (10) generates a surface that passes through the
weighted average of the µi of all the neighborhoods in which yi has nonzero probability,
much like a principal curve passes through the center of each local group of points.
5 Experiments
Synthetic examples: 400 2 D points were randomly sampled from a 2 D square and embed-
ded in 3 D via a curl and twist, then contaminated with gaussian noise. Even if noiselessly
sampled, this manifold cannot be “unrolled” without distortion. In addition, the outer curl
is sampled much less densely than the inner curl. With an order of magnitude fewer points,
higher noise levels, no possibility of an isometric mapping, and uneven sampling, this is
arguably a much more challenging problem than the “swiss roll” and “s-curve” problems
featured in [12, 9, 8, 1]. Figure 2LEFT contrasts the (unique) output of charting and the
best outputs obtained from I SO M AP and LLE (considering all neighborhood sizes between
2 and 20 points). I SO M AP and LLE show catastrophic folding; we had to change LLE’s
regularization in order to coax out nondegenerate (>1 D) solutions. Although charting is
a. data, xy view b. data, yz view c. local charts d. 2D embedding e. 1D embedding
1D ordinate
true manifold arc length
Figure 3: Untying a trefoil knot ( ) by charting. 900 noisy samples from a 3 D-embedded
1 D manifold are shown as connected dots in front (a) and side (b) views. A subset of charts
is shown in (c). Solving for the 2 D connection gives the “unknot” in (d). After removing
some points to cut the knot, charting gives a 1 D embedding which we plot against true
manifold arc length in (e); monotonicity (modulo noise) indicates correctness.
images synthesized via backprojection of straight lines in coordinate space
Figure 4: Modeling the manifold of facial images from raw video. Each row contains
images synthesized by back-projecting an axis-parallel straight line in coordinate space
onto the manifold in image space. Blurry images correspond to points on the manifold
whose neighborhoods contain few if any nearby data points.
not designed for isometry, after affine transform the forward-projected points disagree with
the original points with an RMS error of only 1.0429, lower than the best LLE (3.1423) or
best I SO M AP (1.1424, not shown). Figure 2RIGHT shows the same problem where points
are sampled regularly from a grid, with noise added before and after embedding. Figure 3
shows a similar treatment of a 1 D line that was threaded into a 3 D trefoil knot, contaminated
with gaussian noise, and then “untied” via charting.
Video: We obtained a 1965-frame video sequence (courtesy S. Roweis and B. Frey) of
20 × 28-pixel images in which B.F. strikes a variety of poses and expressions. The video
is heavily contaminated with synthetic camera jitters. We used raw images, though image
processing could have removed this and other uninteresting sources of variation. We took a
500-frame subsequence and left-right mirrored it to obtain 1000 points in 20 × 28 = 560D
image space. The point-growth process peaked just above d = 3 dimensions. We solved for
25 charts, each centered on a random point, and a 3D connection. The recovered degrees
of freedom—recognizable as pose, scale, and expression—are visualized in figure 4.
Figure 5: Flattening a fishbowl. From the left: Original 2000×2D points; their stereo-
graphic mapping to a 3D fishbowl; its 2D embedding recovered using 500 charts; and the
stereographic map. Fewer charts lead to isometric mappings that fold the bowl (not shown).
Conformality: Some manifolds can be flattened conformally (preserving local angles) but
not isometrically. Figure 5 shows that if the data is finely charted, the connection behaves
more conformally than isometrically. This problem was suggested by J. Tenenbaum.
6 Discussion
Charting breaks kernel-based NLDR into two subproblems: (1) Finding a set of data-
covering locally linear neighborhoods (“charts”) such that adjoining neighborhoods span
maximally similar subspaces, and (2) computing a minimal-distortion merger (“connec-
tion”) of all charts. The solution to (1) is optimal w.r.t. the estimated scale of local linearity
r; the solution to (2) is optimal w.r.t. the solution to (1) and the desired dimensionality d.
Both problems have Bayesian settings. By offloading the nonlinearity onto the kernels,
we obtain least-squares problems and closed form solutions. This scheme is also attractive
because large eigenproblems can be avoided by using a reduced set of charts.
The dependence on r, like trusted-set methods, is a potential source of solution instabil-
ity. In practice the point-growth estimate seems fairly robust to data perturbations (to be
expected if the data density changes slowly over a manifold of integral Hausdorff dimen-
sion), while the use of a soft neighborhood partitioning appears to make charting solutions
reasonably stable to variations in r. Eigenvalue stability analyses may prove useful here.
Ultimately, we would prefer to integrate r out. In contrast, use of d appears to be a virtue:
Unlike other eigenvector-based methods, the best d-dimensional embedding is not merely
a linear projection of the best d + 1-dimensional embedding; a unique distortion is found
for each value of d that maximizes the information content of its embedding.
Why does charting performs well on datasets where the signal-to-noise ratio confounds
recent state-of-the-art methods? Two reasons may be adduced: (1) Nonlocal information
is used to construct both the system of local charts and their global connection. (2) The
mapping only preserves the component of local point-to-point distances that project onto
the manifold; relationships perpendicular to the manifold are discarded. Thus charting uses
global shape information to suppress noise in the constraints that determine the mapping.
Thanks to J. Buhmann, S. Makar, S. Roweis, J. Tenenbaum, and anonymous reviewers for
insightful comments and suggested “challenge” problems.
